Building Distributed Systems from Scratch

SQL on Top of a Distributed Key-Value Store

Spanner presents a standard SQL interface to application developers, but underneath every SQL query is a key-value store operation routed to one or more shards. Understanding this translation is key to writing efficient Spanner queries and to understanding how Spanner achieves horizontal scalability without sacrificing SQL semantics.

Key-Range Partitioning

Spanner uses key-range (range) sharding, not hash sharding. Each shard owns a contiguous, non-overlapping interval of the key space: [start_key, end_key). The full set of shard ranges forms a partition of the key space with no gaps and no overlaps.

Range sharding has a critical advantage over hash sharding for analytical workloads: range scans are local. A query like SELECT * WHERE key BETWEEN 100 AND 400 only touches the shards whose ranges overlap [100, 400]. With hash sharding, any range query must contact all shards because consecutive keys are randomly distributed. The cost is that range sharding can create hot spots if data is accessed sequentially (e.g., all inserts going to the last shard). Spanner mitigates this by automatically splitting and rebalancing shards.

The Shard Map

The query router maintains a sorted list of (shard_name, start_key, end_key) entries. To route a point query at key k, the router does a binary search for the shard satisfying start ≤ k < end. To route a range scan [lo, hi], it finds all shards that overlap: any shard where shard.start ≤ hi and shard.end > lo.

def find_shards_for_range(lo, hi, shard_map):
    result = []
    for (name, start, end) in shard_map:
        if start > hi:
            break       # shards are sorted; none further can overlap
        if end > lo:
            result.append(name)
    return result

Cross-Shard Range Scans

When a SCAN query spans multiple shards, the router issues sub-queries to each relevant shard in parallel and then merges the results. Because each shard returns rows sorted by key (key-range sharding preserves order within a shard), the merge is a simple k-way merge that produces a globally sorted result in linear time. This is the same merge strategy used by distributed sort-merge joins and MapReduce.

In production Spanner, the shard boundaries are dynamic — the system automatically splits a shard that is too large or too hot, and merges adjacent shards that are too small. The query router uses a metadata table (itself stored in Spanner) to look up the current shard map before routing each query.

INSERT and Point SELECT

An INSERT at key k is routed to the single shard owning k. The router writes data[shard][k] = value and annotates the output with the shard name for observability. A SELECT at key k follows the same routing logic — if the key is not found in the owning shard's data, the result is NOT_FOUND. If the key falls outside all defined shard ranges (a gap or beyond the end of the last shard), the result is NO_SHARD.

Correctness Invariants

Partition completeness: every valid key must fall within exactly one shard. Gaps in the shard map would make some keys unroutable; overlaps would route the same key to multiple shards, potentially creating split-brain inconsistency.
Stable sort on scan: the merged output of a cross-shard scan must be sorted by key. Each individual shard returns its results sorted; the merge must preserve this ordering globally.
Boundary semantics: shard ranges are half-open intervals [start, end). Key end belongs to the next shard, not the current one. This must be applied consistently in both INSERT routing and SCAN range filtering.

How Spanner's F1 SQL Layer Uses This

The F1 SQL engine (described in the F1 paper, Shute et al., VLDB 2013) is a full distributed SQL engine built on top of Spanner's key-value API. F1 supports joins, aggregations, subqueries, and secondary indexes, all implemented as combinations of key-range scans and distributed hash joins. The query planner generates a query plan that minimizes cross-shard communication — for example, by pushing predicates down to individual shards so that each shard filters its own rows before sending results to the coordinator for merging. This principle — pushing computation to data rather than pulling all data to a central node — is the fundamental design principle of distributed query execution.

Query Routing and Sharding