Concept
MVCC: Reading the Past Without Locks
Multi-Version Concurrency Control (MVCC) is the storage engine mechanism that makes Spanner's snapshot reads possible. Instead of maintaining a single current value for each key, the database keeps a complete history of values, each stamped with the commit timestamp of the transaction that wrote it. A read at any snapshot timestamp can then retrieve the correct historical value without acquiring any lock.
The Version Chain
For each key, Spanner maintains an ordered list of (commit_timestamp, value) pairs. When a write transaction commits at timestamp ts, it appends the pair (ts, new_value) to the chain. The chain is always kept sorted by timestamp in ascending order. A snapshot read at timestamp t scans the chain backward and returns the value at the largest commit_ts ≤ t:
def read_at(key, snapshot_ts):
versions = store[key] # sorted ascending by ts
result = None
for (ts, val) in versions:
if ts <= snapshot_ts:
result = (ts, val) # keep updating — want latest ts <= snapshot
else:
break
return result or NOT_FOUND
This is a pure read with no locks, no barriers, and no coordination with concurrent writers. The snapshot timestamp acts as a "time machine" — the query is answered as if the database were frozen at that instant.
Snapshot Reads vs Read-Write Transactions
Spanner supports two fundamentally different transaction modes:
- Snapshot (read-only) transactions: choose a snapshot timestamp, read from MVCC, acquire zero locks, can execute on any replica whose safe time exceeds the snapshot timestamp. These are the workhorses of analytical and reporting queries.
- Read-write transactions: read the latest version of each key (acquiring read locks to detect conflicts), buffer writes in memory, then commit via 2PC at a TrueTime-derived commit timestamp. Each written key gets a new version appended at the commit timestamp.
The crucial insight is that snapshot readers never block writers and writers never block snapshot readers. The two access patterns are completely independent, which is why Spanner can serve millions of reads per second from read replicas while write transactions proceed on the primary.
Choosing the Snapshot Timestamp
A snapshot read must choose its timestamp carefully. Spanner supports several modes:
- Strong read: the client calls
TT.now().latestto get a timestamp that is guaranteed to reflect all committed writes. The client must then wait for the chosen replica's safe time to advance past this timestamp before the read can execute. This provides linearizable reads. - Bounded staleness read: the client specifies a maximum allowed staleness (e.g., "no older than 10 seconds"). The system picks the most recent timestamp for which data is already available locally, avoiding any wait. This is much faster but may miss very recent writes.
- Exact staleness read: the client provides a specific past timestamp. Useful for batch processing where the exact snapshot time is known.
Garbage Collection
MVCC version chains grow indefinitely if never trimmed. Spanner garbage-collects versions older than the oldest active snapshot transaction. In practice, this means versions older than approximately 1 hour are eligible for GC (Spanner's default maximum staleness for bounded staleness reads). The GC process is coordinated across replicas to ensure that no live transaction is pointing at a version that gets deleted.
Correctness Invariants
- Version ordering: commit timestamps must be strictly increasing per key. Two transactions writing the same key must be serialized so their timestamps are distinct.
- Safe time before reads: a replica must not serve a snapshot read at timestamp
tuntil its safe time exceedst. Serving a stale read would violate the snapshot's consistency guarantee. - No retroactive writes: a write at timestamp
tsmust not be applied to a replica that has already served reads at timestamps≥ ts. This is enforced by the prepare timestamp rules in 2PC and TrueTime commit wait.
How MVCC Connects to TrueTime
TrueTime commit wait ensures that when a transaction with commit timestamp s is released to clients, the true wall time is provably greater than s. Any client that learns about this commit and starts a new snapshot read will call TT.now().latest, which will return a value strictly greater than s. This means the new read's snapshot timestamp is greater than s, so it is guaranteed to see the committed write. MVCC and TrueTime together create a system where reads are always consistent with the causal order of the real world.
Sign in to run and submit code.