ARCHIVED from builddistributedsystem.com on 2026-04-28 — URL: https://builddistributedsystem.com/projects/mini-spanner/tasks/spanner-t3-s1-mvcc
DS

MVCC and Snapshot Reads

Mini-Spanner / Distributed Transactions
hard

Concept

MVCC Version Chain for Key "x" key "x" ts=1 : val="alice" (oldest version) ts=5 : val="bob" ts=9 : val="carol" (latest version) READ T1 @ ts=4 sees ts=1 (latest ≤ 4) → "alice" READ T2 @ ts=7 sees ts=5 (latest ≤ 7) → "bob" READ T3 @ ts=9 sees ts=9 (latest ≤ 9) → "carol" All reads are lock-free — each transaction reads its own consistent snapshot Writes create new versions; reads scan backward from snapshot ts to find latest ≤ ts

MVCC: Reading the Past Without Locks

Multi-Version Concurrency Control (MVCC) is the storage engine mechanism that makes Spanner's snapshot reads possible. Instead of maintaining a single current value for each key, the database keeps a complete history of values, each stamped with the commit timestamp of the transaction that wrote it. A read at any snapshot timestamp can then retrieve the correct historical value without acquiring any lock.

The Version Chain

For each key, Spanner maintains an ordered list of (commit_timestamp, value) pairs. When a write transaction commits at timestamp ts, it appends the pair (ts, new_value) to the chain. The chain is always kept sorted by timestamp in ascending order. A snapshot read at timestamp t scans the chain backward and returns the value at the largest commit_ts ≤ t:

def read_at(key, snapshot_ts):
    versions = store[key]   # sorted ascending by ts
    result = None
    for (ts, val) in versions:
        if ts <= snapshot_ts:
            result = (ts, val)   # keep updating — want latest ts <= snapshot
        else:
            break
    return result or NOT_FOUND

This is a pure read with no locks, no barriers, and no coordination with concurrent writers. The snapshot timestamp acts as a "time machine" — the query is answered as if the database were frozen at that instant.

Snapshot Reads vs Read-Write Transactions

Spanner supports two fundamentally different transaction modes:

  • Snapshot (read-only) transactions: choose a snapshot timestamp, read from MVCC, acquire zero locks, can execute on any replica whose safe time exceeds the snapshot timestamp. These are the workhorses of analytical and reporting queries.
  • Read-write transactions: read the latest version of each key (acquiring read locks to detect conflicts), buffer writes in memory, then commit via 2PC at a TrueTime-derived commit timestamp. Each written key gets a new version appended at the commit timestamp.

The crucial insight is that snapshot readers never block writers and writers never block snapshot readers. The two access patterns are completely independent, which is why Spanner can serve millions of reads per second from read replicas while write transactions proceed on the primary.

Choosing the Snapshot Timestamp

A snapshot read must choose its timestamp carefully. Spanner supports several modes:

  • Strong read: the client calls TT.now().latest to get a timestamp that is guaranteed to reflect all committed writes. The client must then wait for the chosen replica's safe time to advance past this timestamp before the read can execute. This provides linearizable reads.
  • Bounded staleness read: the client specifies a maximum allowed staleness (e.g., "no older than 10 seconds"). The system picks the most recent timestamp for which data is already available locally, avoiding any wait. This is much faster but may miss very recent writes.
  • Exact staleness read: the client provides a specific past timestamp. Useful for batch processing where the exact snapshot time is known.

Garbage Collection

MVCC version chains grow indefinitely if never trimmed. Spanner garbage-collects versions older than the oldest active snapshot transaction. In practice, this means versions older than approximately 1 hour are eligible for GC (Spanner's default maximum staleness for bounded staleness reads). The GC process is coordinated across replicas to ensure that no live transaction is pointing at a version that gets deleted.

Correctness Invariants

  • Version ordering: commit timestamps must be strictly increasing per key. Two transactions writing the same key must be serialized so their timestamps are distinct.
  • Safe time before reads: a replica must not serve a snapshot read at timestamp t until its safe time exceeds t. Serving a stale read would violate the snapshot's consistency guarantee.
  • No retroactive writes: a write at timestamp ts must not be applied to a replica that has already served reads at timestamps ≥ ts. This is enforced by the prepare timestamp rules in 2PC and TrueTime commit wait.

How MVCC Connects to TrueTime

TrueTime commit wait ensures that when a transaction with commit timestamp s is released to clients, the true wall time is provably greater than s. Any client that learns about this commit and starts a new snapshot read will call TT.now().latest, which will return a value strictly greater than s. This means the new read's snapshot timestamp is greater than s, so it is guaranteed to see the committed write. MVCC and TrueTime together create a system where reads are always consistent with the causal order of the real world.

main.py
python

Sign in to run and submit code.