ARCHIVED from builddistributedsystem.com on 2026-04-28 — URL: https://builddistributedsystem.com/projects/mini-spanner/tasks/spanner-t2-s1-commit-wait
DS

Commit Wait and Safe Time

Mini-Spanner / TrueTime and External Consistency
hard

Concept

Commit Wait Timeline time t=100 t=105 t=111 COMMIT WAIT hold until TT.after(105) = (t - eps) > 105 COMMIT: assign s = TT.now().latest = 105 RELEASED — safe_time advances to 105 replicas can serve READ_AT 105 epsilon = 5ms s = current + eps = 100 + 5 = 105 wait_ms = s + eps + 1 - t = 105 + 5 + 1 - 100 = 11ms TT.after(105) = TRUE 111 - 5 = 106 > 105

Commit Wait: Turning Uncertainty Into a Guarantee

TrueTime uncertainty is not a flaw to be minimized — it is the raw material from which Spanner constructs external consistency. The commit wait protocol converts a bounded clock uncertainty window into an ironclad ordering guarantee: any transaction that starts after another commits will be assigned a strictly higher timestamp.

The External Consistency Invariant

Spanner's external consistency guarantee is formally stated in the Spanner paper as: if the commit of T1 ends before the start of T2 in real time, then commit_ts(T1) < commit_ts(T2). This is strictly stronger than serializability. A serializable system only promises that transactions can be ordered consistently — not that the order matches real-world causality across clients on different continents.

How Commit Wait Works: Step by Step

  1. The transaction coordinator assigns commit timestamp s = TT.now().latest. This is the pessimistic upper bound: the true current time is at most s.
  2. The coordinator begins the commit wait: it holds the transaction in a "pending" state, not yet visible to readers.
  3. The coordinator polls until TT.after(s) is true. This happens when TT.now().earliest > s, i.e., current_time - epsilon > s, i.e., current_time > s + epsilon.
  4. The minimum wait duration is 2 * epsilon + 1 time units (covering the full uncertainty window on both sides).
  5. Once TT.after(s) is true, the transaction is released. Safe time on this node advances to s.

After the release, any client that learns about this commit and immediately starts a new transaction will call TT.now(). Since real time has passed at least the uncertainty window since s was assigned, the new transaction's TT.now().latest will be strictly greater than s. This is the external consistency proof by construction.

Safe Time and Linearizable Reads

A replica can serve a snapshot read at timestamp t only once its safe time exceeds t. Safe time is the maximum commit timestamp among all fully committed transactions that have been applied to this replica's state. In the Spanner paper, safe time is formally the minimum of:

  • Paxos safe time: one less than the smallest timestamp of any Paxos write that is prepared but not yet committed on this replica.
  • Transaction manager safe time: one less than the smallest prepare timestamp of any in-flight (prepared but not committed) transaction.

Advancing safe time without gaps requires that there are no "holes" — uncommitted transactions with timestamps below a committed one. Spanner enforces this via the prepare timestamp rule in two-phase commit: each participant's prepare timestamp is at least TT.now().latest at prepare time, ensuring that any future commit will be assigned a timestamp above the current safe time.

Correctness Invariants

  • Commit timestamp assignment: s must equal TT.now().latest at the time of assignment — not any earlier or later value. Using an older timestamp would break the after-predicate wait calculation.
  • Release only after TT.after(s): releasing a transaction before TT.after(s) is true allows another transaction on the same or a remote node to be assigned a timestamp s' < s, violating external consistency.
  • Safe time monotonicity: safe time must only advance, never retreat. A safe time that decreases would allow a replica to serve stale reads for timestamps it already confirmed as readable.

How Spanner Uses This in Production

With epsilon typically under 7ms, commit wait adds at most 14ms of latency per write. Google considers this acceptable for globally consistent writes — especially since the alternative (a global lock server or Paxos-based timestamp oracle) would add comparable or worse latency with additional availability risk. Read-only transactions (snapshot reads) pay zero commit wait cost: they choose a snapshot timestamp and immediately execute on any replica with sufficient safe time, making them extremely fast for the common read-heavy workload.

main.py
python

Sign in to run and submit code.