ARCHIVED from builddistributedsystem.com on 2026-04-28 — URL: https://builddistributedsystem.com/projects/mini-spanner/tasks/spanner-t3-s1-two-phase-commit
DS

Two-Phase Commit

Mini-Spanner / Distributed Transactions
hard

Concept

Two-Phase Commit: Coordinator + 3 Participants Coordinator (Transaction Manager) Shard S1 ts=105 Shard S2 ABORT: lock conflict Shard S3 ts=120 PREPARE T1 PREPARE T1 PREPARE T1 VOTE_COMMIT 105 VOTE_ABORT VOTE_COMMIT 120 ABORT T1 (S2 voted abort) ABORT ABORT ABORT

Two-Phase Commit in Spanner

Two-Phase Commit (2PC) is the protocol that gives Spanner atomic cross-shard writes. When a transaction modifies data in multiple shards, 2PC ensures that either every shard applies the change or none do — eliminating partial commits that would leave the database in an inconsistent state.

The Two Phases

Phase 1 — Prepare: the transaction coordinator sends PREPARE txn_id to every participant shard. Each participant:

  1. Acquires write locks on all rows it must modify.
  2. Writes a redo log entry (so it can recover if it crashes mid-commit).
  3. Assigns a prepare timestamp prepare_ts = TT.now().latest — the pessimistic upper bound on the current time for that shard.
  4. Responds VOTE_COMMIT txn_id prepare_ts or VOTE_ABORT txn_id reason.

Phase 2 — Commit or Abort: the coordinator collects all votes. If all participants voted commit, it computes commit_ts = max(all prepare_ts) and sends COMMIT txn_id commit_ts to all participants. If any participant voted abort, it sends ABORT txn_id to all. Participants apply the decision, release locks, and acknowledge.

Why commit_ts = max(prepare_ts)?

Each shard's prepare timestamp is chosen to be greater than any timestamp previously committed or prepared on that shard. This ensures that when the commit is applied, it does not violate the shard's local MVCC ordering — every new write version is strictly newer than all prior versions. Taking the maximum across all participants' prepare timestamps produces a commit timestamp that satisfies this requirement on every shard simultaneously. This is also the mechanism by which Spanner ensures that the final commit timestamp is valid globally, not just locally.

Coordinator Failure and Recovery

If the coordinator crashes after Phase 1 but before Phase 2, participants are in a blocking state — they hold locks but cannot proceed without a commit or abort decision. Spanner solves this durably: the coordinator is itself a Paxos group, so its intent (the set of votes collected and the decision reached) is replicated across a majority of coordinator replicas. A new Paxos leader can read the coordinator's replicated state and resume the protocol.

If the coordinator crashes before any participant reaches Phase 1, the transaction is simply lost — participants never acquired locks, so no blocking occurs. The client times out and retries.

The All-or-Nothing Guarantee

2PC achieves atomicity through the combination of vote collection and participant durability. Once a participant votes COMMIT, it has persisted its redo log and is committed to applying the transaction if the coordinator sends commit. It cannot unilaterally abort. The coordinator, having received all commit votes, is committed to sending commit — not abort. This mutual commitment creates the atomic all-or-nothing outcome.

Correctness Invariants

  • Vote persistence: participants must write their vote to stable storage before sending it. A crash-and-restart that forgets a commit vote could allow the participant to wrongly respond "I don't know" and potentially abort an already-decided transaction.
  • Single decision: once the coordinator decides (commit or abort), it cannot change its mind. The decision must be persisted before being sent to participants.
  • Prepare timestamp ordering: prepare_ts must be strictly greater than any timestamp the shard has previously used. Spanner enforces this by having each Paxos leader track the maximum timestamp it has ever issued.

How Spanner Integrates 2PC with Paxos

In Spanner, each participant shard is a Paxos group. When a participant shard receives a PREPARE, its Paxos leader runs a Paxos round to replicate the prepare decision across the shard's replicas. Similarly, the commit is replicated via Paxos before being acknowledged to the coordinator. This means that 2PC in Spanner is not just a two-machine protocol — it is a two-phase protocol where each "step" is itself a distributed consensus round, providing fault tolerance at every level of the hierarchy.

main.py
python

Sign in to run and submit code.