Guide

Distributed Transactions: Why Agreeing on a Commit Is Harder Than It Looks

A transaction that spans multiple machines has to either commit everywhere or roll back everywhere. Getting that right, when any participant can fail at any moment, is one of the hardest problems in distributed systems.

Distributed Transactions: Why Agreeing on a Commit Is Harder Than It Looks

On a single database, transactions are handled by locking, a write-ahead log, and a recovery process that replays the log after a crash. The complexity is real but manageable. When your transaction touches data on three different machines, you need all three machines to agree on whether it committed — and that agreement has to survive the failure of any participant, including the machine coordinating the agreement itself.

The Transfer Problem

The canonical example is a money transfer. Deduct 100 dollars from account A (on server 1) and add 100 dollars to account B (on server 2). Both operations must succeed, or neither should.

If you just execute both operations independently, you have a window where the deduction has happened but the credit has not. If server 2 crashes in that window, you have lost 100 dollars. If you credit first and then debit, and server 1 crashes, you have created 100 dollars.

Single-machine databases solve this with ACID transactions. Multi-machine systems need a protocol that gives the same guarantee across the network boundary. The standard protocol is two-phase commit.

Two-Phase Commit

Two-phase commit (2PC) works exactly as its name suggests. There is a coordinator (usually the machine that initiated the transaction) and participants (all the machines whose data the transaction touches).

In the first phase (prepare), the coordinator asks every participant: "Can you commit this transaction?" Each participant writes the transaction changes to a durable redo log, locks the relevant rows, and responds "yes" or "no." Responding "yes" is a promise: I can commit if you tell me to, and I will not change my mind.

In the second phase (commit or abort), if every participant said yes, the coordinator sends commit to everyone. If any participant said no, or if a participant is unreachable, the coordinator sends abort to everyone.

This protocol guarantees atomicity: either all participants commit or none do. The key mechanism is the promise. Once a participant says "yes" in the prepare phase, it cannot unilaterally roll back. It holds its locks and keeps its redo log until it hears from the coordinator.

The Blocking Problem

Two-phase commit has one serious flaw: the participant is blocked waiting for the coordinator's second-phase message.

Suppose the coordinator crashes after receiving all "yes" votes but before sending commit. Each participant is now stuck. It has said yes, it is holding locks, and it does not know whether to commit or abort. It cannot make progress independently because it might commit while another participant aborts (violating atomicity). It cannot abort independently because another participant might commit.

This is the blocking problem. 2PC is a blocking protocol in the presence of coordinator failure. The participants are hostage to the coordinator's availability.

In practice, the coordinator is usually replicated with its own consensus protocol (like Raft) to reduce this window. When the coordinator recovers, it checks its own log to see whether it had decided to commit or abort before crashing, and resumes from there. But there is always a window between the coordinator's decision and the commit messages reaching all participants where a crash produces uncertainty.

Google's Percolator, which underpins Bigtable transactions, addresses this by storing the transaction state in the data store itself. The coordinator writes a "commit intent" to a designated primary lock before sending commit messages. Any recovering coordinator or conflicting transaction can read that intent and resolve the uncertainty without human intervention.

Three-Phase Commit: The Theoretical Fix

Three-phase commit (3PC) adds an intermediate phase to remove the blocking problem. After collecting "yes" votes, the coordinator sends a "prepare to commit" message before the final commit. Only after all participants acknowledge "prepare to commit" does the coordinator send the actual commit.

This extra round means that after the coordinator crashes, participants can query each other to determine the outcome. If any participant is in the "prepare to commit" phase, they can safely proceed with commit — they know the coordinator had received all yes votes before crashing. If no participant has seen "prepare to commit," they can all abort.

The problem with 3PC is network partitions. In a partition, participants cannot reach each other to resolve uncertainty. They may make conflicting decisions. 3PC solves coordinator failure but introduces split-brain under partition.

This is the tradeoff at the heart of distributed transactions. Any protocol that is non-blocking under crash failure becomes unsafe under partition. The FLP result lurks here too.

The Coordinator Pattern in Practice

Most production distributed databases do not use pure 2PC or 3PC. They layer 2PC on top of per-shard consensus for durability, and use techniques like MVCC (Multi-Version Concurrency Control) to reduce the time that locks are held.

Google Spanner uses two-phase commit for cross-shard transactions, with Paxos groups providing durability for each shard. The coordinator writes its decision to a Paxos group before sending it to participants. CockroachDB follows the same pattern with Raft replacing Paxos. Both systems achieve external consistency (transactions appear as if they occurred at a real point in time) using clock synchronization — TrueTime in Spanner, hybrid logical clocks in CockroachDB.

The coordinator pattern also appears in distributed sagas, which are used in microservices architectures. A saga breaks a multi-step transaction into a sequence of local transactions, each with a compensating transaction that undoes its effect. The coordinator executes the steps in order. If any step fails, it executes the compensating transactions in reverse order.

Sagas trade atomicity for availability. A compensating transaction is not the same as a rollback — the intermediate states are visible. But for many business processes, this is acceptable. Booking a hotel, flight, and car rental can use a saga: if the car rental fails, cancel the hotel and flight. The customer was never in a confirmed-booked state for all three, but the system converged correctly.

What This Track Builds

The Coordinator track asks you to implement a two-phase commit protocol in Maelstrom. Your coordinator node receives transaction requests from clients, runs the prepare phase across participant nodes, collects votes, and drives the commit or abort decision.

The tricky parts are timeout handling (what to do when a participant does not respond in the prepare phase) and recovery (what the coordinator does when it restarts after a crash and has unfinished transactions in its log).

Implementing this teaches you more about distributed systems than almost anything else in the curriculum, because it forces you to confront failure scenarios concretely. You cannot hand-wave "the transaction either commits or aborts" — you have to write the code that makes it so.

The systems that you use every day that promise durability — Postgres, MySQL, Oracle — have their own transaction coordinators built in. What you are implementing is the distributed version of what every database does internally. The concepts are identical; the failure modes are more interesting.

Ready to build it? The Coordinator track builds a complete two-phase commit implementation. You will handle the full lifecycle: prepare, vote collection, commit or abort decision, and coordinator recovery after crash. The same protocol powers Spanner, CockroachDB, and every distributed database that promises cross-shard atomicity.

Build it yourself

Reading about distributed systems is useful. Building them is how you actually learn.

Start the The Coordinator track