ARCHIVED from builddistributedsystem.com on 2026-04-28 — URL: https://builddistributedsystem.com/tracks/consensus
Tracks/The Consensus
06

The Consensus

Advanced
Consensus|20 tasks

Complete your Raft implementation with log replication. You will build log matching, commitment tracking, and apply entries to state machines, creating a fully functional consensus layer.

Subtracks & Tasks

Interview Prep

Common interview questions for Distributed Systems Engineer roles that map directly to what you build in this track. Click any question to reveal the model answer.

Model Answer

FLP (Fischer, Lynch, Paterson 1985) proves that no deterministic algorithm can solve consensus in an asynchronous system if even one process can fail. "Asynchronous" means no upper bound on message delay. In practice, all real systems have timing assumptions (TCP timeouts, election timeouts), making them partially synchronous. Raft and Paxos work in the partially synchronous model — they may block during asynchronous periods but are always safe and make progress when synchrony resumes. FLP is a theoretical bound, not a practical barrier.

Model Answer

Paxos defines single-value consensus (single-decree Paxos). Multi-Paxos for a replicated log requires engineering decisions that are not specified in the original paper: how to pipeline proposals, how to handle leader changes, how to handle gaps in the log. Ongaro & Ousterhout found that real Paxos implementations (Chubby, Zookeeper ZAB) diverged significantly from the paper and from each other. Raft explicitly specifies leader election, log replication, and membership changes as a single coherent algorithm, making the design space tractable.

Model Answer

Use a Raft-replicated state machine where the lock state is a key-value entry. Acquire: propose "set lock = clientId" if lock is unset. Release: propose "set lock = nil" if lock == clientId. Add a TTL/lease to handle client crashes. Failure modes: (1) client crash while holding lock — use a lease that expires, (2) network partition — the leader might think it has the lock but is partitioned from the cluster. Use lease-based reads: the leader only serves reads while its lease is valid, which requires clock synchrony assumptions. HashiCorp Consul's session system implements exactly this.

Model Answer

A stale leader (on the minority side of a partition) still believes it is leader because it has not heard about the new higher-term election. Both the stale leader and new leader exist simultaneously. This is safe: the stale leader cannot commit any writes because it cannot reach a majority. Any write it tries will be rejected when the partition heals. The key safety property is that a stale leader's uncommitted writes are overwritten — only committed writes (replicated to majority) survive.

Model Answer

Linearizability: every operation appears to take effect instantaneously at a single point between its invocation and completion. A read always returns the result of the most recent write that completed before the read started. This matches single-machine memory semantics. Eventual consistency: if no new writes are made, all replicas will eventually converge to the same value. There is no guarantee about what a read returns during the convergence period. Tradeoff: linearizability requires coordination (quorum reads or leader reads), adding latency. Eventual consistency can serve reads locally, with lower latency but potentially stale data.

Questions are representative of real interview patterns. Model answers are starting points — adapt them with your own experience and the specific context of the interview.

Common Mistakes

The top 5 mistakes builders make in this track — and exactly how to fix them. Click any mistake to see the root cause and the correct approach.

Why it happens

An entry is only safe to apply once it has been replicated to a majority. Applying earlier can apply entries that the cluster will never agree on.

The fix

Track `commitIndex` separately from `lastLogIndex`. Only advance the state machine to `commitIndex`, which the leader advances only after receiving acknowledgements from a majority.

Why it happens

The consistency check requires the follower to have the entry immediately before the new batch. Sending the wrong index or term causes spurious rejections.

The fix

Maintain a `nextIndex[]` per follower. On rejection, decrement `nextIndex[follower]` and retry with the corrected `prevLogIndex` and `prevLogTerm`.

Why it happens

Without snapshots and log truncation, every write is kept forever.

The fix

Implement snapshotting: periodically write the full state machine state and the index/term of the last included entry, then truncate the log prefix up to that point.

Why it happens

RPCs can take hundreds of milliseconds or time out. Holding a lock during an RPC blocks all other message processing.

The fix

Release the mutex before sending RPCs. Collect the work to do while holding the lock, then release and do the network I/O.

Why it happens

A partitioned node is still reachable but not responding. Blocking on it prevents the leader from advancing `commitIndex` for the healthy majority.

The fix

Use async / parallel AppendEntries. The leader should send to all followers concurrently and advance `commitIndex` as soon as it counts a majority of acknowledgements, without waiting for slow or unreachable followers.

Comparison Mode

Side-by-side comparisons of the approaches, algorithms, and trade-offs you encounter in this track. Expand any comparison to see a detailed breakdown.

DimensionRaftMulti-PaxosViewstamped Replication
First published20141989 (Paxos), extended later1988
Log replicationLeader-driven, term-basedLeader (distinguished proposer) + prepare once per leader epochPrimary-driven, view-stamped ops
Leader electionRandomised timeouts, majority votePhase 1 Paxos (any node can propose)View change protocol, majority vote
ReconfigurationJoint consensus or single-server changesRequires new Paxos roundView change protocol
Specification clarityFormal TLA+ spec, extended Raft paperInformal; many incompatible variantsRevisited 2012 paper is clear
Production useetcd, CockroachDB, TiKV, ConsulGoogle Chubby, SpannerAcademic; influenced PBFT
Verdict:Raft is the go-to for new systems. Multi-Paxos is entrenched in Google's stack. VR is worth studying for its clarity on view changes and recovery.
DimensionLinearisableSequentialEventual
Ordering guaranteeReal-time order respected — if op A finishes before op B starts, A is visible before BAll nodes see the same order, but not necessarily real-timeNo ordering guarantee — any order is valid temporarily
Read-your-writesYesYes (within a session)Not guaranteed
LatencyHigh (requires quorum / leader round-trip)MediumLow (local reads/writes)
Availability under partitionNo (CAP: CP)PartialYes (CAP: AP)
Use casesLocks, leader election, financial txnsCollaborative editing, gamingDNS, social media likes, caches
Examplesetcd, ZooKeeper, SpannerSome multi-leader DB modesCassandra, DynamoDB (default), S3
Verdict:Linearisability is the gold standard for correctness but costs availability. Use eventual consistency only when you can handle stale reads and resolve conflicts.

Concepts Covered

logreplicationAppendEntrieslog matchingconsistency checkconflict resolutioncommitmentmajorityquorumstate machineapplydeterminismelection restrictionsafetyup-to-datemajority replicationcommitIndexlog replicationapply channelcommitted entriesdeterministic replayno-op entryleader changeuncommitted entriessnapshotlog compactionInstallSnapshot RPCstate transferlinearizabilitynetwork partitionMaelstromend-to-end correctnessPaxossingle-decreePreparePromiseproposal numberAcceptAcceptedvalue selectionconsensussafety proofinvariantchosen valueconsensus immutabilityMulti-Paxosinfinite logPhase 1 skipstable leaderRaftmessage complexityleader change costunderstandabilityByzantine faultcrash faultmalicious nodebit flipCFT vs BFTPBFTpre-preparepreparecommitthree-phase protocolequivocationcontradictory messagesevidence collectionByzantine detectionfault threshold3f+1impossibility resultByzantine quorumTendermintCosmosblockchainvoting roundslock mechanism

Prerequisites

It is recommended to complete the previous tracks before starting this one. Concepts build progressively throughout the curriculum.