Subtracks & Tasks
Raft Log Replication
Implement Log Replication
Implement Raft log replication from leader to followers: 1. Leader receives client commands, appends to local log 2. Leader sends AppendEntries to al...
Ensure Log Matching Property
Implement the Log Matching safety property: On the follower side: 1. Receive AppendEntries with prevLogIndex, prevLogTerm 2. If entry at prevLogIndex...
Implement Entry Commitment
Implement log entry commitment: 1. Leader tracks matchIndex for each follower 2. For each index N, count how many nodes have matchIndex >= N 3. If ma...
Apply Committed Entries to State Machine
Apply committed log entries to the state machine: 1. Track lastApplied - highest entry applied to state machine 2. When commitIndex > lastApplied, ap...
Implement Election Restriction for Safety
Implement the Election Restriction to ensure safety: A candidate must have an "up-to-date" log to win election: 1. Voter compares own log to candidat...
Commitment and Application
Implement the Raft Commitment Rule
Implement the Raft commitment rule: an entry is committed when a majority of nodes have it in their log. The leader uses `matchIndex[]` to determine w...
Implement the Apply Channel for State Machine
Implement an apply channel that feeds committed log entries to the state machine. The state machine is a simple key-value store that processes `Put`, ...
Handle Leader Changes with No-Op on Election
When a new leader is elected, it must not apply uncommitted entries from previous terms. The "no-op on election" trick solves this: the new leader imm...
Add Snapshot Support for Log Compaction
Add snapshot support to compact the Raft log. When the log grows beyond a threshold, take a snapshot of the state machine. Followers that fall behind ...
Pass Linearizable KV with Network Partitions
The ultimate test: pass a linearizable key-value workload with 5 nodes under network partitions. This combines everything: leader election, log replic...
Paxos
Implement Single-Decree Paxos Phase 1 (Prepare/Promise)
Implement Phase 1 of single-decree Paxos (agree on one value). Phase 1 (Prepare/Promise): 1. Proposer selects a unique proposal number `n` 2. Propose...
Implement Paxos Phase 2 (Accept/Accepted)
Implement Phase 2 of Paxos. After getting a majority of promises in Phase 1, the proposer sends `Accept(n, v)` to acceptors. Value selection rule: if...
Prove Paxos Safety: Chosen Values Are Immutable
Prove that once a value is chosen in Paxos, all future proposals will also choose the same value. This is the core safety property. Implement a simul...
Implement Multi-Paxos for an Infinite Log
Extend single-decree Paxos to Multi-Paxos: an infinite log where each slot is a separate Paxos instance. Key optimization: once leadership is establi...
Compare Raft vs Multi-Paxos
Compare Raft and Multi-Paxos across multiple dimensions. Both solve the same problem (replicated log) but make different design tradeoffs. ```json Re...
Byzantine Fault Tolerance
Understand Byzantine Faults with Real-World Examples
What is a Byzantine fault? Unlike crash faults (node simply stops), Byzantine faults allow arbitrary misbehavior. A faulty node can lie, send contradi...
Implement Simplified PBFT with 4 Nodes
Implement a simplified PBFT (Practical Byzantine Fault Tolerance) with 4 nodes (f=1 Byzantine fault). PBFT Three-Phase Protocol: 1. **Pre-prepare**: ...
Detect and Handle Equivocation Attacks
Simulate a Byzantine node that sends contradictory messages to different peers (equivocation). Show that PBFT correctly handles this. ```json Request...
Prove the N >= 3f+1 Byzantine Fault Threshold
Prove that tolerating f Byzantine faults requires at least N >= 3f+1 nodes. Then verify empirically. The proof: - We need two quorums to overlap in a...
Implement Tendermint-Style BFT Voting Rounds
Research and implement Tendermint-style BFT voting (used in Cosmos blockchain). Tendermint simplifies PBFT with clear round structure and a rotating p...
Interview Prep
Common interview questions for Distributed Systems Engineer roles that map directly to what you build in this track. Click any question to reveal the model answer.
Model Answer
FLP (Fischer, Lynch, Paterson 1985) proves that no deterministic algorithm can solve consensus in an asynchronous system if even one process can fail. "Asynchronous" means no upper bound on message delay. In practice, all real systems have timing assumptions (TCP timeouts, election timeouts), making them partially synchronous. Raft and Paxos work in the partially synchronous model — they may block during asynchronous periods but are always safe and make progress when synchrony resumes. FLP is a theoretical bound, not a practical barrier.
Model Answer
Paxos defines single-value consensus (single-decree Paxos). Multi-Paxos for a replicated log requires engineering decisions that are not specified in the original paper: how to pipeline proposals, how to handle leader changes, how to handle gaps in the log. Ongaro & Ousterhout found that real Paxos implementations (Chubby, Zookeeper ZAB) diverged significantly from the paper and from each other. Raft explicitly specifies leader election, log replication, and membership changes as a single coherent algorithm, making the design space tractable.
Model Answer
Use a Raft-replicated state machine where the lock state is a key-value entry. Acquire: propose "set lock = clientId" if lock is unset. Release: propose "set lock = nil" if lock == clientId. Add a TTL/lease to handle client crashes. Failure modes: (1) client crash while holding lock — use a lease that expires, (2) network partition — the leader might think it has the lock but is partitioned from the cluster. Use lease-based reads: the leader only serves reads while its lease is valid, which requires clock synchrony assumptions. HashiCorp Consul's session system implements exactly this.
Model Answer
A stale leader (on the minority side of a partition) still believes it is leader because it has not heard about the new higher-term election. Both the stale leader and new leader exist simultaneously. This is safe: the stale leader cannot commit any writes because it cannot reach a majority. Any write it tries will be rejected when the partition heals. The key safety property is that a stale leader's uncommitted writes are overwritten — only committed writes (replicated to majority) survive.
Model Answer
Linearizability: every operation appears to take effect instantaneously at a single point between its invocation and completion. A read always returns the result of the most recent write that completed before the read started. This matches single-machine memory semantics. Eventual consistency: if no new writes are made, all replicas will eventually converge to the same value. There is no guarantee about what a read returns during the convergence period. Tradeoff: linearizability requires coordination (quorum reads or leader reads), adding latency. Eventual consistency can serve reads locally, with lower latency but potentially stale data.
Questions are representative of real interview patterns. Model answers are starting points — adapt them with your own experience and the specific context of the interview.
Common Mistakes
The top 5 mistakes builders make in this track — and exactly how to fix them. Click any mistake to see the root cause and the correct approach.
Why it happens
An entry is only safe to apply once it has been replicated to a majority. Applying earlier can apply entries that the cluster will never agree on.
The fix
Track `commitIndex` separately from `lastLogIndex`. Only advance the state machine to `commitIndex`, which the leader advances only after receiving acknowledgements from a majority.
Why it happens
The consistency check requires the follower to have the entry immediately before the new batch. Sending the wrong index or term causes spurious rejections.
The fix
Maintain a `nextIndex[]` per follower. On rejection, decrement `nextIndex[follower]` and retry with the corrected `prevLogIndex` and `prevLogTerm`.
Why it happens
Without snapshots and log truncation, every write is kept forever.
The fix
Implement snapshotting: periodically write the full state machine state and the index/term of the last included entry, then truncate the log prefix up to that point.
Why it happens
RPCs can take hundreds of milliseconds or time out. Holding a lock during an RPC blocks all other message processing.
The fix
Release the mutex before sending RPCs. Collect the work to do while holding the lock, then release and do the network I/O.
Why it happens
A partitioned node is still reachable but not responding. Blocking on it prevents the leader from advancing `commitIndex` for the healthy majority.
The fix
Use async / parallel AppendEntries. The leader should send to all followers concurrently and advance `commitIndex` as soon as it counts a majority of acknowledgements, without waiting for slow or unreachable followers.
Comparison Mode
Side-by-side comparisons of the approaches, algorithms, and trade-offs you encounter in this track. Expand any comparison to see a detailed breakdown.
| Dimension | Raft | Multi-Paxos | Viewstamped Replication |
|---|---|---|---|
| First published | 2014 | 1989 (Paxos), extended later | 1988 |
| Log replication | Leader-driven, term-based | Leader (distinguished proposer) + prepare once per leader epoch | Primary-driven, view-stamped ops |
| Leader election | Randomised timeouts, majority vote | Phase 1 Paxos (any node can propose) | View change protocol, majority vote |
| Reconfiguration | Joint consensus or single-server changes | Requires new Paxos round | View change protocol |
| Specification clarity | Formal TLA+ spec, extended Raft paper | Informal; many incompatible variants | Revisited 2012 paper is clear |
| Production use | etcd, CockroachDB, TiKV, Consul | Google Chubby, Spanner | Academic; influenced PBFT |
| Dimension | Linearisable | Sequential | Eventual |
|---|---|---|---|
| Ordering guarantee | Real-time order respected — if op A finishes before op B starts, A is visible before B | All nodes see the same order, but not necessarily real-time | No ordering guarantee — any order is valid temporarily |
| Read-your-writes | Yes | Yes (within a session) | Not guaranteed |
| Latency | High (requires quorum / leader round-trip) | Medium | Low (local reads/writes) |
| Availability under partition | No (CAP: CP) | Partial | Yes (CAP: AP) |
| Use cases | Locks, leader election, financial txns | Collaborative editing, gaming | DNS, social media likes, caches |
| Examples | etcd, ZooKeeper, Spanner | Some multi-leader DB modes | Cassandra, DynamoDB (default), S3 |
Concepts Covered
Prerequisites
It is recommended to complete the previous tracks before starting this one. Concepts build progressively throughout the curriculum.