Guide

The Consensus: Getting Nodes to Agree on Anything

What consensus actually means formally, why it is provably impossible in a pure async system, how Paxos solved it for practical systems, and why Raft replaced Paxos as the implementation standard.

What Consensus Actually Means

Before discussing how to achieve consensus, it is worth being precise about what the word means in a distributed systems context. Informally, consensus means "the nodes agree on something." Formally, it has three requirements:

Agreement: No two correct nodes decide different values. If node A decides that the committed value is x, and node B also reaches a decision, node B also decides x.

Validity: The decided value must be a value that was actually proposed by some node. The system cannot invent a value from thin air. Consensus on a nonsensical or external value does not count.

Termination: Every correct node eventually decides. The system cannot stall forever. This is the liveness requirement.

Agreement and validity are safety properties: they say bad things will not happen. Termination is a liveness property: it says something good will eventually happen.

The challenge is that in a distributed system, nodes can crash and messages can be delayed. Achieving all three properties simultaneously is harder than it appears, and there is a famous proof that it is actually impossible in the most general formulation.

FLP: Why Consensus Is Impossible (And Why That Does Not Matter)

In 1985, Fischer, Lynch, and Paterson published a result that shook the distributed systems community. Their paper "Impossibility of Distributed Consensus with One Faulty Process" proved that in an asynchronous system (one where there are no bounds on message delivery time), it is impossible to guarantee consensus even if only one node can fail.

The intuition is this: in a system with no timing assumptions, you cannot distinguish between a slow node and a dead node. If you are waiting for a response from node A and it does not arrive, you have no way to know if node A is dead or if the message is just delayed. Any algorithm that requires waiting for node A to respond might wait forever. And any algorithm that does not wait for node A might decide a value while A is about to respond with a conflicting value.

The FLP result does not say consensus is useless or that we cannot build systems that agree on things. It says that in a purely asynchronous model, you cannot simultaneously guarantee safety (no two nodes decide differently) and liveness (every correct node decides) in the presence of failures. If you want safety, you might have to give up on termination in some cases. If you want termination, you might have to make timing assumptions.

In practice, all real consensus systems make one of two moves to escape the FLP impossibility:

They add timing assumptions. Real networks are not purely asynchronous. Messages usually arrive within some time bound. Algorithms like Paxos and Raft add timeouts: if a node does not hear from the leader within a timeout period, it assumes the leader is dead and starts an election. This is not provably safe in the async model, but it works in practice because real networks do have timing properties.

They accept that the algorithm might not terminate under some conditions. Paxos famously can livelock: two candidates can alternate proposing values and blocking each other indefinitely. In practice, this rarely happens, and randomized backoff prevents it from happening persistently.

The FLP result is important to understand because it tells you that any system claiming to solve consensus must be making some assumption about time or accepting some failure mode. There is no free lunch.

Paxos: Correct but Hard to Implement

Leslie Lamport introduced Paxos in 1989, finally published it in 1998, and then rewrote it in simpler form in 2001 as "Paxos Made Simple." Paxos is the foundational consensus algorithm. Understanding its structure is useful even if you never implement it directly.

Paxos proceeds in rounds. Each round has two phases. In the Prepare phase, a proposer sends a prepare message with a proposal number n to a majority of acceptors. Each acceptor, if n is higher than any proposal number it has seen, promises not to accept any proposal with a lower number and returns the highest-numbered proposal it has already accepted, if any.

If the proposer collects promises from a majority, it moves to the Accept phase. It sends an accept message with number n and a value v. The value v is the value from the highest-numbered accepted proposal returned in any promise response, or the proposer's own value if no acceptor has accepted anything. Acceptors accept this proposal if they have not promised to ignore proposal numbers lower than n.

If a majority of acceptors accept the proposal, the value is decided. A separate Learner role can be notified of the decided value.

The insight behind Paxos is in how the value is chosen during the Accept phase. By using the highest-numbered already-accepted value, the protocol ensures that if any value was previously accepted by a majority (and therefore potentially decided), that value will be chosen again. This is the property that makes Paxos safe: once a value is decided by a majority, subsequent rounds will re-decide the same value.

The problem with Paxos is that this description covers only single-value consensus (sometimes called "single-decree Paxos"). Building a log of commands that a replicated state machine can execute requires "Multi-Paxos," which requires layering additional mechanisms on top: leader election, log management, membership changes. Lamport's papers describe the single-value case cleanly, but Multi-Paxos is underspecified.

Google's 2007 paper "Paxos Made Live: An Engineering Perspective" documents their experience building Chubby, Google's distributed lock service, on top of Paxos. They describe a litany of issues not addressed by the theory: handling disk corruption, multi-master scenarios, non-voting replicas, cluster membership changes, snapshotting. The paper's title is a wink at the gap between Paxos theory and Paxos implementation. Chubby underpins GFS, BigTable, and Spanner, so Google had a strong incentive to get it right, and they report that doing so was significantly harder than the theoretical papers suggest.

Raft: Designed to Be Understood

Diego Ongaro's 2014 PhD thesis, "In Search of an Understandable Consensus Algorithm," is a work of engineering communication as much as computer science. Ongaro observed that most engineers who needed to implement consensus ended up struggling with Paxos because the full algorithm (multi-decree, with leader election, log replication, and membership changes) was never written down in one place. They had to reconstruct it from first principles.

Raft's design goal was understandability. Ongaro decomposed consensus into three relatively independent subproblems: leader election, log replication, and safety. Each subproblem has a clear specification. The full algorithm is the composition of these three parts.

Leader election: Raft always has at most one leader at a time. Time is divided into terms, numbered with consecutive integers. Each term starts with an election. A node starts an election by incrementing its term number, transitioning to candidate state, and sending RequestVote RPCs to other nodes. A node votes for a candidate if the candidate's term is at least as large as the node's term and the candidate's log is at least as up-to-date as the voter's log. If a candidate receives votes from a majority, it becomes leader for that term.

Log replication: The leader receives commands from clients, appends them to its log, and sends AppendEntries RPCs to all followers. A follower accepts the entries if its log matches the leader's log at the preceding entry. If not, it rejects, and the leader steps back through the follower's log until it finds the point of agreement, then resends entries from that point. Once a majority of nodes have written an entry to their logs, it is committed. The leader advances its commit index and notifies followers in the next AppendEntries.

Safety: The election restriction ensures that only a node with an up-to-date log can be elected leader. "Up-to-date" means: higher last log term, or same last log term and longer log. This guarantees that any elected leader has all committed entries.

The Log Matching property is a key invariant: if two logs have an entry with the same index and term, they are identical up to that index. This follows from two rules: a leader creates at most one entry per index in a given term, and AppendEntries rejects entries that do not match the previous entry's term and index.

Let us look at a concrete AppendEntries interaction:

def handle_append_entries(self, ae):
    # ae contains: term, leader_id, prev_log_index, prev_log_term, entries, leader_commit

    # Check term: reject if leader's term is stale
    if ae['term'] < self.current_term:
        return {'success': False, 'term': self.current_term}

    # Reset election timer: we have a valid leader
    self.reset_election_timer()

    # Check log consistency at prev_log_index
    prev_entry = self.log.get_entry(ae['prev_log_index'])
    if ae['prev_log_index'] > 0 and (prev_entry is None or prev_entry.term != ae['prev_log_term']):
        return {'success': False, 'match_index': self.log.get_last_index()}

    # Append new entries, truncating any conflicts
    for i, entry in enumerate(ae['entries']):
        idx = ae['prev_log_index'] + 1 + i
        existing = self.log.get_entry(idx)
        if existing and existing.term != entry['term']:
            self.log.truncate_from(idx)
        if idx > self.log.get_last_index():
            self.log.append(entry)

    # Advance commit index
    if ae['leader_commit'] > self.commit_index:
        self.commit_index = min(ae['leader_commit'], self.log.get_last_index())

    return {'success': True, 'match_index': self.log.get_last_index()}

The truncation step is important. If the follower has log entries that conflict with the leader's entries (different term at the same index), those entries must be deleted. This can happen when a node was briefly leader, appended entries that were not committed before it crashed, and then another leader appended different entries at the same indices. The new leader's entries win.

Commitment and the Current-Term Rule

An entry is committed when the leader knows it is stored on a majority of nodes. But there is a subtle safety requirement: a leader can only directly commit entries from its own term.

The reason is a scenario where a leader from term 2 might have an entry at index 3 that was replicated to a majority, but the leader crashed before committing it. A new leader is elected in term 3. This new leader has that entry in its log. If the new leader could commit that term-2 entry by simply observing that a majority have it, the system would be safe. But Raft does not commit it directly.

Instead, the new leader appends a new entry in term 3 and commits it. As part of committing that entry, all entries before it in the log (including the term-2 entry) are implicitly committed. This is the "indirect commit" of previous terms.

The reason for this rule is subtle: without it, you can construct scenarios where an entry appears to be "safely replicated" on a majority but can still be overwritten. With the rule, once a current-term entry commits, everything before it is locked in.

Linearizability: What the Guarantees Mean for Clients

A Raft cluster provides linearizability, which is the strongest consistency guarantee for a distributed system. Linearizability means that the system behaves like a single machine: every operation appears to take effect atomically at some point between when the client issued it and when the client received a response.

Formally, linearizability requires that you can find a total order of all operations such that each operation appears to take effect at some point within its real-time interval, and the order is consistent with the real-time order of non-overlapping operations.

For reads specifically, linearizability requires that a read returns the value written by the most recent write. A simple implementation would route all reads through the Raft log, just like writes: propose a read as a log entry, wait for it to commit, then execute the read from the state machine. This is correct but adds unnecessary latency.

Raft's paper describes the ReadIndex optimization: before serving a read, the leader records the current commit index, sends heartbeats to confirm it is still the leader (so it is not a partitioned stale leader serving reads while a new leader is accepting writes elsewhere), waits until the state machine has applied up to the recorded commit index, and then serves the read from local state. This avoids writing a log entry for reads while maintaining linearizability.

etcd, the distributed key-value store used by Kubernetes for all cluster state, implements this. Every Kubernetes API server write and read goes through etcd, which runs a Raft cluster. The linearizability guarantee means that if you write a pod spec and then immediately read it back, you will see your write. This is a correctness requirement for a system that coordinates thousands of containers across a cluster.

Raft Versus Paxos in Practice

Raft and Multi-Paxos are equivalent in what they can achieve, but they differ in what they specify. Raft specifies a complete algorithm for log replication including leader election, log matching, commitment, and log compaction (snapshots). Multi-Paxos specifies the core safety properties but leaves many engineering decisions to the implementer.

The result is that Raft implementations tend to look similar to each other. etcd, TiKV, CockroachDB, and RethinkDB all implement something recognizable as Raft. They share the same high-level structure: one leader, AppendEntries for replication, RequestVote for elections, a commit index, and an apply loop.

Paxos implementations (Google's Chubby, Apache ZooKeeper's ZAB, which is a Paxos variant) tend to diverge more from each other because the specification is looser. This makes them harder to reason about and audit for correctness.

Raft is not simpler than Paxos in terms of the underlying logic. The safety argument for Raft requires careful reasoning about election restrictions, log matching invariants, and the current-term commitment rule. What Raft is, is more complete. You do not have to fill in the gaps. The paper tells you exactly what to build.

That is why, if you are building a system that needs consensus, you will almost certainly implement Raft. And when you understand what Raft is doing and why each piece exists, you will have a solid foundation for understanding every production consensus system you encounter.

Build it yourself

Reading about distributed systems is useful. Building them is how you actually learn.

Start the The Consensus track