Leader Election: How Distributed Systems Choose Who Is in Charge
Every distributed system that needs consistency needs a leader. How you elect that leader, and what happens when it fails, determines everything about your system's behavior under pressure.
Leader Election: How Distributed Systems Choose Who Is in Charge
Every distributed system that needs consistency needs a leader. How you elect that leader, and what happens when it fails, determines everything about your system's behavior under pressure.
Leader election sounds like a solved problem. Pick the node with the lowest ID. Done. But that only works if you can assume perfect communication and no failures. In a real distributed system, you cannot assume either of those things. Nodes crash. Networks partition. Messages get delayed. The algorithm that elects a leader when everything is working is not the hard part. The hard part is electing exactly one leader when things go wrong — and knowing when it is safe to proceed.
Why One Leader?
Before getting into the mechanics, it is worth asking why distributed systems want a single leader at all. The answer is consistency.
If two nodes can independently accept writes to the same key, those writes might conflict. Which one wins? Without coordination, you might get different answers on different nodes — a split brain. For a few use cases, like eventually consistent stores, this is acceptable. For anything that needs strict ordering or linearizable reads, it is not.
A single leader solves this by being the serialization point for all writes. Every write goes through the leader. The leader orders them. All nodes apply them in the same order. The system has a single, consistent view of the world.
This is the tradeoff at the heart of the CAP theorem. Consistency (everyone agrees on the same value) and availability (every request gets a response) are in tension when the network partitions. A leader-based system chooses consistency: during a partition, only the partition that contains the leader can make progress. The other partition becomes read-only or unavailable.
The Problem With Naive Election
The simplest leader election algorithm is: whoever has the lowest node ID wins. Nodes send heartbeats. If you have not heard from a lower-ID node in a while, declare yourself leader.
This works if node failures are clean. If node n1 (the leader) crashes, n2 eventually stops hearing heartbeats and declares itself leader. Clean.
Now imagine n1 does not crash — it just gets slow. Maybe it is doing a major GC pause. Maybe a network hiccup delayed its heartbeats by a few hundred milliseconds. n2, seeing the heartbeat gap, elects itself leader. n1 comes back and also thinks it is still the leader. Now you have two leaders, and split brain is in progress.
Every serious leader election algorithm has to solve this problem. The solution, in general, is terms.
Terms: The Key Abstraction
Raft, the consensus algorithm this track builds toward, introduces the concept of terms. A term is a monotonically increasing integer that acts as a logical clock for the election process.
Every message in the Raft protocol includes the sender's current term. If a node receives a message with a higher term than its own, it immediately updates its term and reverts to follower status. This is how stale leaders get displaced: when a new election happens, the winning candidate's term is higher than the old leader's term. When the old leader hears about the new term, it knows it is no longer in charge.
This single rule — "highest term wins, always" — is what prevents split brain in Raft. The old slow leader sends messages with term 5. The new leader is on term 6. Every node that receives a term-6 message from the new leader will stop following the old term-5 leader. The cluster converges to one leader per term.
Terms also give you a clean way to detect stale information. A vote request from term 3 can be safely ignored if you are already on term 5. A log entry from term 2 is older than one from term 4. The term number is a total order on every piece of information in the system.
The Election Process
When a follower has not heard from the leader for too long (the election timeout), it starts an election:
- Increment your term counter.
- Transition from follower to candidate.
- Vote for yourself.
- Send RequestVote messages to all other nodes.
Other nodes will vote for a candidate if: the candidate's term is higher than theirs, and they have not already voted in this term. A candidate wins if it gets votes from a majority of nodes (more than half).
The majority requirement is the key safety property. At any given time, at most one candidate can win a majority, because the sets that form majorities overlap. If there are five nodes and candidate A gets votes from n1, n2, n3, candidate B cannot get votes from n3, n4, n5 — because n3 already voted for A in this term.
This is why Raft (and Paxos) always require odd cluster sizes in the examples: 3 nodes, 5 nodes, 7 nodes. With 3 nodes, a majority is 2 — one leader can win and one node can fail. With 5 nodes, a majority is 3 — one leader can win and two nodes can fail. The number of tolerable failures is (cluster size - 1) / 2.
Randomized Timeouts: The Livelock Solution
There is a subtle problem with the election algorithm as described: what if multiple candidates start an election at the same time?
With five nodes and simultaneous elections, you might get two candidates each getting two votes and then waiting for more that never come. Neither has a majority. Another election starts. Another split vote. The system never makes progress.
This is called livelock: nodes are alive and communicating, but making no progress.
Raft solves this with randomized election timeouts. Each node picks its election timeout from a random range (the paper suggests 150ms to 300ms). The first node to time out starts an election. If it wins quickly, the other nodes see the new leader's heartbeats and reset their timeouts before they fire. If it does not win quickly, another node times out and tries.
Randomization does not guarantee progress in any single election round. It makes simultaneous conflicts increasingly unlikely with each round. In practice, Raft clusters elect a leader in one or two rounds almost always.
This is a recurring pattern in distributed systems: when you need to break symmetry between identical nodes, add randomness. Ethernet's CSMA/CD collision avoidance works the same way. Bitcoin's proof-of-work lottery works the same way.
Heartbeats and Lease Renewal
Once elected, a leader must continuously prove it is still alive by sending heartbeat messages to all followers. Heartbeats are empty AppendEntries RPCs — the same message used for log replication, sent with no entries just to reset followers' election timers.
This creates a continuous cycle: leader sends heartbeats, followers reset timers, followers stay followers. As long as the leader is alive and connected, this cycle runs indefinitely. When the leader fails, followers' timers eventually expire, and a new election begins.
The heartbeat interval has to be shorter than the election timeout. The Raft paper suggests heartbeats at roughly 1/10th of the election timeout. If election timeouts are 150-300ms, heartbeats fire every 15ms. This gives the leader plenty of time to get its heartbeat in before any follower's timer fires, while keeping latency low enough to detect failures quickly.
There is a real tradeoff here. Short election timeouts mean fast failure detection and fast failover. They also mean more false elections triggered by transient network blips. Long election timeouts mean fewer false elections, but longer unavailability windows when real failures happen. Systems like etcd tune these based on their deployment environment and their availability requirements.
Pre-Vote: A Practical Extension
The Raft paper describes pre-vote as an optional extension, but most production implementations include it. Here is the problem it solves.
Imagine a node that gets partitioned from the rest of the cluster. It stops hearing heartbeats, so it starts elections. Because it is partitioned, its vote requests never arrive. Its term counter keeps incrementing with each failed election. Term 10, 20, 100.
When the partition heals, this node rejoins with term 100. The current leader, on term 5, receives a message with term 100 and immediately reverts to follower (per the "highest term wins" rule). A new election starts, disrupting the cluster even though the returning node was isolated, not authoritative.
Pre-vote adds a check before a node starts a real election: "Would anyone vote for me if I asked?" If the node is still partitioned and isolated, no one will respond, and it stays at its current term. Only if pre-vote succeeds does the node increment its term and start a real election.
This extension costs one extra round-trip per election but dramatically reduces unnecessary elections from partitioned nodes. etcd added pre-vote in version 3.4. CockroachDB includes it in their Raft implementation.
The Split Brain Guarantee
One of the most counterintuitive things about Raft is that it can have at most one leader per term, but it can have leaders from different terms simultaneously during a partition.
Consider a five-node cluster that partitions into groups of two and three. The majority partition (three nodes) will elect a new leader, say on term 6. The minority partition (two nodes) might still have the old leader, on term 5. Both are running simultaneously.
Is this safe? Yes, for one reason: the minority partition cannot make progress. It does not have a majority, so any write the old leader tries to commit will fail (it cannot replicate to a majority of the cluster). The old leader is running but stuck. When the partition heals, the old leader will receive messages with term 6 and immediately step down.
This is the crucial safety property. Raft guarantees that you will never have two active leaders simultaneously making committed writes. You might have a stale leader that is alive but unable to commit. That is fine.
Building It in Practice
The tasks in this track ask you to implement node states, heartbeats, randomized timeouts, and vote handling. This is the core of the Raft election protocol without the log replication part.
The trickiest part is getting the state transitions right. Every node starts as a follower. It transitions to candidate when its election timer fires. It transitions to leader when it wins a majority of votes. It transitions back to follower when it hears from a leader with a higher term, or when it receives a vote request from a node with a higher term.
These transitions need to be atomic in the sense that they should not be interrupted by a concurrent message. In practice, this usually means serializing all state transitions through a single event loop or using locks carefully.
The other tricky part is the timer management. You need per-node election timers that reset when you hear from a leader. You need to cancel the timer when you become leader (leaders do not wait for elections) and restart it when you become a follower. Getting this right without race conditions is the implementation challenge.
What This Enables
Leader election is not interesting by itself. It becomes interesting because of what you build on top of it.
With a stable leader, you can implement log replication: the leader accepts writes, appends them to its log, and replicates them to followers. A write is committed when the leader has confirmed replication to a majority. This is Raft's log replication, which comes in the next track.
With log replication, you can implement a replicated key-value store. That is what etcd is. It is what CockroachDB's range replication is. It is what ZooKeeper's ZAB protocol does.
etcd is the concrete example that makes this real. Every Kubernetes cluster runs etcd. When you run kubectl apply, your desired cluster state goes into etcd. The Kubernetes API server reads from etcd to make scheduling decisions. The reliability of your Kubernetes cluster depends directly on etcd's leader election working correctly.
The engineers who built etcd implemented exactly what this track builds: term-based elections, randomized timeouts, majority votes, heartbeat-driven lease renewal. The code you write here is structurally similar to what they wrote.
Ready to build it? The Elector track starts with implementing the three node states and transitions between them. By the end, your implementation will handle the full election lifecycle: timeouts, vote requests, vote granting, leader heartbeats, and term-based conflict resolution. The same concepts power etcd, CockroachDB, Consul, and TiKV.
Build it yourself
Reading about distributed systems is useful. Building them is how you actually learn.
Start the The Elector track