Raft Leader Election for Partition Leadership
Mini-Kafka / Leader Election and ISR TrackingConcept
Raft Leader Election
Raft divides time into terms — monotonically increasing integers. Each term begins with an election. At most one leader can exist per term. This single constraint — one leader per term — is the foundation of Raft's safety guarantee: no two nodes will ever simultaneously believe they are the authoritative leader for the same term.
The Three Node States
- Follower: The default state. Followers accept log entries from the leader and vote in elections. They do not initiate anything. If a follower stops hearing from the leader, its election timer fires and it transitions to Candidate.
- Candidate: A follower that has timed out. It increments its term, votes for itself, and broadcasts
RequestVoteRPCs to all other nodes. It either wins (becomes leader), loses (reverts to follower), or times out again and retries with an incremented term. - Leader: The node that won the election. It handles all client writes and sends periodic
AppendEntries(heartbeat) RPCs to suppress other nodes' election timers. There is exactly one leader per term across the entire cluster.
The Election Protocol Step by Step
- A follower's election timeout expires (it has not received a heartbeat from the current leader within the timeout window).
- The follower transitions to Candidate, increments its current term, and votes for itself.
- It broadcasts
RequestVote(term=T, candidateId=X)to all other nodes. - Each receiving node grants the vote if: (a) the candidate's term is greater than or equal to the voter's current term, and (b) the voter has not already voted for a different candidate in this term.
- If the candidate accumulates votes from a strict majority — more than N/2 nodes, counting its own self-vote — it declares itself Leader and immediately begins sending heartbeats.
- If the candidate does not reach majority (another candidate won first, or votes are split), it reverts to Follower for this term.
Term Safety: The Core Invariant
The term number is the single most important safety mechanism in Raft. Every message carries the sender's current term. The rule is absolute:
on receive(any_message with term T):
if T > self.term:
self.term = T # adopt the higher term
self.state = FOLLOWER # immediately step down
self.voted_for = {} # clear vote record for old terms
This rule ensures that a stale leader that was partitioned from the rest of the cluster immediately steps down the moment it reconnects and sees a higher term. The cluster can never have two active leaders for the same term.
Why Majority Quorums Prevent Split-Brain
The majority requirement is not just a convention — it is a mathematical guarantee. Consider N=5 nodes with majority=3. For two candidates A and B to both win the same term, A needs 3 votes and B needs 3 votes, for a total of 6 votes. But there are only 5 nodes and each node can vote for at most one candidate per term. Therefore it is arithmetically impossible for two candidates to both win the same term. This is why Raft can tolerate floor((N-1)/2) simultaneous node failures while maintaining safety.
| Cluster size | Majority needed | Failures tolerated |
|---|---|---|
| 3 nodes | 2 votes | 1 failure |
| 5 nodes | 3 votes | 2 failures |
| 7 nodes | 4 votes | 3 failures |
How Kafka Uses Raft (KRaft Mode)
Before Kafka 2.8, partition leader election was managed by ZooKeeper — a separate coordination service that itself ran a variant of ZAB (Zookeeper Atomic Broadcast, closely related to Raft). KIP-500 introduced KRaft mode, which replaces ZooKeeper with a Raft quorum of Kafka brokers themselves. The metadata log (which broker is leader for which partition, which topics exist, etc.) is replicated via Raft directly in Kafka. This eliminates the operational complexity of running a separate ZooKeeper ensemble and allows Kafka to scale to millions of partitions.
Edge Cases and Invariants
- A node cannot vote for itself twice: The self-vote counts as the candidate's one vote for that term. If another candidate requests a vote for the same term, it must be denied.
- A node that already voted for A must deny B in the same term: Even if B has the same term. Once voted, the choice is locked for that term.
- Higher term always wins: A node receiving a
RequestVotewith a higher term than its own must grant the vote (and step down if it was leader or candidate). - Ties resolve by retry: If two candidates each get exactly half the votes (possible with even N), neither wins. Both revert to follower and restart with a new term. Randomised election timeouts make repeated ties extremely unlikely.
Sign in to run and submit code.