Building Distributed Systems from Scratch

In-Sync Replicas and the High Watermark

Kafka's durability guarantee rests on two interlocking concepts: the In-Sync Replica (ISR) set and the high watermark (HW). Together they define a commitment boundary — the highest offset that is guaranteed to survive any single broker failure without data loss.

What the ISR Set Represents

The ISR is the set of replicas (leader plus followers) that are sufficiently caught up with the leader's log. A replica is in-sync when its replication lag — measured as log_end_offset - replica_fetch_offset — is within a configurable threshold (replica.lag.time.max.ms in production, or MAX_LAG offsets in this exercise). A replica that falls too far behind is removed from the ISR. When it catches up again, it is re-admitted.

lag = leader.log_end_offset - replica.fetch_offset

if lag <= MAX_LAG:
    isr.add(replica)        # within threshold — re-admit or retain
else:
    isr.discard(replica)    # too far behind — remove from ISR

The leader itself is always in the ISR; its "fetch offset" equals its log end offset, so its lag is always 0.

The High Watermark

The high watermark is the minimum fetch offset across all current ISR members. Its meaning is precise: every replica in the ISR has confirmed it has received all messages up to offset HW-1. Therefore, even if the leader crashes right now and a new leader is elected from the ISR, no message with offset less than HW will be lost.

high_watermark = min(fetch_offsets[node] for node in isr)

Consumers can only read messages with offset less than the high watermark. Messages above the HW are "not yet committed" — they may or may not survive a leader failover. This guarantee protects consumers from reading data that could be rolled back.

The acks=all Producer Write Path

When a producer sends a message with acks=all (the strongest durability setting), the following sequence occurs:

The leader appends the message to its own log, advancing log_end_offset.
Each ISR follower fetches the new message, advancing its fetch_offset.
The leader recalculates the high watermark: HW = min(fetch_offsets across ISR).
Once HW advances past the new message's offset, the message is committed.
The leader sends the ACK back to the producer.

This means acks=all adds latency proportional to replication round-trip time. In exchange, a committed message is guaranteed to survive the loss of any subset of ISR nodes.

Leader Failover and the ISR Guarantee

When the current leader crashes, Kafka's controller elects a new leader — always from the ISR. This is safe because every ISR member, by definition, has all data up to the high watermark. No committed message is lost, regardless of which ISR member becomes the new leader.

The dangerous alternative is unclean leader election: electing a replica that is not in the ISR. This restores availability after a total ISR failure (all ISR members crashed simultaneously), but at the cost of losing messages that were committed to the old leader but not yet replicated to the new one. The configuration flag unclean.leader.election.enable defaults to false in modern Kafka.

Why Kafka Uses This Approach

Kafka's ISR mechanism is a deliberate alternative to Raft's majority-quorum writes. In Raft, every write must be acknowledged by a majority before it is committed — the cluster always tolerates floor(N/2) failures. Kafka's approach allows writes to succeed as long as any ISR member is alive (including just the leader), which gives higher availability. The trade-off is that if all ISR members simultaneously fail, some committed data may be lost unless unclean election is disabled.

Invariants to Maintain

The leader is always in the ISR: lag of the leader is 0 by definition. Never remove the leader from its own ISR.
HW never exceeds the leader's log end offset: The leader cannot have committed more data than it has written.
HW is monotonically non-decreasing: The high watermark only advances forward. It never moves backward, even if a follower temporarily loses progress.
A replica joining the ISR does not lower the HW: A newly re-admitted replica might have a low fetch offset, but this should not retract the HW — the HW only considers current ISR members, and the new member's position reflects where it currently is, which may lower the HW temporarily as it catches up.

Track the In-Sync Replica Set and High Watermark