Concept
Sparse Index Design
As a Kafka partition grows to millions of messages, a consumer seeking to a specific offset cannot afford to scan from the beginning. Kafka's solution is a sparse offset index: a compact sorted array of (offset, file_position) pairs sampled every N messages. A seek uses binary search on this small array to land near the target, then scans forward by at most N messages to find the exact record.
Why Sparse Rather Than Dense?
A dense index — one entry per message — would require storing the same amount of data twice, doubling disk usage and halving effective storage capacity. The sparse design makes an explicit trade-off:
| Index density | Index size | Max scan per seek | Memory cost |
|---|---|---|---|
| Every message (dense) | O(N) | O(1) | Very high |
| Every 100 messages | O(N/100) | O(100) | Moderate |
| Every 4096 bytes (Kafka default) | Very small | O(segment size) | Fits in RAM |
In production Kafka, the .index file for a 1 GB log segment contains only around 250,000 entries (one per 4 KB of data). The entire index for a typical broker fits in memory, making seeks extremely fast.
The Two-Phase Seek Algorithm
Every seek involves exactly two phases:
def seek(target_offset):
# Phase 1: Binary search the sparse index
# Find the largest indexed_offset that is <= target_offset
lo, hi = 0, len(index) - 1
best_pos = 0
while lo <= hi:
mid = (lo + hi) // 2
if index[mid].offset <= target_offset:
best_pos = index[mid].position
lo = mid + 1
else:
hi = mid - 1
# Phase 2: Linear scan from best_pos
# Walk forward until we hit the exact target_offset
for pos in range(best_pos, len(log)):
if log[pos].offset == target_offset:
return log[pos].message
return "ERROR: offset not found"
The binary search runs in O(log(N/interval)). The linear scan runs in at most O(interval) steps. Together they give sub-millisecond seeks even on logs with hundreds of millions of messages.
Index Construction During Append
The index is built incrementally as messages are appended — there is no separate indexing pass. The rule is simple: if offset % interval == 0, append an entry to the index. This means the index always reflects exactly the messages in the log, with no staleness.
def append(message):
offset = len(self.messages)
self.messages.append(message)
if offset % self.interval == 0:
self.index.append((offset, offset)) # (indexed_offset, position)
return offset
Why Kafka Uses This Approach
Kafka retains messages for days or weeks. Without an index, a consumer resuming from offset 50,000,000 in a partition with 100,000,000 messages would need to scan half the log just to find its starting position. The sparse index reduces this to a binary search over a few thousand entries followed by a scan of at most a few thousand bytes — regardless of total log size.
Segment Rollover and Multiple Index Files
In production Kafka, the log is divided into segments. Each segment has its own .log file and a corresponding .index file covering only offsets within that segment. When a segment reaches its maximum size (default 1 GB), Kafka closes it and starts a new one. Old segments can be deleted, archived, or compacted independently without affecting active writes. This design means no single file ever grows unboundedly, and old indexes can be evicted from memory when not in use.
Edge Cases to Handle
- Seeking offset 0: The index always has an entry at offset 0. The scan starts at position 0 and terminates immediately.
- Seeking the last offset (TAIL - 1): The binary search lands on the nearest indexed entry before it, then scans forward to the last message.
- Seeking an offset greater than or equal to TAIL: No entry exists — return
ERROR: offset not foundbefore even consulting the index. - Empty log: No index entries exist. Any seek on an empty log is immediately an error.
Sign in to run and submit code.