Building Distributed Systems from Scratch

Sparse Index Design

As a Kafka partition grows to millions of messages, a consumer seeking to a specific offset cannot afford to scan from the beginning. Kafka's solution is a sparse offset index: a compact sorted array of (offset, file_position) pairs sampled every N messages. A seek uses binary search on this small array to land near the target, then scans forward by at most N messages to find the exact record.

Why Sparse Rather Than Dense?

A dense index — one entry per message — would require storing the same amount of data twice, doubling disk usage and halving effective storage capacity. The sparse design makes an explicit trade-off:

Index density	Index size	Max scan per seek	Memory cost
Every message (dense)	O(N)	O(1)	Very high
Every 100 messages	O(N/100)	O(100)	Moderate
Every 4096 bytes (Kafka default)	Very small	O(segment size)	Fits in RAM

In production Kafka, the .index file for a 1 GB log segment contains only around 250,000 entries (one per 4 KB of data). The entire index for a typical broker fits in memory, making seeks extremely fast.

The Two-Phase Seek Algorithm

Every seek involves exactly two phases:

def seek(target_offset):
    # Phase 1: Binary search the sparse index
    #   Find the largest indexed_offset that is <= target_offset
    lo, hi = 0, len(index) - 1
    best_pos = 0
    while lo <= hi:
        mid = (lo + hi) // 2
        if index[mid].offset <= target_offset:
            best_pos = index[mid].position
            lo = mid + 1
        else:
            hi = mid - 1

    # Phase 2: Linear scan from best_pos
    #   Walk forward until we hit the exact target_offset
    for pos in range(best_pos, len(log)):
        if log[pos].offset == target_offset:
            return log[pos].message
    return "ERROR: offset not found"

The binary search runs in O(log(N/interval)). The linear scan runs in at most O(interval) steps. Together they give sub-millisecond seeks even on logs with hundreds of millions of messages.

Index Construction During Append

The index is built incrementally as messages are appended — there is no separate indexing pass. The rule is simple: if offset % interval == 0, append an entry to the index. This means the index always reflects exactly the messages in the log, with no staleness.

def append(message):
    offset = len(self.messages)
    self.messages.append(message)
    if offset % self.interval == 0:
        self.index.append((offset, offset))  # (indexed_offset, position)
    return offset

Why Kafka Uses This Approach

Kafka retains messages for days or weeks. Without an index, a consumer resuming from offset 50,000,000 in a partition with 100,000,000 messages would need to scan half the log just to find its starting position. The sparse index reduces this to a binary search over a few thousand entries followed by a scan of at most a few thousand bytes — regardless of total log size.

Segment Rollover and Multiple Index Files

In production Kafka, the log is divided into segments. Each segment has its own .log file and a corresponding .index file covering only offsets within that segment. When a segment reaches its maximum size (default 1 GB), Kafka closes it and starts a new one. Old segments can be deleted, archived, or compacted independently without affecting active writes. This design means no single file ever grows unboundedly, and old indexes can be evicted from memory when not in use.

Edge Cases to Handle

Seeking offset 0: The index always has an entry at offset 0. The scan starts at position 0 and terminates immediately.
Seeking the last offset (TAIL - 1): The binary search lands on the nearest indexed entry before it, then scans forward to the last message.
Seeking an offset greater than or equal to TAIL: No entry exists — return ERROR: offset not found before even consulting the index.
Empty log: No index entries exist. Any seek on an empty log is immediately an error.

Build a Sparse Offset Index for O(log n) Seeks