Building Distributed Systems from Scratch

The Partition Log

Most databases store data in mutable structures — B-trees, hash maps — where updates overwrite old values. Kafka takes the opposite approach: the log is the database. All writes are appends. Nothing is ever overwritten. A partition is simply an ordered, immutable sequence of records, each addressed by a monotonically increasing integer called an offset.

Why Append-Only?

The append-only constraint is not an arbitrary restriction — it is the source of almost every performance and correctness property Kafka provides:

Sequential I/O: Modern disks — both spinning and SSD — deliver dramatically higher throughput for sequential writes than random writes. Kafka consistently reaches near-hardware-limit disk throughput because it never seeks backward for writes. In benchmarks, a single Kafka broker can sustain hundreds of MB/s of throughput on commodity hardware.
Deterministic replay: Because the log is immutable, any consumer can seek to any past offset and re-read exactly what was written. This is the foundation of event sourcing, stream processing re-computation, and consumer group rebalancing — all of which depend on being able to re-process history.
Zero-copy reads: Kafka uses the sendfile(2) system call to transfer bytes from the log file directly to the network socket, bypassing the kernel-to-userspace-to-kernel copy cycle. This halves memory bandwidth usage for high-throughput reads.
Concurrent readers at no cost: Because the log is read-only from the consumer's perspective, any number of consumers can read the same partition simultaneously without coordination, locking, or any read amplification.

Offsets: The Permanent Address of Every Message

Every message in a Kafka partition has exactly one offset — an immutable, monotonically increasing integer assigned at write time. Offsets are:

Assigned by the leader broker, starting at 0 for each partition
Never reused, even if the message is deleted during log compaction
Used by consumers as a bookmark — a consumer persists its offset and resumes from there on restart
The basis for the high watermark — the maximum offset visible to consumers (covered in Track 2)

The TAIL of the log is the next offset to be assigned — it equals the total number of messages written so far. A newly created partition has TAIL=0.

Multiple Consumers, One Log

A critical insight: Kafka does not delete messages when they are consumed. Each consumer group maintains its own offset pointer, independently tracking where it is in the log. Two consumer groups reading the same topic do not interfere. A slow consumer does not block a fast one. This decoupling is what allows Kafka to serve as both a real-time event bus and a historical data store simultaneously.

Partition 0 — retained for 7 days:
  offset 0: {"event":"login","user":"alice"}
  offset 1: {"event":"purchase","user":"bob"}
  offset 2: {"event":"login","user":"carol"}   <-- TAIL=3

Consumer group "fraud-detector":  committed offset = 3  (fully caught up)
Consumer group "analytics":       committed offset = 1  (2 messages behind)
Consumer group "audit-log":       committed offset = 0  (needs full replay)

Why Kafka Uses This Approach

Traditional message queues (RabbitMQ, ActiveMQ) delete messages after delivery to a consumer. This works for task queues but creates problems at scale:

You cannot add a new consumer and have it process historical data
A slow consumer creates backpressure that affects all other consumers
The broker must track per-message acknowledgement state, which is expensive

Kafka's log-centric model shifts the complexity to the consumer: each consumer is responsible for its own offset. The broker's job is simply to store and serve the log efficiently.

Invariants to Maintain

Offset monotonicity: The offset assigned to message N+1 must always be greater than that of message N.
Immutability: Once written, a message's content and offset must never change.
TAIL = number of messages written: On an empty log, TAIL=0. After appending N messages, TAIL=N.
READ out-of-range is an error: Accessing an offset that does not yet exist (>= TAIL) or is negative must return an error, not a crash.