Guide

The Store: Building a Linearizable Key-Value Store on Raft

How replicated key-value stores work above the consensus layer, the quorum model for reads and writes, why linearizable reads are more complex than they look, and how etcd and DynamoDB make fundamentally different consistency choices.

From Log to Store

Raft gives you a replicated log. Every node agrees on the same sequence of entries, and those entries are durable across failures. But a log is not a useful application interface on its own. Users want to put a key, get a key, delete a key. Building a key-value store on top of Raft requires connecting the consensus layer to a state machine that interprets the log entries as operations.

The basic architecture is straightforward. When a client issues a write, the request goes to the Raft leader. The leader wraps the operation (say, PUT x=5) in a log entry, appends it, and replicates it to followers. When a majority of nodes have acknowledged the entry, it is committed. The leader then applies the entry to the state machine (which maintains the actual key-value map), returns the result to the client.

Read requests are where the interesting consistency tradeoffs live.

The Quorum Model

The quorum model generalizes beyond the Raft-specific case and underpins the design of many distributed storage systems. The key insight is captured in a simple inequality.

Given N replicas, a write quorum W (how many must acknowledge a write), and a read quorum R (how many must respond to a read), the system provides strong consistency when W + R > N.

With N=3, W=2, R=2: any write is acknowledged by at least 2 replicas. Any read contacts at least 2 replicas. Those two sets must overlap by at least one node. That one overlapping node has the most recent write. As long as you take the highest-versioned response among your read quorum, you will see the latest write.

Amazon's Dynamo paper popularized this model. DynamoDB, by default, uses N=3, W=2, R=2. The system sends writes to all 3 replicas and waits for 2 acknowledgments. Reads contact all 3 and wait for 2 responses. This gives you "eventual consistency" in the paper's terminology, but with quorum overlap it is stronger than naive eventual consistency: you will always see any write that has received a majority acknowledgment, as long as you contact a quorum of replicas.

DynamoDB also supports eventual consistency reads (R=1), which are faster because they only need one response, but they might return stale data if the queried replica has not received recent writes.

The same inequality applies to Raft, though Raft's implementation makes it less explicit. Raft's write quorum is majority (at least N/2 + 1 nodes). Raft's read quorum, for the ReadIndex optimization, is also majority. The leader confirms it is still the leader by getting acknowledgment from a majority before serving a read. These two majorities are guaranteed to overlap by at least one node.

Why Reading from the Leader Is Not Obviously Safe

You might think that if you are the Raft leader, you can just read your local state machine. You have all the committed entries. What could go wrong?

The problem is network partitions. Suppose you have 5 nodes: n1, n2, n3, n4, n5. n1 is the leader. A network partition splits the cluster: n1 and n2 are on one side, n3, n4, and n5 on the other. n1 cannot reach the majority (3 nodes), so it cannot commit new log entries. On the other side, n3, n4, and n5 can elect a new leader, say n3.

Now two leaders exist. n1 is a stale leader; it just does not know it yet. If n1 serves reads from its local state, it might serve reads based on state that was current before the partition but that a client has since overwritten via n3. A client writes x=2 through n3 (which commits successfully with n3, n4, n5 majority), then reads from n1 (which still shows x=1). This violates linearizability.

The ReadIndex protocol addresses this directly. Before serving a read, the leader must confirm that it is still the leader. It does this by broadcasting heartbeats and waiting for a majority to respond. If it receives majority acknowledgment, it is definitely still the leader (the other partition would need a majority to elect a new leader, and if the current leader already has a majority, the other side cannot). Then it records the current commit index, waits until the state machine has applied up to that index, and executes the read.

This means that a read in a Raft-based system might not be instant. If the state machine is lagging behind the commit index (it is applying entries asynchronously), the read must wait. And the heartbeat round adds latency. For a strongly consistent read, you are paying for a network round trip's worth of latency even if you go to the leader directly.

def read_via_index(self, key):
    if not self.raft.is_leader():
        return {'error': 'not_leader'}

    # Capture current commit index
    read_index = self.raft.get_commit_index()

    # Confirm we are still leader (heartbeat round)
    if not self._confirm_leadership():
        return {'error': 'lost_leadership'}

    # Wait for state machine to apply up to read_index
    if not self._wait_applied(read_index):
        return {'error': 'timeout'}

    # Now read from local state machine
    return self.state_machine.get(key)

A simpler but slower approach is to treat reads like writes: log the read operation, wait for it to commit, then execute it. This is correct but wasteful. You are using Raft's replication machinery just to confirm you are the leader.

Lease-Based Reads

Some systems use leader leases to avoid the heartbeat round on reads. The leader tracks when it last sent successful heartbeats to a majority. If that was less than one election timeout ago, it knows it is still the leader: no other node could have been elected without first timing out the current leader's heartbeat, and that timeout has not elapsed yet.

A lease-based read looks like: if current time minus last successful heartbeat is less than election timeout, serve the read locally with no network round trip.

This is faster, but it relies on bounded clock skew. If node A's clock runs fast relative to node B's clock, and A believes it still has a valid lease while B has already started a new election (from B's perspective, the lease expired), A might serve stale reads. In practice, lease-based reads require that your nodes' clocks are synchronized to within a small bound, and the election timeout must exceed the maximum possible clock skew.

Google Spanner uses TrueTime, which provides explicit bounds on clock uncertainty. When a transaction commits, Spanner waits out the clock uncertainty bound before reporting the transaction as committed. This "commit wait" ensures that causally later transactions, even across nodes with different clocks, will always see causally earlier ones. It is an elegant solution that comes with a real latency cost (tens of milliseconds of wait on writes) in exchange for the ability to safely use lease-based reads globally.

etcd Versus DynamoDB: A Study in Different Tradeoffs

etcd and DynamoDB are both distributed key-value stores, but they make diametrically opposite consistency choices.

etcd is the storage system for Kubernetes cluster state. Every pod specification, every service endpoint, every configuration value lives in etcd. The correctness requirement for Kubernetes is strict: if you write a pod spec and then immediately read it, you must see your write. If two API servers race to update a resource, one of them must win cleanly, with no partial updates. etcd uses Raft and provides linearizability for all reads by default. Every read goes through the ReadIndex protocol or the full log path. There is no "eventual consistency" mode. This is correct and appropriate for cluster coordination, where stale reads could cause a controller to make wrong decisions.

The cost is that etcd is not fast in the throughput sense. It is designed for relatively low write rates (thousands per second, not millions) and strongly consistent reads. The Kubernetes use case fits: cluster state changes are infrequent compared to user-facing traffic, but consistency is critical.

DynamoDB is designed for applications where you want millions of requests per second across a global fleet. It makes eventual consistency the default for reads and charges extra (in latency and cost) for strongly consistent reads. The quorum model allows DynamoDB to spread read traffic across replicas, increasing read throughput horizontally. A DynamoDB table can serve millions of reads per second by adding more replicas and spreading reads across them.

The DynamoDB design accepts that some reads might be slightly stale. For many web applications, this is the right tradeoff. A user adding an item to a shopping cart does not need to see that addition reflected within the same millisecond on every server they hit. If they refresh the page a second later and the item is there, the user is happy.

Where these systems differ is not in their correctness, but in what they optimize for. etcd optimizes for correctness guarantees that simplify distributed coordination. DynamoDB optimizes for throughput and global availability, accepting that application developers must reason about eventual consistency.

Client Sessions and Deduplication

One issue that arises in building a linearizable KV store is retry safety. Suppose a client sends a PUT request to the leader, the leader commits the entry, applies it to the state machine, but crashes before sending the response to the client. The client retimes out and retries. The retry goes to the new leader. If the new leader applies the PUT again, the operation executes twice.

For idempotent operations (GET, PUT with the same value), this is harmless. For non-idempotent operations (increment a counter, append to a list), duplicate execution causes incorrect results.

The standard solution is client sessions with sequence numbers. Each client has a unique client ID. Each request has a monotonically increasing sequence number. The server tracks, for each client, the highest sequence number it has processed and the result of that operation.

When a new request arrives, the server checks: have I already processed this (client_id, sequence_number) pair? If yes, return the cached result. If no, process it and cache the result.

The deduplication state must be stored in the Raft log, not just in memory, so it survives leader failover. When a new leader is elected and a client retries its request, the new leader can check the replicated deduplication table and return the cached result if the operation was already committed.

def handle_request(self, client_id, seq, operation):
    is_dup, cached = self.tracker.check(client_id, seq)
    if is_dup:
        return cached

    result = self.submit_to_raft_and_wait(operation)
    self.tracker.record(client_id, seq, result)
    return result

Periodically, you garbage-collect old session entries. If a client has not made a request in some time, its session entry can be removed. This requires that clients do not retry requests indefinitely and that the garbage collection window is larger than the client's maximum retry period.

The Cost of Linearizability

Linearizability is not free. Every strongly consistent read requires either a log entry (slow, burns Raft's replication capacity) or a heartbeat round (one network round trip plus apply-lag wait). Write latency includes replication to a majority, which means your write latency is at least the network round trip to the second-fastest node in the majority.

For a three-node cluster with nodes in the same datacenter, this adds maybe 1-2 milliseconds to every write. For a five-node cluster spanning datacenters across the US, the majority quorum requires at least 2 nodes to respond, and if those are in different coasts, you are looking at 50-100ms latency for every write.

This is why systems like Apache Cassandra make eventual consistency the default. Cassandra uses a tunable consistency model. You can request quorum reads and writes (strongly consistent, slower) or one-node reads (eventually consistent, fast). The default depends on your configured replication factor and your read/write consistency settings. You choose the consistency you need per operation.

The fundamental constraint is that in a distributed system, if you want strong consistency, you are bounded by the speed of communication between nodes. Any read that must consult a majority cannot complete faster than the network allows. Eventual consistency breaks this constraint by not requiring majority confirmation: reads can be served immediately from any replica, and writes can be acknowledged as soon as they are written locally. The price is that different clients might see different values for the same key at the same time.

For building your store in this track, you are implementing the hard version: full linearizability. Every read that can observe a prior write must observe it. Every write must be committed by a majority before it is acknowledged. The implementation is more complex, but the resulting system has the strongest possible guarantees. Understanding what it takes to build this correctly makes you much better positioned to understand when weaker guarantees are acceptable in a specific application.

Build it yourself

Reading about distributed systems is useful. Building them is how you actually learn.

Start the The Store track