ARCHIVED from builddistributedsystem.com on 2026-04-28 — URL: https://builddistributedsystem.com/tracks/store
Tracks/The Store
07

The Store

Advanced
Consensus|15 tasks

Build a linearizable key-value store on top of your Raft implementation. Handle client requests, manage sessions, ensure read consistency, and implement client retry logic.

Subtracks & Tasks

Interview Prep

Common interview questions for Distributed Systems / Backend Engineer roles that map directly to what you build in this track. Click any question to reveal the model answer.

Model Answer

Use a Raft-replicated state machine: all writes go through the leader, which replicates to a majority before acknowledging. Reads from the leader provide linearizability (or use ReadIndex protocol to avoid log round-trips). Partition tolerance: the majority partition continues serving both reads and writes; the minority partition becomes unavailable for writes (choosing CP in CAP terms). This is the etcd / CockroachDB model. For higher availability at the cost of consistency, use quorum reads/writes (DynamoDB style) where W + R > N.

Model Answer

With N replicas, requiring W write acknowledgments and R read responses such that W + R > N ensures every read quorum intersects every write quorum by at least one node. Example: N=3, W=2, R=2 (2+2>3). A write must reach 2 of 3 nodes. A read queries 2 of 3 nodes. At least 1 node is in both sets — that node has the latest write, and a read that returns the highest version across R responses will return it. This does not guarantee linearizability (concurrent writes can still cause conflicts), but prevents reading a value older than the last quorum write.

Model Answer

DynamoDB uses conditional writes and optimistic locking via version numbers. Without a condition, the last writer wins (based on wall clock, which is susceptible to clock skew). With conditional expressions (e.g., version = expected_version), write is rejected if the version does not match, forcing the client to re-read and retry. DynamoDB's original design used vector clocks (described in the 2007 Dynamo paper), but the production service moved to simpler conflict resolution because clients found vector clock reconciliation complex to implement correctly.

Model Answer

A key-value store persists data durably and provides strong consistency guarantees (e.g., etcd, DynamoDB). A distributed cache stores data in memory for fast access, typically with eviction and optional durability. Redis can function as both: with AOF (append-only file) and/or RDB snapshots enabled, Redis provides durability, but its consistency model (single-threaded, eventually consistent replication) differs from systems like etcd. For critical data, use a durable store with replication; use Redis as a cache in front of it.

Model Answer

Read repair happens when a quorum read returns different values from different replicas. The coordinator detects the inconsistency and sends a write to the stale replicas to bring them up to date. This is background repair triggered by read traffic. It solves replica divergence caused by temporary node failures: if a node was down during a write, it missed the update; the next quorum read that includes this node detects the divergence and repairs it. Cassandra also runs anti-entropy repair (Merkle tree-based) as a scheduled background process to catch divergence that is not detected by reads.

Questions are representative of real interview patterns. Model answers are starting points — adapt them with your own experience and the specific context of the interview.

Common Mistakes

The top 5 mistakes builders make in this track — and exactly how to fix them. Click any mistake to see the root cause and the correct approach.

Why it happens

A single-node read returns whatever that node has. If the latest write only reached a subset of nodes, a node that missed it returns the old value.

The fix

For linearisable reads, either route all reads through the leader (which already has the latest committed value) or perform a read quorum (R + W > N).

Why it happens

Without a total order on writes (e.g., via a log), two concurrent writes to the same key have an undefined winner.

The fix

Linearise writes through the Raft log or use a last-write-wins rule with a tie-breaker (e.g., vector clocks or physical timestamps with node ID as tiebreak).

Why it happens

Acknowledging a write before it is persisted (to disk or a majority of nodes) violates durability.

The fix

Only reply `ok` after the entry is committed (replicated to a majority in Raft, or written to stable storage in simpler designs).

Why it happens

In eventual-consistency designs, a simple "copy the newest value" merge re-instates deleted keys if the deleting node has not propagated its state everywhere.

The fix

Use tombstones: mark deleted keys with a special sentinel and a timestamp. Merge logic must recognise tombstones and never overwrite a tombstone with an older value.

Why it happens

A linear scan over all keys on every read is only acceptable for toy sizes.

The fix

Use a `HashMap` / `dict` keyed by the key string for O(1) average-case reads. If range queries are needed, add a sorted index (e.g., `BTreeMap`).

Comparison Mode

Side-by-side comparisons of the approaches, algorithms, and trade-offs you encounter in this track. Expand any comparison to see a detailed breakdown.

DimensionB-TreeLSM-Tree
Write patternIn-place updates (random I/O)Append-only writes (sequential I/O)
Write amplificationLow (one disk write per page update)High (data written multiple times during compaction)
Read amplificationLow (log₂ n page reads for a lookup)High (must check memtable + multiple SSTable levels)
Space amplificationLow (updates are in-place)High (stale versions exist until compaction)
Write throughputLimited by random I/OVery high (sequential writes)
Crash recoveryWAL (Write-Ahead Log) requiredMemtable WAL required; SSTables are immutable
Used inPostgreSQL, MySQL InnoDB, SQLiteLevelDB, RocksDB, Cassandra, HBase
Verdict:B-Tree for read-heavy workloads with random access. LSM-Tree for write-heavy append workloads where sequential throughput matters.

Concepts Covered

key-valueAPIoperationsleader routingredirectclient sessionslinearizable readsread indexleaseidempotencydeduplicationsnapshotlog compactionrecoveryheartbeat confirmationleader leaselease readsclock assumptionzero network round tripsstale read riskfollower readsbounded stalenessread scalabilityconsistency tradeoffread-your-writessession consistencycommit index trackingclient tokenthroughput benchmarkread/write ratiolatency comparisonscalabilitymulti-key transactionatomic batchlog entryall-or-nothingOCCversion checkconflict detectionabort and retryMVCCversioned storagesnapshot isolationreaders never block writersTiKVregionsrange partitioningRaft per regionmulti-regioncontentionabort ratethroughputOCC vs MVCChot key

Prerequisites

It is recommended to complete the previous tracks before starting this one. Concepts build progressively throughout the curriculum.