Building Distributed Systems from Scratch

Merkle Trees for Efficient Replica Diff

Dynamo replicates each key to N nodes. Over time, replicas diverge: a node that was temporarily partitioned misses writes; a hinted handoff fails to deliver; a disk corruption corrupts a handful of values. The system must continuously detect and repair these divergences in the background without transferring the entire dataset between replicas.

Why Naive Comparison Is Too Expensive

Comparing every key-value pair between two replicas means transferring O(K) data — the entire dataset. At millions of keys this is impractical as a continuous background job. Anti-entropy runs constantly; it cannot afford to scan everything on every pass.

The Merkle Tree Structure

A Merkle tree (also called a hash tree) partitions the key-space into leaf buckets. Each leaf bucket has a hash of all the key-value pairs it contains. Internal nodes have hashes of their children's hashes. The root hash summarises the entire dataset in a single value.

To find which keys differ between two replicas, compare hashes top-down:

Compare root hashes. If they match, the replicas are in sync — done in one message.
If roots differ, compare the two children of the root. One or both will differ.
Recurse only into the subtrees that differ, until you reach the leaf buckets.
Exchange only the key-value pairs in the identified leaf buckets.

With B leaf buckets and tree height H = log2(B):
  - O(H) round-trips to identify diverged buckets
  - Then transfer only keys in those buckets

Example: 1 million keys, 1024 buckets, 1 diverged bucket
  - Only ~1000 keys transferred instead of 1 million

Bucket Hash Function

In this task the bucket assignment uses the first character of the key: bucket(key) = ord(key[0]) % bucket_count. The bucket hash is computed by sorting all key:value strings in the bucket, concatenating them, and summing the character ordinals modulo a large prime:

def bucket_hash(pairs):
    if not pairs:
        return 0
    s = ''.join(sorted(f'{k}:{v}' for k, v in pairs))
    return sum(ord(c) for c in s) % 1000000007

Two buckets whose hash values match are guaranteed to have the same content (assuming no hash collisions). Two buckets with different hashes have diverged and their keys need to be exchanged.

Why It Matters in Dynamo

Dynamo runs Merkle tree anti-entropy continuously between replicas. Even if a node was offline for hours, the tree comparison quickly identifies the small fraction of keys that diverged — without scanning the full dataset. This is the mechanism that underpins eventual consistency: replicas that diverge will always reconverge, bounded by the frequency of anti-entropy rounds.

Common Pitfalls

Empty buckets must hash to the same value on both sides: a bucket with no keys should hash to 0 on both replicas. If one side has a key and the other does not, the bucket hashes will differ, which is correct behavior.
Sort before hashing: the bucket hash must be deterministic regardless of insertion order. Always sort the key-value pairs before concatenating.
Bucket count is shared state: both replicas must use the same bucket count, otherwise the same key will land in different buckets on each side and the comparison will be meaningless.