Guide

The Sharder: Scaling Beyond One Machine

Why vertical scaling hits a wall, how consistent hashing solves the rebalancing problem, why virtual nodes matter for even distribution, and what cross-shard transactions cost you.

The Wall You Hit

Every storage system starts on one machine. One machine is fast, simple to operate, and easy to reason about. Transactions are free. Consistency is easy. You add more RAM and faster disks and it keeps working.

Then you hit the wall. A single machine has a maximum amount of RAM, disk, and network bandwidth. The highest-end servers available today can hold a few terabytes of RAM and tens of terabytes of NVMe storage. If your dataset exceeds what fits on the largest machines you can buy, you have to spread it across multiple machines. This is horizontal scaling, or sharding.

Replication, which is what Raft does, is not the answer here. Replication copies data across machines to provide fault tolerance and high read throughput. Every replica holds the full dataset. Adding replicas increases availability and read capacity, but it does not help when the problem is that the dataset is larger than any single machine can hold. Every replica would be just as full.

Sharding partitions the data. Each node holds a subset of the keys. To read or write a key, you need to find which node owns that key, and only talk to that node. The total storage capacity is the sum of all shards. The total write throughput is also roughly the sum, since different keys go to different machines.

The Naive Approach and Its Problem

The first approach most engineers reach for is modulo hashing. You have N shards. To find which shard owns key k, compute hash(k) % N. Consistent, simple, no coordination required.

The problem appears when you add or remove shards. Suppose you have 4 shards and you add a fifth. Now the hash function is hash(k) % 5 instead of hash(k) % 4. For almost every key, hash(k) % 4 and hash(k) % 5 produce different results. Nearly every key needs to be moved to a different shard. If you have 100 million keys distributed across 4 shards, adding one shard requires moving roughly 80 million of them.

This is not just slow. During the migration, your system needs to handle reads and writes that may target keys currently in transit. You need complex coordination to ensure that reads do not go to the old shard after a key has moved, and writes do not go to the new shard before the key has arrived. The operational complexity is significant.

Consistent hashing was invented specifically to solve this problem.

Consistent Hashing: The Ring

David Karger and colleagues published the consistent hashing algorithm in 1997 as a solution to distributing load across web caches. The core idea is elegant.

Imagine the output space of your hash function as a ring. If you are using a 32-bit hash, the ring goes from 0 to 2^32 - 1, wrapping around. Both keys and shards are hashed to positions on this ring.

To find which shard owns a key, hash the key to get its position on the ring. Then walk clockwise from that position until you find a shard. That shard owns the key.

When you add a new shard, it takes ownership of the keys between itself and the previous shard walking counter-clockwise. When you remove a shard, its keys are taken over by the next shard clockwise. In both cases, only the keys in the immediate neighborhood of the changed shard need to move. All other keys stay where they are.

Mathematically, with K keys and N shards, adding or removing one shard moves approximately K/N keys. With 100 million keys and 4 shards, adding a fifth shard moves only 20 million keys (keys previously owned by the shard immediately counter-clockwise of the new one). The other 80 million keys are undisturbed.

Here is a minimal implementation:

import hashlib
import bisect

class ConsistentHash:
    def __init__(self, replicas=100):
        self.replicas = replicas
        self.ring = {}        # hash position -> shard_id
        self.sorted_keys = [] # sorted list of hash positions

    def _hash(self, key):
        return int(hashlib.sha256(key.encode()).hexdigest(), 16)

    def add_shard(self, shard_id):
        for i in range(self.replicas):
            h = self._hash(f"{shard_id}:{i}")
            self.ring[h] = shard_id
            bisect.insort(self.sorted_keys, h)

    def get_shard(self, key):
        if not self.ring:
            return None
        h = self._hash(key)
        idx = bisect.bisect(self.sorted_keys, h)
        if idx >= len(self.sorted_keys):
            idx = 0  # wrap around the ring
        return self.ring[self.sorted_keys[idx]]

The replicas parameter is the number of virtual nodes per shard. We will get to that in a moment.

Virtual Nodes: Solving the Distribution Problem

Basic consistent hashing with one ring position per shard has a problem: the distribution is uneven. With 4 shards placed randomly on a ring, some shards might own 40% of the key space while others own 10%. The shards with larger arcs get more keys and more load.

The solution is virtual nodes: instead of placing each shard at one position on the ring, place it at many positions. Each physical shard has 100 or 200 virtual nodes spread around the ring. The key space is divided into smaller, more numerous segments. By the law of large numbers, with enough virtual nodes, each shard ends up owning approximately equal portions of the key space.

Virtual nodes also handle heterogeneous hardware. If you have a mix of powerful and weak machines, you can give the powerful machines more virtual nodes, proportional to their capacity. A machine with twice the RAM and CPU gets twice as many virtual nodes and ends up owning twice as many keys. The ring model naturally handles this without special-casing.

When a shard leaves the cluster, its virtual nodes are spread across the ring, so its keys are redistributed among all the remaining shards rather than all flowing to one neighbor. This prevents the surviving shard from suddenly being overloaded.

Amazon DynamoDB and Apache Cassandra both use consistent hashing with virtual nodes. DynamoDB's original Dynamo paper describes placing each node at multiple random positions on the ring, with the number of positions proportional to the node's capacity. Cassandra originally used the same approach and later switched to a deterministic virtual node placement to make it easier to reason about token ownership.

Hotspots and Why the Ring Does Not Eliminate Them

Consistent hashing distributes keys evenly across shards in expectation, but access patterns are not uniform. Some keys are accessed far more frequently than others. The key trending:post:12345678 in a social network might receive a million reads per second. It does not matter that the other 99 million keys are evenly distributed if that one key is hammering a single shard.

This is a hotspot: a specific key or range of keys that receives disproportionate traffic. Consistent hashing does not help with hotspots because the hot key always maps to the same shard.

Production systems use several strategies to mitigate hotspots.

The simplest is caching. A cache in front of the storage layer absorbs the hot reads. The shard itself sees only a small fraction of the traffic. This works well for read hotspots but not write hotspots.

For write hotspots, you can use write buffering: accept writes into a distributed buffer and apply them to the shard in batches. This smooths out bursty write traffic but adds latency.

Another approach is to artificially shard a hot key. Instead of storing trending:post:12345678 at a single location, store it as trending:post:12345678:0, trending:post:12345678:1, ..., trending:post:12345678:9. Writes increment a random suffix. Reads check all suffixes and aggregate. This spreads load across 10 shards at the cost of more complex read and write logic.

DynamoDB calls this "write sharding" and explicitly recommends it for high-cardinality write scenarios. You pay a 10x overhead on reads (checking all suffixes) to gain a 10x improvement in write throughput.

Cassandra's approach to detecting hotspots is monitoring per-node throughput. If one node is handling significantly more operations per second than others, it is likely a hotspot destination. The operator can then investigate and apply mitigation strategies.

Cross-Shard Transactions and Why They Are Expensive

Sharding breaks the transaction model. On a single machine or single Raft group, you can atomically update multiple keys. Either both changes commit or neither does. With multiple shards, those keys might live on different machines.

Consider a bank transfer: debit account A, credit account B. If A is on shard 1 and B is on shard 3, you need both shards to agree to apply their respective changes atomically. If shard 1 applies its change and then shard 3 fails before applying its change, you have money that exists on neither side of the transfer.

The standard solution is two-phase commit (2PC). The coordinator sends a "prepare" message to all participating shards. Each shard writes its change to a log (but does not apply it yet) and responds with either "prepared" or "abort." If all shards respond "prepared," the coordinator sends "commit" to all shards, and they apply their changes. If any shard responds "abort," the coordinator sends "abort" to all.

2PC is correct, but it has serious drawbacks. The coordinator is a single point of failure. If the coordinator crashes after sending "prepare" but before sending "commit," the participating shards are stuck in the prepared state indefinitely, holding locks on the affected keys. Recovery requires a separate protocol for participants to determine whether to commit or abort. This is the "blocking" in "2PC is a blocking protocol."

The performance cost is also significant. A 2PC transaction requires two network round trips (prepare phase and commit phase) instead of one. If the shards are in different datacenters, each round trip might be 50-100ms. A single-shard transaction completes in one round trip; a cross-shard 2PC requires two.

For these reasons, most sharded systems are designed to minimize cross-shard transactions. Data modeling choices matter enormously. If you model your data so that operations that are logically related are colocated on the same shard, you avoid the need for 2PC. User account data and all the user's posts on the same shard. Order and all the order's line items on the same shard. The goal is that 95%+ of transactions are single-shard.

Google Spanner takes the opposite approach: it provides full cross-shard transactions with external consistency, using 2PC coordinated with TrueTime. Spanner accepts the latency cost of 2PC (tens to hundreds of milliseconds for cross-shard transactions) in exchange for the correctness guarantees of global ACID transactions. For many applications, this is the right tradeoff.

CockroachDB, which is modeled on Spanner but built on commodity hardware, makes the same choice. It uses a Raft group per range of keys, and cross-range transactions use 2PC. The system is designed to make cross-range transactions fast enough to be acceptable, but they are still more expensive than single-range transactions.

The Shard Controller

In a sharded system, someone needs to know which shard owns which keys. This is the shard controller. The controller maintains the mapping from key ranges (or shard IDs) to replica groups.

The controller needs to be highly available: if it is unreachable, no one can find the right shard for a key. The standard approach is to run the controller as a small Raft group (typically 3 or 5 nodes). Configuration changes (adding a replica group, moving a shard, removing a replica group) are proposed as Raft log entries and committed by the majority.

Configuration changes must be applied by replica groups in version order. If a replica group at configuration version 3 receives a configuration update to version 5, it cannot skip to version 5 directly. It must fetch and apply version 4 first. Skipping versions could cause a shard to be missed: version 4 might have assigned a shard to this group, and if the group skips to version 5 without applying version 4, it never knows it is supposed to own that shard.

The shard migration protocol is the most operationally complex part of a sharded system. When a configuration change moves shard s from group A to group B:

Group A must stop serving requests for shard s. Group A creates a snapshot of the shard data. Group A sends the snapshot to group B. Group B installs the snapshot and starts serving requests for shard s. Group B notifies group A. Group A deletes its copy of the shard.

Each of these steps can fail. The protocol must be idempotent: if any step is repeated (due to retries), the result is the same. Client sessions (the deduplication table) must also be migrated with the shard data, or clients might see duplicate operations after a migration.

The shard controller in this track is a simplified version of what systems like MIT's ShardKV lab and Google Spanner implement. Understanding the controller and its interaction with replica groups is what gives you the intuition to design sharded systems that are both correct and operationally manageable.

Putting It Together

Sharding is the final piece in the scaling picture. Replication (the Raft layer) provides fault tolerance and read throughput by copying data across machines. Sharding provides horizontal write throughput and capacity by partitioning data across machines.

The combination is a standard production architecture: multiple Raft groups, each owning a subset of the key space, with a Raft-backed shard controller managing the configuration. Each Raft group can survive the failure of a minority of its members. The overall system can survive the complete failure of any shard (with data loss for that shard's keys, unless you have cross-shard replication).

The engineering challenges are in the transitions: adding nodes, removing nodes, moving shards. Consistent hashing minimizes the data movement during transitions. The configuration versioning protocol ensures all nodes apply changes in the right order. The migration protocol ensures data moves atomically and consistently.

Building this correctly is one of the harder engineering problems in distributed systems. But the fundamental ideas, hash rings, virtual nodes, configuration versioning, two-phase migration, are not mysterious. They are careful engineering in response to specific failure modes. Once you understand the failure modes, the designs follow.

Build it yourself

Reading about distributed systems is useful. Building them is how you actually learn.

Start the The Sharder track