Subtracks & Tasks
Naive Broadcast (Flooding)
Implement Basic Broadcast to All Nodes
Implement a broadcast system where messages sent to any node eventually reach all nodes. Handle three message types: 1. topology: Tells you your neig...
Build Flat Tree Topology Gossip
Optimize your broadcast by using a tree topology. Instead of flooding to all neighbors, organize nodes into a spanning tree where each message travels...
Implement Peer-to-Peer Gossip with Random Neighbors
Implement a gossip protocol where each node randomly selects neighbors to share information with. This provides robustness against node failures while...
Add Message Batching to Reduce Network Overhead
Reduce network overhead by batching multiple messages into single transmissions. Instead of sending immediately, buffer messages and flush periodicall...
Handle Network Partition Healing and Resynchronization
Handle the scenario where network partitions heal and previously isolated nodes reconnect. Implement anti-entropy mechanisms to synchronize message se...
Gossip Protocol
Implement Gossip Fanout with Random Peer Selection
Gossip protocols spread information probabilistically: instead of broadcasting to all nodes, each node forwards to K random peers. This is more resili...
Calculate Minimum Fanout for Reliable Delivery
What fanout K guarantees that all N nodes receive a message with probability >= 0.99? The theory says that after R rounds of gossip with fanout K, the...
Add Periodic Gossip Rounds on a Timer
Instead of only gossiping when a new broadcast arrives, add **periodic gossip rounds**: every interval, pick K random peers and send them all your kno...
Implement Anti-Entropy with Digest Comparison
Full state sync wastes bandwidth when nodes are mostly in sync. **Anti-entropy** optimizes this: first exchange compact digests. Only transfer full st...
Tune Gossip Parameters for Maelstrom Broadcast
The Maelstrom broadcast workload requires messages-per-op < 30 under network partitions. Your task is to implement configurable gossip parameters and ...
Topology-Aware Gossip
Implement Tree-Based Broadcast Overlay
Random gossip wastes messages because nodes may receive duplicates. A **spanning tree** ensures each node receives exactly one copy: the root broadcas...
Handle Tree Node Failure with Direct Fallback
Tree broadcast is efficient but fragile: if a node crashes, all its descendants lose connectivity. Your task is to add **failure detection** with dire...
Implement Hybrid Tree and Gossip Broadcast
Pure tree broadcast is fast but fragile. Pure gossip is reliable but slow and wasteful. A **hybrid** approach uses tree for the first hop (fast, effic...
Simulate Network Partition and Healing
Network partitions split the cluster into isolated groups. After the partition heals, gossip must merge the diverged states. Your task is to simulate ...
Epidemic Algorithms and CRDT Gossip
Implement Grow-Only Set (G-Set) with Gossip
A **Grow-only Set (G-Set)** is the simplest CRDT. Elements can be added but never removed. Merge is set union, which is commutative, associative, and ...
Implement Two-Phase Set (2P-Set)
A **2P-Set** (two-phase set) supports both add and remove by maintaining two G-Sets: the add-set and the remove-set (tombstones). An element is in the...
Implement Last-Writer-Wins Key-Value Store
A **Last-Writer-Wins (LWW)** register resolves conflicts by always keeping the value with the latest timestamp. This is simple but can lose concurrent...
Demonstrate LWW Data Loss with Version Vectors
LWW silently loses data when two clients write concurrently. Your task is to demonstrate this and implement a version-vector alternative that detects ...
Benchmark Gossip KV Store Performance
Benchmark your gossip KV store to measure real-world performance characteristics: 1. **Convergence time**: How long until all replicas have the same ...
Interview Prep
Common interview questions for Distributed Systems / Infrastructure Engineer roles that map directly to what you build in this track. Click any question to reveal the model answer.
Model Answer
Cassandra uses a gossip protocol where each node gossips with three random peers per second. It exchanges Gossip Digest (node + version) in three phases: SYN (send digest), ACK (send missing info), ACK2 (confirm receipt). Convergence time is O(log N) rounds. It provides eventual consistency for cluster state, not strong consistency.
Model Answer
Use SWIM-style gossip-based failure detection: periodic ping with indirect probing (if direct ping fails, ask k peers to probe). Tunable: ping interval, ping timeout, k for indirect probing. 10-second detection with low false positives requires ping interval around 1-2s. Phi accrual failure detector (used by Akka/Cassandra) maintains a sliding window of heartbeat intervals and computes a suspicion level rather than a binary up/down.
Model Answer
Gossip: decentralized, no single point of failure, O(log N) convergence, eventual consistency, scales to thousands of nodes. Used when you can tolerate eventual consistency for cluster state. ZooKeeper: strongly consistent, linearizable reads, immediate failover detection, but requires a quorum of ZK nodes, has lower write throughput. Use ZooKeeper for configuration, leader election, and distributed locks where strong consistency is required. Use gossip for membership and health state where eventual consistency is acceptable.
Model Answer
With fanout f and k rounds, the number of informed nodes grows as f^k. For 3 rounds to reach 1000 nodes: f^3 ≈ 1000, so f ≈ 10. In practice you need to account for the proportion already informed, so the actual fanout needed is a bit higher. Typical gossip systems use fanout of 3-5 with more rounds.
Model Answer
Use push-pull gossip with redundancy: each node gossips to more than 1 peer per round (fanout > 1). With f = 3 and 20% failure rate, the probability of any single path being blocked is 0.2. After k rounds, the probability of not receiving a message via at least one path decreases exponentially. Combine with anti-entropy: nodes periodically ask random peers for messages they may have missed (pull). Add message IDs for deduplication. Bloom filters on the receiver side to quickly check "have I seen this ID?"
Questions are representative of real interview patterns. Model answers are starting points — adapt them with your own experience and the specific context of the interview.
Common Mistakes
The top 5 mistakes builders make in this track — and exactly how to fix them. Click any mistake to see the root cause and the correct approach.
Why it happens
If every node fans out to every other node each round, you get n*(n-1) messages per round. Gossip is only efficient when each node fans out to a small, random subset (fanout of 2-3 is typical).
The fix
Pick a random subset of neighbours on each gossip tick (fanout = 2 or 3). Full convergence still occurs in O(log n) rounds while keeping per-round message count at O(n * fanout).
Why it happens
Without tracking which message IDs have been processed, a node relays a message it has already seen, which creates cycles.
The fix
Maintain a `seen` set of message IDs. Only gossip a message to a neighbour if you have not received confirmation that the neighbour already has it (or simply only forward to neighbours who have not yet gossiped it back to you).
Why it happens
Gossip with no retries assumes reliable delivery. In Maelstrom's partition workloads, some sends are silently dropped.
The fix
Track which neighbours have acknowledged receipt of each message and retry until acknowledged. The gossip tick should re-send unacknowledged messages, not just new ones.
Why it happens
Blocking RPCs inside the periodic tick prevent the node from processing its message queue. The tick fires again before the first one completes, causing queuing and timeouts.
The fix
Use fire-and-forget sends or async callbacks for gossip propagation. Only block on RPCs inside dedicated RPC handler goroutines.
Why it happens
The gossip interval directly controls the trade-off between convergence speed and message overhead.
The fix
Start with a 50-100 ms interval for a 5-node cluster. Measure p95 latency in Maelstrom output and tune the interval to meet the constraint without exceeding the message budget.
Comparison Mode
Side-by-side comparisons of the approaches, algorithms, and trade-offs you encounter in this track. Expand any comparison to see a detailed breakdown.
| Dimension | Push | Pull | Push-Pull |
|---|---|---|---|
| How it works | Node sends its state to a random peer | Node requests state from a random peer | Node sends its state AND requests peer state in one round |
| Convergence speed | Fast early (new info spreads quickly) | Slow early, fast later | Fastest overall |
| Bandwidth usage | Sender pays full cost | Receiver triggers sender to pay | Both pay, but convergence in fewer rounds |
| Message count per round | 1 message | 2 messages (request + response) | 2 messages, but no extra round needed |
| Latency to convergence | O(log n) rounds | O(log n) rounds (slower constant) | O(log n) rounds (best constant) |
| Used in | Memcached, epidemic broadcast | Anti-entropy in Cassandra | Serf, SWIM protocol |
Concepts Covered
Prerequisites
It is recommended to complete the previous tracks before starting this one. Concepts build progressively throughout the curriculum.