ARCHIVED from builddistributedsystem.com on 2026-04-28 — URL: https://builddistributedsystem.com/tracks/gossiper
Tracks/The Gossiper
03

The Gossiper

Intermediate
Foundations|19 tasks

Implement efficient information propagation across a cluster. You will build broadcast protocols from basic flooding to optimized gossip with batching, learning how distributed systems share information without central coordination.

Subtracks & Tasks

Interview Prep

Common interview questions for Distributed Systems / Infrastructure Engineer roles that map directly to what you build in this track. Click any question to reveal the model answer.

Model Answer

Cassandra uses a gossip protocol where each node gossips with three random peers per second. It exchanges Gossip Digest (node + version) in three phases: SYN (send digest), ACK (send missing info), ACK2 (confirm receipt). Convergence time is O(log N) rounds. It provides eventual consistency for cluster state, not strong consistency.

Model Answer

Use SWIM-style gossip-based failure detection: periodic ping with indirect probing (if direct ping fails, ask k peers to probe). Tunable: ping interval, ping timeout, k for indirect probing. 10-second detection with low false positives requires ping interval around 1-2s. Phi accrual failure detector (used by Akka/Cassandra) maintains a sliding window of heartbeat intervals and computes a suspicion level rather than a binary up/down.

Model Answer

Gossip: decentralized, no single point of failure, O(log N) convergence, eventual consistency, scales to thousands of nodes. Used when you can tolerate eventual consistency for cluster state. ZooKeeper: strongly consistent, linearizable reads, immediate failover detection, but requires a quorum of ZK nodes, has lower write throughput. Use ZooKeeper for configuration, leader election, and distributed locks where strong consistency is required. Use gossip for membership and health state where eventual consistency is acceptable.

Model Answer

With fanout f and k rounds, the number of informed nodes grows as f^k. For 3 rounds to reach 1000 nodes: f^3 ≈ 1000, so f ≈ 10. In practice you need to account for the proportion already informed, so the actual fanout needed is a bit higher. Typical gossip systems use fanout of 3-5 with more rounds.

Model Answer

Use push-pull gossip with redundancy: each node gossips to more than 1 peer per round (fanout > 1). With f = 3 and 20% failure rate, the probability of any single path being blocked is 0.2. After k rounds, the probability of not receiving a message via at least one path decreases exponentially. Combine with anti-entropy: nodes periodically ask random peers for messages they may have missed (pull). Add message IDs for deduplication. Bloom filters on the receiver side to quickly check "have I seen this ID?"

Questions are representative of real interview patterns. Model answers are starting points — adapt them with your own experience and the specific context of the interview.

Common Mistakes

The top 5 mistakes builders make in this track — and exactly how to fix them. Click any mistake to see the root cause and the correct approach.

Why it happens

If every node fans out to every other node each round, you get n*(n-1) messages per round. Gossip is only efficient when each node fans out to a small, random subset (fanout of 2-3 is typical).

The fix

Pick a random subset of neighbours on each gossip tick (fanout = 2 or 3). Full convergence still occurs in O(log n) rounds while keeping per-round message count at O(n * fanout).

Why it happens

Without tracking which message IDs have been processed, a node relays a message it has already seen, which creates cycles.

The fix

Maintain a `seen` set of message IDs. Only gossip a message to a neighbour if you have not received confirmation that the neighbour already has it (or simply only forward to neighbours who have not yet gossiped it back to you).

Why it happens

Gossip with no retries assumes reliable delivery. In Maelstrom's partition workloads, some sends are silently dropped.

The fix

Track which neighbours have acknowledged receipt of each message and retry until acknowledged. The gossip tick should re-send unacknowledged messages, not just new ones.

Why it happens

Blocking RPCs inside the periodic tick prevent the node from processing its message queue. The tick fires again before the first one completes, causing queuing and timeouts.

The fix

Use fire-and-forget sends or async callbacks for gossip propagation. Only block on RPCs inside dedicated RPC handler goroutines.

Why it happens

The gossip interval directly controls the trade-off between convergence speed and message overhead.

The fix

Start with a 50-100 ms interval for a 5-node cluster. Measure p95 latency in Maelstrom output and tune the interval to meet the constraint without exceeding the message budget.

Comparison Mode

Side-by-side comparisons of the approaches, algorithms, and trade-offs you encounter in this track. Expand any comparison to see a detailed breakdown.

DimensionPushPullPush-Pull
How it worksNode sends its state to a random peerNode requests state from a random peerNode sends its state AND requests peer state in one round
Convergence speedFast early (new info spreads quickly)Slow early, fast laterFastest overall
Bandwidth usageSender pays full costReceiver triggers sender to payBoth pay, but convergence in fewer rounds
Message count per round1 message2 messages (request + response)2 messages, but no extra round needed
Latency to convergenceO(log n) roundsO(log n) rounds (slower constant)O(log n) rounds (best constant)
Used inMemcached, epidemic broadcastAnti-entropy in CassandraSerf, SWIM protocol
Verdict:Push-Pull has the best convergence per round. Use Push for initial epidemic spread; add Pull for anti-entropy repair of stale state.

Concepts Covered

broadcastfloodingmessage propagationtree topologyspanning treeefficient propagationgossip protocolrandom selectionprobabilistic broadcastbatchingthroughput optimizationlatency tradeoffnetwork partitionsresynchronizationanti-entropyfanoutrandom peer selectionprobabilitygossip reliabilityfanout analysisconvergenceperiodic gossipconvergence timepull gossipdigestset reconciliationbandwidth optimizationparameter tuningmessages-per-opgossip optimizationtree broadcastoverlay networkmessage forwardingfault tolerancetree failureack timeoutdirect deliveryhybrid broadcasttree overlaygossip fallbackconvergence speednetwork partitionpartition healingsplit brainG-SetCRDTset unioneventual consistency2P-Settombstone setadd-remove semanticsLWW registerconflict resolutiontimestamp orderinggossip replicationLWW limitationdata lossversion vectorsconflict detectionbenchmarkingmessage overheadconsistency

Prerequisites

It is recommended to complete the previous tracks before starting this one. Concepts build progressively throughout the curriculum.