Tracks/The Gossiper

The Gossiper

Intermediate

Foundations|19 tasks

Implement efficient information propagation across a cluster. You will build broadcast protocols from basic flooding to optimized gossip with batching, learning how distributed systems share information without central coordination.

Subtracks & Tasks

Naive Broadcast (Flooding)

0/5

NA-1

beginner

Implement Basic Broadcast to All Nodes

Implement a broadcast system where messages sent to any node eventually reach all nodes. Handle three message types: 1. topology: Tells you your neig...

broadcastfloodingmessage propagation

NA-2

intermediate

Build Flat Tree Topology Gossip

Optimize your broadcast by using a tree topology. Instead of flooding to all neighbors, organize nodes into a spanning tree where each message travels...

tree topologyspanning treeefficient propagation

NA-3

intermediate

Implement Peer-to-Peer Gossip with Random Neighbors

Implement a gossip protocol where each node randomly selects neighbors to share information with. This provides robustness against node failures while...

gossip protocolrandom selectionprobabilistic broadcast

NA-4

intermediate

Add Message Batching to Reduce Network Overhead

Reduce network overhead by batching multiple messages into single transmissions. Instead of sending immediately, buffer messages and flush periodicall...

batchingthroughput optimizationlatency tradeoff

NA-5

advanced

Handle Network Partition Healing and Resynchronization

Handle the scenario where network partitions heal and previously isolated nodes reconnect. Implement anti-entropy mechanisms to synchronize message se...

network partitionsresynchronizationanti-entropy

Gossip Protocol

0/5

GO-1

intermediate

Implement Gossip Fanout with Random Peer Selection

Gossip protocols spread information probabilistically: instead of broadcasting to all nodes, each node forwards to K random peers. This is more resili...

gossip protocolfanoutrandom peer selection+1 more

GO-2

advanced

Calculate Minimum Fanout for Reliable Delivery

What fanout K guarantees that all N nodes receive a message with probability >= 0.99? The theory says that after R rounds of gossip with fanout K, the...

probabilitygossip reliabilityfanout analysis+1 more

GO-3

intermediate

Add Periodic Gossip Rounds on a Timer

Instead of only gossiping when a new broadcast arrives, add **periodic gossip rounds**: every interval, pick K random peers and send them all your kno...

periodic gossipconvergence timeanti-entropy+1 more

GO-4

advanced

Implement Anti-Entropy with Digest Comparison

Full state sync wastes bandwidth when nodes are mostly in sync. **Anti-entropy** optimizes this: first exchange compact digests. Only transfer full st...

anti-entropydigestset reconciliation+1 more

GO-5

advanced

Tune Gossip Parameters for Maelstrom Broadcast

The Maelstrom broadcast workload requires messages-per-op < 30 under network partitions. Your task is to implement configurable gossip parameters and ...

parameter tuningmessages-per-oplatency tradeoff+1 more

Topology-Aware Gossip

0/4

TO-1

intermediate

Implement Tree-Based Broadcast Overlay

Random gossip wastes messages because nodes may receive duplicates. A **spanning tree** ensures each node receives exactly one copy: the root broadcas...

spanning treetree broadcastoverlay network+1 more

TO-2

advanced

Handle Tree Node Failure with Direct Fallback

Tree broadcast is efficient but fragile: if a node crashes, all its descendants lose connectivity. Your task is to add **failure detection** with dire...

fault tolerancetree failureack timeout+1 more

TO-3

advanced

Implement Hybrid Tree and Gossip Broadcast

Pure tree broadcast is fast but fragile. Pure gossip is reliable but slow and wasteful. A **hybrid** approach uses tree for the first hop (fast, effic...

hybrid broadcasttree overlaygossip fallback+1 more

TO-4

advanced

Simulate Network Partition and Healing

Network partitions split the cluster into isolated groups. After the partition heals, gossip must merge the diverged states. Your task is to simulate ...

network partitionpartition healingconvergence+1 more

Epidemic Algorithms and CRDT Gossip

0/5

EP-1

intermediate

Implement Grow-Only Set (G-Set) with Gossip

A **Grow-only Set (G-Set)** is the simplest CRDT. Elements can be added but never removed. Merge is set union, which is commutative, associative, and ...

G-SetCRDTset union+1 more

EP-2

advanced

Implement Two-Phase Set (2P-Set)

A **2P-Set** (two-phase set) supports both add and remove by maintaining two G-Sets: the add-set and the remove-set (tombstones). An element is in the...

2P-SetCRDTtombstone set+1 more

EP-3

advanced

Implement Last-Writer-Wins Key-Value Store

A **Last-Writer-Wins (LWW)** register resolves conflicts by always keeping the value with the latest timestamp. This is simple but can lose concurrent...

LWW registerconflict resolutiontimestamp ordering+1 more

EP-4

advanced

Demonstrate LWW Data Loss with Version Vectors

LWW silently loses data when two clients write concurrently. Your task is to demonstrate this and implement a version-vector alternative that detects ...

LWW limitationdata lossversion vectors+1 more

EP-5

advanced

Benchmark Gossip KV Store Performance

Benchmark your gossip KV store to measure real-world performance characteristics: 1. **Convergence time**: How long until all replicas have the same ...

benchmarkingconvergence timemessage overhead+1 more

Interview Prep

Common interview questions for Distributed Systems / Infrastructure Engineer roles that map directly to what you build in this track. Click any question to reveal the model answer.

Model Answer

Cassandra uses a gossip protocol where each node gossips with three random peers per second. It exchanges Gossip Digest (node + version) in three phases: SYN (send digest), ACK (send missing info), ACK2 (confirm receipt). Convergence time is O(log N) rounds. It provides eventual consistency for cluster state, not strong consistency.

Model Answer

Use SWIM-style gossip-based failure detection: periodic ping with indirect probing (if direct ping fails, ask k peers to probe). Tunable: ping interval, ping timeout, k for indirect probing. 10-second detection with low false positives requires ping interval around 1-2s. Phi accrual failure detector (used by Akka/Cassandra) maintains a sliding window of heartbeat intervals and computes a suspicion level rather than a binary up/down.

Model Answer

Gossip: decentralized, no single point of failure, O(log N) convergence, eventual consistency, scales to thousands of nodes. Used when you can tolerate eventual consistency for cluster state. ZooKeeper: strongly consistent, linearizable reads, immediate failover detection, but requires a quorum of ZK nodes, has lower write throughput. Use ZooKeeper for configuration, leader election, and distributed locks where strong consistency is required. Use gossip for membership and health state where eventual consistency is acceptable.

Model Answer

With fanout f and k rounds, the number of informed nodes grows as f^k. For 3 rounds to reach 1000 nodes: f^3 ≈ 1000, so f ≈ 10. In practice you need to account for the proportion already informed, so the actual fanout needed is a bit higher. Typical gossip systems use fanout of 3-5 with more rounds.

Model Answer

Use push-pull gossip with redundancy: each node gossips to more than 1 peer per round (fanout > 1). With f = 3 and 20% failure rate, the probability of any single path being blocked is 0.2. After k rounds, the probability of not receiving a message via at least one path decreases exponentially. Combine with anti-entropy: nodes periodically ask random peers for messages they may have missed (pull). Add message IDs for deduplication. Bloom filters on the receiver side to quickly check "have I seen this ID?"

Questions are representative of real interview patterns. Model answers are starting points — adapt them with your own experience and the specific context of the interview.

Common Mistakes

The top 5 mistakes builders make in this track — and exactly how to fix them. Click any mistake to see the root cause and the correct approach.

Why it happens

If every node fans out to every other node each round, you get n*(n-1) messages per round. Gossip is only efficient when each node fans out to a small, random subset (fanout of 2-3 is typical).

The fix

Pick a random subset of neighbours on each gossip tick (fanout = 2 or 3). Full convergence still occurs in O(log n) rounds while keeping per-round message count at O(n * fanout).

Why it happens

Without tracking which message IDs have been processed, a node relays a message it has already seen, which creates cycles.

The fix

Maintain a `seen` set of message IDs. Only gossip a message to a neighbour if you have not received confirmation that the neighbour already has it (or simply only forward to neighbours who have not yet gossiped it back to you).

Why it happens

Gossip with no retries assumes reliable delivery. In Maelstrom's partition workloads, some sends are silently dropped.

The fix

Track which neighbours have acknowledged receipt of each message and retry until acknowledged. The gossip tick should re-send unacknowledged messages, not just new ones.

Why it happens

Blocking RPCs inside the periodic tick prevent the node from processing its message queue. The tick fires again before the first one completes, causing queuing and timeouts.

The fix

Use fire-and-forget sends or async callbacks for gossip propagation. Only block on RPCs inside dedicated RPC handler goroutines.

Why it happens

The gossip interval directly controls the trade-off between convergence speed and message overhead.

The fix

Start with a 50-100 ms interval for a 5-node cluster. Measure p95 latency in Maelstrom output and tune the interval to meet the constraint without exceeding the message budget.

Comparison Mode

Side-by-side comparisons of the approaches, algorithms, and trade-offs you encounter in this track. Expand any comparison to see a detailed breakdown.

Dimension	Push	Pull	Push-Pull
How it works	Node sends its state to a random peer	Node requests state from a random peer	Node sends its state AND requests peer state in one round
Convergence speed	Fast early (new info spreads quickly)	Slow early, fast later	Fastest overall
Bandwidth usage	Sender pays full cost	Receiver triggers sender to pay	Both pay, but convergence in fewer rounds
Message count per round	1 message	2 messages (request + response)	2 messages, but no extra round needed
Latency to convergence	O(log n) rounds	O(log n) rounds (slower constant)	O(log n) rounds (best constant)
Used in	Memcached, epidemic broadcast	Anti-entropy in Cassandra	Serf, SWIM protocol

Verdict:Push-Pull has the best convergence per round. Use Push for initial epidemic spread; add Pull for anti-entropy repair of stale state.

Concepts Covered

broadcastfloodingmessage propagationtree topologyspanning treeefficient propagationgossip protocolrandom selectionprobabilistic broadcastbatchingthroughput optimizationlatency tradeoffnetwork partitionsresynchronizationanti-entropyfanoutrandom peer selectionprobabilitygossip reliabilityfanout analysisconvergenceperiodic gossipconvergence timepull gossipdigestset reconciliationbandwidth optimizationparameter tuningmessages-per-opgossip optimizationtree broadcastoverlay networkmessage forwardingfault tolerancetree failureack timeoutdirect deliveryhybrid broadcasttree overlaygossip fallbackconvergence speednetwork partitionpartition healingsplit brainG-SetCRDTset unioneventual consistency2P-Settombstone setadd-remove semanticsLWW registerconflict resolutiontimestamp orderinggossip replicationLWW limitationdata lossversion vectorsconflict detectionbenchmarkingmessage overheadconsistency

Prerequisites

It is recommended to complete the previous tracks before starting this one. Concepts build progressively throughout the curriculum.