ARCHIVED from builddistributedsystem.com on 2026-04-28 — URL: https://builddistributedsystem.com/tracks/messenger
Tracks/The Messenger
01

The Messenger

Beginner
Foundations|15 tasks

Build the foundation of distributed communication. You will implement a Maelstrom node that handles JSON messages, processes initialization, and responds to echo requests. This track teaches the fundamental protocol that underlies all subsequent challenges.

Subtracks & Tasks

Interview Prep

Common interview questions for Distributed Systems / Backend Engineer roles that map directly to what you build in this track. Click any question to reveal the model answer.

Model Answer

Message queues (Kafka, SQS) for async decoupling; gRPC/HTTP for synchronous RPCs. Each message must be self-contained. Discuss at-least-once vs at-most-once delivery, idempotency keys to handle retries, and correlation IDs for request tracing across services.

Model Answer

Correlation/request IDs: the sender attaches a unique ID to each outgoing request. The receiver echoes this ID in its response. The sender maps incoming response IDs to pending callbacks. This is how HTTP/2 stream IDs, Kafka consumer group offsets, and Maelstrom msg_ids all work.

Model Answer

Use timeouts with exponential backoff and jitter. Retry only idempotent operations (GET, PUT with full replacement) or operations with idempotency keys. Use circuit breakers to stop retrying against a consistently failing downstream. Distinguish between 503 (retry) and 400/404 (do not retry).

Model Answer

A message broker (Kafka, RabbitMQ) provides async, durable message delivery with decoupled producers and consumers. A service mesh (Istio, Linkerd) handles synchronous service-to-service traffic with features like mTLS, retries, circuit breaking, and observability. Use a broker when you need temporal decoupling or fan-out; use a mesh for sync RPC with cross-cutting network concerns.

Model Answer

Check for duplicate delivery: is the queue at-least-once? Is the consumer crashing after processing but before acknowledging? Add idempotency: track processed message IDs in a store and skip duplicates. Use exactly-once semantics in Kafka (requires transactions + idempotent producers) for critical flows. Log message IDs at each processing step to trace duplicates.

Questions are representative of real interview patterns. Model answers are starting points — adapt them with your own experience and the specific context of the interview.

Common Mistakes

The top 5 mistakes builders make in this track — and exactly how to fix them. Click any mistake to see the root cause and the correct approach.

Why it happens

Maelstrom sends messages line by line over a long-lived process. Buffering all of stdin blocks the read loop until the process is killed.

The fix

Read stdin line by line in a loop using a buffered reader. Each newline-delimited JSON object is one complete message.

Why it happens

Most languages buffer stdout by default. The written bytes sit in an OS buffer and are never flushed to the pipe that Maelstrom is reading.

The fix

Explicitly flush stdout after every `json.dump` / `fmt.Println` call. In Python use `sys.stdout.flush()`. In Go, `os.Stdout` is unbuffered by default, but wrapping it in a `bufio.Writer` requires a manual flush.

Why it happens

The reply must route back: `dest` = original `src`, `src` = this node's own ID. Copying the outgoing message directly from the incoming one inverts this.

The fix

Always set `reply.dest = incoming.src` and `reply.src = self.node_id`. Build a dedicated `reply()` helper so this is never done ad-hoc.

Why it happens

The protocol requires `body.in_reply_to` to equal the original `body.msg_id`. Without it Maelstrom cannot correlate responses to requests.

The fix

Copy `incoming.body.msg_id` into `reply.body.in_reply_to`. Most node frameworks do this automatically; if writing raw JSON, do it explicitly.

Why it happens

Maelstrom may run a workload that fires many simultaneous RPCs. Any shared map or counter accessed from multiple goroutines / threads without synchronization is a data race.

The fix

Protect every shared data structure with a mutex (Go `sync.Mutex`) or thread lock (Python `threading.Lock`). Prefer immutable message handlers that only mutate state through a single serialised channel.

Comparison Mode

Side-by-side comparisons of the approaches, algorithms, and trade-offs you encounter in this track. Expand any comparison to see a detailed breakdown.

DimensionJSONProtobufMessagePack
Human readableYesNo (binary)No (binary)
Payload sizeLarge (field names repeated)Small (field tags, no names)Medium (compact binary JSON)
Parse speedSlowFastFast
Schema requiredNoYes (.proto file)No
Schema evolutionFlexible but fragileExcellent (field numbers)Fragile (no field names)
Debugging easeEasy (curl, browser)Hard (need protoc)Hard (binary)
Best forAPIs, prototypinggRPC, high-throughput servicesCache serialisation, Redis
Verdict:Start with JSON for correctness, migrate to Protobuf when payload size or parse latency becomes a bottleneck.
DimensionTCPUDP
Delivery guaranteeExactly once (at the OS level)Best effort — packets may be lost
OrderingIn-order delivery guaranteedOut-of-order delivery possible
LatencyHigher (handshake, retransmit, ACKs)Lower (fire and forget)
Head-of-line blockingYes — one lost packet blocks the streamNo — each datagram is independent
Connection setup3-way handshake requiredConnectionless, first packet is immediately sent
Typical useHTTP, databases, file transferDNS, video streaming, gaming, QUIC
Verdict:TCP for reliability by default. UDP only when you implement your own reliability layer (like QUIC) or can tolerate loss (metrics, video).

Concepts Covered

JSON parsingstdin/stdoutmessage formatinitializationnode identitycluster topologyRPCrequest-responsemessage handlingvalidationerror handlingdefensive programmingconcurrencyevent loopasync processingsynchronous communicationtimeoutblocking callsretry logicfault toleranceat-least-once deliveryasynchronous programmingcallbacksnon-blocking I/Oevent-drivenresource cleanupmemory leaksperiodic tasksgarbage collectionexponential backoffjittercongestion controlload managementserializationdeserializationschema designtype safetyloggingobservabilitymessage tracingtimestampsidempotencydeduplicationLRU cacheat-most-once deliverybenchmarkingthroughputlatencyprofilingperformancechaos engineeringfault injectionresilience testingnetwork partitions