ARCHIVED from builddistributedsystem.com on 2026-04-28 — URL: https://builddistributedsystem.com/tracks/queues
Tracks/Queues
15

Queues

Advanced
Scalability|10 tasks

Build message queue systems for decoupling producers and consumers. Implement various delivery guarantees from at-most-once to exactly-once, learning the patterns that enable reliable asynchronous communication.

Subtracks & Tasks

Interview Prep

Common interview questions for Backend / Data Infrastructure Engineer roles that map directly to what you build in this track. Click any question to reveal the model answer.

Model Answer

At-most-once: messages may be lost, never duplicated. Producer fires and forgets; no retries. At-least-once: messages are never lost but may be duplicated. Producer retries on failure; consumer may process same message twice. Kafka default is at-least-once for producers with acks=all and retries enabled. Exactly-once: each message is processed exactly once. Kafka supports this since 0.11 via idempotent producers (dedup by sequence number) and Kafka Streams transactions. Requires careful configuration and is slower. Most production systems use at-least-once + idempotent consumers rather than exactly-once semantics in the broker.

Model Answer

10M/day is ~116 events/second — modest. Kafka handles millions/second. Architecture: IoT devices -> MQTT broker (lightweight protocol for constrained devices) -> Kafka (durable event log, partitioned by device ID for ordering) -> Stream processor (Flink/Kafka Streams) for real-time aggregations -> Time-series DB (InfluxDB/TimescaleDB) for storage -> dashboard. Key decisions: partition by device ID to ensure ordering per device; use compacted topics for latest-value semantics; set retention based on replay needs. Scale: Kafka clusters can handle this with a single broker; add replicas for durability.

Model Answer

Kafka triggers a consumer group rebalance. The group coordinator detects the consumer death (via missed heartbeats within session.timeout.ms, default 10s). A new partition assignment is computed and the remaining 2 consumers each take one more partition (1 consumer gets 2 partitions, the other gets 1). During the rebalance, consumption pauses. The new assignment starts from the last committed offset of the dead consumer — any messages processed but not committed since the last commit are reprocessed (at-least-once delivery). To minimize rebalance disruption, use incremental cooperative rebalancing (KIP-429) available since Kafka 2.4.

Model Answer

A dead letter queue (DLQ) receives messages that failed processing after a configured number of retries. Scenario: an order-processing service receives a malformed order payload that always raises a JSON parsing exception. Without a DLQ, the message is retried indefinitely, blocking the queue (if FIFO) or consuming retries for other messages. With a DLQ: after 3 failures, the message moves to the DLQ. Normal processing continues. An operator inspects the DLQ, identifies the malformed order, fixes the bug, and replays. DLQs are essential in any at-least-once delivery system for separating transient failures (retry) from poison pill messages (DLQ).

Model Answer

Backpressure signals the producer to slow down when the consumer is overwhelmed. Approaches: (1) Bounded queues: when the queue is full, the producer blocks or receives an error. Producer reduces rate or sheds load. (2) Pull-based consumption: consumers pull at their own rate (Kafka's model). Consumer lag grows visibly; auto-scaling can add consumers. (3) Rate limiting at ingestion: API gateway applies rate limits based on consumer group lag metrics. (4) Reactive Streams: back-pressure protocol built into the stream abstraction (RxJava, Project Reactor, Akka Streams). In practice: monitor consumer lag with alerts, auto-scale consumers, and set circuit breakers on the producer side to shed non-critical events under sustained lag.

Questions are representative of real interview patterns. Model answers are starting points — adapt them with your own experience and the specific context of the interview.

Common Mistakes

The top 5 mistakes builders make in this track — and exactly how to fix them. Click any mistake to see the root cause and the correct approach.

Why it happens

An early ACK tells the broker the message is done. If the consumer then crashes, the message is gone — the broker will not redeliver it.

The fix

ACK only after all processing and any downstream writes are complete. Use at-least-once delivery semantics with idempotent processing to handle redeliveries.

Why it happens

A message that causes a consumer to crash will be redelivered indefinitely (until the max delivery count is exceeded if configured).

The fix

Implement a dead-letter queue (DLQ). After N failed delivery attempts, route the message to the DLQ for manual inspection rather than retrying forever.

Why it happens

In Kafka-style partitioned queues, each partition is assigned to exactly one consumer in a group. Extra consumers beyond the number of partitions have nothing to consume.

The fix

Keep consumer count <= partition count. If you need more throughput, increase the number of partitions first, then scale consumers to match.

Why it happens

Messages that are no longer actionable (e.g., a location update from 2 hours ago) waste consumer capacity and may cause incorrect behaviour if processed late.

The fix

Set a per-message TTL or a queue-level TTL for time-sensitive messages. The broker discards messages that expire before being consumed.

Why it happens

Waiting for a publish ACK from the broker adds the broker's round-trip latency to every API response.

The fix

Publish asynchronously using a local outbox pattern: write the event to a local DB table (outbox) atomically with the business transaction, then have a background process relay it to the broker.

Comparison Mode

Side-by-side comparisons of the approaches, algorithms, and trade-offs you encounter in this track. Expand any comparison to see a detailed breakdown.

DimensionPoint-to-PointPub-SubLog-Based
Message routingOne producer → one consumer (competing consumers OK)One producer → all subscribers of the topicOne producer → all consumer groups (each gets every message)
Message deleted after consumeYesYes (after all subscribers ACK)No — messages retained for configurable period
ReplayabilityNoNoYes — seek to any offset and replay
Ordering guaranteeFIFO per queuePer-partition in most systemsTotal order within a partition
Consumer scalingAdd consumers to the same queueAdd subscriber queuesAdd partitions; consumer groups scale independently
ExamplesRabbitMQ queues, Amazon SQSGoogle Pub/Sub, SNS, Redis Pub/SubApache Kafka, Apache Pulsar, AWS Kinesis
Verdict:Point-to-point for work queues (job processing). Pub-Sub for event fan-out (notifications). Log-based for event sourcing and analytics pipelines where replay and multi-consumer are required.

Concepts Covered

queueproducer-consumerFIFOconsumer groupspartitioningparallel processingat-least-onceacknowledgmentredeliveryexactly-onceidempotencydeduplicationDLQpoison messageerror handlingexactly-once semanticsmessage deliveryat-most-onceduplicate processingidempotent consumersmessage deduplicationunique message IDsprocessed message trackingidempotency keystransactional processingatomic operationsdatabase transactionsmessage acknowledgmentoutbox patternevent publishingmessage reliabilityeventual consistencypublisher reliabilitytwo-phase commit2PCdistributed transactionsqueue coordinationdatabase coordinationatomic commit

Prerequisites

It is recommended to complete the previous tracks before starting this one. Concepts build progressively throughout the curriculum.