ARCHIVED from builddistributedsystem.com on 2026-04-28 — URL: https://builddistributedsystem.com/tracks/caches
Tracks/Caches
11

Caches

Intermediate
Scalability|5 tasks

Learn how caching improves read performance in distributed systems. Build request node caches, global caches, and distributed caches with consistent hashing. Master cache eviction and invalidation strategies.

Tasks

Interview Prep

Common interview questions for Backend / Infrastructure Engineer roles that map directly to what you build in this track. Click any question to reveal the model answer.

Model Answer

Cache-aside: read hits cache; miss fetches from DB and populates cache; writes update DB and invalidate cache. Race condition: read misses, fetches from DB, write updates DB and deletes cache key, read populates cache with now-stale value. Mitigation: (1) TTL-based expiry as a backstop, (2) Lease tokens (Facebook's approach): cache issues a token on miss; write invalidates token; if token expired when read tries to populate, it aborts population. (3) Shorter TTLs for frequently-written keys.

Model Answer

Scenario: a popular cached item (trending article, flash sale product) expires. Thousands of requests simultaneously miss the cache and hit the database. The DB is overwhelmed; many requests time out; the system appears to go down even though the underlying data is fine. Solutions: (1) Probabilistic early expiration: as TTL approaches, a percentage of requests refresh the cache before expiry, spreading the load, (2) Lock-based: first miss acquires a lock and fetches from DB; others wait and read the freshly populated value, (3) Stale-while-revalidate: serve stale content while background refresh happens, (4) Longer TTLs with explicit invalidation instead of time-based expiry.

Model Answer

Without resilience: all requests hit the DB, which may be sized for cache-hit rates (e.g., only handling 10% of traffic directly). DB becomes overloaded, requests time out, cascading failure. With resilience: circuit breaker opens after Redis failures exceed threshold, routing all traffic to DB with a reduced feature set. Cache should be a performance optimization, not a reliability dependency for correct operation. Design for cache failure: implement request coalescing, DB connection pool limits, and graceful degradation. Use Redis Sentinel or Cluster for high availability.

Model Answer

Write-through: every write goes to both cache and DB synchronously. Cache is always consistent with DB. Higher write latency (two writes per operation). Safe — no data loss on cache failure. Use when read/write ratio is balanced and you cannot tolerate stale reads. Write-back (write-behind): writes go to cache immediately; DB write is deferred/async. Lower write latency. Risk: data in cache but not DB is lost if cache fails. Use when write throughput is critical and you can tolerate some data loss (metrics, analytics, IoT sensor data). Not suitable for financial data.

Model Answer

Fixed window: INCR key; if count > limit, reject; set TTL = window duration on first INCR. Problem: boundary burst (2x limit at window boundary). Sliding window with sorted set: ZADD key timestamp member; ZREMRANGEBYSCORE key 0 (now - window); ZCARD key; if count < limit, add and proceed. Atomic in Lua script or pipelining. Token bucket with Redis: store (tokens, last_refill_time); on each request, compute tokens to add since last refill, cap at max, subtract 1 if available. Implement with EVALSHA for atomicity. Cloudflare's production rate limiter uses a fixed window with a probabilistic correction for the boundary problem.

Questions are representative of real interview patterns. Model answers are starting points — adapt them with your own experience and the specific context of the interview.

Common Mistakes

The top 5 mistakes builders make in this track — and exactly how to fix them. Click any mistake to see the root cause and the correct approach.

Why it happens

Every request finds the key missing, decides to fetch from origin, and all fire at once before any one of them has repopulated the cache.

The fix

Use probabilistic early expiration (PER) or a lock-based approach: the first miss acquires a lock and fetches the value; other misses wait for the lock holder to populate the cache.

Why it happens

A pure LRU treats a one-time sequential scan the same as a repeated hot key. The scan promotes millions of cold entries to the top of the LRU list, flushing the actual hot set.

The fix

Use LRU-K (LRU based on the K-th most recent access) or a segmented LRU (SLRU) that requires an entry to be accessed twice before it enters the protected segment.

Why it happens

Invalidating a cache entry shared by many clients causes all of them to miss simultaneously and race to reload it.

The fix

Use a "fetch-and-set" atomic primitive (e.g., `SET NX PX` in Redis) to ensure only one client fetches from the origin. Others should serve a slightly stale value while the reload is in flight.

Why it happens

If TTLs are never set or set to very large values, keys accumulate without bound.

The fix

Every key must have a TTL or the cache must enforce a maximum capacity with an eviction policy. Prefer short TTLs for mutable data and longer TTLs for immutable data.

Why it happens

In a write-through design, the write must succeed on both the cache and the database. If the database write fails after the cache write succeeds, the cache holds an uncommitted value.

The fix

Write to the database first, then invalidate (not update) the cache. This "cache-aside" or "write-invalidate" pattern ensures the cache never holds a value the database does not have.

Comparison Mode

Side-by-side comparisons of the approaches, algorithms, and trade-offs you encounter in this track. Expand any comparison to see a detailed breakdown.

DimensionWrite-ThroughWrite-BackWrite-Around
How it worksWrite to cache and backing store synchronouslyWrite to cache only; flush to backing store laterWrite directly to backing store, bypass cache
Write latencyHigh (waits for backing store)Low (write to cache only)Medium (backing store latency, no cache overhead)
Data loss riskNoneHigh — dirty cache lost on crashNone
Read-after-write from cacheYesYesNo — first read is a cache miss
Cache pollution from write-once dataYesYesNo — large writes do not pollute the cache
Best forRead-heavy, consistency-critical dataWrite-heavy workloads (DB buffer pools)Large batch writes, log storage
Verdict:Write-through for safety. Write-back for write-heavy throughput with durability handled at the storage layer. Write-around when written data is rarely re-read.
DimensionLRULFUARC
EvictsLeast recently used entryLeast frequently used entryDynamically balances recency vs frequency
Implementation complexitySimple (doubly-linked list + hashmap)Complex (frequency buckets)Complex (four LRU lists)
Handles scan resistanceNo — sequential scan evicts hot entriesYes — hot entries have high frequencyYes — ghost lists absorb scan traffic
Cold startGood — recency bootstraps quicklyPoor — all new entries have frequency = 1Good — recency arm covers cold start
Adapts to access pattern changesYes (recent entries favoured)Slow — old high-frequency entries stick aroundYes — adapts T1/T2 split dynamically
Used inRedis (allkeys-lru), most cachesRedis (allkeys-lfu, Redis 4.0+)ZFS ARC, IBM storage systems
Verdict:LRU is the default and works well for most workloads. LFU or ARC when you have mixed hot/cold access patterns and scan traffic that pollutes the LRU list.

Concepts Covered

cachinglocal cacheTTLshared cachecache coherencesingle point of truthconsistent hashingpartitioninghorizontal scalingLRUevictioncache capacityinvalidationconsistencywrite-throughwrite-behind

Prerequisites

It is recommended to complete the previous tracks before starting this one. Concepts build progressively throughout the curriculum.