ARCHIVED from builddistributedsystem.com on 2026-04-28 — URL: https://builddistributedsystem.com/tracks/logger/tasks/task-10-2-4-compaction
TASK

Implementation

Compaction is the background process that keeps the LSM tree healthy. Without it, reads degrade because more and more SSTables accumulate, each requiring a disk seek.

The compaction algorithm (leveled compaction):

  1. Trigger: when L0 exceeds 4 SSTables, or when a level exceeds its size limit
  2. Select: choose SSTables from the source level whose key ranges overlap with the target level
  3. Merge: read all selected SSTables, merge-sort by key, deduplicate (keep newest version of each key)
  4. Write: write new sorted, non-overlapping SSTables to the target level
  5. Cleanup: delete the old source SSTables

Key metric - write amplification: a single user write may be rewritten 10-30x across levels as data is compacted. This is the fundamental cost of LSM trees. RocksDB's leveled compaction has write amplification of approximately 10 * (number_of_levels - 1).

Request:  {"type": "lsm_compact", "msg_id": 1, "source_level": 0, "target_level": 1}
Response: {"type": "lsm_compact_ok", "in_reply_to": 1, "sstables_merged": 4, "new_sstables": 2, "keys_deduplicated": 500, "bytes_read": 16000000, "bytes_written": 14000000, "write_amplification": 2.3}

Request:  {"type": "lsm_compact_status", "msg_id": 2}
Response: {"type": "lsm_compact_status_ok", "in_reply_to": 2, "pending_compactions": 0, "total_compactions": 5, "total_bytes_compacted": 100000000}

Sample Test Cases

Compact L0 into L1Timeout: 5000ms
Input
{"src":"c0","dest":"n1","body":{"type":"init","msg_id":1,"node_id":"n1","node_ids":["n1"]}}
{"src":"c1","dest":"n1","body":{"type":"lsm_compact","msg_id":2,"source_level":0,"target_level":1}}
Expected Output
{"src": "n1", "dest": "c0", "body": {"type": "init_ok", "in_reply_to": 1, "msg_id": 0}}
Compaction deduplicates keysTimeout: 5000ms
Input
{"src":"c0","dest":"n1","body":{"type":"init","msg_id":1,"node_id":"n1","node_ids":["n1"]}}
{"src":"c1","dest":"n1","body":{"type":"lsm_compact","msg_id":2,"source_level":0,"target_level":1}}
Expected Output
{"src": "n1", "dest": "c0", "body": {"type": "init_ok", "in_reply_to": 1, "msg_id": 0}}

Hints

Hint 1
When L0 has >4 SSTables, merge them into L1 (compaction trigger)
Hint 2
Compaction = merge sort: read overlapping SSTables from both levels, merge by key, write new SSTables to the target level
Hint 3
Deduplication: if the same key appears in multiple SSTables, keep only the newest version
Hint 4
Write amplification = total bytes written to disk / total bytes of new data. LSM trees have high write amp (10-30x)
Hint 5
After compaction, delete the old SSTables that were merged
OVERVIEW

Theoretical Hub

Concept overview coming soon

Key Concepts

compactionmerge sortdeduplicationwrite amplificationspace reclamation
main.py
python
Implement LSM Compaction with Merge Sort - The Logger | Build Distributed Systems