ARCHIVED from builddistributedsystem.com on 2026-04-28 — URL: https://builddistributedsystem.com/tracks/filesystem/tasks/task-12-2-2-re-replication
TASK

Implementation

When a chunk server dies, its chunks become under-replicated. The master must automatically schedule re-replication to restore the target replication factor.

Re-replication algorithm:

  1. Detect: master notices missing heartbeats from a server and marks it dead
  2. Scan: identify all chunks that were on the dead server — they now have fewer replicas
  3. Prioritize: chunks with RF=1 are critical (one more failure = data loss). Re-replicate them first.
  4. Schedule: for each under-replicated chunk, pick a healthy server that does NOT already hold the chunk
  5. Copy: instruct an existing replica to send the chunk data to the new server
  6. Update: add the new server to the chunk's location list in the master's metadata
Request:  {"type": "check_replication", "msg_id": 1}
Response: {"type": "check_replication_ok", "in_reply_to": 1, "under_replicated": [
    {"chunk": "ch_001", "current_rf": 2, "target_rf": 3, "missing_on": ["cs3"]},
    {"chunk": "ch_005", "current_rf": 1, "target_rf": 3, "priority": "critical"}
]}

Request:  {"type": "replicate_chunk", "msg_id": 2, "chunk": "ch_005", "source": "cs1", "target": "cs4"}
Response: {"type": "replicate_chunk_ok", "in_reply_to": 2, "chunk": "ch_005", "new_rf": 2, "bytes_copied": 67108864}

Sample Test Cases

Check replication identifies under-replicated chunksTimeout: 5000ms
Input
{"src":"c0","dest":"n1","body":{"type":"init","msg_id":1,"node_id":"n1","node_ids":["n1"]}}
{"src":"c1","dest":"n1","body":{"type":"check_replication","msg_id":2}}
Expected Output
{"src": "n1", "dest": "c0", "body": {"type": "init_ok", "in_reply_to": 1, "msg_id": 0}}
Replicate chunk to new serverTimeout: 5000ms
Input
{"src":"c0","dest":"n1","body":{"type":"init","msg_id":1,"node_id":"n1","node_ids":["n1","n2","n3","n4"]}}
{"src":"c1","dest":"n1","body":{"type":"replicate_chunk","msg_id":2,"chunk":"ch_005","source":"n2","target":"n4"}}
Expected Output
{"src": "n1", "dest": "c0", "body": {"type": "init_ok", "in_reply_to": 1, "msg_id": 0}}

Hints

Hint 1
When a server dies, its chunks drop below replication factor 3
Hint 2
The master scans for under-replicated chunks and schedules re-replication
Hint 3
Pick a healthy server that does NOT already hold the chunk to be the new replica
Hint 4
Copy chunk data from an existing replica to the new server
Hint 5
Prioritize: chunks with replication factor 1 (one server failure from data loss)
OVERVIEW

Theoretical Hub

Concept overview coming soon

Key Concepts

re-replicationunder-replicated chunksreplication factorfailure recovery
main.py
python
Implement Automatic Re-Replication - The Filesystem | Build Distributed Systems