Implement Automatic Re-Replication - The Filesystem

Implementation

When a chunk server dies, its chunks become under-replicated. The master must automatically schedule re-replication to restore the target replication factor.

Re-replication algorithm:

Detect: master notices missing heartbeats from a server and marks it dead
Scan: identify all chunks that were on the dead server — they now have fewer replicas
Prioritize: chunks with RF=1 are critical (one more failure = data loss). Re-replicate them first.
Schedule: for each under-replicated chunk, pick a healthy server that does NOT already hold the chunk
Copy: instruct an existing replica to send the chunk data to the new server
Update: add the new server to the chunk's location list in the master's metadata

Request:  {"type": "check_replication", "msg_id": 1}
Response: {"type": "check_replication_ok", "in_reply_to": 1, "under_replicated": [
    {"chunk": "ch_001", "current_rf": 2, "target_rf": 3, "missing_on": ["cs3"]},
    {"chunk": "ch_005", "current_rf": 1, "target_rf": 3, "priority": "critical"}
]}

Request:  {"type": "replicate_chunk", "msg_id": 2, "chunk": "ch_005", "source": "cs1", "target": "cs4"}
Response: {"type": "replicate_chunk_ok", "in_reply_to": 2, "chunk": "ch_005", "new_rf": 2, "bytes_copied": 67108864}

Sample Test Cases

Check replication identifies under-replicated chunksTimeout: 5000ms

Input

{"src":"c0","dest":"n1","body":{"type":"init","msg_id":1,"node_id":"n1","node_ids":["n1"]}}
{"src":"c1","dest":"n1","body":{"type":"check_replication","msg_id":2}}

Expected Output

{"src": "n1", "dest": "c0", "body": {"type": "init_ok", "in_reply_to": 1, "msg_id": 0}}

Replicate chunk to new serverTimeout: 5000ms

Input

{"src":"c0","dest":"n1","body":{"type":"init","msg_id":1,"node_id":"n1","node_ids":["n1","n2","n3","n4"]}}
{"src":"c1","dest":"n1","body":{"type":"replicate_chunk","msg_id":2,"chunk":"ch_005","source":"n2","target":"n4"}}

Expected Output

{"src": "n1", "dest": "c0", "body": {"type": "init_ok", "in_reply_to": 1, "msg_id": 0}}

Hints

Hint 1▾

When a server dies, its chunks drop below replication factor 3

Hint 2▾

The master scans for under-replicated chunks and schedules re-replication

Hint 3▾

Pick a healthy server that does NOT already hold the chunk to be the new replica

Hint 4▾

Copy chunk data from an existing replica to the new server

Hint 5▾

Prioritize: chunks with replication factor 1 (one server failure from data loss)

Implementation

Sample Test Cases

Hints

Resources

Theoretical Hub

Key Concepts