Implement Master Failover with Shadow Master - The Filesystem

Implementation

The master is a single point of failure. A shadow master mitigates this by continuously replaying the primary's WAL, staying nearly synchronized.

Failover process:

Normal operation: primary master handles all requests; shadow replays WAL entries from a shared storage
Failure detection: if the primary misses heartbeats for 10 seconds, the shadow initiates takeover
WAL catchup: shadow replays any remaining WAL entries not yet processed
Promote: shadow becomes the new primary, starts accepting requests
Redirect: chunk servers and clients are notified to use the new master

Request:  {"type": "shadow_status", "msg_id": 1}
Response: {"type": "shadow_status_ok", "in_reply_to": 1, "primary": "master1", "shadow": "master2", "wal_lag_entries": 5, "wal_lag_ms": 200}

Request:  {"type": "trigger_failover", "msg_id": 2, "failed_master": "master1"}
Response: {"type": "trigger_failover_ok", "in_reply_to": 2, "new_primary": "master2", "wal_entries_replayed": 5, "failover_time_ms": 450}

Sample Test Cases

Shadow status shows WAL lagTimeout: 5000ms

Input

{"src":"c0","dest":"n1","body":{"type":"init","msg_id":1,"node_id":"n1","node_ids":["n1","n2"]}}
{"src":"c1","dest":"n1","body":{"type":"shadow_status","msg_id":2}}

Expected Output

{"src": "n1", "dest": "c0", "body": {"type": "init_ok", "in_reply_to": 1, "msg_id": 0}}

Failover promotes shadowTimeout: 5000ms

Input

{"src":"c0","dest":"n1","body":{"type":"init","msg_id":1,"node_id":"n1","node_ids":["n1","n2"]}}
{"src":"c1","dest":"n1","body":{"type":"trigger_failover","msg_id":2,"failed_master":"n1"}}

Expected Output

{"src": "n1", "dest": "c0", "body": {"type": "init_ok", "in_reply_to": 1, "msg_id": 0}}

Hints

Hint 1▾

A shadow master replays the primary WAL periodically to stay nearly up-to-date

Hint 2▾

On primary failure, the shadow takes over with minimal data loss (only recent WAL entries)

Hint 3▾

The shadow cannot serve queries while the primary is alive — it is a hot standby

Hint 4▾

Failover time = time to detect failure + time to replay remaining WAL entries

Hint 5▾

After failover, chunk servers redirect heartbeats to the new master

Implementation

Sample Test Cases

Hints

Resources

Theoretical Hub

Key Concepts