Implement WAL Recovery on Startup - The Logger

                      TASK
                    

Implementation

WAL recovery is the other half of the durability story. When a node restarts after a crash, it must scan the WAL and replay all valid entries to reconstruct its state.

The recovery algorithm:

Open the WAL file and read entries sequentially
For each entry, validate the checksum (recompute CRC32 and compare)
If the checksum matches, replay the entry (apply the operation to the state machine)
If the checksum is invalid, this is a torn write — the entry was partially written before the crash. Skip it and all subsequent entries.
After replay, the state machine reflects the last consistent state

This is how every database recovers from crashes: PostgreSQL replays its WAL, SQLite replays its journal, and Raft replays its log.

Request:  {"type": "wal_recover", "msg_id": 1, "wal_entries": [
    {"offset": 0, "payload": "set x=1", "checksum": "valid"},
    {"offset": 1, "payload": "set y=2", "checksum": "valid"},
    {"offset": 2, "payload": "set z=", "checksum": "invalid"}
]}
Response: {"type": "wal_recover_ok", "in_reply_to": 1, "entries_replayed": 2, "entries_skipped": 1, "state": {"x": "1", "y": "2"}}

Sample Test Cases

Recovery replays valid entries and skips corruptedTimeout: 5000ms

Input

{"src":"c0","dest":"n1","body":{"type":"init","msg_id":1,"node_id":"n1","node_ids":["n1"]}}
{"src":"c1","dest":"n1","body":{"type":"wal_recover","msg_id":2,"wal_entries":[{"offset":0,"payload":"set x=1","checksum":"valid"},{"offset":1,"payload":"set y=2","checksum":"valid"},{"offset":2,"payload":"set z=","checksum":"invalid"}]}}

Expected Output

{"src": "n1", "dest": "c0", "body": {"type": "init_ok", "in_reply_to": 1, "msg_id": 0}}
{"src": "n1", "dest": "c1", "body": {"type": "wal_recover_ok", "in_reply_to": 2, "entries_replayed": 2, "entries_skipped": 1, "msg_id": 1}}

Recovery with all valid entries replays everythingTimeout: 5000ms

Input

{"src":"c0","dest":"n1","body":{"type":"init","msg_id":1,"node_id":"n1","node_ids":["n1"]}}
{"src":"c1","dest":"n1","body":{"type":"wal_recover","msg_id":2,"wal_entries":[{"offset":0,"payload":"set a=10","checksum":"valid"},{"offset":1,"payload":"set b=20","checksum":"valid"},{"offset":2,"payload":"set c=30","checksum":"valid"}]}}

Expected Output

{"src": "n1", "dest": "c0", "body": {"type": "init_ok", "in_reply_to": 1, "msg_id": 0}}
{"src": "n1", "dest": "c1", "body": {"type": "wal_recover_ok", "in_reply_to": 2, "entries_replayed": 3, "entries_skipped": 0, "msg_id": 1}}

Hints

Hint 1▾

On startup, scan the WAL sequentially from the beginning

Hint 2▾

For each entry, recompute the CRC32 checksum and compare to the stored one

Hint 3▾

Valid entries are replayed to reconstruct state; invalid entries are skipped

Hint 4▾

A torn write (crash mid-write) produces a partial entry with an invalid checksum

Hint 5▾

After recovery, the state machine is identical to the last consistent pre-crash state

Resources

ARIES Recovery Algorithm

The classic ARIES algorithm for WAL-based crash recovery in databases

                      OVERVIEW
                    

Theoretical Hub

Concept overview coming soon

Key Concepts

crash recoverylog replaychecksum validationtorn writesstate reconstruction

main.py

python