TASK
Implementation
The Google File System (GFS) architecture is the foundation of modern distributed storage. It separates metadata (managed by a master) from data (stored on chunk servers), enabling petabyte-scale storage across thousands of machines.
Architecture:
- Master node: stores all metadata in memory — the namespace (file/directory tree) and the mapping of each file to its chunks and their locations. Metadata changes are logged to a WAL for durability.
- Chunk servers: store 64MB data chunks on local disks. Each chunk is replicated to 3 servers.
- Clients: contact the master to discover chunk locations, then read/write directly to chunk servers.
Key design decisions:
- Large chunks (64MB): reduces metadata size and the number of master interactions
- Replication factor 3: tolerates 2 simultaneous server failures
- Master out of data path: the master only handles metadata; data flows directly between clients and chunk servers
Request: {"type": "dfs_create_file", "msg_id": 1, "path": "/data/logs/2024.log", "chunk_size_mb": 64, "replication_factor": 3}
Response: {"type": "dfs_create_file_ok", "in_reply_to": 1, "chunks": [
{"chunk_handle": "ch_001", "chunk_servers": ["cs1", "cs2", "cs3"], "primary": "cs1"}
]}
Request: {"type": "dfs_file_info", "msg_id": 2, "path": "/data/logs/2024.log"}
Response: {"type": "dfs_file_info_ok", "in_reply_to": 2, "size_bytes": 67108864, "chunks": 1, "replication_factor": 3}Sample Test Cases
Create file returns chunk allocationTimeout: 5000ms
Input
{"src":"c0","dest":"n1","body":{"type":"init","msg_id":1,"node_id":"n1","node_ids":["n1","n2","n3"]}}
{"src":"c1","dest":"n1","body":{"type":"dfs_create_file","msg_id":2,"path":"/data/test.log","chunk_size_mb":64,"replication_factor":3}}
Expected Output
{"src": "n1", "dest": "c0", "body": {"type": "init_ok", "in_reply_to": 1, "msg_id": 0}}
File info returns metadataTimeout: 5000ms
Input
{"src":"c0","dest":"n1","body":{"type":"init","msg_id":1,"node_id":"n1","node_ids":["n1"]}}
{"src":"c1","dest":"n1","body":{"type":"dfs_file_info","msg_id":2,"path":"/data/test.log"}}
Expected Output
{"src": "n1", "dest": "c0", "body": {"type": "init_ok", "in_reply_to": 1, "msg_id": 0}}
Hints
Hint 1▾
The architecture has two components: a single master (metadata) and many chunk servers (data)
Hint 2▾
Files are split into fixed-size 64MB chunks — large to minimize metadata overhead
Hint 3▾
Each chunk is replicated to 3 chunk servers for fault tolerance
Hint 4▾
The master stores the mapping: filename -> list of (chunk_handle, [chunk_server_addresses])
Hint 5▾
Clients talk to the master for metadata and directly to chunk servers for data — the master is never in the data path
OVERVIEW
Theoretical Hub
Concept overview coming soon
Key Concepts
GFS architecturemaster nodechunk server64MB chunksreplication factor
main.py
python
1
2
3
4
5
6
7
8
9
10
11
12
13
#!/usr/bin/env python3
import sys
import json
def main():
# Your implementation here
for line in sys.stdin:
msg = json.loads(line)
print(json.dumps(msg), flush=True)
if __name__ == "__main__":
main()