TASK
Implementation
Moving a job to where its data lives is cheaper than shipping large data over the network. Locality-aware scheduling scores workers based on data proximity, then selects the best-scoring, least-loaded worker.
Implement a node that makes locality-aware scheduling decisions:
// node-1 hosts the data but is 90% loaded; node-2 is in same rack, 30% loaded
{ "type": "submit_job", "msg_id": 1,
"job": {"id":"job1","inputs":["data.csv"]},
"topology": {"rack1":["node-1","node-2"]},
"data_location": "node-1",
"node-1_utilization": 0.9,
"node-2_utilization": 0.3 }
-> { "type": "job_assigned", "in_reply_to": 1,
"worker": "node-2",
"reason": "Same rack as data (rack1) and less loaded than node-1" }
// Data moves to node-5 -> future jobs follow it
{ "type": "update_data_location", "msg_id": 1,
"file": "data.csv", "old_location": "node-1", "new_location": "node-5" }
{ "type": "submit_job", "msg_id": 2,
"job": {"id":"job1","inputs":["data.csv"]} }
-> { "type": "job_assigned", "in_reply_to": 2,
"worker": "node-5", "reason": "Data moved to node-5" }Sample Test Cases
Locality-aware schedulingTimeout: 5000ms
Input
{"src":"client","dest":"scheduler","body":{"type":"init","msg_id":1,"workers":["node-1","node-2","node-3"],"data_map":{"file1.txt":["node-1"],"file2.txt":["node-2"]}}}
{"src":"client","dest":"scheduler","body":{"type":"submit_job","msg_id":2,"job":{"id":"job1","inputs":["file1.txt","file2.txt"]}}}
Expected Output
{"src": "scheduler", "dest": "client", "body": {"type": "init_ok", "in_reply_to": 1}}
Rack-aware schedulingTimeout: 5000ms
Input
{
"src": "client",
"dest": "scheduler",
"body": {
"type": "submit_job",
"msg_id": 1,
"job": {
"id": "job1",
"inputs": [
"data.csv"
]
},
"topology": {
"rack1": [
"node-1",
"node-2"
],
"rack2": [
"node-3",
"node-4"
]
},
"data_location": "node-1",
"node-1_utilization": 0.9,
"node-2_utilization": 0.3
}
}Expected Output
{"src": "scheduler", "dest": "client", "body": {"type": "job_assigned", "in_reply_to": 1, "worker": "node-2", "reason": "Same rack as data (rack1) and less loaded than node-1"}}
Hints
Hint 1▾
Score workers: +2 if it hosts all input files, +1 if in the same rack, 0 otherwise
Hint 2▾
Among equal locality scores, prefer the worker with lower current utilization
Hint 3▾
Rack-aware: same-rack worker preferred over cross-rack even if the exact data node is overloaded
Hint 4▾
update_data_location changes the data map; subsequent jobs use the new location
Hint 5▾
locality_aware balances speed and fairness: faster than load-balancing, more even than locality-only
OVERVIEW
Theoretical Hub
Concept overview coming soon
Key Concepts
data localityrack awarenessworker scoringdynamic data placementload vs locality tradeoff
main.py
python
1
2
3
4
5
6
7
8
9
10
11
12
13
#!/usr/bin/env python3
import sys
import json
def main():
# Your implementation here
for line in sys.stdin:
msg = json.loads(line)
print(json.dumps(msg), flush=True)
if __name__ == "__main__":
main()