Implement Dynamic Scheduling with Locality Awareness - The Scheduler

                      TASK
                    

Implementation

Moving a job to where its data lives is cheaper than shipping large data over the network. Locality-aware scheduling scores workers based on data proximity, then selects the best-scoring, least-loaded worker.

Implement a node that makes locality-aware scheduling decisions:

// node-1 hosts the data but is 90% loaded; node-2 is in same rack, 30% loaded
{ "type": "submit_job", "msg_id": 1,
  "job": {"id":"job1","inputs":["data.csv"]},
  "topology": {"rack1":["node-1","node-2"]},
  "data_location": "node-1",
  "node-1_utilization": 0.9,
  "node-2_utilization": 0.3 }
-> { "type": "job_assigned", "in_reply_to": 1,
    "worker": "node-2",
    "reason": "Same rack as data (rack1) and less loaded than node-1" }

// Data moves to node-5 -> future jobs follow it
{ "type": "update_data_location", "msg_id": 1,
  "file": "data.csv", "old_location": "node-1", "new_location": "node-5" }
{ "type": "submit_job", "msg_id": 2,
  "job": {"id":"job1","inputs":["data.csv"]} }
-> { "type": "job_assigned", "in_reply_to": 2,
    "worker": "node-5", "reason": "Data moved to node-5" }

Sample Test Cases

Locality-aware schedulingTimeout: 5000ms

Input

{"src":"client","dest":"scheduler","body":{"type":"init","msg_id":1,"workers":["node-1","node-2","node-3"],"data_map":{"file1.txt":["node-1"],"file2.txt":["node-2"]}}}
{"src":"client","dest":"scheduler","body":{"type":"submit_job","msg_id":2,"job":{"id":"job1","inputs":["file1.txt","file2.txt"]}}}

Expected Output

{"src": "scheduler", "dest": "client", "body": {"type": "init_ok", "in_reply_to": 1}}

Rack-aware schedulingTimeout: 5000ms

Input

{
  "src": "client",
  "dest": "scheduler",
  "body": {
    "type": "submit_job",
    "msg_id": 1,
    "job": {
      "id": "job1",
      "inputs": [
        "data.csv"
      ]
    },
    "topology": {
      "rack1": [
        "node-1",
        "node-2"
      ],
      "rack2": [
        "node-3",
        "node-4"
      ]
    },
    "data_location": "node-1",
    "node-1_utilization": 0.9,
    "node-2_utilization": 0.3
  }
}

Expected Output

{"src": "scheduler", "dest": "client", "body": {"type": "job_assigned", "in_reply_to": 1, "worker": "node-2", "reason": "Same rack as data (rack1) and less loaded than node-1"}}

Hints

Hint 1▾

Score workers: +2 if it hosts all input files, +1 if in the same rack, 0 otherwise

Hint 2▾

Among equal locality scores, prefer the worker with lower current utilization

Hint 3▾

Rack-aware: same-rack worker preferred over cross-rack even if the exact data node is overloaded

Hint 4▾

update_data_location changes the data map; subsequent jobs use the new location

Hint 5▾

locality_aware balances speed and fairness: faster than load-balancing, more even than locality-only

Resources

Data Locality in Hadoop

How Hadoop uses data locality for scheduling decisions

                      OVERVIEW
                    

Theoretical Hub

Concept overview coming soon

Key Concepts

data localityrack awarenessworker scoringdynamic data placementload vs locality tradeoff

main.py

python