ARCHIVED from builddistributedsystem.com on 2026-04-28 — URL: https://builddistributedsystem.com/tracks/scheduler/tasks/task-22-2-5-dynamic-scheduling
TASK

Implementation

Moving a job to where its data lives is cheaper than shipping large data over the network. Locality-aware scheduling scores workers based on data proximity, then selects the best-scoring, least-loaded worker.

Implement a node that makes locality-aware scheduling decisions:

// node-1 hosts the data but is 90% loaded; node-2 is in same rack, 30% loaded
{ "type": "submit_job", "msg_id": 1,
  "job": {"id":"job1","inputs":["data.csv"]},
  "topology": {"rack1":["node-1","node-2"]},
  "data_location": "node-1",
  "node-1_utilization": 0.9,
  "node-2_utilization": 0.3 }
-> { "type": "job_assigned", "in_reply_to": 1,
    "worker": "node-2",
    "reason": "Same rack as data (rack1) and less loaded than node-1" }

// Data moves to node-5 -> future jobs follow it
{ "type": "update_data_location", "msg_id": 1,
  "file": "data.csv", "old_location": "node-1", "new_location": "node-5" }
{ "type": "submit_job", "msg_id": 2,
  "job": {"id":"job1","inputs":["data.csv"]} }
-> { "type": "job_assigned", "in_reply_to": 2,
    "worker": "node-5", "reason": "Data moved to node-5" }

Sample Test Cases

Locality-aware schedulingTimeout: 5000ms
Input
{"src":"client","dest":"scheduler","body":{"type":"init","msg_id":1,"workers":["node-1","node-2","node-3"],"data_map":{"file1.txt":["node-1"],"file2.txt":["node-2"]}}}
{"src":"client","dest":"scheduler","body":{"type":"submit_job","msg_id":2,"job":{"id":"job1","inputs":["file1.txt","file2.txt"]}}}
Expected Output
{"src": "scheduler", "dest": "client", "body": {"type": "init_ok", "in_reply_to": 1}}
Rack-aware schedulingTimeout: 5000ms
Input
{
  "src": "client",
  "dest": "scheduler",
  "body": {
    "type": "submit_job",
    "msg_id": 1,
    "job": {
      "id": "job1",
      "inputs": [
        "data.csv"
      ]
    },
    "topology": {
      "rack1": [
        "node-1",
        "node-2"
      ],
      "rack2": [
        "node-3",
        "node-4"
      ]
    },
    "data_location": "node-1",
    "node-1_utilization": 0.9,
    "node-2_utilization": 0.3
  }
}
Expected Output
{"src": "scheduler", "dest": "client", "body": {"type": "job_assigned", "in_reply_to": 1, "worker": "node-2", "reason": "Same rack as data (rack1) and less loaded than node-1"}}

Hints

Hint 1
Score workers: +2 if it hosts all input files, +1 if in the same rack, 0 otherwise
Hint 2
Among equal locality scores, prefer the worker with lower current utilization
Hint 3
Rack-aware: same-rack worker preferred over cross-rack even if the exact data node is overloaded
Hint 4
update_data_location changes the data map; subsequent jobs use the new location
Hint 5
locality_aware balances speed and fairness: faster than load-balancing, more even than locality-only
OVERVIEW

Theoretical Hub

Concept overview coming soon

Key Concepts

data localityrack awarenessworker scoringdynamic data placementload vs locality tradeoff
main.py
python
Implement Dynamic Scheduling with Locality Awareness - The Scheduler | Build Distributed Systems