Implement MapReduce-Style Work Partitioning - The Scheduler

                      TASK
                    

Implementation

Large datasets are split into partitions so multiple workers can process them simultaneously. The partitioning strategy controls how evenly work is distributed, which directly affects total job time.

Implement a node that partitions work and handles stragglers:

// Split 1TB input evenly across 4 workers
{ "type": "submit_mapreduce", "msg_id": 1,
  "input": "s3://data/*.txt", "map": "wordcount", "num_partitions": 4 }
-> { "type": "job_submitted", "in_reply_to": 1,
    "job_id": "mr1",
    "partition_sizes": [256, 256, 256, 256], "workers": [1,2,3,4] }

// Hash beats range on skewed keys
{ "type": "benchmark_partitioning", "msg_id": 2,
  "strategies": ["range","hash"], "data": "keys_with_skew" }
-> results: range skew_factor=5.0, hash skew_factor=1.2

// Slow partition -> speculative execution on backup worker
{ "type": "submit_mapreduce", ...,
  "straggler_partition": 2, "straggler_slowdown": 5 }
-> { "type": "straggler_detected",
    "partition": 2, "action": "speculative_execution", "backup_worker": 5 }

Sample Test Cases

Partition input dataTimeout: 5000ms

Input

{
  "src": "client",
  "dest": "mr_cluster",
  "body": {
    "type": "submit_mapreduce",
    "msg_id": 1,
    "input": "s3://data/*.txt",
    "map": "wordcount",
    "num_partitions": 4
  }
}

Expected Output

{"src": "mr_cluster", "dest": "client", "body": {"type": "job_submitted", "in_reply_to": 1, "job_id": "mr1", "partition_sizes": [256, 256, 256, 256], "workers": [1,2,3,4]}}

Hash partitioning distributionTimeout: 10000ms

Input

{
  "src": "client",
  "dest": "mr_cluster",
  "body": {
    "type": "benchmark_partitioning",
    "msg_id": 1,
    "strategies": [
      "range",
      "hash"
    ],
    "data": "keys_with_skew",
    "num_partitions": 4
  }
}

Expected Output

{"src": "mr_cluster", "dest": "client", "body": {"type": "benchmark_complete", "in_reply_to": 1, "results": {"range": {"skew_factor": 5.0}, "hash": {"skew_factor": 1.2}}}}

Hints

Hint 1▾

Hash partitioning: partition_id = hash(job_id) % num_partitions — distributes evenly even for skewed keys

Hint 2▾

Skew factor = max_partition_size / avg_partition_size — closer to 1.0 means more balanced

Hint 3▾

Range partitioning produces unequal partitions when key distribution is skewed

Hint 4▾

A straggler is a partition whose runtime exceeds 2x the median of completed partitions

Hint 5▾

Speculative execution: launch a backup copy on another worker and take whichever finishes first

Resources

MapReduce Paper

Original Google MapReduce paper covering partitioning and straggler mitigation

                      OVERVIEW
                    

Theoretical Hub

Concept overview coming soon

Key Concepts

hash partitioningrange partitioningdata skewstraggler mitigationspeculative execution

main.py

python