TASK
Implementation
Large datasets are split into partitions so multiple workers can process them simultaneously. The partitioning strategy controls how evenly work is distributed, which directly affects total job time.
Implement a node that partitions work and handles stragglers:
// Split 1TB input evenly across 4 workers
{ "type": "submit_mapreduce", "msg_id": 1,
"input": "s3://data/*.txt", "map": "wordcount", "num_partitions": 4 }
-> { "type": "job_submitted", "in_reply_to": 1,
"job_id": "mr1",
"partition_sizes": [256, 256, 256, 256], "workers": [1,2,3,4] }
// Hash beats range on skewed keys
{ "type": "benchmark_partitioning", "msg_id": 2,
"strategies": ["range","hash"], "data": "keys_with_skew" }
-> results: range skew_factor=5.0, hash skew_factor=1.2
// Slow partition -> speculative execution on backup worker
{ "type": "submit_mapreduce", ...,
"straggler_partition": 2, "straggler_slowdown": 5 }
-> { "type": "straggler_detected",
"partition": 2, "action": "speculative_execution", "backup_worker": 5 }Sample Test Cases
Partition input dataTimeout: 5000ms
Input
{
"src": "client",
"dest": "mr_cluster",
"body": {
"type": "submit_mapreduce",
"msg_id": 1,
"input": "s3://data/*.txt",
"map": "wordcount",
"num_partitions": 4
}
}Expected Output
{"src": "mr_cluster", "dest": "client", "body": {"type": "job_submitted", "in_reply_to": 1, "job_id": "mr1", "partition_sizes": [256, 256, 256, 256], "workers": [1,2,3,4]}}
Hash partitioning distributionTimeout: 10000ms
Input
{
"src": "client",
"dest": "mr_cluster",
"body": {
"type": "benchmark_partitioning",
"msg_id": 1,
"strategies": [
"range",
"hash"
],
"data": "keys_with_skew",
"num_partitions": 4
}
}Expected Output
{"src": "mr_cluster", "dest": "client", "body": {"type": "benchmark_complete", "in_reply_to": 1, "results": {"range": {"skew_factor": 5.0}, "hash": {"skew_factor": 1.2}}}}
Hints
Hint 1▾
Hash partitioning: partition_id = hash(job_id) % num_partitions — distributes evenly even for skewed keys
Hint 2▾
Skew factor = max_partition_size / avg_partition_size — closer to 1.0 means more balanced
Hint 3▾
Range partitioning produces unequal partitions when key distribution is skewed
Hint 4▾
A straggler is a partition whose runtime exceeds 2x the median of completed partitions
Hint 5▾
Speculative execution: launch a backup copy on another worker and take whichever finishes first
OVERVIEW
Theoretical Hub
Concept overview coming soon
Key Concepts
hash partitioningrange partitioningdata skewstraggler mitigationspeculative execution
main.py
python
1
2
3
4
5
6
7
8
9
10
11
12
13
#!/usr/bin/env python3
import sys
import json
def main():
# Your implementation here
for line in sys.stdin:
msg = json.loads(line)
print(json.dumps(msg), flush=True)
if __name__ == "__main__":
main()