TASK
Implementation
Raw traces tell you what happened. Trace analysis tells you why it was slow and where errors are concentrated. By aggregating many traces, you surface bottlenecks, error hot-spots, service dependencies, and latency outliers.
Implement a node that analyses trace data and surfaces insights:
// Identify bottleneck (db takes 94% of trace time)
{ "type": "analyze_traces", "msg_id": 1,
"traces": [{"trace_id":"t1","duration_ms":5000,
"spans":[{"service":"web","duration":100},
{"service":"api","duration":200},
{"service":"db","duration":4700}]}] }
-> { "type": "insights", "in_reply_to": 1,
"bottlenecks": ["db"], "critical_path": "web->api->db",
"optimization_suggestion": "Add caching for database queries" }
// Error rate per service
{ "type": "analyze_errors", "msg_id": 2,
"traces": [{"trace_id":"t1","has_error":true,"service":"payment-service"},
{"trace_id":"t2","has_error":false},
{"trace_id":"t3","has_error":true,"service":"payment-service"}] }
-> { "type": "error_analysis", "in_reply_to": 2,
"error_rate_by_service": {"payment-service": "66.7%"},
"total_errors": 2 }Sample Test Cases
Performance analysisTimeout: 10000ms
Input
{
"src": "analyzer",
"dest": "insights",
"body": {
"type": "analyze_traces",
"msg_id": 1,
"time_range": "1h",
"traces": [
{
"trace_id": "t1",
"duration_ms": 5000,
"spans": [
{
"service": "web",
"duration": 100
},
{
"service": "api",
"duration": 200
},
{
"service": "db",
"duration": 4700
}
]
}
]
}
}Expected Output
{"type": "insights", "in_reply_to": 1, "bottlenecks": ["db"], "critical_path": "web->api->db", "optimization_suggestion": "Add caching for database queries"}Error rate analysisTimeout: 5000ms
Input
{
"src": "analyzer",
"dest": "insights",
"body": {
"type": "analyze_errors",
"msg_id": 1,
"traces": [
{
"trace_id": "t1",
"has_error": true,
"service": "payment-service"
},
{
"trace_id": "t2",
"has_error": false
},
{
"trace_id": "t3",
"has_error": true,
"service": "payment-service"
}
]
}
}Expected Output
{"type": "error_analysis", "in_reply_to": 1, "error_rate_by_service": {"payment-service": "66.7%"}, "total_errors": 2}Hints
Hint 1▾
Bottleneck: the span with the largest share of total trace duration
Hint 2▾
Critical path: the chain of spans from root to leaf with the maximum total duration
Hint 3▾
Error rate per service = error traces for that service / total traces for that service
Hint 4▾
Service map edges: parent span service -> child span service
Hint 5▾
Anomaly: trace duration > N * baseline p50 (e.g. 100x = high severity)
OVERVIEW
Theoretical Hub
Concept overview coming soon
Key Concepts
bottleneck detectioncritical patherror rateservice mapanomaly detection
main.py
python
1
2
3
4
5
6
7
8
9
10
11
12
13
#!/usr/bin/env python3
import sys
import json
def main():
# Your implementation here
for line in sys.stdin:
msg = json.loads(line)
print(json.dumps(msg), flush=True)
if __name__ == "__main__":
main()