ARCHIVED from builddistributedsystem.com on 2026-04-28 — URL: https://builddistributedsystem.com/tracks/tracer/tasks/task-23-1-4-trace-analysis
TASK

Implementation

Raw traces tell you what happened. Trace analysis tells you why it was slow and where errors are concentrated. By aggregating many traces, you surface bottlenecks, error hot-spots, service dependencies, and latency outliers.

Implement a node that analyses trace data and surfaces insights:

// Identify bottleneck (db takes 94% of trace time)
{ "type": "analyze_traces", "msg_id": 1,
  "traces": [{"trace_id":"t1","duration_ms":5000,
               "spans":[{"service":"web","duration":100},
                         {"service":"api","duration":200},
                         {"service":"db","duration":4700}]}] }
-> { "type": "insights", "in_reply_to": 1,
    "bottlenecks": ["db"], "critical_path": "web->api->db",
    "optimization_suggestion": "Add caching for database queries" }

// Error rate per service
{ "type": "analyze_errors", "msg_id": 2,
  "traces": [{"trace_id":"t1","has_error":true,"service":"payment-service"},
              {"trace_id":"t2","has_error":false},
              {"trace_id":"t3","has_error":true,"service":"payment-service"}] }
-> { "type": "error_analysis", "in_reply_to": 2,
    "error_rate_by_service": {"payment-service": "66.7%"},
    "total_errors": 2 }

Sample Test Cases

Performance analysisTimeout: 10000ms
Input
{
  "src": "analyzer",
  "dest": "insights",
  "body": {
    "type": "analyze_traces",
    "msg_id": 1,
    "time_range": "1h",
    "traces": [
      {
        "trace_id": "t1",
        "duration_ms": 5000,
        "spans": [
          {
            "service": "web",
            "duration": 100
          },
          {
            "service": "api",
            "duration": 200
          },
          {
            "service": "db",
            "duration": 4700
          }
        ]
      }
    ]
  }
}
Expected Output
{"type": "insights", "in_reply_to": 1, "bottlenecks": ["db"], "critical_path": "web->api->db", "optimization_suggestion": "Add caching for database queries"}
Error rate analysisTimeout: 5000ms
Input
{
  "src": "analyzer",
  "dest": "insights",
  "body": {
    "type": "analyze_errors",
    "msg_id": 1,
    "traces": [
      {
        "trace_id": "t1",
        "has_error": true,
        "service": "payment-service"
      },
      {
        "trace_id": "t2",
        "has_error": false
      },
      {
        "trace_id": "t3",
        "has_error": true,
        "service": "payment-service"
      }
    ]
  }
}
Expected Output
{"type": "error_analysis", "in_reply_to": 1, "error_rate_by_service": {"payment-service": "66.7%"}, "total_errors": 2}

Hints

Hint 1
Bottleneck: the span with the largest share of total trace duration
Hint 2
Critical path: the chain of spans from root to leaf with the maximum total duration
Hint 3
Error rate per service = error traces for that service / total traces for that service
Hint 4
Service map edges: parent span service -> child span service
Hint 5
Anomaly: trace duration > N * baseline p50 (e.g. 100x = high severity)
OVERVIEW

Theoretical Hub

Concept overview coming soon

Key Concepts

bottleneck detectioncritical patherror rateservice mapanomaly detection
main.py
python
Implement Trace Analysis and Insights - The Tracer | Build Distributed Systems