Implement Trace Analysis and Insights - The Tracer

                      TASK
                    

Implementation

Raw traces tell you what happened. Trace analysis tells you why it was slow and where errors are concentrated. By aggregating many traces, you surface bottlenecks, error hot-spots, service dependencies, and latency outliers.

Implement a node that analyses trace data and surfaces insights:

// Identify bottleneck (db takes 94% of trace time)
{ "type": "analyze_traces", "msg_id": 1,
  "traces": [{"trace_id":"t1","duration_ms":5000,
               "spans":[{"service":"web","duration":100},
                         {"service":"api","duration":200},
                         {"service":"db","duration":4700}]}] }
-> { "type": "insights", "in_reply_to": 1,
    "bottlenecks": ["db"], "critical_path": "web->api->db",
    "optimization_suggestion": "Add caching for database queries" }

// Error rate per service
{ "type": "analyze_errors", "msg_id": 2,
  "traces": [{"trace_id":"t1","has_error":true,"service":"payment-service"},
              {"trace_id":"t2","has_error":false},
              {"trace_id":"t3","has_error":true,"service":"payment-service"}] }
-> { "type": "error_analysis", "in_reply_to": 2,
    "error_rate_by_service": {"payment-service": "66.7%"},
    "total_errors": 2 }

Sample Test Cases

Performance analysisTimeout: 10000ms

Input

{
  "src": "analyzer",
  "dest": "insights",
  "body": {
    "type": "analyze_traces",
    "msg_id": 1,
    "time_range": "1h",
    "traces": [
      {
        "trace_id": "t1",
        "duration_ms": 5000,
        "spans": [
          {
            "service": "web",
            "duration": 100
          },
          {
            "service": "api",
            "duration": 200
          },
          {
            "service": "db",
            "duration": 4700
          }
        ]
      }
    ]
  }
}

Expected Output

{"type": "insights", "in_reply_to": 1, "bottlenecks": ["db"], "critical_path": "web->api->db", "optimization_suggestion": "Add caching for database queries"}

Error rate analysisTimeout: 5000ms

Input

{
  "src": "analyzer",
  "dest": "insights",
  "body": {
    "type": "analyze_errors",
    "msg_id": 1,
    "traces": [
      {
        "trace_id": "t1",
        "has_error": true,
        "service": "payment-service"
      },
      {
        "trace_id": "t2",
        "has_error": false
      },
      {
        "trace_id": "t3",
        "has_error": true,
        "service": "payment-service"
      }
    ]
  }
}

Expected Output

{"type": "error_analysis", "in_reply_to": 1, "error_rate_by_service": {"payment-service": "66.7%"}, "total_errors": 2}

Hints

Hint 1▾

Bottleneck: the span with the largest share of total trace duration

Hint 2▾

Critical path: the chain of spans from root to leaf with the maximum total duration

Hint 3▾

Error rate per service = error traces for that service / total traces for that service

Hint 4▾

Service map edges: parent span service -> child span service

Hint 5▾

Anomaly: trace duration > N * baseline p50 (e.g. 100x = high severity)

Resources

Distributed Tracing with Jaeger

Jaeger docs on trace analysis and root cause investigation

                      OVERVIEW
                    

Theoretical Hub

Concept overview coming soon

Key Concepts

bottleneck detectioncritical patherror rateservice mapanomaly detection

main.py

python