ARCHIVED from builddistributedsystem.com on 2026-04-28 — URL: https://builddistributedsystem.com/tracks/orchestrator/tasks/task-26-2-5-observability
TASK

Implementation

Observability in a service mesh means collecting metrics, traces, and logs from every sidecar proxy automatically, without changing application code. The three pillars together let you understand what is happening and diagnose problems quickly.

Implement a node that handles all three observability signals:

// Record a request metric (latency + status code)
{ "type": "record", "msg_id": 1,
  "service": "api", "duration_ms": 150, "status": 200 }
-> { "type": "metrics_recorded", "in_reply_to": 1,
    "service": "api", "request_count": 1 }

// Create a distributed trace with a span
{ "type": "trace", "msg_id": 2,
  "operation": "GET /api/users" }
-> { "type": "trace_created", "in_reply_to": 2,
    "trace_id": "<uuid>", "span_id": "<uuid>" }

// Query access logs by service
{ "type": "query", "msg_id": 3,
  "filter": {"source_service": "api", "target_service": "database"} }
-> { "type": "access_logs", "in_reply_to": 3,
    "logs": [{"source_service": "api", "target_service": "database", "count": 50}] }

// Generate service dependency graph
{ "type": "generate", "msg_id": 4 }
-> { "type": "service_graph", "in_reply_to": 4,
    "nodes": ["api", "database", "cache"],
    "edges": [{"source": "api", "target": "database", "request_count": 500}] }

Sample Test Cases

Collect service metricsTimeout: 5000ms
Input
{
  "src": "sidecar",
  "dest": "metrics",
  "body": {
    "type": "record",
    "msg_id": 1,
    "service": "api",
    "duration_ms": 150,
    "status": 200
  }
}
Expected Output
{"type": "metrics_recorded", "in_reply_to": 1, "service": "api", "request_count": 1}
Create distributed traceTimeout: 5000ms
Input
{
  "src": "service-a",
  "dest": "tracer",
  "body": {
    "type": "trace",
    "msg_id": 1,
    "operation": "GET /api/users"
  }
}
Expected Output
{"type": "trace_created", "in_reply_to": 1, "trace_id": ".*", "span_id": ".*"}

Hints

Hint 1
record stores one request metric: service name, duration_ms, and HTTP status code
Hint 2
trace creates a trace with a unique trace_id and a span for the given operation
Hint 3
query returns access logs filtered by source_service and/or target_service
Hint 4
generate builds a service graph: nodes are service names, edges are (source, target, count) pairs
Hint 5
request_count increments by 1 for every record call for that service
OVERVIEW

Theoretical Hub

Concept overview coming soon

Key Concepts

metricsdistributed tracingaccess logsservice graphgolden signals
main.py
python
Implement Service Mesh Observability - The Orchestrator | Build Distributed Systems