ARCHIVED from builddistributedsystem.com on 2026-04-28 — URL: https://builddistributedsystem.com/tracks/tracer/tasks/task-23-2-2-alerting-rules
TASK

Implementation

An alerting rules engine evaluates metric conditions and fires notifications when thresholds are breached. It routes alerts to the right channel based on severity, groups duplicate alerts to prevent storms, and auto-resolves when conditions return to normal.

Implement a node that evaluates alert rules and manages notifications:

// Error rate above threshold for 5 minutes -> WARNING
{ "type": "evaluate", "msg_id": 1,
  "metric": "error_rate", "value": 0.08,
  "threshold": 0.05, "duration_sec": 300 }
-> { "type": "alert_triggered", "in_reply_to": 1,
    "rule": "High error rate", "severity": "WARNING", "value": 0.08 }

// Service down -> CRITICAL -> page PagerDuty
{ "type": "evaluate", "msg_id": 2,
  "metric": "up", "value": 0, "threshold": 0,
  "duration_sec": 60, "service": "api" }
{ routing: {channels: ["pagerduty"]} }
-> { "type": "alert_triggered", "in_reply_to": 2,
    "severity": "CRITICAL", "action": "page_sent", "service": "api" }

// Metric returns to normal -> auto-resolve
{ "type": "evaluate", "msg_id": 3,
  "metric": "error_rate", "value": 0.01,
  "threshold": 0.05, "alert_resolved": true }
-> { "type": "alert_resolved", "in_reply_to": 3,
    "rule": "High error rate", "resolution": "Value returned to normal" }

Sample Test Cases

Threshold alert triggeredTimeout: 5000ms
Input
{
  "src": "metrics",
  "dest": "alerter",
  "body": {
    "type": "evaluate",
    "msg_id": 1,
    "metric": "error_rate",
    "value": 0.08,
    "threshold": 0.05,
    "duration_sec": 300
  }
}
Expected Output
{"type": "alert_triggered", "in_reply_to": 1, "rule": "High error rate", "severity": "WARNING", "value": 0.08}
Alert routing to PagerDutyTimeout: 5000ms
Input
{
  "src": "metrics",
  "dest": "alerter",
  "body": {
    "type": "evaluate",
    "msg_id": 1,
    "metric": "up",
    "value": 0,
    "threshold": 0,
    "duration_sec": 60,
    "service": "api"
  },
  "routing": {
    "channels": [
      "pagerduty"
    ]
  }
}
Expected Output
{"type": "alert_triggered", "in_reply_to": 1, "severity": "CRITICAL", "action": "page_sent", "service": "api"}

Hints

Hint 1
Fire alert when metric > threshold for at least duration_sec seconds
Hint 2
Route CRITICAL severity to PagerDuty (pager); WARNING to Slack or email
Hint 3
Grouping: alerts with the same fingerprint are merged into one notification
Hint 4
Resolution: fire alert_resolved when the metric returns below threshold
Hint 5
severity is determined by which threshold band the value falls in
OVERVIEW

Theoretical Hub

Concept overview coming soon

Key Concepts

alert rulesthreshold evaluationalert routingalert groupingauto-resolution
main.py
python
Implement Alerting Rules Engine - The Tracer | Build Distributed Systems