Simulate Thundering Herd with Circuit Breaking - Load Balancers

                      TASK
                    

Implementation

The thundering herd problem occurs when a large number of clients simultaneously retry after a backend failure, overwhelming the remaining backends and causing cascading failures.

The problem:

Normal state:
  backend-1: 1000 req/s
  backend-2: 1000 req/s
  backend-3: 1000 req/s
  backend-4: 1000 req/s

backend-4 fails:
  All 4000 req/s from backend-4 retry immediately
  backend-1, 2, 3 now receive 2333 req/s each (overload!)
  They fail under load
  All backends down = outage

Circuit breaking + exponential backoff:

Circuit breaker:
  - Detect backend-4 failure rate > 50%
  - Open circuit: stop sending requests to backend-4
  - Traffic redistributes gradually (not all at once)

Exponential backoff (client-side):
  - Retry 1: 100ms delay
  - Retry 2: 200ms delay
  - Retry 3: 400ms delay
  - Retry 4: 800ms delay
  - Spreads out retry load over time

Result:
  - backend-1, 2, 3 handle 1333 req/s each (manageable)
  - No cascading failure

Example thundering herd simulation:

Request:  {"type": "simulate_thundering_herd", "msg_id": 1, "backends": ["b1", "b2", "b3", "b4"], "requests_per_second": 4000, "fail_backend": "b4", "duration_ms": 10000, "with_circuit_breaker": false}
Response: {"type": "simulation_complete", "in_reply_to": 1, "results": {"without_circuit_breaker": {"cascading_failures": true, "backends_down": ["b1", "b2", "b3"], "p99_latency_ms": 5000, "error_rate": 1.0}}}

Request:  {"type": "simulate_thundering_herd", "msg_id": 2, "backends": ["b1", "b2", "b3", "b4"], "requests_per_second": 4000, "fail_backend": "b4", "duration_ms": 10000, "with_circuit_breaker": true, "with_exponential_backoff": true}
Response: {"type": "simulation_complete", "in_reply_to": 2, "results": {"with_protections": {"cascading_failures": false, "backends_down": ["b4"], "p99_latency_ms": 150, "error_rate": 0.25}}}

Metrics to track:

Requests per second (per backend)
Error rate (5xx responses)
Latency percentiles (p50, p99, p99.9)
Circuit breaker state transitions
Retry attempt distribution

Sample Test Cases

Thundering herd without protectionsTimeout: 15000ms

Input

{
  "src": "client",
  "dest": "simulator",
  "body": {
    "type": "simulate_thundering_herd",
    "msg_id": 1,
    "backends": [
      "b1",
      "b2",
      "b3",
      "b4"
    ],
    "requests_per_second": 4000,
    "fail_backend": "b4",
    "with_circuit_breaker": false
  }
}

Expected Output

{"src": "simulator", "dest": "client", "body": {"type": "simulation_complete", "in_reply_to": 1, "results": {"cascading_failures": true, "all_backends_failed": true, "error_rate": 1.0}}}

Circuit breaking prevents cascadeTimeout: 15000ms

Input

{
  "src": "client",
  "dest": "simulator",
  "body": {
    "type": "simulate_thundering_herd",
    "msg_id": 1,
    "backends": [
      "b1",
      "b2",
      "b3",
      "b4"
    ],
    "requests_per_second": 4000,
    "fail_backend": "b4",
    "with_circuit_breaker": true
  }
}

Expected Output

{"src": "simulator", "dest": "client", "body": {"type": "simulation_complete", "in_reply_to": 1, "results": {"cascading_failures": false, "failed_backends": ["b4"], "error_rate": 0.25}}}

Hints

Hint 1▾

Simulate: backend-4 fails, 10,000 connections suddenly redistribute to remaining backends

Hint 2▾

Without circuit breaking: remaining backends overload and fail

Hint 3▾

With circuit breaking: failed backend is removed from pool immediately

Hint 4▾

With exponential backoff: clients retry with increasing delays

Hint 5▾

Measure: requests/sec, error rate, latency p99 before and after failure

Resources

Thundering Herd Problem

AWS blog on thundering herd and mitigation strategies

                      OVERVIEW
                    

Theoretical Hub

Concept overview coming soon

Key Concepts

thundering herdcascading failuresexponential backoffgraceful degradationcircuit breaking

main.py

python