Add Health Checks and Failover - Load Balancers

                      TASK
                    

Implementation

Add health checking to your load balancer:

Periodically send health check requests to servers
Track consecutive failures per server
Mark server unhealthy after N failures
Exclude unhealthy servers from selection
Re-add servers after successful health checks

Support both active (probing) and passive (observing failures) checks.

Sample Test Cases

Exclude unhealthy serverTimeout: 5000ms

Input

{"src":"c0","dest":"lb","body":{"type":"init","msg_id":1,"node_id":"lb","node_ids":["lb","s1","s2","s3"]}}
{"src":"c1","dest":"lb","body":{"type":"health_status","msg_id":2,"server":"s1","consecutive_failures":5,"threshold":3}}
{"src":"c2","dest":"lb","body":{"type":"get_healthy_servers","msg_id":3}}

Expected Output

{"src":"lb","dest":"c0","body":{"type":"init_ok","in_reply_to":1,"msg_id":0}}
{"src":"lb","dest":"c1","body":{"type":"health_status_ok","in_reply_to":2,"msg_id":1,"status":"unhealthy"}}
{"src":"lb","dest":"c2","body":{"type":"get_healthy_servers_ok","in_reply_to":3,"msg_id":2,"healthy":["s2","s3"]}}

Hints

Hint 1▾

Periodically probe each server

Hint 2▾

Mark unhealthy after consecutive failures

Hint 3▾

Remove from rotation until healthy

                      OVERVIEW
                    

Theoretical Hub

Health Checking

Health checks detect server failures before routing requests to them. Active checks send periodic probes (HTTP GET /health). Passive checks observe real request failures.

Graceful Degradation

When servers fail, the load balancer redistributes traffic to healthy servers. Slow re-introduction (ramping up traffic) prevents overwhelming recovering servers.

Key Concepts

health checkfailoverliveness

main.py

python

#!/usr/bin/env python3
import sys
import json
import threading
import time
class HealthCheckLB:
    def __init__(self, servers, check_interval=10, failure_threshold=3):
        self.servers = servers
        self.check_interval = check_interval
        self.failure_threshold = failure_threshold
        self.healthy = {s: True for s in servers}
        self.failures = {s: 0 for s in servers}
        self.lock = threading.Lock()
    
    def _health_check(self, server):
        # TODO: Check if server is healthy
        pass
    
    def _health_check_loop(self):
        # TODO: Periodically check all servers
        pass
    
    def mark_failure(self, server):
        # TODO: Increment failures, mark unhealthy if threshold
        pass
    
    def mark_success(self, server):
        # TODO: Reset failures, mark healthy
        pass
    
    def get_healthy_servers(self):
        # TODO: Return list of healthy servers
        pass
if __name__ == "__main__":
    pass