Architecture
Design Circuit Breaker
The circuit breaker pattern prevents cascade failures by detecting failures and short-circuiting requests to failing services. This design covers state machines, health monitoring, and fallback strategies.
- Problem — Cascade failures from slow/unresponsive services
- Solution — Short-circuit requests after failure threshold
- States — Closed, Open, Half-Open
Circuit breakers are the electrical fuses of software: they break the circuit before the damage spreads.
What Is a Circuit Breaker?
DfCircuit Breaker
A circuit breaker monitors failures to a downstream service. When failures exceed a threshold, the circuit opens and requests fail fast without calling the service. After a timeout, the circuit enters half-open state to test recovery.
State Machine
Implementation
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30):
self.state = "CLOSED"
self.failure_count = 0
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.last_failure_time = None
def call(self, func, *args, **kwargs):
if self.state == "OPEN":
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = "HALF_OPEN"
else:
raise CircuitOpenError("Circuit is open")
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
if self.state == "HALF_OPEN":
self.state = "CLOSED"
self.failure_count = 0
def _on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "OPEN"
Configuration Parameters
Circuit Breaker Thresholds
Here,
- =Number of failures
- =Failure rate threshold (e.g., 50%)
- =Time window (e.g., 60 seconds)
Fallback Strategies
DfGraceful Degradation
When the circuit is open, the system should degrade gracefully: (1) Return cached data, (2) Use default values, (3) Show partial results, (4) Queue for later processing. The goal is to maintain partial functionality.
A circuit breaker without a fallback is just a fancy way to return errors. Always pair circuit breakers with meaningful fallback strategies.
Hystrix vs Resilience4j
| Feature | Hystrix | Resilience4j |
|---|---|---|
| Status | Maintenance mode | Active development |
| Thread Isolation | Thread pool | Semaphore |
| Circuit Breaker | Yes | Yes |
| Rate Limiter | No | Yes |
| Bulkhead | Thread pool | Semaphore/Thread pool |
| Metrics | RxJava | Micrometer |
Practice Exercises
- Design: Implement a circuit breaker with exponential backoff for recovery timeout.
- Fallback: Design fallback strategies for: (a) Payment service, (b) Recommendation service, (c) Search service.
- Monitoring: How would you alert when circuit breakers are opening frequently?
- Testing: Design chaos engineering tests for circuit breaker validation.
Key Takeaways:
- Circuit breakers prevent cascade failures by short-circuiting requests
- Three states: Closed (normal), Open (fail fast), Half-Open (test recovery)
- Always pair circuit breakers with meaningful fallback strategies
- Configure thresholds based on failure rate, not just count
- Monitor circuit breaker state changes for operational visibility
What to Learn Next
-> Back Pressure Load management and flow control.
-> Retry Patterns Resilient retry with backoff.
-> Sidecar Pattern Service mesh circuit breaking.
-> Saga Pattern Distributed transaction resilience.
-> Design Netflix Chaos engineering and resilience.
-> Design Uber Service resilience at scale.