Architecture
Design Retry Patterns
Retries are essential for handling transient failures, but naive retries can cause thundering herds and cascade failures. This design covers exponential backoff, jitter, and retry budgets.
- Problem β Transient failures require automatic recovery
- Solution β Smart retries with backoff and jitter
- Goal β Maximize success without overwhelming the system
Retries are a double-edged sword: they improve individual request success but can amplify system load.
Why Retry?
Most failures in distributed systems are transient: network blips, temporary overload, DNS timeouts. Retries handle these automatically. However, retries during sustained outages create thundering herds that make things worse.
Exponential Backoff
Exponential Backoff
Here,
- =Initial delay (e.g., 100ms)
- =Retry attempt number
- =Maximum delay cap (e.g., 30s)
Backoff Sequence
Base = 100ms, Max = 30s:
- Attempt 1: 100ms
- Attempt 2: 200ms
- Attempt 3: 400ms
- Attempt 4: 800ms
- Attempt 5: 1600ms
- Attempt 6: 3200ms
- Attempt 7: 6400ms
- Attempt 8: 12800ms
- Attempt 9: 25600ms
- Attempt 10: 30000ms (capped)
Adding Jitter
DfFull Jitter
Without jitter, all clients retry at the same intervals, creating synchronized bursts. Full jitter randomizes the delay: delay = random(0, base * 2^n). This spreads retries evenly across time.
Full Jitter Formula
Here,
- =Uniform random between 0 and x
Without jitter, exponential backoff causes "thundering herds" where all clients retry at the same time. Jitter is essential for distributed systems.
Retry Budget
DfRetry Budget
A retry budget limits the total number of retries per time window. If the retry budget is exhausted, new requests fail fast instead of retrying. This prevents retry storms during sustained outages.
Retry Budget
Here,
- =New requests per second
- =Retry requests per second
- =Maximum total requests
Retry Strategies
| Strategy | Use Case | Risk |
|---|---|---|
| Immediate | Local operations | Thundering herd |
| Fixed Interval | Simple recovery | Synchronized bursts |
| Exponential Backoff | Network calls | Still synchronized |
| Exponential + Jitter | Distributed systems | None (best practice) |
| Retry Budget | High-scale systems | Complex configuration |
Implementation
import random
import time
def retry_with_backoff(func, max_retries=5, base_delay=0.1, max_delay=30):
for attempt in range(max_retries):
try:
return func()
except TransientError:
if attempt == max_retries - 1:
raise
delay = min(base_delay * (2 ** attempt), max_delay)
jittered = random.uniform(0, delay)
time.sleep(jittered)
Practice Exercises
- Design: Implement a retry mechanism with exponential backoff and full jitter.
- Budget: Design a retry budget that limits retries to 20% of total traffic.
- Idempotency: How do retries interact with idempotency? Design a system that handles both.
- Monitoring: Design a dashboard that shows retry rates, success rates, and latency impact.
Key Takeaways:
- Exponential backoff prevents immediate repeated failures
- Jitter prevents thundering herds by randomizing retry timing
- Retry budgets limit total retry load during sustained outages
- Always pair retries with idempotency for safe repetition
- Monitor retry rates to detect systemic issues vs transient failures
What to Learn Next
-> Circuit Breaker Preventing cascade failures.
-> Back Pressure Load management.
-> Idempotency Safe retry semantics.
-> Saga Pattern Retry in distributed transactions.
-> Sidecar Pattern Service mesh retry handling.
-> Design Netflix Resilient microservices.