Infrastructure
Rate Limiting
Rate limiting controls the rate of requests a client can send to a service. It protects against abuse, ensures fair resource usage, and maintains system stability under load.
- Protection β Prevent overload and denial-of-service attacks
- Fairness β Ensure equitable resource allocation
- Stability β Maintain predictable performance under load
Rate limiting is the first line of defense for any production service.
Why Rate Limit?
Uncontrolled traffic can cascade failures through a system. Rate limiting provides a controlled degradation path.
DfRate Limiting
Rate limiting is a technique that controls the number of requests a client can make within a given time window. When the limit is exceeded, requests are rejected (HTTP 429), queued, or degraded. Rate limiting protects services from traffic spikes, ensures fair usage, and prevents cascading failures.
Types of Limits
| Limit Type | Description | Example |
|---|---|---|
| Per-client | Limit per individual user/IP | 100 requests/minute per user |
| Global | System-wide limit | 10,000 requests/second total |
| Per-endpoint | Limit per API endpoint | 10 writes/second per user |
| Tiered | Different limits per plan | Free: 100/min, Pro: 1000/min |
Token Bucket
The most widely used rate limiting algorithm.
DfToken Bucket
The token bucket algorithm adds tokens at a fixed rate (r tokens/second) to a bucket with capacity C. Each request consumes one token. If the bucket is empty, the request is rejected. Bursts up to C are allowed, but sustained rate is limited to r.
Token Bucket
Here,
- =Bucket capacity (max burst size)
- =Token refill rate (tokens per second)
- =Tokens remaining at last check
- =Time elapsed since last check
Token Bucket Calculation
Bucket capacity C = 10 tokens, refill rate r = 2 tokens/second.
- t=0: Bucket has 10 tokens
- Request at t=0: 1 token consumed, 9 remaining
- Request at t=0.5: 1 token consumed, 8 + 1 = 9 (refilled 1)
- No requests until t=5: Bucket refills to 10 (capped)
- 12 requests at t=5: 10 accepted, 2 rejected (bucket empty)
Sliding Window Log
Tracks exact timestamps of each request in a sorted set.
DfSliding Window Log
The sliding window log stores timestamps of each request in a sorted set (e.g., Redis sorted set). On each new request, expired entries are removed, and if the count is below the limit, the request is allowed. Provides exact counting but requires O(n) memory per client.
Sliding Window Check
Here,
- =Set of request timestamps in current window
- =Maximum allowed requests in window
The sliding window log provides exact rate limiting but is memory-intensive. For high-traffic systems, consider the sliding window counter (hybrid approach) which approximates using fixed windows with interpolation.
Sliding Window Counter
A hybrid approach combining fixed windows with interpolation.
DfSliding Window Counter
The sliding window counter divides time into fixed windows (e.g., 1 minute) and counts requests in the current and previous window. The count is interpolated based on how far into the current window we are. This provides smooth limiting with O(1) memory per window.
Sliding Window Counter
Here,
- =Request count in previous window
- =Request count in current window
- =Time elapsed in current window
- =Window duration
Sliding Window Calculation
Window = 60 seconds, limit = 100 requests.
- Previous window (0-60s): 80 requests
- Current window (60-90s): 25 requests (30 seconds elapsed)
Count = 80 Γ (1 - 30/60) + 25 = 80 Γ 0.5 + 25 = 65
Since 65 < 100, the request is allowed.
Fixed Window
The simplest approach: count requests in fixed time periods.
DfFixed Window
The fixed window algorithm divides time into fixed periods (e.g., 1 minute). Each request increments a counter for the current window. When the window expires, the counter resets. Simple but prone to burst issues at window boundaries.
Fixed windows have a boundary problem: a client can make limit requests at the end of one window and limit at the start of the next, achieving 2Γ the intended rate in a short period. Use sliding window or token bucket for production systems.
Distributed Rate Limiting
Rate limiting across multiple service instances.
DfDistributed Rate Limiting
Distributed rate limiting coordinates limits across multiple service instances. Each instance checks a centralized store (e.g., Redis) before allowing a request. This ensures global limits are enforced regardless of which instance handles the request.
Distributed Challenges
| Challenge | Solution |
|---|---|
| Race conditions | Use atomic operations (Redis INCR with EXPIRE) |
| Network latency | Local rate limiting with periodic sync |
| Redis failure | Fallback to local limiting or fail open |
| Clock skew | Use logical timestamps or server-side time |
Use Redis Lua scripts for atomic distributed rate limiting. The script checks and increments the counter in a single atomic operation, preventing race conditions between instances.
Algorithm Comparison
| Algorithm | Accuracy | Memory | Burst Handling | Complexity |
|---|---|---|---|---|
| Token Bucket | High | O(1) | Allows bursts up to C | Low |
| Sliding Window Log | Exact | O(n) per client | No bursts | Medium |
| Sliding Window Counter | Approximate | O(1) | Smooth | Low |
| Fixed Window | Low | O(1) | Boundary bursts | Very Low |
| Leaky Bucket | High | O(1) | Smooth output | Low |
Practice Exercises
-
Design: Design a rate limiting system for an API gateway handling 1M requests/second with per-user limits of 100 requests/minute. Include Redis architecture and failover.
-
Analysis: Compare token bucket and sliding window counter for a video streaming service that allows 10-second bursts of high bitrate.
-
Distributed: Implement distributed rate limiting using Redis with a Lua script. Handle the case where Redis is unavailable.
-
Edge Case: A user has a rate limit of 10 requests/minute. They make 10 requests at 12:00:59 and 10 more at 12:01:01. How does each algorithm handle this?
Key Takeaways:
- Rate limiting protects services from abuse and maintains stability
- Token bucket allows bursts while enforcing sustained rate limits
- Sliding window algorithms provide smoother limiting than fixed windows
- Distributed rate limiting requires atomic operations and fault tolerance
- Choose algorithms based on accuracy requirements, memory constraints, and burst tolerance
- Always return HTTP 429 with Retry-After header when rate limiting
What to Learn Next
-> Proxy and Reverse Proxy Forward proxy, Nginx, HAProxy, and SSL termination.
-> API Design REST, GraphQL, gRPC, and API gateway patterns.
-> CDN Edge caching, DNS routing, and content distribution.
-> Load Balancing Distribution algorithms and L4 vs L7 load balancing.
-> Security Patterns Authentication, authorization, encryption, and mTLS.
-> Observability Logging, metrics, tracing, and monitoring.