Infrastructure

Rate Limiting

Rate limiting controls the rate of requests a client can send to a service. It protects against abuse, ensures fair resource usage, and maintains system stability under load.

Protection — Prevent overload and denial-of-service attacks
Fairness — Ensure equitable resource allocation
Stability — Maintain predictable performance under load

Rate limiting is the first line of defense for any production service.

Why Rate Limit?

Uncontrolled traffic can cascade failures through a system. Rate limiting provides a controlled degradation path.

DfRate Limiting

Rate limiting is a technique that controls the number of requests a client can make within a given time window. When the limit is exceeded, requests are rejected (HTTP 429), queued, or degraded. Rate limiting protects services from traffic spikes, ensures fair usage, and prevents cascading failures.

Types of Limits

Limit Type	Description	Example
Per-client	Limit per individual user/IP	100 requests/minute per user
Global	System-wide limit	10,000 requests/second total
Per-endpoint	Limit per API endpoint	10 writes/second per user
Tiered	Different limits per plan	Free: 100/min, Pro: 1000/min

Token Bucket

The most widely used rate limiting algorithm.

DfToken Bucket

The token bucket algorithm adds tokens at a fixed rate (r tokens/second) to a bucket with capacity C. Each request consumes one token. If the bucket is empty, the request is rejected. Bursts up to C are allowed, but sustained rate is limited to r.

Token Bucket

tokens(t) = \min(C, tokens(t_{last}) + r \times (t - t_{last}))

Here,

$C$ =Bucket capacity (max burst size)
$r$ =Token refill rate (tokens per second)
$tokens(t_{last})$ =Tokens remaining at last check
$t - t_{last}$ =Time elapsed since last check

Token Bucket Calculation

Bucket capacity C = 10 tokens, refill rate r = 2 tokens/second.

t=0: Bucket has 10 tokens
Request at t=0: 1 token consumed, 9 remaining
Request at t=0.5: 1 token consumed, 8 + 1 = 9 (refilled 1)
No requests until t=5: Bucket refills to 10 (capped)
12 requests at t=5: 10 accepted, 2 rejected (bucket empty)

Sliding Window Log

Tracks exact timestamps of each request in a sorted set.

DfSliding Window Log

The sliding window log stores timestamps of each request in a sorted set (e.g., Redis sorted set). On each new request, expired entries are removed, and if the count is below the limit, the request is allowed. Provides exact counting but requires O(n) memory per client.

Sliding Window Check

allowed = (count(requests \text{ in window}) < limit)

Here,

$requests$ =Set of request timestamps in current window
$limit$ =Maximum allowed requests in window

The sliding window log provides exact rate limiting but is memory-intensive. For high-traffic systems, consider the sliding window counter (hybrid approach) which approximates using fixed windows with interpolation.

Sliding Window Counter

A hybrid approach combining fixed windows with interpolation.

DfSliding Window Counter

The sliding window counter divides time into fixed windows (e.g., 1 minute) and counts requests in the current and previous window. The count is interpolated based on how far into the current window we are. This provides smooth limiting with O(1) memory per window.

Sliding Window Counter

count = prev\_count \times (1 - \frac{elapsed}{window}) + current\_count

Here,

$prev_count$ =Request count in previous window
$current_count$ =Request count in current window
$elapsed$ =Time elapsed in current window
$window$ =Window duration

Sliding Window Calculation

Window = 60 seconds, limit = 100 requests.

Previous window (0-60s): 80 requests
Current window (60-90s): 25 requests (30 seconds elapsed)

Count = 80 × (1 - 30/60) + 25 = 80 × 0.5 + 25 = 65

Since 65 < 100, the request is allowed.

Fixed Window

The simplest approach: count requests in fixed time periods.

DfFixed Window

The fixed window algorithm divides time into fixed periods (e.g., 1 minute). Each request increments a counter for the current window. When the window expires, the counter resets. Simple but prone to burst issues at window boundaries.

Fixed windows have a boundary problem: a client can make limit requests at the end of one window and limit at the start of the next, achieving 2× the intended rate in a short period. Use sliding window or token bucket for production systems.

Distributed Rate Limiting

Rate limiting across multiple service instances.

DfDistributed Rate Limiting

Distributed rate limiting coordinates limits across multiple service instances. Each instance checks a centralized store (e.g., Redis) before allowing a request. This ensures global limits are enforced regardless of which instance handles the request.

Distributed Challenges

Challenge	Solution
Race conditions	Use atomic operations (Redis INCR with EXPIRE)
Network latency	Local rate limiting with periodic sync
Redis failure	Fallback to local limiting or fail open
Clock skew	Use logical timestamps or server-side time

Use Redis Lua scripts for atomic distributed rate limiting. The script checks and increments the counter in a single atomic operation, preventing race conditions between instances.

Algorithm Comparison

Algorithm	Accuracy	Memory	Burst Handling	Complexity
Token Bucket	High	O(1)	Allows bursts up to C	Low
Sliding Window Log	Exact	O(n) per client	No bursts	Medium
Sliding Window Counter	Approximate	O(1)	Smooth	Low
Fixed Window	Low	O(1)	Boundary bursts	Very Low
Leaky Bucket	High	O(1)	Smooth output	Low

Practice Exercises

Design: Design a rate limiting system for an API gateway handling 1M requests/second with per-user limits of 100 requests/minute. Include Redis architecture and failover.
Analysis: Compare token bucket and sliding window counter for a video streaming service that allows 10-second bursts of high bitrate.
Distributed: Implement distributed rate limiting using Redis with a Lua script. Handle the case where Redis is unavailable.
Edge Case: A user has a rate limit of 10 requests/minute. They make 10 requests at 12:00:59 and 10 more at 12:01:01. How does each algorithm handle this?

Key Takeaways:

Rate limiting protects services from abuse and maintains stability
Token bucket allows bursts while enforcing sustained rate limits
Sliding window algorithms provide smoother limiting than fixed windows
Distributed rate limiting requires atomic operations and fault tolerance
Choose algorithms based on accuracy requirements, memory constraints, and burst tolerance
Always return HTTP 429 with Retry-After header when rate limiting

What to Learn Next

-> Proxy and Reverse Proxy Forward proxy, Nginx, HAProxy, and SSL termination.

-> API Design REST, GraphQL, gRPC, and API gateway patterns.

-> CDN Edge caching, DNS routing, and content distribution.

-> Load Balancing Distribution algorithms and L4 vs L7 load balancing.

-> Security Patterns Authentication, authorization, encryption, and mTLS.

-> Observability Logging, metrics, tracing, and monitoring.

Rate Limiting

Rate Limiting

Why Rate Limit?

DfRate Limiting

Types of Limits

Token Bucket

DfToken Bucket

Token Bucket

Token Bucket Calculation

Sliding Window Log

DfSliding Window Log

Sliding Window Check

Sliding Window Counter

DfSliding Window Counter

Sliding Window Counter

Sliding Window Calculation

Fixed Window

DfFixed Window

Distributed Rate Limiting

DfDistributed Rate Limiting

Distributed Challenges

Algorithm Comparison

Practice Exercises

What to Learn Next

Premium Content

Need Expert System Design Help?