πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Design a Rate Limiter

System Design ProblemsRate Limiting🟒 Free Lesson

Advertisement

System Design Problems

Design a Rate Limiter

A rate limiter controls the rate of requests a client can send to an API. It protects services from abuse, ensures fair resource usage, and prevents cascading failures. Systems like Cloudflare, AWS API Gateway, and Kong implement distributed rate limiting at massive scale.

  • Protection β€” Prevent DoS attacks and API abuse
  • Fairness β€” Ensure equitable resource allocation across clients
  • Throttling β€” Gracefully degrade service under overload

A rate limiter sits between the client and the server, deciding whether to allow or reject each request based on the client's recent request history.

Requirements

Functional Requirements

  • Limit requests per client per time window
  • Support different rate limits per API endpoint
  • Support different limits per client tier (free, pro, enterprise)
  • Return HTTP 429 (Too Many Requests) when limit exceeded
  • Provide rate limit headers in responses
  • Support distributed rate limiting across multiple servers

Non-Functional Requirements

  • Latency: Rate limit check in < 1ms
  • Accuracy: Allow slight over-limit (within 1%) rather than blocking valid requests
  • Availability: Rate limiter failure should fail open (allow requests)
  • Scalability: Handle millions of clients and thousands of API endpoints

Rate limiters can be deployed as a middleware in the API gateway, as a sidecar proxy, or as a separate service. The choice depends on the deployment architecture.

API Design

Architecture Diagram
GET /api/v1/rate-limit/status
Response: {
  "client_id": "user_123",
  "limit": 1000,
  "remaining": 742,
  "reset_at": "2026-06-20T11:00:00Z",
  "retry_after": 0
}

// Response headers
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 742
X-RateLimit-Reset: 1687267200
Retry-After: 30  // Seconds until next request allowed

Rate Limiting Algorithms

Token BucketBucket of TokensAdd tokens at fixed rateRemove 1 per requestBurst-friendlySliding Window LogLog of Request TimestampsCount in windowEvict old entriesMemory-intensiveFixed WindowCounter per WindowSimple implementationReset at boundaryBoundary spike issueSliding WindowCounterWeighted AverageSmooth boundaryLow memoryBest balance βœ“

Token Bucket

DfToken Bucket Algorithm

A token bucket has a fixed capacity of tokens. Tokens are added at a constant rate. Each request consumes one token. If the bucket is empty, the request is rejected. Allows bursts up to the bucket size.

Token Bucket

tokens=min⁑(capacity,tokens+rateΓ—Ξ”t)βˆ’costtokens = \min(capacity, tokens + rate \times \Delta t) - cost

Here,

  • capacitycapacity=Maximum tokens in the bucket
  • raterate=Token refill rate (tokens/second)
  • DeltatDelta t=Time since last request
  • costcost=Tokens consumed (usually 1)

Token Bucket in Action

Bucket capacity: 10 tokens, refill rate: 2 tokens/second

  • t=0: 10 tokens, request β†’ 9 tokens (allowed)
  • t=0.1: 9 tokens, request β†’ 8 tokens (allowed)
  • ... 8 rapid requests ...
  • t=0.8: 2 tokens, request β†’ 1 token (allowed)
  • t=0.9: 1 token, request β†’ 0 tokens (allowed)
  • t=1.0: 0 tokens, request β†’ REJECTED (429)
  • t=1.5: 1 token, request β†’ 0 tokens (allowed, refilled 1)

Sliding Window Counter

DfSliding Window Counter

The sliding window counter combines the current window count with a weighted portion of the previous window count to smooth boundary effects. This provides a more accurate count than fixed windows.

Sliding Window Count

count=prev_countΓ—overlap_ratio+current_countcount = prev\_count \times overlap\_ratio + current\_count

Here,

  • prevcountprev_count=Request count in previous window
  • currentcountcurrent_count=Request count in current window
  • overlapratiooverlap_ratio=Fraction of previous window still in the sliding window

Sliding Window Calculation

Window size: 1 minute, current time: 10:45:30

  • Previous window (10:44:00-10:45:00): 80 requests
  • Current window (10:45:00-10:46:00): 30 requests
  • Overlap ratio: 30/60 = 0.5

Sliding count = 80 Γ— 0.5 + 30 = 70 requests

With a limit of 100, the request is allowed.

Algorithm Comparison

AlgorithmMemoryAccuracyBurst HandlingComplexity
Token BucketO(1)GoodAllows burstsSimple
Sliding Window LogO(n)ExactNo burstsComplex
Fixed WindowO(1)Boundary issuesAllows burstsSimplest
Sliding Window CounterO(1)GoodSmoothModerate

For most use cases, the sliding window counter offers the best balance of accuracy, memory efficiency, and simplicity. Use token bucket when burst handling is important.

Distributed Rate Limiting

Single vs Distributed

For single-server rate limiting, an in-memory counter suffices. For distributed systems, use Redis:

ClientServer 1Server 2RedisAtomic CountersAPIDistributed rate limiting with Redis

Redis Rate Limit Check (Lua Script)

current=INCR(key);\nifcurrent==1:\nEXPIRE(key,window);\nifcurrent>limit:\nREJECT;\nelse:\nALLOWcurrent = INCR(key);\nif current == 1:\n EXPIRE(key, window);\nif current > limit:\n REJECT;\nelse:\n ALLOW

Here,

  • keykey=Redis key: rate:{client}:{window}
  • limitlimit=Maximum requests per window
  • windowwindow=Window duration in seconds

Use Redis Lua scripts for atomic rate limit checks. The INCR + EXPIRE pattern is atomic within a Lua script, preventing race conditions between multiple API servers.

Race Condition Handling

In distributed systems, race conditions can allow more requests than the limit:

Race Condition Window

over_limit=Nserversβˆ’1window_sizeΓ—limitover\_limit = \frac{N_{servers} - 1}{window\_size} \times limit

Here,

  • NserversN_{servers}=Number of API servers
  • windowsizewindow_size=Rate limit window
  • limitlimit=Request limit

Accept a small margin of over-limit (1-2%) rather than adding distributed locking. The simplicity gain outweighs the slight inaccuracy.

Practice Exercises

  1. Design: How would you implement rate limiting that accounts for different API endpoint costs (e.g., a search query costs 5 units, a simple GET costs 1 unit)?

  2. Distributed: If you have 10 API servers and a rate limit of 1000 requests/minute per client, what is the worst-case over-limit due to race conditions? How would you mitigate it?

  3. Trade-offs: Compare token bucket and sliding window counter for an API that needs to support occasional bursts (10x normal rate for 5 seconds).

  4. Edge Case: How would you handle rate limiting for a client that uses multiple IP addresses (e.g., behind a NAT)? Design a client identification strategy.

Key Takeaways:

  • Token bucket is best for burst-friendly rate limiting; sliding window counter is best for smooth limiting
  • Redis Lua scripts provide atomic rate limit checks in distributed systems
  • Sliding window counter smooths boundary effects of fixed windows with minimal memory
  • Accept slight over-limit (1-2%) rather than adding distributed locking complexity
  • Rate limiters should fail open (allow requests) when the limiter itself fails

What to Learn Next

-> Rate Limiting Deep dive into rate limiting algorithms and distributed implementation.

-> Load Balancing Distributing requests across multiple API servers.

-> API Design REST conventions, error handling, and HTTP status codes.

-> Caching Strategies Redis caching patterns for distributed counters.

-> Proxy and Reverse Proxy Rate limiting at the proxy layer (nginx, Envoy).

-> Microservices Service mesh rate limiting and circuit breaking.

⭐

Premium Content

Design a Rate Limiter

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert System Design Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement