System Design Problems

Design a Distributed Cache

A distributed cache stores frequently accessed data in memory across a cluster of servers, providing sub-millisecond read latency. Redis Cluster and Memcached are the most widely used distributed caches, forming the performance backbone of nearly every large-scale system.

Sub-millisecond Latency — Memory-based storage with no disk I/O
Horizontal Scalability — Distribute data across thousands of nodes
High Availability — Replication and automatic failover

The fundamental trade-off in caching is between cache hit ratio (accuracy) and memory cost. A 99% cache hit ratio means 99% of requests never touch the database.

Requirements

Functional Requirements

put(key, value, ttl): Store key-value pair with optional TTL
get(key): Retrieve value by key
delete(key): Remove key from cache
Support for different data types (strings, hashes, lists, sets)
Cache eviction policies (LRU, LFU, TTL-based)
Cluster mode with automatic data distribution

Non-Functional Requirements

Latency: Read/write in < 1ms (p99)
Throughput: 1M operations/second per node
Availability: 99.99% (cache failure should not crash the system)
Scalability: 100 TB total cache size across cluster
Durability: Optional persistence (RDB/AOF for Redis)

Distributed caches are typically used as a cache-aside layer: the application checks the cache first, falls back to the database on miss, and populates the cache with the result.

Back-of-the-Envelope Estimation

Distributed Cache Capacity

100 TB total cache, 100 nodes = 1 TB per node
1M QPS per node × 100 nodes = 100M QPS cluster-wide
Average key size: 50 bytes, value: 1 KB
Keys per node: 1 TB / 1.05 KB ≈ 1 billion keys/node
Total cluster: 100 billion keys

Data Distribution

Consistent Hashing

DfConsistent Hashing for Caching

Consistent hashing maps both keys and cache nodes to positions on a hash ring. A key is assigned to the first node found clockwise. When a node is added/removed, only keys between adjacent nodes are redistributed.

Use 150-200 virtual nodes per physical node to ensure even distribution. Without virtual nodes, some nodes may receive 2-3x more keys than others.

Replication

DfCache Replication

Each key is replicated to N-1 additional nodes for fault tolerance. If a node fails, replicas serve cached data without database fallback.

Replication Factor

replicas = RF - 1

Here,

$RF$ =Replication factor (typically 3)

With RF=3, each key exists on 3 nodes. If one node fails, the remaining 2 replicas serve reads. When the failed node recovers, it catches up via anti-entropy (Merkle tree comparison).

Cache Architecture

Redis Cluster Slot Management

DfHash Slots

Redis Cluster divides the key space into 16384 hash slots. Each node is responsible for a subset of slots. Keys are assigned to slots using CRC16(key) % 16384.

Slot Assignment

slot = CRC16(key) \mod 16384

Here,

$CRC16$ =16-bit cyclic redundancy check hash
$16384$ =Total number of hash slots

Slot Distribution

3 nodes with 16384 slots:

Node A: slots 0-5460 (5461 slots)
Node B: slots 5461-10922 (5462 slots)
Node C: slots 10923-16383 (5461 slots)

When adding Node D, rebalance by migrating slots from A, B, C.

Eviction Policies

When memory is full, evict keys based on policy:

Policy	Description	Use Case
LRU	Evict least recently used	General purpose
LFU	Evict least frequently used	Access pattern varies
TTL	Evict expired keys	Time-sensitive data
Random	Evict random keys	Simple, no tracking
No-eviction	Return error when full	Critical data only

Redis supports allkeys-lru (evict any LRU key) and volatile-lru (evict only keys with TTL). Use volatile-lru when some keys must never be evicted.

Cache Patterns

DfCache-Aside Pattern

Application checks cache for key
Cache hit → return value
Cache miss → query database, write to cache, return value
On write: update database, invalidate cache

Cache Hit Ratio Impact

effective\_latency = hit\_ratio \times latency_{cache} + (1 - hit\_ratio) \times latency_{db}

Here,

$hit_ratio$ =Fraction of reads served from cache
$latency_{cache}$ =Cache read latency (typically 0.5ms)
$latency_{db}$ =Database read latency (typically 5ms)

Cache Efficiency

With 99% hit ratio: effective_latency = 0.99 × 0.5ms + 0.01 × 5ms = 0.495 + 0.05 = 0.545ms

Without cache: 5ms Improvement: 9.2x faster

Practice Exercises

Design: How would you handle cache stampede (thundering herd) when a popular key expires? Design a mechanism to prevent all requests from hitting the database simultaneously.
Scale: If the cluster has 100 nodes and 100 TB of data, estimate the memory overhead for consistent hashing metadata and replication information.
Consistency: Design a cache invalidation strategy that ensures eventual consistency between the cache and database. What happens during network partitions?
Recovery: When a failed node recovers, how do you synchronize its data with the current cluster state? Compare Merkle tree anti-entropy vs. full sync.

Key Takeaways:

Consistent hashing distributes keys across nodes with minimal redistribution on scaling
Redis Cluster uses 16384 hash slots; each node owns a subset
Replication (RF=3) provides fault tolerance without database fallback
Cache-aside with high hit ratios (>99%) reduces database latency by 10x+
Virtual nodes (150-200 per physical node) ensure even distribution

What to Learn Next

-> Caching Strategies Cache-aside, write-through, and cache invalidation.

-> Consistent Hashing Deep dive into hash rings and virtual nodes.

-> Design Key-Value Store Distributed KV store with replication and consistency.

-> Data Replication Leader-follower replication and conflict resolution.

-> Load Balancing Distributing cache requests across cluster nodes.

-> Databases Cache invalidation strategies for database-backed caches.

Design a Distributed Cache

Design a Distributed Cache

Requirements

Functional Requirements

Non-Functional Requirements

Back-of-the-Envelope Estimation

Distributed Cache Capacity

Data Distribution

Consistent Hashing

DfConsistent Hashing for Caching

Replication

DfCache Replication

Replication Factor

Cache Architecture

Redis Cluster Slot Management

DfHash Slots

Slot Assignment

Slot Distribution

Eviction Policies

Cache Patterns

DfCache-Aside Pattern

Cache Hit Ratio Impact

Cache Efficiency

Practice Exercises

What to Learn Next

Premium Content

Need Expert System Design Help?