πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Design a Distributed Cache

System Design ProblemsCaching Infrastructure🟒 Free Lesson

Advertisement

System Design Problems

Design a Distributed Cache

A distributed cache stores frequently accessed data in memory across a cluster of servers, providing sub-millisecond read latency. Redis Cluster and Memcached are the most widely used distributed caches, forming the performance backbone of nearly every large-scale system.

  • Sub-millisecond Latency β€” Memory-based storage with no disk I/O
  • Horizontal Scalability β€” Distribute data across thousands of nodes
  • High Availability β€” Replication and automatic failover

The fundamental trade-off in caching is between cache hit ratio (accuracy) and memory cost. A 99% cache hit ratio means 99% of requests never touch the database.

Requirements

Functional Requirements

  • put(key, value, ttl): Store key-value pair with optional TTL
  • get(key): Retrieve value by key
  • delete(key): Remove key from cache
  • Support for different data types (strings, hashes, lists, sets)
  • Cache eviction policies (LRU, LFU, TTL-based)
  • Cluster mode with automatic data distribution

Non-Functional Requirements

  • Latency: Read/write in < 1ms (p99)
  • Throughput: 1M operations/second per node
  • Availability: 99.99% (cache failure should not crash the system)
  • Scalability: 100 TB total cache size across cluster
  • Durability: Optional persistence (RDB/AOF for Redis)

Distributed caches are typically used as a cache-aside layer: the application checks the cache first, falls back to the database on miss, and populates the cache with the result.

Back-of-the-Envelope Estimation

Distributed Cache Capacity

  • 100 TB total cache, 100 nodes = 1 TB per node
  • 1M QPS per node Γ— 100 nodes = 100M QPS cluster-wide
  • Average key size: 50 bytes, value: 1 KB
  • Keys per node: 1 TB / 1.05 KB β‰ˆ 1 billion keys/node
  • Total cluster: 100 billion keys

Data Distribution

Consistent Hashing

DfConsistent Hashing for Caching

Consistent hashing maps both keys and cache nodes to positions on a hash ring. A key is assigned to the first node found clockwise. When a node is added/removed, only keys between adjacent nodes are redistributed.

N1N2N3N4N5N6K1K2K3Consistent hash ring with 6 nodes

Use 150-200 virtual nodes per physical node to ensure even distribution. Without virtual nodes, some nodes may receive 2-3x more keys than others.

Replication

DfCache Replication

Each key is replicated to N-1 additional nodes for fault tolerance. If a node fails, replicas serve cached data without database fallback.

Replication Factor

replicas=RFβˆ’1replicas = RF - 1

Here,

  • RFRF=Replication factor (typically 3)

With RF=3, each key exists on 3 nodes. If one node fails, the remaining 2 replicas serve reads. When the failed node recovers, it catches up via anti-entropy (Merkle tree comparison).

Cache Architecture

ClientClient LibraryCache ClusterHash Ring RouterSlot MigrationFailover ManagerMaster NodeSlots 0-5460ReplicaSlots 5461-10922Master NodeSlots 10923-16383Redis Cluster Architecture (16384 hash slots)

Redis Cluster Slot Management

DfHash Slots

Redis Cluster divides the key space into 16384 hash slots. Each node is responsible for a subset of slots. Keys are assigned to slots using CRC16(key) % 16384.

Slot Assignment

slot=CRC16(key)mod  16384slot = CRC16(key) \mod 16384

Here,

  • CRC16CRC16=16-bit cyclic redundancy check hash
  • 1638416384=Total number of hash slots

Slot Distribution

3 nodes with 16384 slots:

  • Node A: slots 0-5460 (5461 slots)
  • Node B: slots 5461-10922 (5462 slots)
  • Node C: slots 10923-16383 (5461 slots)

When adding Node D, rebalance by migrating slots from A, B, C.

Eviction Policies

When memory is full, evict keys based on policy:

PolicyDescriptionUse Case
LRUEvict least recently usedGeneral purpose
LFUEvict least frequently usedAccess pattern varies
TTLEvict expired keysTime-sensitive data
RandomEvict random keysSimple, no tracking
No-evictionReturn error when fullCritical data only

Redis supports allkeys-lru (evict any LRU key) and volatile-lru (evict only keys with TTL). Use volatile-lru when some keys must never be evicted.

Cache Patterns

DfCache-Aside Pattern

  1. Application checks cache for key
  2. Cache hit β†’ return value
  3. Cache miss β†’ query database, write to cache, return value
  4. On write: update database, invalidate cache

Cache Hit Ratio Impact

effective_latency=hit_ratioΓ—latencycache+(1βˆ’hit_ratio)Γ—latencydbeffective\_latency = hit\_ratio \times latency_{cache} + (1 - hit\_ratio) \times latency_{db}

Here,

  • hitratiohit_ratio=Fraction of reads served from cache
  • latencycachelatency_{cache}=Cache read latency (typically 0.5ms)
  • latencydblatency_{db}=Database read latency (typically 5ms)

Cache Efficiency

With 99% hit ratio: effective_latency = 0.99 Γ— 0.5ms + 0.01 Γ— 5ms = 0.495 + 0.05 = 0.545ms

Without cache: 5ms Improvement: 9.2x faster

Practice Exercises

  1. Design: How would you handle cache stampede (thundering herd) when a popular key expires? Design a mechanism to prevent all requests from hitting the database simultaneously.

  2. Scale: If the cluster has 100 nodes and 100 TB of data, estimate the memory overhead for consistent hashing metadata and replication information.

  3. Consistency: Design a cache invalidation strategy that ensures eventual consistency between the cache and database. What happens during network partitions?

  4. Recovery: When a failed node recovers, how do you synchronize its data with the current cluster state? Compare Merkle tree anti-entropy vs. full sync.

Key Takeaways:

  • Consistent hashing distributes keys across nodes with minimal redistribution on scaling
  • Redis Cluster uses 16384 hash slots; each node owns a subset
  • Replication (RF=3) provides fault tolerance without database fallback
  • Cache-aside with high hit ratios (>99%) reduces database latency by 10x+
  • Virtual nodes (150-200 per physical node) ensure even distribution

What to Learn Next

-> Caching Strategies Cache-aside, write-through, and cache invalidation.

-> Consistent Hashing Deep dive into hash rings and virtual nodes.

-> Design Key-Value Store Distributed KV store with replication and consistency.

-> Data Replication Leader-follower replication and conflict resolution.

-> Load Balancing Distributing cache requests across cluster nodes.

-> Databases Cache invalidation strategies for database-backed caches.

⭐

Premium Content

Design a Distributed Cache

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert System Design Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement