System Design — Data Systems
Caching Strategies
Caching is the single most impactful performance optimization in system design. By storing frequently accessed data in fast storage (memory), you can reduce latency from milliseconds to microseconds and dramatically reduce database load.
- Cache-Aside — Application manages cache explicitly
- Write-Through — Writes go to cache and database simultaneously
- Redis vs Memcached — When to use each caching technology
There are only two hard things in computer science: cache invalidation and naming things. — Phil Karlton
What Is Caching?
DfCaching
Caching is the technique of storing a copy of data in a faster-access storage layer (typically memory) so that future requests for that data can be served without accessing the original, slower data source. The fundamental trade-off is space and consistency in exchange for speed.
Where Caches Live
Cache Topologies
DfLocal Cache
A local cache (in-process) stores data in the application's memory space. It's the fastest option (nanosecond access) but is limited to a single process and doesn't share state across instances.
DfDistributed Cache
A distributed cache stores data across multiple networked servers. It provides shared state across application instances and can scale independently, but introduces network latency (microsecond to millisecond access).
| Property | Local Cache | Distributed Cache |
|---|---|---|
| Latency | Nanoseconds | Microseconds-milliseconds |
| Scope | Single process | All instances |
| Invalidation | Simple | Complex (broadcast needed) |
| Memory | Limited by process | Scales independently |
| Consistency | Eventual | Configurable |
Cache Eviction Policies
When the cache is full, eviction policies determine which entries to remove:
| Policy | Description | Trade-off |
|---|---|---|
| LRU | Least Recently Used | Good general-purpose, may evict rarely-used but important items |
| LFU | Least Frequently Used | Better for skewed distributions, slower to update |
| FIFO | First In First Out | Simplest, no frequency/recency awareness |
| TTL | Time To Live | Expires after fixed duration, may evict useful data |
| Random | Random eviction | Surprisingly effective, no tracking overhead |
Cache Hit Ratio
Here,
- =Requests served from cache
- =Requests that required fetching from origin
Impact of Hit Ratio on Latency
Consider a system with:
- Database latency: 10ms
- Cache latency: 0.5ms
- 95% cache hit ratio
Average latency = 0.95 × 0.5ms + 0.05 × 10ms = 0.475ms + 0.5ms = 0.975ms
Without caching: 10ms average With caching: ~1ms average — a 10x improvement
At 99% hit ratio: 0.99 × 0.5 + 0.01 × 10 = 0.595ms — a 17x improvement
Cache-Aside (Lazy Loading)
The most common caching pattern:
Flow:
- Application checks cache for data
- Cache hit: Return data directly from cache
- Cache miss: Fetch from database, store in cache, return data
- On write: Update database, invalidate cache (don't update cache)
In cache-aside, always invalidate (delete) the cache on writes rather than updating it. This avoids race conditions where a read-modify-write cycle on the cache overwrites a concurrent database update.
Write-Through
Writes go to both cache and database simultaneously:
Pros: Cache is always consistent, simple to implement Cons: Write latency is cache + DB, cache may be filled with data never read
Write-Back (Write-Behind)
Writes go to cache first, then asynchronously to database:
Pros: Very fast writes, batched database writes Cons: Data loss risk if cache fails before flush, complex implementation
Redis vs Memcached
| Feature | Redis | Memcached |
|---|---|---|
| Data structures | Strings, lists, sets, hashes, sorted sets, streams | Strings only |
| Persistence | RDB + AOF | None |
| Replication | Built-in primary-replica | None |
| Cluster mode | Redis Cluster | Client-side sharding |
| Memory efficiency | Higher overhead per key | Very memory efficient |
| Pub/Sub | Built-in | None |
| Lua scripting | Supported | Not supported |
| Best for | Complex data, persistence, pub/sub | Simple key-value caching |
Redis is more than a cache—it's an in-memory data store with rich data structures. Use Redis when you need sorted sets (leaderboards), lists (queues), or pub/sub. Use Memcached when you need simple, memory-efficient key-value caching at massive scale.
Cache Patterns in Practice
Cache Stampede Prevention
When a popular cache entry expires, many requests simultaneously try to rebuild it:
Solutions to cache stampede:
- Locking: Only one thread rebuilds the cache; others wait
- Early refresh: Refresh cache before TTL expires (e.g., at 80% TTL)
- Probabilistic early expiration: Randomly trigger refresh before TTL
- Request coalescing: Deduplicate concurrent requests for the same key
Cache Invalidation Strategies
| Strategy | Description | Consistency |
|---|---|---|
| TTL-based | Expires after fixed duration | Eventual |
| Event-driven | Invalidate on write events | Near real-time |
| Version-based | Cache key includes version | Strong |
| Tag-based | Group related keys for bulk invalidation | Configurable |
Practice Exercises
-
Design: Design a caching strategy for a news website with 10M daily visitors. What would you cache? What TTL would you use? How do you handle breaking news?
-
Analysis: Compare cache-aside, write-through, and write-back for a banking application. Which pattern is most appropriate and why?
-
Debugging: A system has a 99% cache hit ratio but still shows high latency at p99. What could cause this? How would you diagnose it?
-
Architecture: Design a distributed caching layer for a global e-commerce platform. Consider: data locality, consistency, failure handling, and warming the cache.
Key Takeaways:
- Caching reduces latency by orders of magnitude; a 99% hit ratio can yield 10-20x improvement
- Cache-aside is the most common pattern; always invalidate on writes, don't update
- Write-through provides consistency; write-back provides performance; choose based on requirements
- Redis is a feature-rich data store; Memcached is simple, memory-efficient caching
- Cache stampede, invalidation, and consistency are the hardest problems in caching
What to Learn Next
-> Databases SQL vs NoSQL, indexing, replication, and sharding.
-> Load Balancing Algorithms, health checks, and L4 vs L7.
-> Message Queues Kafka, RabbitMQ, event-driven architecture.
-> CAP Theorem Consistency models, availability, and partition tolerance.
-> Microservices Service decomposition, discovery, and API gateways.
-> Scalability Fundamentals Vertical vs horizontal scaling and capacity planning.