Data Systems
Consistent Hashing
When adding or removing nodes from a distributed system, naive hashing forces massive key redistribution. Consistent hashing minimizes key movement, making scaling operations efficient.
- Hash Ring — A circular key space mapped to nodes
- Virtual Nodes — Multiple positions per physical node for balance
- Minimal Disruption — Only K/N keys move when a node is added
Consistent hashing is the foundation of distributed caches, databases, and CDNs.
The Problem with Simple Hashing
With standard modulo hashing, adding a new node invalidates nearly every key's mapping.
DfModulo Hashing
With N nodes, key k is assigned to node k mod N. When a node is added (N → N+1), approximately (N/(N+1)) × total keys must be redistributed — nearly all keys move.
Key Redistribution with Modulo Hashing
- 3 servers, 1000 keys → each server holds ~333 keys
- Add 1 server (4 total) → keys redistribute: 1000 × 3/4 = 750 keys move
This causes massive cache misses and database thrashing during scaling.
Hash Ring
Consistent hashing maps both keys and servers onto a circular ring using a hash function.
DfConsistent Hash Ring
A hash ring is a circular key space [0, 2^m - 1]. Both keys and servers are hashed onto this ring. A key is assigned to the first server encountered clockwise from its position. When a server is added or removed, only keys between the new/removed server and its counterclockwise neighbor are affected.
How Keys Are Assigned
Each key hashes to a position on the ring. Travel clockwise from that position until you hit a server — that server owns the key.
Key Assignment
Here,
- =The key being assigned
- =Hash value of key k on the ring
Impact of Node Changes
Adding a Node
With 3 servers on a ring, adding a 4th server only affects keys between the new server and its counterclockwise neighbor.
- Keys between new server and previous counterclockwise neighbor move: ~1/N of all keys
- With 4 servers: ~25% of keys move (vs 75% with modulo hashing)
- With 100 servers: ~1% of keys move
Virtual Nodes
Physical servers are mapped to multiple positions on the ring to improve load distribution.
DfVirtual Nodes
Virtual nodes (vnodes) replicate each physical server onto multiple points on the hash ring. A physical server with V virtual nodes occupies V positions. This smooths out uneven distribution that occurs with few physical servers, ensuring each server owns approximately 1/N of the key space.
With virtual nodes, even a small number of physical servers achieves near-uniform distribution. Cassandra uses 256 vnodes per node by default. DynamoDB uses a similar approach with configurable virtual node count.
Virtual Node Load
Here,
- =Fraction of keys assigned to physical server i
- =Number of virtual nodes per physical server
Load Distribution Analysis
Uniformity with Virtual Nodes
Consider 3 physical servers with different capacities: S1 (40%), S2 (35%), S3 (25%).
Without virtual nodes: Assign 3 positions proportional to capacity.
- S1: positions 0, 144, 288 (3 positions)
- S2: positions 100, 244 (2 positions)
- S3: positions 200 (1 position)
With virtual nodes (100 vnodes): Each server gets 100, 87, 63 vnodes respectively. Distribution approaches capacity ratios within ~5% variance.
Coefficient of Variation
Here,
- =Coefficient of variation of load across servers
- =Standard deviation of load
- =Mean load per server
Range-Based Consistent Hashing
Some systems use range-based approaches for ordered data.
Range-based partitioning assigns contiguous key ranges to servers. This enables efficient range queries but creates hotspots for sequential workloads. Consistent hashing with virtual nodes balances the load while preserving range locality within each server's assigned space.
Real-World Applications
| System | Use Case | Virtual Nodes |
|---|---|---|
| Amazon DynamoDB | Partition assignment | ~1000 per partition |
| Apache Cassandra | Data distribution | 256 per node |
| Memcached | Client-side sharding | Configurable |
| Akamai CDN | Request routing | millions of positions |
| Riak | Distributed KV store | 64 per node |
Implementation Considerations
When implementing consistent hashing, use a binary search tree (or sorted array) to store virtual node positions. Key lookup is O(log(V×N)) where V is virtual nodes per physical node and N is the number of physical nodes.
Replication
Consistent hashing naturally supports replication by assigning keys to V consecutive servers clockwise on the ring.
Replication Factor
Here,
- =Replication factor
- =i-th replica server
Practice Exercises
-
Calculation: Draw a hash ring with 4 servers at positions 10, 50, 120, 200 (ring size 256). Where do keys at positions 5, 55, 150, 210 map to?
-
Design: Design a consistent hashing scheme for a cache cluster with 5 servers having weights 3, 2, 2, 1, 1. How many virtual nodes should each server have?
-
Analysis: Compare the key redistribution when removing a node with 100 vs 1000 virtual nodes. What are the trade-offs?
-
Implementation: Implement a simple consistent hash ring in Python with virtual nodes. Test that adding/removing nodes moves approximately 1/N of keys.
Key Takeaways:
- Consistent hashing maps keys and servers onto a circular ring, minimizing key movement during scaling
- Virtual nodes ensure uniform load distribution across heterogeneous servers
- Only K/N keys move when a node is added (vs K with modulo hashing)
- Used by DynamoDB, Cassandra, Memcached, CDNs, and distributed caches
- Natural support for replication by assigning keys to consecutive servers on the ring
What to Learn Next
-> Data Partitioning Horizontal partitioning, range vs hash partitioning, and rebalancing.
-> Data Replication Leader-follower, multi-leader, and conflict resolution.
-> CAP Theorem Consistency models, availability, and partition tolerance.
-> Databases SQL vs NoSQL, indexing, replication, and sharding.
-> Load Balancing Distribution algorithms and L4 vs L7 load balancing.
-> Distributed Consensus Raft, Paxos, and leader election algorithms.