🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Consistent Hashing

Data SystemsDistributed Data🟢 Free Lesson

Advertisement

Data Systems

Consistent Hashing

When adding or removing nodes from a distributed system, naive hashing forces massive key redistribution. Consistent hashing minimizes key movement, making scaling operations efficient.

  • Hash Ring — A circular key space mapped to nodes
  • Virtual Nodes — Multiple positions per physical node for balance
  • Minimal Disruption — Only K/N keys move when a node is added

Consistent hashing is the foundation of distributed caches, databases, and CDNs.

The Problem with Simple Hashing

With standard modulo hashing, adding a new node invalidates nearly every key's mapping.

DfModulo Hashing

With N nodes, key k is assigned to node k mod N. When a node is added (N → N+1), approximately (N/(N+1)) × total keys must be redistributed — nearly all keys move.

Key Redistribution with Modulo Hashing

  • 3 servers, 1000 keys → each server holds ~333 keys
  • Add 1 server (4 total) → keys redistribute: 1000 × 3/4 = 750 keys move

This causes massive cache misses and database thrashing during scaling.

Hash Ring

Consistent hashing maps both keys and servers onto a circular ring using a hash function.

DfConsistent Hash Ring

A hash ring is a circular key space [0, 2^m - 1]. Both keys and servers are hashed onto this ring. A key is assigned to the first server encountered clockwise from its position. When a server is added or removed, only keys between the new/removed server and its counterclockwise neighbor are affected.

S1hash: 0S2hash: 60S3hash: 120S4hash: 200S5hash: 280K1K2K3Clockwise assignment: K1→S2, K2→S3, K3→S5

How Keys Are Assigned

Each key hashes to a position on the ring. Travel clockwise from that position until you hit a server — that server owns the key.

Key Assignment

server(k)=first server clockwise from hash(k)server(k) = \text{first server clockwise from } hash(k)

Here,

  • kk=The key being assigned
  • hash(k)hash(k)=Hash value of key k on the ring

Impact of Node Changes

Adding a Node

With 3 servers on a ring, adding a 4th server only affects keys between the new server and its counterclockwise neighbor.

  • Keys between new server and previous counterclockwise neighbor move: ~1/N of all keys
  • With 4 servers: ~25% of keys move (vs 75% with modulo hashing)
  • With 100 servers: ~1% of keys move

Virtual Nodes

Physical servers are mapped to multiple positions on the ring to improve load distribution.

DfVirtual Nodes

Virtual nodes (vnodes) replicate each physical server onto multiple points on the hash ring. A physical server with V virtual nodes occupies V positions. This smooths out uneven distribution that occurs with few physical servers, ensuring each server owns approximately 1/N of the key space.

With virtual nodes, even a small number of physical servers achieves near-uniform distribution. Cassandra uses 256 vnodes per node by default. DynamoDB uses a similar approach with configurable virtual node count.

Virtual Node Load

load(si)=keys_on_ring_owned_by_s_itotal_keysload(s_i) = \frac{|keys\_on\_ring\_owned\_by\_s\_i|}{total\_keys}

Here,

  • load(si)load(s_i)=Fraction of keys assigned to physical server i
  • VV=Number of virtual nodes per physical server

Load Distribution Analysis

Uniformity with Virtual Nodes

Consider 3 physical servers with different capacities: S1 (40%), S2 (35%), S3 (25%).

Without virtual nodes: Assign 3 positions proportional to capacity.

  • S1: positions 0, 144, 288 (3 positions)
  • S2: positions 100, 244 (2 positions)
  • S3: positions 200 (1 position)

With virtual nodes (100 vnodes): Each server gets 100, 87, 63 vnodes respectively. Distribution approaches capacity ratios within ~5% variance.

Coefficient of Variation

CV=σμCV = \frac{\sigma}{\mu}

Here,

  • CVCV=Coefficient of variation of load across servers
  • σ\sigma=Standard deviation of load
  • μ\mu=Mean load per server

Range-Based Consistent Hashing

Some systems use range-based approaches for ordered data.

Range-based partitioning assigns contiguous key ranges to servers. This enables efficient range queries but creates hotspots for sequential workloads. Consistent hashing with virtual nodes balances the load while preserving range locality within each server's assigned space.

Real-World Applications

SystemUse CaseVirtual Nodes
Amazon DynamoDBPartition assignment~1000 per partition
Apache CassandraData distribution256 per node
MemcachedClient-side shardingConfigurable
Akamai CDNRequest routingmillions of positions
RiakDistributed KV store64 per node

Implementation Considerations

When implementing consistent hashing, use a binary search tree (or sorted array) to store virtual node positions. Key lookup is O(log(V×N)) where V is virtual nodes per physical node and N is the number of physical nodes.

Replication

Consistent hashing naturally supports replication by assigning keys to V consecutive servers clockwise on the ring.

Replication Factor

replicas(k)={s1,s2,...,sR} where si is i-th server clockwise from hash(k)replicas(k) = \{s_1, s_2, ..., s_R\} \text{ where } s_i \text{ is i-th server clockwise from } hash(k)

Here,

  • RR=Replication factor
  • sis_i=i-th replica server

Practice Exercises

  1. Calculation: Draw a hash ring with 4 servers at positions 10, 50, 120, 200 (ring size 256). Where do keys at positions 5, 55, 150, 210 map to?

  2. Design: Design a consistent hashing scheme for a cache cluster with 5 servers having weights 3, 2, 2, 1, 1. How many virtual nodes should each server have?

  3. Analysis: Compare the key redistribution when removing a node with 100 vs 1000 virtual nodes. What are the trade-offs?

  4. Implementation: Implement a simple consistent hash ring in Python with virtual nodes. Test that adding/removing nodes moves approximately 1/N of keys.

Key Takeaways:

  • Consistent hashing maps keys and servers onto a circular ring, minimizing key movement during scaling
  • Virtual nodes ensure uniform load distribution across heterogeneous servers
  • Only K/N keys move when a node is added (vs K with modulo hashing)
  • Used by DynamoDB, Cassandra, Memcached, CDNs, and distributed caches
  • Natural support for replication by assigning keys to consecutive servers on the ring

What to Learn Next

-> Data Partitioning Horizontal partitioning, range vs hash partitioning, and rebalancing.

-> Data Replication Leader-follower, multi-leader, and conflict resolution.

-> CAP Theorem Consistency models, availability, and partition tolerance.

-> Databases SQL vs NoSQL, indexing, replication, and sharding.

-> Load Balancing Distribution algorithms and L4 vs L7 load balancing.

-> Distributed Consensus Raft, Paxos, and leader election algorithms.

Premium Content

Consistent Hashing

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert System Design Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement