πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Data Partitioning

Data SystemsDistributed Data🟒 Free Lesson

Advertisement

Data Systems

Data Partitioning

When a single node cannot store or process all data, partitioning splits data across multiple nodes. The partitioning strategy determines how data is distributed and what queries are efficient.

  • Horizontal Partitioning β€” Splitting rows across nodes
  • Range Partitioning β€” Contiguous key ranges per partition
  • Hash Partitioning β€” Hash function determines partition

Partitioning is the key to scaling data storage and query throughput beyond a single machine.

Why Partition?

A single database has hard limits: disk space, CPU, memory, and I/O bandwidth. Partitioning (sharding) distributes data across multiple machines.

DfData Partitioning

Data partitioning (sharding) is the process of splitting a large dataset into smaller subsets (partitions) distributed across multiple machines. Each partition holds a subset of the total data, and queries are routed to the appropriate partition based on the partition key. This enables horizontal scaling of storage and read/write throughput.

Partitioning is often confused with replication. Partitioning splits data; replication copies it. Most production systems use both: data is partitioned for scalability and each partition is replicated for fault tolerance.

Horizontal vs Vertical Partitioning

StrategyDescriptionUse Case
VerticalSplit columns into different tables/databasesRarely used in modern systems
HorizontalSplit rows across multiple databasesStandard sharding approach

Vertical Partitioning

Splitting a table by columns into separate tables or databases.

Vertical Partitioning

A users table with columns (id, name, email, bio, profile_image) can be split:

  • users_core: (id, name, email) β€” frequently accessed
  • users_extended: (id, bio, profile_image) β€” less frequently accessed

This improves cache hit rates and reduces I/O for common queries.

Horizontal Partitioning

Splitting rows across multiple database instances.

Horizontal Partitioning

A users table with 100M rows, partitioned by user_id modulo 4:

  • Partition 0: user_id % 4 == 0
  • Partition 1: user_id % 4 == 1
  • Partition 2: user_id % 4 == 2
  • Partition 3: user_id % 4 == 3

Each partition holds ~25M rows, fitting on a single machine.

Range Partitioning

Each partition owns a contiguous range of keys.

DfRange Partitioning

Range partitioning assigns contiguous key ranges to partitions. Keys are sorted, and boundaries divide them into equal-sized ranges. This enables efficient range queries (e.g., "get all orders from January") but can create hotspots for sequential workloads.

Partition 1Keys: A - F~25M rowsPartition 2Keys: G - M~25M rowsPartition 3Keys: N - Z~25M rowsHotspot: sequential writesRange PartitioningRange queries are efficient | Sequential writes create hotspots

Range Partitioning Trade-offs

Range partitioning is vulnerable to hotspots with sequential keys. If keys are timestamps, the latest partition receives all writes. Mitigate by using a composite key: hash(user_id) + timestamp, distributing writes across partitions while maintaining temporal locality.

Hash Partitioning

A hash function determines which partition owns a key.

DfHash Partitioning

Hash partitioning applies a hash function to the partition key and assigns the key to partition hash(key) mod N. This distributes keys uniformly regardless of their values, eliminating hotspots from sequential key patterns. However, range queries require querying all partitions.

Hash Partition Assignment

partition(k)=hash(k)mod  Npartition(k) = hash(k) \mod N

Here,

  • kk=The partition key
  • hash(k)hash(k)=Hash function output
  • NN=Number of partitions

Consistent Hashing

When N changes, naive hash partitioning remaps all keys. Consistent hashing minimizes key movement.

Consistent hashing maps both keys and partitions onto a ring. A key is assigned to the next partition clockwise. Adding/removing a partition only affects keys between the new/removed partition and its neighbor. See the Consistent Hashing lesson for details.

Partition Key Selection

The partition key determines data distribution and query efficiency.

DfPartition Key Requirements

An ideal partition key:

  • High cardinality β€” Many distinct values for even distribution
  • Query-aligned β€” Most queries filter by the partition key
  • Stable β€” Doesn't change over the entity's lifetime

Partition Key Trade-offs

Partition KeyDistributionRange QueriesHotspot Risk
user_idEvenEfficient per userLow
timestampSequentialEfficient by timeHigh (latest)
regionSkewedInefficient cross-regionMedium
order_idEvenInefficient across usersLow

Rebalancing

When partitions become uneven or nodes are added/removed, data must be rebalanced.

DfRebalancing Strategies

  • Fixed partitioning: Pre-create many more partitions than nodes; assign multiple partitions per node. Moving a partition moves data.
  • Dynamic partitioning: Split hot partitions, merge cold ones. Automatically adapts to workload.
  • Hash-based: Rehash all keys when node count changes. Simple but expensive.
  • Consistent hashing: Minimal key movement on node changes.

Rebalancing Cost

cost=total_dataΓ—keys_movedtotal_keyscost = \frac{total\_data \times keys\_moved}{total\_keys}

Here,

  • costcost=Fraction of total data that must move
  • keysmovedkeys_moved=Number of keys that change partition

Cross-Partition Queries

Queries spanning multiple partitions are inherently more expensive.

Design your schema so most queries touch a single partition. Use denormalization, materialized views, or secondary indexes that are partition-aware. Cross-partition joins should be rare exceptions, not the norm.

Scatter-Gather

A query that touches all partitions must scatter (send to all) and gather (combine results).

Scatter-Gather Cost

A global search across 100 partitions:

  • Query sent to all 100 partitions in parallel
  • Each partition returns results
  • Results merged and sorted at coordinator
  • Latency = max(partition latency) + merge time

This is why cross-partition queries are expensive and should be minimized.

Partitioning in Practice

SystemPartitioning StrategyNotes
MySQL (Vitess)Hash or rangeConfigurable per table
MongoDBHash or rangeAuto-balancing
CassandraHash (consistent)Virtual nodes for balance
DynamoDBHash (consistent)Automatic partition management
CockroachDBRangeAutomatic split/merge

Practice Exercises

  1. Design: Design a partitioning strategy for a ride-sharing app with 100M rides/day. Queries: rides by user, rides by city, rides by time range. What partition key supports the most common query?

  2. Analysis: Compare range and hash partitioning for an e-commerce order system where 80% of queries are "my orders" (by user_id) and 20% are "all orders today" (by timestamp).

  3. Rebalancing: A system has 4 partitions. Adding a 5th node with consistent hashing moves ~20% of keys. With fixed partitioning and 16 partitions, how many partitions move?

  4. Hotspot Mitigation: A time-series database receives 1M writes/second, all with current timestamps. Design a partitioning strategy that avoids hotspots while maintaining time-range query efficiency.

Key Takeaways:

  • Horizontal partitioning distributes rows across multiple machines for scalability
  • Range partitioning enables efficient range queries but is vulnerable to hotspots
  • Hash partitioning distributes evenly but requires scatter-gather for range queries
  • Consistent hashing minimizes data movement during rebalancing
  • Cross-partition queries (scatter-gather) are expensive; design schemas to minimize them
  • Partition key selection is critical: high cardinality, query-aligned, and stable

What to Learn Next

-> Consistent Hashing Hash rings, virtual nodes, and minimal key redistribution.

-> Data Replication Leader-follower, multi-leader, and conflict resolution.

-> Databases SQL vs NoSQL, indexing, replication, and sharding.

-> Database Indexing B-trees, LSM trees, and query optimization.

-> CAP Theorem Consistency models, availability, and partition tolerance.

-> Scalability Fundamentals Vertical vs horizontal scaling, load balancing, and capacity planning.

⭐

Premium Content

Data Partitioning

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert System Design Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement