Data Systems

Data Partitioning

When a single node cannot store or process all data, partitioning splits data across multiple nodes. The partitioning strategy determines how data is distributed and what queries are efficient.

Horizontal Partitioning — Splitting rows across nodes
Range Partitioning — Contiguous key ranges per partition
Hash Partitioning — Hash function determines partition

Partitioning is the key to scaling data storage and query throughput beyond a single machine.

Why Partition?

A single database has hard limits: disk space, CPU, memory, and I/O bandwidth. Partitioning (sharding) distributes data across multiple machines.

DfData Partitioning

Data partitioning (sharding) is the process of splitting a large dataset into smaller subsets (partitions) distributed across multiple machines. Each partition holds a subset of the total data, and queries are routed to the appropriate partition based on the partition key. This enables horizontal scaling of storage and read/write throughput.

Partitioning is often confused with replication. Partitioning splits data; replication copies it. Most production systems use both: data is partitioned for scalability and each partition is replicated for fault tolerance.

Horizontal vs Vertical Partitioning

Strategy	Description	Use Case
Vertical	Split columns into different tables/databases	Rarely used in modern systems
Horizontal	Split rows across multiple databases	Standard sharding approach

Vertical Partitioning

Splitting a table by columns into separate tables or databases.

Vertical Partitioning

A users table with columns (id, name, email, bio, profile_image) can be split:

users_core: (id, name, email) — frequently accessed
users_extended: (id, bio, profile_image) — less frequently accessed

This improves cache hit rates and reduces I/O for common queries.

Horizontal Partitioning

Splitting rows across multiple database instances.

Horizontal Partitioning

A users table with 100M rows, partitioned by user_id modulo 4:

Partition 0: user_id % 4 == 0
Partition 1: user_id % 4 == 1
Partition 2: user_id % 4 == 2
Partition 3: user_id % 4 == 3

Each partition holds ~25M rows, fitting on a single machine.

Range Partitioning

Each partition owns a contiguous range of keys.

DfRange Partitioning

Range partitioning assigns contiguous key ranges to partitions. Keys are sorted, and boundaries divide them into equal-sized ranges. This enables efficient range queries (e.g., "get all orders from January") but can create hotspots for sequential workloads.

Range Partitioning Trade-offs

Range partitioning is vulnerable to hotspots with sequential keys. If keys are timestamps, the latest partition receives all writes. Mitigate by using a composite key: hash(user_id) + timestamp, distributing writes across partitions while maintaining temporal locality.

Hash Partitioning

A hash function determines which partition owns a key.

DfHash Partitioning

Hash partitioning applies a hash function to the partition key and assigns the key to partition hash(key) mod N. This distributes keys uniformly regardless of their values, eliminating hotspots from sequential key patterns. However, range queries require querying all partitions.

Hash Partition Assignment

partition(k) = hash(k) \mod N

Here,

$k$ =The partition key
$hash(k)$ =Hash function output
$N$ =Number of partitions

Consistent Hashing

When N changes, naive hash partitioning remaps all keys. Consistent hashing minimizes key movement.

Consistent hashing maps both keys and partitions onto a ring. A key is assigned to the next partition clockwise. Adding/removing a partition only affects keys between the new/removed partition and its neighbor. See the Consistent Hashing lesson for details.

Partition Key Selection

The partition key determines data distribution and query efficiency.

DfPartition Key Requirements

An ideal partition key:

High cardinality — Many distinct values for even distribution
Query-aligned — Most queries filter by the partition key
Stable — Doesn't change over the entity's lifetime

Partition Key Trade-offs

Partition Key	Distribution	Range Queries	Hotspot Risk
user_id	Even	Efficient per user	Low
timestamp	Sequential	Efficient by time	High (latest)
region	Skewed	Inefficient cross-region	Medium
order_id	Even	Inefficient across users	Low

Rebalancing

When partitions become uneven or nodes are added/removed, data must be rebalanced.

DfRebalancing Strategies

Fixed partitioning: Pre-create many more partitions than nodes; assign multiple partitions per node. Moving a partition moves data.
Dynamic partitioning: Split hot partitions, merge cold ones. Automatically adapts to workload.
Hash-based: Rehash all keys when node count changes. Simple but expensive.
Consistent hashing: Minimal key movement on node changes.

Rebalancing Cost

cost = \frac{total\_data \times keys\_moved}{total\_keys}

Here,

$cost$ =Fraction of total data that must move
$keys_moved$ =Number of keys that change partition

Cross-Partition Queries

Queries spanning multiple partitions are inherently more expensive.

Design your schema so most queries touch a single partition. Use denormalization, materialized views, or secondary indexes that are partition-aware. Cross-partition joins should be rare exceptions, not the norm.

Scatter-Gather

A query that touches all partitions must scatter (send to all) and gather (combine results).

Scatter-Gather Cost

A global search across 100 partitions:

Query sent to all 100 partitions in parallel
Each partition returns results
Results merged and sorted at coordinator
Latency = max(partition latency) + merge time

This is why cross-partition queries are expensive and should be minimized.

Partitioning in Practice

System	Partitioning Strategy	Notes
MySQL (Vitess)	Hash or range	Configurable per table
MongoDB	Hash or range	Auto-balancing
Cassandra	Hash (consistent)	Virtual nodes for balance
DynamoDB	Hash (consistent)	Automatic partition management
CockroachDB	Range	Automatic split/merge

Practice Exercises

Design: Design a partitioning strategy for a ride-sharing app with 100M rides/day. Queries: rides by user, rides by city, rides by time range. What partition key supports the most common query?
Analysis: Compare range and hash partitioning for an e-commerce order system where 80% of queries are "my orders" (by user_id) and 20% are "all orders today" (by timestamp).
Rebalancing: A system has 4 partitions. Adding a 5th node with consistent hashing moves ~20% of keys. With fixed partitioning and 16 partitions, how many partitions move?
Hotspot Mitigation: A time-series database receives 1M writes/second, all with current timestamps. Design a partitioning strategy that avoids hotspots while maintaining time-range query efficiency.

Key Takeaways:

Horizontal partitioning distributes rows across multiple machines for scalability
Range partitioning enables efficient range queries but is vulnerable to hotspots
Hash partitioning distributes evenly but requires scatter-gather for range queries
Consistent hashing minimizes data movement during rebalancing
Cross-partition queries (scatter-gather) are expensive; design schemas to minimize them
Partition key selection is critical: high cardinality, query-aligned, and stable

What to Learn Next

-> Consistent Hashing Hash rings, virtual nodes, and minimal key redistribution.

-> Data Replication Leader-follower, multi-leader, and conflict resolution.

-> Databases SQL vs NoSQL, indexing, replication, and sharding.

-> Database Indexing B-trees, LSM trees, and query optimization.

-> CAP Theorem Consistency models, availability, and partition tolerance.

-> Scalability Fundamentals Vertical vs horizontal scaling, load balancing, and capacity planning.

Data Partitioning

Data Partitioning

Why Partition?

DfData Partitioning

Horizontal vs Vertical Partitioning

Vertical Partitioning

Vertical Partitioning

Horizontal Partitioning

Horizontal Partitioning

Range Partitioning

DfRange Partitioning

Range Partitioning Trade-offs

Hash Partitioning

DfHash Partitioning

Hash Partition Assignment

Consistent Hashing

Partition Key Selection

DfPartition Key Requirements

Partition Key Trade-offs

Rebalancing

DfRebalancing Strategies

Rebalancing Cost

Cross-Partition Queries

Scatter-Gather

Scatter-Gather Cost

Partitioning in Practice

Practice Exercises

What to Learn Next

Premium Content

Need Expert System Design Help?