Partitioning & Indexing Strategies for Data Warehouses

Partitioning & Indexing: Optimizing Data Access Patterns

Partitioning divides tables into segments based on column values, while indexing creates metadata structures that enable fast lookups. Together, they determine query performance by minimizing data scanned.

Why Partitioning & Indexing Matter

Without partitioning, a query scanning a 1 TB table reads all 1 TB. With effective partitioning on a date column, the same query might scan only 3 GB (one day). This 300x reduction in I/O directly translates to faster queries and lower costs.

Architecture Overview

Partitioning Strategies

Range partitioning divides data into partitions based on a range of values in a column (e.g., date ranges). Each partition contains rows where the partition column falls within a specific range.

Hash partitioning distributes data across partitions using a hash function on the partition column. Rows are assigned to partitions based on hash(column) mod num_partitions, ensuring uniform distribution.

List partitioning assigns rows to partitions based on explicit value lists (e.g., region = 'US', 'EU', 'APAC'). Each partition contains rows matching one or more values in the list.

Partition Pruning Efficiency

Total Partitions: P
Partitions Scanned: p (after pruning)
Pruning Ratio: PR = (P - p) / P × 100%
I/O Reduction: Data_Scanned = Total_Size × (p / P)
Query Speedup: Approximately 1 / (p/P) for I/O-bound queries
Optimal Partition Count: 100 ≤ P ≤ 10,000 (too few = large partitions, too many = metadata overhead)
Max Partition Size: 1-4 GB (balance between pruning and file management)

-- RANGE PARTITIONING: By date (most common pattern)
CREATE TABLE fact_orders (
    order_id        BIGINT,
    customer_id     INT,
    order_date      DATE NOT NULL,
    net_amount      DECIMAL(14,2),
    status          VARCHAR(20)
)
PARTITION BY RANGE (order_date);

-- PostgreSQL syntax
CREATE TABLE fact_orders_pg (
    order_id        BIGINT,
    customer_id     INT,
    order_date      DATE NOT NULL,
    net_amount      DECIMAL(14,2)
) PARTITION BY RANGE (order_date);

CREATE TABLE fact_orders_2024_q1 PARTITION OF fact_orders_pg
    FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');

CREATE TABLE fact_orders_2024_q2 PARTITION OF fact_orders_pg
    FOR VALUES FROM ('2024-04-01') TO ('2024-07-01');

-- HASH PARTITIONING: Even distribution for high-cardinality columns
CREATE TABLE fact_events (
    event_id        BIGINT,
    user_id         BIGINT NOT NULL,
    event_type      VARCHAR(50),
    event_timestamp TIMESTAMP
)
PARTITION BY HASH (user_id) PARTITIONS 16;

-- PostgreSQL hash partitioning
CREATE TABLE fact_events_pg (
    event_id        BIGINT,
    user_id         BIGINT NOT NULL,
    event_type      VARCHAR(50)
) PARTITION BY HASH (user_id);

CREATE TABLE fact_events_p0 PARTITION OF fact_events_pg
    FOR VALUES WITH (MODULUS 16, REMAINDER 0);
CREATE TABLE fact_events_p1 PARTITION OF fact_events_pg
    FOR VALUES WITH (MODULUS 16, REMAINDER 1);

-- LIST PARTITIONING: By categorical value
CREATE TABLE fact_sales (
    order_id        BIGINT,
    region          VARCHAR(20) NOT NULL,
    order_date      DATE,
    net_amount      DECIMAL(14,2)
)
PARTITION BY LIST (region);

CREATE TABLE fact_sales_north_america PARTITION OF fact_sales
    FOR VALUES IN ('US', 'CA', 'MX');
CREATE TABLE fact_sales_europe PARTITION OF fact_sales
    FOR VALUES IN ('UK', 'DE', 'FR', 'ES');
CREATE TABLE fact_sales_apac PARTITION OF fact_sales
    FOR VALUES IN ('JP', 'AU', 'IN', 'SG');

-- Sub-partitioning: Range + Hash
CREATE TABLE fact_clickstream (
    click_id        BIGINT,
    user_id         BIGINT NOT NULL,
    event_date      DATE NOT NULL,
    event_type      VARCHAR(50)
)
PARTITION BY RANGE (event_date)
SUBPARTITION BY HASH (user_id) SUBPARTITIONS 8;

-- Partition pruning: Query must include partition column
-- GOOD: Prunes to single partition
SELECT * FROM fact_orders WHERE order_date = '2025-01-15';

-- GOOD: Prunes to range of partitions
SELECT * FROM fact_orders WHERE order_date BETWEEN '2025-01-01' AND '2025-01-31';

-- BAD: Full table scan (no partition filter)
SELECT * FROM fact_orders WHERE customer_id = 12345;

-- Analyze partition distribution
SELECT
    partition_name,
    num_rows,
    ROUND(bytes / 1024 / 1024, 2) AS size_mb,
    last_analyzed
FROM user_tab_partitions
WHERE table_name = 'FACT_ORDERS'
ORDER BY partition_name;

Indexing Strategies

A B-Tree index is a balanced tree data structure that maintains sorted data and enables O(log n) lookup, insertion, and deletion. It is optimal for equality and range queries on columns with high cardinality.

A Bitmap index stores a bitmap for each distinct value in a column. Each bit in the bitmap represents a row, and is set to 1 if the row contains that value. Bitmap indexes are optimal for low-cardinality columns used in complex boolean queries.

A Bloom filter is a probabilistic data structure that tests whether an element is a member of a set. It may produce false positives but never false negatives. In data systems, Bloom filters skip files/partitions that definitely don't contain the queried value.

-- B-TREE INDEX: Equality and range queries
CREATE INDEX idx_orders_customer ON fact_orders (customer_id);
CREATE INDEX idx_orders_date ON fact_orders (order_date);
CREATE INDEX idx_orders_composite ON fact_orders (order_date, customer_id);

-- Partial index: Only index active orders
CREATE INDEX idx_orders_active ON fact_orders (customer_id)
    WHERE status = 'active';

-- Covering index: Include all columns needed by query
CREATE INDEX idx_orders_covering ON fact_orders (order_date, customer_id)
    INCLUDE (net_amount, status);

-- BITMAP INDEX: Low-cardinality columns
CREATE BITMAP INDEX idx_orders_status ON fact_orders (status);
CREATE BITMAP INDEX idx_orders_region ON fact_sales (region);

-- Bitmap index for complex boolean queries
-- SELECT * FROM fact_orders
-- WHERE status IN ('pending', 'processing') AND region = 'US'
-- -> Bitmap OR for status, Bitmap AND for region

-- BLOOM FILTER INDEX: Membership tests
-- Spark/Delta Lake syntax
CREATE BLOOMFILTER INDEX ON delta.`/delta/events`
    FOR COLUMNS(user_id, event_type);

-- Iceberg Bloom filter
ALTER TABLE iceberg.events
    SET PROPERTIES (
        'write.bloom-filter.enabled' = 'true',
        'write.bloom-filter.columns' = 'user_id',
        'write.bloom-filter.ndv' = '1000000'
    );

-- Z-ORDER INDEX: Multi-dimensional clustering
-- Delta Lake
OPTIMIZE delta.`/delta/events`
    ZORDER BY (user_id, event_type);

-- Iceberg
ALTER TABLE iceberg.events
    WRITE ORDERED BY event_type, user_id;

-- Analyze index effectiveness
EXPLAIN ANALYZE
SELECT * FROM fact_orders
WHERE order_date = '2025-01-15' AND customer_id = 12345;

-- Check index usage
SELECT
    indexname,
    indexdef,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
WHERE relname = 'fact_orders'
ORDER BY idx_scan DESC;

Partition Pruning Deep Dive

Partition Pruning Math

Column Selectivity: S = distinct_values / total_rows
Pruning Effectiveness: E = partitions_pruned / total_partitions
Index Selectivity: I = matching_rows / total_rows (lower = more selective)
Combined Pruning: Effective_I/O = Total × (1-E) × I × (1-Bloom_Reduction)
Optimal Strategy: Partition by low-selectivity (date), Index by high-selectivity (ID)

-- Partition pruning analysis
SELECT
    partition_name,
    num_rows,
    ROUND(bytes/1024/1024, 2) AS size_mb,
    high_value,
    low_value
FROM user_tab_partitions
WHERE table_name = 'FACT_ORDERS'
ORDER BY partition_position;

-- Compare query plans with and without partition pruning
-- Without pruning (full scan)
EXPLAIN
SELECT COUNT(*) FROM fact_orders WHERE customer_id = 12345;

-- With pruning (partition + index)
EXPLAIN
SELECT COUNT(*) FROM fact_orders
WHERE order_date = '2025-01-15' AND customer_id = 12345;

-- Partition statistics
SELECT
    p.partition_name,
    p.num_rows,
    p.last_analyzed,
    s Blocks,
    s.Size_MB,
    s.Compression_Ratio
FROM user_tab_partitions p
JOIN (
    SELECT
        partition_name,
        ROUND(SUM(bytes)/1024/1024, 2) AS Size_MB,
        COUNT(*) AS Blocks,
        AVG(compression_ratio) AS Compression_Ratio
    FROM user_tab_part_statistics
    GROUP BY partition_name
) s ON p.partition_name = s.partition_name
ORDER BY p.partition_position;

Partition Pruning Deep Dive

Partition pruning eliminates entire partitions from query execution based on filter conditions on the partition column. It is the most impactful optimization for large tables, reducing data scanned by orders of magnitude.

Partition Pruning Math

Total Partitions: P = total_number_of_partitions
Partitions Scanned: p = partitions_passing_filter
Pruning Ratio: PR = (P - p) / P × 100%
I/O Reduction: Data_Scanned = Total_Data × (p / P)
Speedup Factor: SF ≈ 1 / (p / P) for I/O-bound queries
Example: 1000 partitions, filter matches 10 -> PR = 99% -> 100x speedup

-- Analyze partition pruning effectiveness
SELECT
    query_id,
    query_text,
    partitions_scanned,
    partitions_total,
    ROUND(partitions_scanned * 100.0 / partitions_total, 2) AS scan_pct,
    CASE
        WHEN partitions_scanned * 100.0 / partitions_total < 10 THEN 'EXCELLENT pruning'
        WHEN partitions_scanned * 100.0 / partitions_total < 50 THEN 'GOOD pruning'
        WHEN partitions_scanned * 100.0 / partitions_total < 90 THEN 'MODERATE pruning'
        ELSE 'POOR pruning - check partition filter'
    END AS pruning_assessment
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE start_time >= DATEADD(day, -7, CURRENT_DATE())
  AND partitions_total > 100
ORDER BY scan_pct DESC
LIMIT 20;

-- Common partition pruning anti-patterns
-- ANTI-PATTERN: Function on partition column prevents pruning
SELECT * FROM fact_orders
WHERE YEAR(order_date) = 2025;  -- Full scan!

-- CORRECT: Range filter enables pruning
SELECT * FROM fact_orders
WHERE order_date >= '2025-01-01' AND order_date < '2026-01-01';

-- ANTI-PATTERN: OR with non-partition predicates
SELECT * FROM fact_orders
WHERE order_date = '2025-01-15' OR customer_id = 123;  -- May not prune

-- CORRECT: Use AND with partition column
SELECT * FROM fact_orders
WHERE order_date = '2025-01-15' AND customer_id = 123;  -- Prunes to single partition

Key Concepts Summary

Strategy	Column Type	Pruning Method	Best For	Drawback
Range Partition	Date/Timestamp	Partition elimination	Time-series queries	Manual maintenance
Hash Partition	Any high-cardinality	Even distribution	JOIN distribution	No range pruning
List Partition	Categorical	Partition elimination	Regional/segment queries	Limited values
B-Tree Index	High-cardinality	Point/range lookup	Equality queries	Write overhead
Bitmap Index	Low-cardinality	Boolean operations	Complex filters	Large storage
Bloom Filter	Any	File/partition skip	Membership tests	False positives
Z-Order	Multi-column	Multi-dim clustering	Multi-column filters	Rebuild cost
Covering Index	N/A	Index-only scan	Specific query patterns	Storage overhead

Performance Metrics

Optimization	Query Speedup	Storage Cost	Write Overhead
Range Partition (date)	10-100x	+1-2%	Low
Hash Partition	2-5x	+1-2%	Low
B-Tree Index	10-1000x	+10-30%	Medium
Bitmap Index	5-50x	+20-50%	Medium
Bloom Filter	2-10x	+5-10%	Low
Z-Order	2-5x	+5-10%	High
Composite Index	10-100x	+15-40%	High

10 Best Practices

Partition by date first — time-range queries are the most common pattern in analytics
Keep partition count between 100-10,000 — too few = large scans, too many = metadata overhead
Use hash partitioning for high-cardinality JOIN columns to distribute data evenly
Create B-tree indexes on columns used in WHERE clauses with high selectivity
Use bitmap indexes for low-cardinality columns (< 1000 distinct values)
Implement Bloom filters on columns used in point lookups for large tables
Monitor partition pruning with EXPLAIN plans — ensure partition filters are applied
Avoid over-indexing — each index adds write overhead and storage cost
Use Z-Ordering for multi-column filter patterns in Delta Lake / Iceberg
Regularly analyze table statistics — update stats after large data loads

Range partitioning by date is the single most impactful optimization for analytics queries
Partition pruning can reduce data scanned by 90-99% for time-range queries
B-tree indexes excel at point lookups; bitmap indexes excel at boolean filters
Bloom filters provide probabilistic skip capabilities with minimal storage overhead
Combining partitioning + indexing + Z-Ordering achieves maximum query performance

Partitioning & Indexing Strategies for Data Warehouses

Partitioning & Indexing: Optimizing Data Access Patterns

Why Partitioning & Indexing Matter

Architecture Overview

Partitioning Strategies

Partition Pruning Efficiency

Indexing Strategies

Partition Pruning Deep Dive

Partition Pruning Math

Partition Pruning Deep Dive

Partition Pruning Math

Key Concepts Summary

Performance Metrics

10 Best Practices

See Also

Premium Content

Need Expert Data Engineering Help?