πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Partitioning & Indexing Strategies for Data Warehouses

Module 3: Data Warehouses & StoragePerformance Optimization🟒 Free Lesson

Advertisement

Partitioning & Indexing: Optimizing Data Access Patterns

Partitioning divides tables into segments based on column values, while indexing creates metadata structures that enable fast lookups. Together, they determine query performance by minimizing data scanned.

Why Partitioning & Indexing Matter

Without partitioning, a query scanning a 1 TB table reads all 1 TB. With effective partitioning on a date column, the same query might scan only 3 GB (one day). This 300x reduction in I/O directly translates to faster queries and lower costs.


Architecture Overview


Partitioning Strategies ComparisonRange PartitionRHSONColumn: order_dateSplit by time rangesBest for: Time-seriesPruning: Range eliminationP0: Jan-Mar | P1: Apr-Jun | P2: Jul-SepP3: Oct-Dec | P4: Next year...Hash Partitionhash(){'\u2192'}P0P1P2P3Even distribution across N bucketsColumn: user_idhash(col) mod N partitionsBest for: JOIN distributionPruning: None (uniform)hash(user_id) mod 16 = bucketUniform: P0=P1=P2=P3 sizeList PartitionUSEUAPACP0: US,CA,MXP1: UK,DE,FRP2: JP,AU,INColumn: regionExplicit value listsBest for: Regional dataPruning: Value eliminationWHERE region = 'US' {'\u2192'} scan P0 only

Partitioning Strategies

Range partitioning divides data into partitions based on a range of values in a column (e.g., date ranges). Each partition contains rows where the partition column falls within a specific range.

Hash partitioning distributes data across partitions using a hash function on the partition column. Rows are assigned to partitions based on hash(column) mod num_partitions, ensuring uniform distribution.

List partitioning assigns rows to partitions based on explicit value lists (e.g., region = 'US', 'EU', 'APAC'). Each partition contains rows matching one or more values in the list.

Partition Pruning Efficiency

  • Total Partitions: P
  • Partitions Scanned: p (after pruning)
  • Pruning Ratio: PR = (P - p) / P Γ— 100%
  • I/O Reduction: Data_Scanned = Total_Size Γ— (p / P)
  • Query Speedup: Approximately 1 / (p/P) for I/O-bound queries
  • Optimal Partition Count: 100 ≀ P ≀ 10,000 (too few = large partitions, too many = metadata overhead)
  • Max Partition Size: 1-4 GB (balance between pruning and file management)
-- RANGE PARTITIONING: By date (most common pattern)
CREATE TABLE fact_orders (
    order_id        BIGINT,
    customer_id     INT,
    order_date      DATE NOT NULL,
    net_amount      DECIMAL(14,2),
    status          VARCHAR(20)
)
PARTITION BY RANGE (order_date);

-- PostgreSQL syntax
CREATE TABLE fact_orders_pg (
    order_id        BIGINT,
    customer_id     INT,
    order_date      DATE NOT NULL,
    net_amount      DECIMAL(14,2)
) PARTITION BY RANGE (order_date);

CREATE TABLE fact_orders_2024_q1 PARTITION OF fact_orders_pg
    FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');

CREATE TABLE fact_orders_2024_q2 PARTITION OF fact_orders_pg
    FOR VALUES FROM ('2024-04-01') TO ('2024-07-01');

-- HASH PARTITIONING: Even distribution for high-cardinality columns
CREATE TABLE fact_events (
    event_id        BIGINT,
    user_id         BIGINT NOT NULL,
    event_type      VARCHAR(50),
    event_timestamp TIMESTAMP
)
PARTITION BY HASH (user_id) PARTITIONS 16;

-- PostgreSQL hash partitioning
CREATE TABLE fact_events_pg (
    event_id        BIGINT,
    user_id         BIGINT NOT NULL,
    event_type      VARCHAR(50)
) PARTITION BY HASH (user_id);

CREATE TABLE fact_events_p0 PARTITION OF fact_events_pg
    FOR VALUES WITH (MODULUS 16, REMAINDER 0);
CREATE TABLE fact_events_p1 PARTITION OF fact_events_pg
    FOR VALUES WITH (MODULUS 16, REMAINDER 1);

-- LIST PARTITIONING: By categorical value
CREATE TABLE fact_sales (
    order_id        BIGINT,
    region          VARCHAR(20) NOT NULL,
    order_date      DATE,
    net_amount      DECIMAL(14,2)
)
PARTITION BY LIST (region);

CREATE TABLE fact_sales_north_america PARTITION OF fact_sales
    FOR VALUES IN ('US', 'CA', 'MX');
CREATE TABLE fact_sales_europe PARTITION OF fact_sales
    FOR VALUES IN ('UK', 'DE', 'FR', 'ES');
CREATE TABLE fact_sales_apac PARTITION OF fact_sales
    FOR VALUES IN ('JP', 'AU', 'IN', 'SG');

-- Sub-partitioning: Range + Hash
CREATE TABLE fact_clickstream (
    click_id        BIGINT,
    user_id         BIGINT NOT NULL,
    event_date      DATE NOT NULL,
    event_type      VARCHAR(50)
)
PARTITION BY RANGE (event_date)
SUBPARTITION BY HASH (user_id) SUBPARTITIONS 8;

-- Partition pruning: Query must include partition column
-- GOOD: Prunes to single partition
SELECT * FROM fact_orders WHERE order_date = '2025-01-15';

-- GOOD: Prunes to range of partitions
SELECT * FROM fact_orders WHERE order_date BETWEEN '2025-01-01' AND '2025-01-31';

-- BAD: Full table scan (no partition filter)
SELECT * FROM fact_orders WHERE customer_id = 12345;

-- Analyze partition distribution
SELECT
    partition_name,
    num_rows,
    ROUND(bytes / 1024 / 1024, 2) AS size_mb,
    last_analyzed
FROM user_tab_partitions
WHERE table_name = 'FACT_ORDERS'
ORDER BY partition_name;

Index Types ComparisonB-Tree IndexRoot Node{'< 50'}50-100{'> 100'}Leaf LLeaf L1Leaf L2Leaf L3O(log n) lookupBest for: Point + RangeWHERE id = 12345 {'\u2192'} O(log n)WHERE id BETWEEN 1 AND 100High cardinality columnsBitmap Indexactive1 0 1 1 0 1 0 1pending0 1 0 0 1 0 1 0shipped0 0 0 0 0 0 0 0cancelled0 0 0 0 0 0 0 0Boolean operations: AND, ORBest for: Low cardinalityactive AND region = 'US' {'\u2192'} ANDBitmap(active) AND Bitmap(US)Low cardinality: {'<'} 1000 valuesBloom FilterBit Array: [0,1,0,1,1,0,0,1]hash(user_id) {'\u2192'} set bitsQuery: check bits {'\u2192'} likely yes/noProbabilistic membership testFalse positives, no negativesSkip files that definitely missReduces I/O by 50-90%Point lookups on large tables

Indexing Strategies

A B-Tree index is a balanced tree data structure that maintains sorted data and enables O(log n) lookup, insertion, and deletion. It is optimal for equality and range queries on columns with high cardinality.

A Bitmap index stores a bitmap for each distinct value in a column. Each bit in the bitmap represents a row, and is set to 1 if the row contains that value. Bitmap indexes are optimal for low-cardinality columns used in complex boolean queries.

A Bloom filter is a probabilistic data structure that tests whether an element is a member of a set. It may produce false positives but never false negatives. In data systems, Bloom filters skip files/partitions that definitely don't contain the queried value.

-- B-TREE INDEX: Equality and range queries
CREATE INDEX idx_orders_customer ON fact_orders (customer_id);
CREATE INDEX idx_orders_date ON fact_orders (order_date);
CREATE INDEX idx_orders_composite ON fact_orders (order_date, customer_id);

-- Partial index: Only index active orders
CREATE INDEX idx_orders_active ON fact_orders (customer_id)
    WHERE status = 'active';

-- Covering index: Include all columns needed by query
CREATE INDEX idx_orders_covering ON fact_orders (order_date, customer_id)
    INCLUDE (net_amount, status);

-- BITMAP INDEX: Low-cardinality columns
CREATE BITMAP INDEX idx_orders_status ON fact_orders (status);
CREATE BITMAP INDEX idx_orders_region ON fact_sales (region);

-- Bitmap index for complex boolean queries
-- SELECT * FROM fact_orders
-- WHERE status IN ('pending', 'processing') AND region = 'US'
-- -> Bitmap OR for status, Bitmap AND for region

-- BLOOM FILTER INDEX: Membership tests
-- Spark/Delta Lake syntax
CREATE BLOOMFILTER INDEX ON delta.`/delta/events`
    FOR COLUMNS(user_id, event_type);

-- Iceberg Bloom filter
ALTER TABLE iceberg.events
    SET PROPERTIES (
        'write.bloom-filter.enabled' = 'true',
        'write.bloom-filter.columns' = 'user_id',
        'write.bloom-filter.ndv' = '1000000'
    );

-- Z-ORDER INDEX: Multi-dimensional clustering
-- Delta Lake
OPTIMIZE delta.`/delta/events`
    ZORDER BY (user_id, event_type);

-- Iceberg
ALTER TABLE iceberg.events
    WRITE ORDERED BY event_type, user_id;

-- Analyze index effectiveness
EXPLAIN ANALYZE
SELECT * FROM fact_orders
WHERE order_date = '2025-01-15' AND customer_id = 12345;

-- Check index usage
SELECT
    indexname,
    indexdef,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
WHERE relname = 'fact_orders'
ORDER BY idx_scan DESC;

Partition Pruning Deep Dive

Partition Pruning Math

  • Column Selectivity: S = distinct_values / total_rows
  • Pruning Effectiveness: E = partitions_pruned / total_partitions
  • Index Selectivity: I = matching_rows / total_rows (lower = more selective)
  • Combined Pruning: Effective_I/O = Total Γ— (1-E) Γ— I Γ— (1-Bloom_Reduction)
  • Optimal Strategy: Partition by low-selectivity (date), Index by high-selectivity (ID)
-- Partition pruning analysis
SELECT
    partition_name,
    num_rows,
    ROUND(bytes/1024/1024, 2) AS size_mb,
    high_value,
    low_value
FROM user_tab_partitions
WHERE table_name = 'FACT_ORDERS'
ORDER BY partition_position;

-- Compare query plans with and without partition pruning
-- Without pruning (full scan)
EXPLAIN
SELECT COUNT(*) FROM fact_orders WHERE customer_id = 12345;

-- With pruning (partition + index)
EXPLAIN
SELECT COUNT(*) FROM fact_orders
WHERE order_date = '2025-01-15' AND customer_id = 12345;

-- Partition statistics
SELECT
    p.partition_name,
    p.num_rows,
    p.last_analyzed,
    s Blocks,
    s.Size_MB,
    s.Compression_Ratio
FROM user_tab_partitions p
JOIN (
    SELECT
        partition_name,
        ROUND(SUM(bytes)/1024/1024, 2) AS Size_MB,
        COUNT(*) AS Blocks,
        AVG(compression_ratio) AS Compression_Ratio
    FROM user_tab_part_statistics
    GROUP BY partition_name
) s ON p.partition_name = s.partition_name
ORDER BY p.partition_position;

Partition Pruning Deep Dive

Partition pruning eliminates entire partitions from query execution based on filter conditions on the partition column. It is the most impactful optimization for large tables, reducing data scanned by orders of magnitude.

Partition Pruning Math

  • Total Partitions: P = total_number_of_partitions
  • Partitions Scanned: p = partitions_passing_filter
  • Pruning Ratio: PR = (P - p) / P Γ— 100%
  • I/O Reduction: Data_Scanned = Total_Data Γ— (p / P)
  • Speedup Factor: SF β‰ˆ 1 / (p / P) for I/O-bound queries
  • Example: 1000 partitions, filter matches 10 -> PR = 99% -> 100x speedup
-- Analyze partition pruning effectiveness
SELECT
    query_id,
    query_text,
    partitions_scanned,
    partitions_total,
    ROUND(partitions_scanned * 100.0 / partitions_total, 2) AS scan_pct,
    CASE
        WHEN partitions_scanned * 100.0 / partitions_total < 10 THEN 'EXCELLENT pruning'
        WHEN partitions_scanned * 100.0 / partitions_total < 50 THEN 'GOOD pruning'
        WHEN partitions_scanned * 100.0 / partitions_total < 90 THEN 'MODERATE pruning'
        ELSE 'POOR pruning - check partition filter'
    END AS pruning_assessment
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE start_time >= DATEADD(day, -7, CURRENT_DATE())
  AND partitions_total > 100
ORDER BY scan_pct DESC
LIMIT 20;

-- Common partition pruning anti-patterns
-- ANTI-PATTERN: Function on partition column prevents pruning
SELECT * FROM fact_orders
WHERE YEAR(order_date) = 2025;  -- Full scan!

-- CORRECT: Range filter enables pruning
SELECT * FROM fact_orders
WHERE order_date >= '2025-01-01' AND order_date < '2026-01-01';

-- ANTI-PATTERN: OR with non-partition predicates
SELECT * FROM fact_orders
WHERE order_date = '2025-01-15' OR customer_id = 123;  -- May not prune

-- CORRECT: Use AND with partition column
SELECT * FROM fact_orders
WHERE order_date = '2025-01-15' AND customer_id = 123;  -- Prunes to single partition

Key Concepts Summary

StrategyColumn TypePruning MethodBest ForDrawback
Range PartitionDate/TimestampPartition eliminationTime-series queriesManual maintenance
Hash PartitionAny high-cardinalityEven distributionJOIN distributionNo range pruning
List PartitionCategoricalPartition eliminationRegional/segment queriesLimited values
B-Tree IndexHigh-cardinalityPoint/range lookupEquality queriesWrite overhead
Bitmap IndexLow-cardinalityBoolean operationsComplex filtersLarge storage
Bloom FilterAnyFile/partition skipMembership testsFalse positives
Z-OrderMulti-columnMulti-dim clusteringMulti-column filtersRebuild cost
Covering IndexN/AIndex-only scanSpecific query patternsStorage overhead

Performance Metrics

OptimizationQuery SpeedupStorage CostWrite Overhead
Range Partition (date)10-100x+1-2%Low
Hash Partition2-5x+1-2%Low
B-Tree Index10-1000x+10-30%Medium
Bitmap Index5-50x+20-50%Medium
Bloom Filter2-10x+5-10%Low
Z-Order2-5x+5-10%High
Composite Index10-100x+15-40%High

10 Best Practices

  1. Partition by date first β€” time-range queries are the most common pattern in analytics
  2. Keep partition count between 100-10,000 β€” too few = large scans, too many = metadata overhead
  3. Use hash partitioning for high-cardinality JOIN columns to distribute data evenly
  4. Create B-tree indexes on columns used in WHERE clauses with high selectivity
  5. Use bitmap indexes for low-cardinality columns (< 1000 distinct values)
  6. Implement Bloom filters on columns used in point lookups for large tables
  7. Monitor partition pruning with EXPLAIN plans β€” ensure partition filters are applied
  8. Avoid over-indexing β€” each index adds write overhead and storage cost
  9. Use Z-Ordering for multi-column filter patterns in Delta Lake / Iceberg
  10. Regularly analyze table statistics β€” update stats after large data loads

  • Range partitioning by date is the single most impactful optimization for analytics queries
  • Partition pruning can reduce data scanned by 90-99% for time-range queries
  • B-tree indexes excel at point lookups; bitmap indexes excel at boolean filters
  • Bloom filters provide probabilistic skip capabilities with minimal storage overhead
  • Combining partitioning + indexing + Z-Ordering achieves maximum query performance

See Also

⭐

Premium Content

Partitioning & Indexing Strategies for Data Warehouses

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Data Engineering Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement