🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Indexing: B-Tree, Hash, GiST, BRIN, Partial, Covering

Advanced SQLIndexing Strategies⭐ Premium

Advertisement

Interview Question: "When would you choose a BRIN index over a B-Tree? Explain covering indexes and index-only scans. How do partial indexes improve performance?" — Asked at Amazon, Pinterest, LinkedIn for Senior DBA roles

ℹ️

Difficulty: Advanced | Companies: Amazon, Pinterest, LinkedIn, Uber, Netflix | Time: 45-60 minutes

B-Tree Index Structure

B-Tree maintains sorted data with balanced tree properties:

Height=O(logBN)\text{Height} = O(\log_B N)

Where BB is branching factor and NN is number of keys.

-- Create table for indexing experiments
CREATE TABLE product_sales (
    sale_id SERIAL PRIMARY KEY,
    product_id INT,
    category VARCHAR(50),
    sale_date DATE,
    quantity INT,
    unit_price DECIMAL(10,2),
    total_amount DECIMAL(12,2),
    region VARCHAR(20),
    status VARCHAR(20)
);

-- Generate 10M rows
INSERT INTO product_sales (product_id, category, sale_date, quantity, unit_price, total_amount, region, status)
SELECT 
    (random() * 10000)::INT + 1,
    (ARRAY['Electronics', 'Clothing', 'Food', 'Books', 'Home'])[floor(random()*5+1)],
    CURRENT_DATE - (random() * 1000)::INT,
    (random() * 100)::INT + 1,
    (random() * 500)::DECIMAL(10,2),
    (random() * 50000)::DECIMAL(12,2),
    (ARRAY['North', 'South', 'East', 'West'])[floor(random()*4+1)],
    (ARRAY['completed', 'pending', 'refunded'])[floor(random()*3+1)]
FROM generate_series(1, 10000000);

B-Tree Index Operations

-- Standard B-Tree index
CREATE INDEX idx_product_sales_product ON product_sales(product_id);

-- Composite B-Tree index
CREATE INDEX idx_product_sales_composite ON product_sales(category, sale_date, total_amount);

-- Check index structure
SELECT 
    indexname,
    indexdef,
    pg_size_pretty(pg_relation_size(indexname::regclass)) AS index_size
FROM pg_indexes
WHERE tablename = 'product_sales';

-- Analyze index usage
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM product_sales 
WHERE product_id = 1234;

-- Range scan performance
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM product_sales 
WHERE category = 'Electronics' 
    AND sale_date BETWEEN '2024-01-01' AND '2024-12-31';

B-Tree Complexity:

OperationTime ComplexitySpace Complexity
SearchO(logn)O(\log n)O(n)O(n)
InsertO(logn)O(\log n)O(n)O(n)
DeleteO(logn)O(\log n)O(n)O(n)
Range ScanO(logn+k)O(\log n + k)O(k)O(k)

ℹ️

Branch Factor: PostgreSQL B-Tree has ~400 keys per page with 8KB pages. Height of 3 can store ~64M keys.

Hash Index

Hash indexes provide O(1)O(1) lookup for equality comparisons:

Hash(key)bucket\text{Hash}(key) \rightarrow \text{bucket}
-- Hash index for equality lookups
CREATE INDEX idx_product_sales_hash_region ON product_sales USING hash(region);

-- Compare B-Tree vs Hash for equality
EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*) FROM product_sales WHERE region = 'North';

-- Hash index only works with equality operators
-- This will NOT use hash index:
EXPLAIN SELECT * FROM product_sales WHERE region > 'North';

Hash vs B-Tree Comparison:

FeatureHashB-Tree
EqualityO(1)O(1)O(logn)O(\log n)
RangeNot supportedO(logn+k)O(\log n + k)
OrderNot maintainedMaintained
MemoryHigherLower
WriteFasterSlower

GiST Index for Complex Types

GiST (Generalized Search Tree) supports complex data types:

-- GiST index for range types
CREATE TABLE reservations (
    reservation_id SERIAL PRIMARY KEY,
    room_id INT,
    during TSRANGE,
    guest_name VARCHAR(100)
);

-- Create GiST index on range column
CREATE INDEX idx_reservations_during ON reservations USING gist(during);

-- Query for overlapping ranges
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM reservations 
WHERE during @> '2024-06-15'::date;

-- Find overlapping reservations
EXPLAIN (ANALYZE, BUFFERS)
SELECT r1.*, r2.* 
FROM reservations r1
JOIN reservations r2 ON r1.during && r2.during
WHERE r1.reservation_id < r2.reservation_id;

GiST Operator Classes:

TypeOperatorsUse Case
Range@>, <@, &&, <<, >>Temporal queries
Geometric@>, <@, &&, ~, @Spatial queries
Full-text@@, @>Text search
Array@>, <@Array containment

BRIN Index for Large Tables

BRIN (Block Range Index) stores min/max per block range:

Size=O(Nblock_range_size)\text{Size} = O\left(\frac{N}{\text{block\_range\_size}}\right)
-- BRIN index for time-series data
CREATE TABLE sensor_data (
    sensor_id INT,
    reading_time TIMESTAMPTZ,
    value DECIMAL(10,4)
);

-- Generate time-series data (correlated with insertion order)
INSERT INTO sensor_data
SELECT 
    (random() * 100)::INT + 1,
    NOW() - (random() * 365 * 24 * 60 * 60)::INT * INTERVAL '1 second',
    (random() * 100)::DECIMAL(10,4)
FROM generate_series(1, 10000000);

-- B-Tree index (large)
CREATE INDEX idx_sensor_btreen ON sensor_data(reading_time);
-- Size: ~200MB

-- BRIN index (small)
CREATE INDEX idx_sensor_brin ON sensor_data USING brin(reading_time)
    WITH (pages_per_range = 32);
-- Size: ~100KB

-- Compare performance
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM sensor_data 
WHERE reading_time BETWEEN '2024-06-01' AND '2024-06-30';

BRIN vs B-Tree Size Comparison:

RowsB-Tree SizeBRIN SizeRatio
1M21MB24KB875x
10M210MB240KB875x
100M2.1GB2.4MB875x

⚠️

BRIN Prerequisite: Data must be physically correlated with the indexed column. Random data makes BRIN ineffective.

Partial Index

Partial indexes reduce size by indexing only subset of rows:

Sizeselectivity×full index size\text{Size} \propto \text{selectivity} \times \text{full index size}
-- Partial index for active orders only
CREATE INDEX idx_orders_active ON orders(order_date, total_amount)
WHERE status = 'pending';

-- Partial index with expression
CREATE INDEX idx_orders_high_value ON orders(total_amount)
WHERE total_amount > 10000;

-- Compare query plans
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders 
WHERE status = 'pending' AND order_date > '2024-01-01';

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders 
WHERE status = 'shipped' AND order_date > '2024-01-01';

-- Check partial index size
SELECT 
    indexname,
    pg_size_pretty(pg_relation_size(indexname::regclass)) AS size
FROM pg_indexes
WHERE tablename = 'orders'
    AND indexname LIKE 'idx_orders_%';

Partial Index Selectivity:

Selectivity=rows matching WHEREtotal rows\text{Selectivity} = \frac{\text{rows matching WHERE}}{\text{total rows}}
Index Size Reduction=1Selectivity\text{Index Size Reduction} = 1 - \text{Selectivity}

Covering Index (INCLUDE)

Covering indexes include additional columns to enable index-only scans:

-- Covering index with INCLUDE
CREATE INDEX idx_orders_covering ON orders(customer_id, order_date)
    INCLUDE (total_amount, status);

-- Query can be satisfied entirely from index
EXPLAIN (ANALYZE, BUFFERS)
SELECT customer_id, order_date, total_amount, status
FROM orders 
WHERE customer_id = 1234
    AND order_date > '2024-01-01';

-- Compare with non-covering index
CREATE INDEX idx_orders_non_covering ON orders(customer_id, order_date);

EXPLAIN (ANALYZE, BUFFERS)
SELECT customer_id, order_date, total_amount, status
FROM orders 
WHERE customer_id = 1234
    AND order_date > '2024-01-01';

Index-Only Scan Benefits:

MetricTable ScanIndex ScanIndex-Only Scan
I/OHighMediumLow
CPULowMediumMedium
MemoryLowMediumHigh

ℹ️

Covering Index Formula: For query QQ with columns CwhereC_{where} and CselectC_{select}, optimal covering index is:

Index(Cwhere) INCLUDE (CselectCwhere)\text{Index}(C_{where}) \text{ INCLUDE } (C_{select} \setminus C_{where})

Expression Index

-- Index on expression
CREATE INDEX idx_orders_upper_status ON orders(UPPER(status));

-- Index on function result
CREATE INDEX idx_orders_year ON orders(EXTRACT(YEAR FROM order_date));

-- Check if expression index is used
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders 
WHERE UPPER(status) = 'PENDING';

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders 
WHERE EXTRACT(YEAR FROM order_date) = 2024;

Multi-Column Index Ordering

The order of columns in composite index matters:

Index(A,B,C) supports:AA,BA,B,CBut not B or C alone\text{Index}(A, B, C) \text{ supports:} \\ \quad A \\ \quad A, B \\ \quad A, B, C \\ \quad \text{But not } B \text{ or } C \text{ alone}
-- Optimal column ordering for selective predicates
CREATE INDEX idx_product_category_date ON product_sales(category, sale_date);

-- This uses the index efficiently
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM product_sales 
WHERE category = 'Electronics' AND sale_date > '2024-01-01';

-- This only uses category prefix
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM product_sales 
WHERE category = 'Electronics';

-- This cannot use the index
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM product_sales 
WHERE sale_date > '2024-01-01';

Index Maintenance

-- Monitor index bloat
SELECT 
    indexname,
    pg_size_pretty(pg_relation_size(indexname::regclass)) AS size,
    idx_scan AS times_used,
    pg_size_pretty(pg_relation_size(indexname::regclass) - 
        (SELECT pg_relation_size(indexname::regclass) * 0.3)) AS bloat_estimate
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY pg_relation_size(indexname::regclass) DESC;

-- Rebuild bloated indexes
REINDEX INDEX idx_product_sales_composite;

-- Check index fragmentation
SELECT 
    indexname,
    round(100 * pg_relation_size(indexname::regclass) / 
        nullif(pg_relation_size('product_sales'::regclass), 0), 2) AS index_ratio
FROM pg_indexes
WHERE tablename = 'product_sales';

Index Selection Algorithm

The optimizer evaluates indexes using:

Score(I)=CseqCidxCidx×write_overhead\text{Score}(I) = \frac{C_{seq} - C_{idx}}{C_{idx} \times \text{write\_overhead}}
Optimal Index=argmaxIcandidatesScore(I)\text{Optimal Index} = \arg\max_{I \in \text{candidates}} \text{Score}(I)

Decision Matrix:

Query PatternRecommended IndexAvoid
Equality on 1 columnB-Tree or HashGiST
Range on 1 columnB-TreeHash
Equality + RangeComposite B-TreeSeparate indexes
Full-text searchGiST with tsvectorB-Tree
Large time-seriesBRINB-Tree
Low-selectivity filterPartial indexFull index

⚠️

Index Anti-Patterns:

  1. Over-indexing: Too many indexes slow down writes
  2. Wrong column order in composite indexes
  3. Not using partial indexes for filtered queries
  4. Ignoring index bloat and maintenance

Advanced: Index-Only Scan with Visibility Map

-- Check visibility map status
SELECT 
    relname,
    pg_size_pretty(pg_relation_size(relname::regclass)) AS table_size,
    pg_size_pretty(pg_total_relation_size(relname::regclass)) AS total_size,
    CASE 
        WHEN pg_relation_size(relname::regclass) > 1073741824 THEN 'Consider BRIN'
        WHEN pg_relation_size(relname::regclass) > 104857600 THEN 'Monitor bloat'
        ELSE 'OK'
    END AS recommendation
FROM pg_class
WHERE relkind = 'r'
    AND relnamespace = 'public'::regnamespace
ORDER BY pg_relation_size(relname::regclass) DESC
LIMIT 10;

ℹ️

Pro Tip: Use pg_stat_user_indexes to find unused indexes. Drop them to improve write performance.

Advertisement