Indexing: B-Tree, Hash, GiST, BRIN, Partial, Covering

Interview Question: "When would you choose a BRIN index over a B-Tree? Explain covering indexes and index-only scans. How do partial indexes improve performance?" — Asked at Amazon, Pinterest, LinkedIn for Senior DBA roles

ℹ️

Difficulty: Advanced | Companies: Amazon, Pinterest, LinkedIn, Uber, Netflix | Time: 45-60 minutes

B-Tree Index Structure

B-Tree maintains sorted data with balanced tree properties:

\text{Height} = O(\log_B N)

Where $B$ is branching factor and $N$ is number of keys.

-- Create table for indexing experiments
CREATE TABLE product_sales (
    sale_id SERIAL PRIMARY KEY,
    product_id INT,
    category VARCHAR(50),
    sale_date DATE,
    quantity INT,
    unit_price DECIMAL(10,2),
    total_amount DECIMAL(12,2),
    region VARCHAR(20),
    status VARCHAR(20)
);

-- Generate 10M rows
INSERT INTO product_sales (product_id, category, sale_date, quantity, unit_price, total_amount, region, status)
SELECT 
    (random() * 10000)::INT + 1,
    (ARRAY['Electronics', 'Clothing', 'Food', 'Books', 'Home'])[floor(random()*5+1)],
    CURRENT_DATE - (random() * 1000)::INT,
    (random() * 100)::INT + 1,
    (random() * 500)::DECIMAL(10,2),
    (random() * 50000)::DECIMAL(12,2),
    (ARRAY['North', 'South', 'East', 'West'])[floor(random()*4+1)],
    (ARRAY['completed', 'pending', 'refunded'])[floor(random()*3+1)]
FROM generate_series(1, 10000000);

B-Tree Index Operations

-- Standard B-Tree index
CREATE INDEX idx_product_sales_product ON product_sales(product_id);

-- Composite B-Tree index
CREATE INDEX idx_product_sales_composite ON product_sales(category, sale_date, total_amount);

-- Check index structure
SELECT 
    indexname,
    indexdef,
    pg_size_pretty(pg_relation_size(indexname::regclass)) AS index_size
FROM pg_indexes
WHERE tablename = 'product_sales';

-- Analyze index usage
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM product_sales 
WHERE product_id = 1234;

-- Range scan performance
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM product_sales 
WHERE category = 'Electronics' 
    AND sale_date BETWEEN '2024-01-01' AND '2024-12-31';

B-Tree Complexity:

Operation	Time Complexity	Space Complexity
Search	$O(\log n)$	$O(n)$
Insert	$O(\log n)$	$O(n)$
Delete	$O(\log n)$	$O(n)$
Range Scan	$O(\log n + k)$	$O(k)$

ℹ️

Branch Factor: PostgreSQL B-Tree has ~400 keys per page with 8KB pages. Height of 3 can store ~64M keys.

Hash Index

Hash indexes provide $O(1)$ lookup for equality comparisons:

\text{Hash}(key) \rightarrow \text{bucket}

-- Hash index for equality lookups
CREATE INDEX idx_product_sales_hash_region ON product_sales USING hash(region);

-- Compare B-Tree vs Hash for equality
EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*) FROM product_sales WHERE region = 'North';

-- Hash index only works with equality operators
-- This will NOT use hash index:
EXPLAIN SELECT * FROM product_sales WHERE region > 'North';

Hash vs B-Tree Comparison:

Feature	Hash	B-Tree
Equality	$O(1)$	$O(\log n)$
Range	Not supported	$O(\log n + k)$
Order	Not maintained	Maintained
Memory	Higher	Lower
Write	Faster	Slower

GiST Index for Complex Types

GiST (Generalized Search Tree) supports complex data types:

-- GiST index for range types
CREATE TABLE reservations (
    reservation_id SERIAL PRIMARY KEY,
    room_id INT,
    during TSRANGE,
    guest_name VARCHAR(100)
);

-- Create GiST index on range column
CREATE INDEX idx_reservations_during ON reservations USING gist(during);

-- Query for overlapping ranges
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM reservations 
WHERE during @> '2024-06-15'::date;

-- Find overlapping reservations
EXPLAIN (ANALYZE, BUFFERS)
SELECT r1.*, r2.* 
FROM reservations r1
JOIN reservations r2 ON r1.during && r2.during
WHERE r1.reservation_id < r2.reservation_id;

GiST Operator Classes:

Type	Operators	Use Case
Range	@>, <@, &&, <<, >>	Temporal queries
Geometric	@>, <@, &&, ~, @	Spatial queries
Full-text	@@, @>	Text search
Array	@>, <@	Array containment

BRIN Index for Large Tables

BRIN (Block Range Index) stores min/max per block range:

\text{Size} = O\left(\frac{N}{\text{block\_range\_size}}\right)

-- BRIN index for time-series data
CREATE TABLE sensor_data (
    sensor_id INT,
    reading_time TIMESTAMPTZ,
    value DECIMAL(10,4)
);

-- Generate time-series data (correlated with insertion order)
INSERT INTO sensor_data
SELECT 
    (random() * 100)::INT + 1,
    NOW() - (random() * 365 * 24 * 60 * 60)::INT * INTERVAL '1 second',
    (random() * 100)::DECIMAL(10,4)
FROM generate_series(1, 10000000);

-- B-Tree index (large)
CREATE INDEX idx_sensor_btreen ON sensor_data(reading_time);
-- Size: ~200MB

-- BRIN index (small)
CREATE INDEX idx_sensor_brin ON sensor_data USING brin(reading_time)
    WITH (pages_per_range = 32);
-- Size: ~100KB

-- Compare performance
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM sensor_data 
WHERE reading_time BETWEEN '2024-06-01' AND '2024-06-30';

BRIN vs B-Tree Size Comparison:

Rows	B-Tree Size	BRIN Size	Ratio
1M	21MB	24KB	875x
10M	210MB	240KB	875x
100M	2.1GB	2.4MB	875x

⚠️

BRIN Prerequisite: Data must be physically correlated with the indexed column. Random data makes BRIN ineffective.

Partial Index

Partial indexes reduce size by indexing only subset of rows:

\text{Size} \propto \text{selectivity} \times \text{full index size}

-- Partial index for active orders only
CREATE INDEX idx_orders_active ON orders(order_date, total_amount)
WHERE status = 'pending';

-- Partial index with expression
CREATE INDEX idx_orders_high_value ON orders(total_amount)
WHERE total_amount > 10000;

-- Compare query plans
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders 
WHERE status = 'pending' AND order_date > '2024-01-01';

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders 
WHERE status = 'shipped' AND order_date > '2024-01-01';

-- Check partial index size
SELECT 
    indexname,
    pg_size_pretty(pg_relation_size(indexname::regclass)) AS size
FROM pg_indexes
WHERE tablename = 'orders'
    AND indexname LIKE 'idx_orders_%';

Partial Index Selectivity:

\text{Selectivity} = \frac{\text{rows matching WHERE}}{\text{total rows}}

\text{Index Size Reduction} = 1 - \text{Selectivity}

Covering Index (INCLUDE)

Covering indexes include additional columns to enable index-only scans:

-- Covering index with INCLUDE
CREATE INDEX idx_orders_covering ON orders(customer_id, order_date)
    INCLUDE (total_amount, status);

-- Query can be satisfied entirely from index
EXPLAIN (ANALYZE, BUFFERS)
SELECT customer_id, order_date, total_amount, status
FROM orders 
WHERE customer_id = 1234
    AND order_date > '2024-01-01';

-- Compare with non-covering index
CREATE INDEX idx_orders_non_covering ON orders(customer_id, order_date);

EXPLAIN (ANALYZE, BUFFERS)
SELECT customer_id, order_date, total_amount, status
FROM orders 
WHERE customer_id = 1234
    AND order_date > '2024-01-01';

Index-Only Scan Benefits:

Metric	Table Scan	Index Scan	Index-Only Scan
I/O	High	Medium	Low
CPU	Low	Medium	Medium
Memory	Low	Medium	High

ℹ️

Covering Index Formula: For query $Q$ with columns $C_{where}$ and $C_{select}$ , optimal covering index is:

\text{Index}(C_{where}) \text{ INCLUDE } (C_{select} \setminus C_{where})

Expression Index

-- Index on expression
CREATE INDEX idx_orders_upper_status ON orders(UPPER(status));

-- Index on function result
CREATE INDEX idx_orders_year ON orders(EXTRACT(YEAR FROM order_date));

-- Check if expression index is used
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders 
WHERE UPPER(status) = 'PENDING';

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders 
WHERE EXTRACT(YEAR FROM order_date) = 2024;

Multi-Column Index Ordering

The order of columns in composite index matters:

\text{Index}(A, B, C) \text{ supports:} \\ \quad A \\ \quad A, B \\ \quad A, B, C \\ \quad \text{But not } B \text{ or } C \text{ alone}

-- Optimal column ordering for selective predicates
CREATE INDEX idx_product_category_date ON product_sales(category, sale_date);

-- This uses the index efficiently
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM product_sales 
WHERE category = 'Electronics' AND sale_date > '2024-01-01';

-- This only uses category prefix
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM product_sales 
WHERE category = 'Electronics';

-- This cannot use the index
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM product_sales 
WHERE sale_date > '2024-01-01';

Index Maintenance

-- Monitor index bloat
SELECT 
    indexname,
    pg_size_pretty(pg_relation_size(indexname::regclass)) AS size,
    idx_scan AS times_used,
    pg_size_pretty(pg_relation_size(indexname::regclass) - 
        (SELECT pg_relation_size(indexname::regclass) * 0.3)) AS bloat_estimate
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY pg_relation_size(indexname::regclass) DESC;

-- Rebuild bloated indexes
REINDEX INDEX idx_product_sales_composite;

-- Check index fragmentation
SELECT 
    indexname,
    round(100 * pg_relation_size(indexname::regclass) / 
        nullif(pg_relation_size('product_sales'::regclass), 0), 2) AS index_ratio
FROM pg_indexes
WHERE tablename = 'product_sales';

Index Selection Algorithm

The optimizer evaluates indexes using:

\text{Score}(I) = \frac{C_{seq} - C_{idx}}{C_{idx} \times \text{write\_overhead}}

\text{Optimal Index} = \arg\max_{I \in \text{candidates}} \text{Score}(I)

Decision Matrix:

Query Pattern	Recommended Index	Avoid
Equality on 1 column	B-Tree or Hash	GiST
Range on 1 column	B-Tree	Hash
Equality + Range	Composite B-Tree	Separate indexes
Full-text search	GiST with tsvector	B-Tree
Large time-series	BRIN	B-Tree
Low-selectivity filter	Partial index	Full index

⚠️

Index Anti-Patterns:

Over-indexing: Too many indexes slow down writes
Wrong column order in composite indexes
Not using partial indexes for filtered queries
Ignoring index bloat and maintenance

Advanced: Index-Only Scan with Visibility Map

-- Check visibility map status
SELECT 
    relname,
    pg_size_pretty(pg_relation_size(relname::regclass)) AS table_size,
    pg_size_pretty(pg_total_relation_size(relname::regclass)) AS total_size,
    CASE 
        WHEN pg_relation_size(relname::regclass) > 1073741824 THEN 'Consider BRIN'
        WHEN pg_relation_size(relname::regclass) > 104857600 THEN 'Monitor bloat'
        ELSE 'OK'
    END AS recommendation
FROM pg_class
WHERE relkind = 'r'
    AND relnamespace = 'public'::regnamespace
ORDER BY pg_relation_size(relname::regclass) DESC
LIMIT 10;

ℹ️

Pro Tip: Use pg_stat_user_indexes to find unused indexes. Drop them to improve write performance.