Query Optimization: Cost Model, Statistics, Histograms

Interview Question: "Explain the difference between a cost-based optimizer and a rule-based optimizer. How does PostgreSQL estimate row counts? What are statistics and how do they affect query plans?" — Asked at Oracle, Microsoft, Snowflake for Database Engineer roles

ℹ️

Difficulty: Advanced | Companies: Oracle, Microsoft, Snowflake, Databricks, Amazon Redshift | Time: 60-75 minutes

Cost Model Fundamentals

The query optimizer estimates cost using:

\text{Total Cost} = \text{I/O Cost} + \text{CPU Cost} + \text{Network Cost}

For a sequential scan:

C_{seq} = \text{pages} \times \text{seq\_page\_cost}

For an index scan:

C_{idx} = \text{index\_pages} \times \text{random\_page\_cost} + \text{rows} \times \text{cpu\_tuple\_cost}

-- Create sample tables for optimization analysis
CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INT,
    order_date DATE,
    total_amount DECIMAL(12,2),
    status VARCHAR(20),
    region VARCHAR(50)
);

CREATE TABLE order_items (
    item_id SERIAL PRIMARY KEY,
    order_id INT REFERENCES orders(order_id),
    product_id INT,
    quantity INT,
    unit_price DECIMAL(10,2)
);

CREATE TABLE customers (
    customer_id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(255),
    region VARCHAR(50),
    credit_limit DECIMAL(12,2)
);

-- Generate sample data (1M orders)
INSERT INTO orders (customer_id, order_date, total_amount, status, region)
SELECT 
    (random() * 10000)::INT + 1,
    CURRENT_DATE - (random() * 365)::INT,
    (random() * 10000)::DECIMAL(12,2),
    (ARRAY['pending', 'shipped', 'delivered', 'cancelled'])[floor(random()*4+1)],
    (ARRAY['North', 'South', 'East', 'West'])[floor(random()*4+1)]
FROM generate_series(1, 1000000);

-- Generate order items
INSERT INTO order_items (order_id, product_id, quantity, unit_price)
SELECT 
    order_id,
    (random() * 1000)::INT + 1,
    (random() * 10)::INT + 1,
    (random() * 500)::DECIMAL(10,2)
FROM orders, generate_series(1, (random() * 5)::INT + 1);

-- Generate customers
INSERT INTO customers (name, email, region, credit_limit)
SELECT 
    'Customer ' || i,
    'customer' || i || '@example.com',
    (ARRAY['North', 'South', 'East', 'West'])[floor(random()*4+1)],
    (random() * 50000)::DECIMAL(12,2)
FROM generate_series(1, 10000) i;

Understanding EXPLAIN Output

-- Basic EXPLAIN analysis
EXPLAIN SELECT * FROM orders WHERE region = 'North';

Output:

Architecture Diagram

Seq Scan on orders  (cost=0.00..18334.00 rows=250000 width=48)
  Filter: (region = 'North')
  Rows Removed by Filter: 750000

Cost Breakdown:

0.00 = startup cost (before first row)
18334.00 = total cost (all rows processed)
250000 = estimated rows
48 = estimated row width in bytes

-- Detailed EXPLAIN with ANALYZE
EXPLAIN ANALYZE 
SELECT o.order_id, o.total_amount, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.region = 'North'
    AND o.total_amount > 1000;

Output:

Architecture Diagram

Hash Join  (cost=2567.89..19456.78 rows=62500 width=64) (actual time=12.345..89.012 rows=62489 loops=1)
  Hash Cond: (o.customer_id = c.customer_id)
  ->  Seq Scan on orders o  (cost=0.00..15678.90 rows=250000 width=16) (actual time=0.012..45.678 rows=250000 loops=1)
        Filter: ((region = 'North') AND (total_amount > 1000))
        Rows Removed by Filter: 750000
  ->  Hash  (cost=1234.56..1234.56 rows=10000 width=48) (actual time=8.901..8.901 rows=10000 loops=1)
        Buckets: 16384  Batches: 1  Memory Usage: 1234kB
        ->  Seq Scan on customers c  (cost=0.00..1234.56 rows=10000 width=48) (actual time=0.005..4.567 rows=10000 loops=1)
Planning Time: 0.234 ms
Execution Time: 89.567 ms

ℹ️

Key Metrics: Compare estimated rows vs actual rows. Large discrepancies indicate stale statistics or poor estimates.

Statistics and Histograms

-- View table statistics
SELECT 
    attname,
    null_frac,
    avg_width,
    n_distinct,
    correlation,
    most_common_vals,
    most_common_freqs,
    histogram_bounds
FROM pg_stats
WHERE tablename = 'orders'
    AND attname = 'region';

-- Manually update statistics
ANALYZE orders;

-- Check statistics target
SELECT 
    attname,
    attstattarget,
    pg_stats.ext_stats
FROM pg_attribute
JOIN pg_stats ON attname = attname
WHERE attrelid = 'orders'::regclass;

Histogram Representation:

For a column with values $[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]$ :

\text{Histogram} = \{[1,3], [4,6], [7,8], [9,10]\}

Each bucket contains approximately $\frac{N}{\text{buckets}}$ values.

-- Create custom statistics for better estimates
CREATE STATISTICS orders_region_status (region, status) ON orders;

-- Check correlation statistics
SELECT 
    correlation,
    -- Physical correlation affects index scan cost
    CASE 
        WHEN correlation > 0.8 THEN 'High correlation - index scan efficient'
        WHEN correlation > 0.5 THEN 'Medium correlation'
        ELSE 'Low correlation - index scan less efficient'
    END AS analysis
FROM pg_stats
WHERE tablename = 'orders'
    AND attname = 'order_date';

Cost Calculation Formulas

Sequential Scan Cost:

C_{seq} = \text{pages} \times \text{seq\_page\_cost} + \text{rows} \times \text{cpu\_tuple\_cost}

Index Scan Cost:

C_{idx} = \text{leaf\_pages} \times \text{random\_page\_cost} + \text{rows} \times (\text{cpu\_operator\_cost} + \text{cpu\_tuple\_cost})

Hash Join Cost:

C_{hash} = 3 \times (C_{build} + C_{probe})

Merge Join Cost:

C_{merge} = C_{sort\_outer} + C_{sort\_inner} + \text{rows} \times \text{cpu\_operator\_cost}

-- Compare different join strategies
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT o.*, c.*
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date > '2024-01-01';

-- Force specific join method
SET enable_hashjoin = off;
SET enable_mergejoin = off;
-- Only nested loop remains

-- Reset
RESET enable_hashjoin;
RESET enable_mergejoin;

Index Selection Algorithm

The optimizer selects indexes based on:

\text{Selectivity} = \frac{\text{matching rows}}{\text{total rows}}

\text{Index Benefit} = C_{seq} - C_{idx}

-- Create various indexes for comparison
CREATE INDEX idx_orders_region ON orders(region);
CREATE INDEX idx_orders_date ON orders(order_date);
CREATE INDEX idx_orders_composite ON orders(region, order_date, total_amount);
CREATE INDEX idx_orders_partial ON orders(total_amount) WHERE status = 'shipped';

-- Compare plans with different indexes
EXPLAIN SELECT * FROM orders WHERE region = 'North' AND order_date > '2024-06-01';

-- Drop indexes to see plan changes
DROP INDEX idx_orders_region;
EXPLAIN SELECT * FROM orders WHERE region = 'North' AND order_date > '2024-06-01';

Index Cost Comparison:

Index Type	Random I/O	Sequential I/O	Memory
Seq Scan	0	High	Low
B-Tree	Medium	Low	Medium
Hash	Low	Low	High
GiST	High	Low	Medium

⚠️

Index Overhead: Each index adds write overhead. The optimizer must balance read performance vs write cost. Monitor index usage with pg_stat_user_indexes.

Advanced Cost Estimation

Correlation Factor:

\rho = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}}

Selectivity Estimation:

\text{sel}(a < x < b) = \frac{\text{histogram\_selectivity}(a, b)}{\text{total\_rows}}

-- Analyze selectivity of different predicates
EXPLAIN SELECT * FROM orders 
WHERE total_amount BETWEEN 1000 AND 5000;

EXPLAIN SELECT * FROM orders 
WHERE total_amount > 1000;

EXPLAIN SELECT * FROM orders 
WHERE status = 'shipped' AND region = 'North';

-- Check estimated vs actual for different ranges
EXPLAIN ANALYZE SELECT * FROM orders 
WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31';

EXPLAIN ANALYZE SELECT * FROM orders 
WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31';

Query Plan Cache Analysis

-- Check plan cache statistics
SELECT 
    query,
    calls,
    total_time,
    mean_time,
    rows
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;

-- Analyze plan stability
EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON)
SELECT * FROM orders WHERE region = 'North';

-- Compare with different parameter values
PREPARE test_query AS SELECT * FROM orders WHERE region = $1;
EXPLAIN ANALYZE EXECUTE test_query('North');
EXPLAIN ANALYZE EXECUTE test_query('South');

Performance Tuning Checklist

Statistics Freshness: ANALYZE after major data changes
Index Usage: Check pg_stat_user_indexes
Join Order: Verify optimal join sequence
Memory Settings: Adjust work_mem, shared_buffers
Parallel Queries: Enable parallel workers for large scans

ℹ️

Pro Tip: Use EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) with tools like pganalyze or explain.dalibo.com for visual plan analysis.

Mathematical Model for Plan Selection

The optimizer uses a dynamic programming approach:

C_{best}(S) = \min_{T \in \text{plans}(S)} C(T)

Where $S$ is the query and $T$ is a candidate plan.

For joins with $n$ tables:

\text{Complexity} = O(n!)

With pruning and heuristics:

\text{Complexity} = O(2^n)

⚠️

Cardinality Estimation Errors: The #1 cause of bad query plans. Monitor with pg_stat_statements and tune statistics targets for critical columns.