πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Topic: Query Optimization with EXPLAIN/ANALYZE for FAANG Interviews

SQL AdvancedQuery Optimization⭐ Premium

Advertisement

⚑ Query Optimization & EXPLAIN

Google & Uber Interview Deep Dive

🏒 Google🏒 Uber⚑ Difficulty: Hard⏱️ 45 min

πŸ“‹ Interview Question

β„ΉοΈπŸ”΄ Google/Uber Interview Question

"Given a slow-performing query on a large orders table (100M+ rows), walk through your optimization process using EXPLAIN ANALYZE. Identify the bottlenecks, suggest improvements, and verify the improvements. What metrics would you track?"

Companies: Google, Uber | Difficulty: Hard | Time: 45 minutes

πŸ“Š Setup: Large Dataset

-- Create a large orders table
CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INT NOT NULL,
    product_id INT NOT NULL,
    order_date TIMESTAMP NOT NULL,
    quantity INT DEFAULT 1,
    unit_price DECIMAL(10, 2),
    total_amount DECIMAL(12, 2),
    status VARCHAR(20) DEFAULT 'pending',
    shipping_address TEXT,
    payment_method VARCHAR(30)
);

-- Create indexes (we'll analyze which help)
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_orders_order_date ON orders(order_date);
CREATE INDEX idx_orders_status ON orders(status);
CREATE INDEX idx_orders_composite ON orders(customer_id, order_date);

-- Insert 100M rows (using generate_series for speed)
INSERT INTO orders (customer_id, product_id, order_date, quantity, unit_price, total_amount, status)
SELECT
    (random() * 1000000)::INT + 1,
    (random() * 50000)::INT + 1,
    '2020-01-01'::timestamp + (random() * 1825 || ' days')::interval,
    (random() * 10)::INT + 1,
    (random() * 1000)::decimal(10,2),
    (random() * 10000)::decimal(12,2),
    CASE (random() * 4)::INT
        WHEN 0 THEN 'pending'
        WHEN 1 THEN 'shipped'
        WHEN 2 THEN 'delivered'
        WHEN 3 THEN 'cancelled'
        ELSE 'refunded'
    END
FROM generate_series(1, 10000000);

πŸ” Part 1: Reading EXPLAIN Output

Basic EXPLAIN

-- Simple EXPLAIN shows the query plan without executing
EXPLAIN
SELECT * FROM orders WHERE customer_id = 12345;

Output:

Architecture Diagram
Seq Scan on orders  (cost=0.00..183347.00 rows=10 width=124)
  Filter: (customer_id = 12345)

β„ΉοΈπŸ” Reading EXPLAIN Output

  • cost: Estimated startup cost..total cost (in arbitrary units)
  • rows: Estimated number of rows returned
  • width: Estimated average width of rows in bytes
  • Seq Scan: Full table scan (usually bad for large tables)
  • Index Scan: Using an index (usually good)

EXPLAIN with Index Usage

-- Same query with index
EXPLAIN
SELECT * FROM orders WHERE customer_id = 12345;

Output (with index):

Architecture Diagram
Index Scan using idx_orders_customer_id on orders  (cost=0.43..8.45 rows=10 width=124)
  Index Cond: (customer_id = 12345)

EXPLAIN ANALYZE (Actual Execution)

-- EXPLAIN ANALYZE actually runs the query
EXPLAIN ANALYZE
SELECT
    customer_id,
    COUNT(*) AS order_count,
    SUM(total_amount) AS total_spent,
    AVG(total_amount) AS avg_order_value
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id
HAVING COUNT(*) > 5
ORDER BY total_spent DESC
LIMIT 10;

Sample Output:

Architecture Diagram
Limit  (cost=2847532.43..2847532.45 rows=10 width=20) (actual time=45231.234..45231.256 rows=10 loops=1)
  ->  Sort  (cost=2847532.43..2847532.68 rows=100 width=20) (actual time=45231.230..45231.243 rows=10 loops=1)
        Sort Key: (sum(orders.total_amount)) DESC
        Sort Method: top-N heapsort  Memory: 25kB
        ->  HashAggregate  (cost=2847520.00..2847530.00 rows=1000 width=20) (actual time=45230.100..45230.890 rows=1234 loops=1)
              Group Key: customer_id
              Filter: (count(*) > 5)
              Batches: 1  Memory Usage: 241kB
              ->  Seq Scan on orders  (cost=0.00..2013470.00 rows=5000000 width=12) (actual time=0.012..18234.567 rows=5000000 loops=1)
                    Filter: (order_date >= '2024-01-01')
                    Rows Removed by Filter: 5000000
Planning Time: 0.085 ms
Execution Time: 45231.312 ms

πŸ“ˆ Part 2: Key Metrics to Analyze

Understanding Key Values

⚠️⚠️ Critical EXPLAIN ANALYZE Metrics

1. Rows Removed by Filter: High values mean many rows are scanned but not used β€” indicates missing indexes.

2. Actual Time vs Planning Time: High planning time may indicate complex queries; high execution time needs optimization.

3. Memory Usage: HashAggregate memory can indicate if work_mem needs increase.

4. Sort Method: "external merge" means disk-based sorting β€” increase work_mem or optimize ORDER BY.

-- Analyze specific bottleneck
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT
    o.order_id,
    o.order_date,
    c.customer_name,
    p.product_name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_id
WHERE o.order_date >= '2024-01-01'
AND o.status = 'shipped';

Enhanced Output with BUFFERS:

Architecture Diagram
Nested Loop  (cost=0.87..123456.78 rows=1000 width=50) (actual time=0.025..1234.567 rows=1000 loops=1)
  Buffers: shared hit=54321 read=1234
  ->  Index Scan using idx_orders_status_date on orders o  (cost=0.43..98765.43 rows=1000 width=20) (actual time=0.018..987.654 rows=1000 loops=1)
        Index Cond: ((status = 'shipped') AND (order_date >= '2024-01-01'))
        Buffers: shared hit=43210 read=1234
  ->  Index Scan using customers_pkey on customers c  (cost=0.29..8.31 rows=1 width=30) (actual time=0.002..0.003 rows=1 loops=1000)
        Index Cond: (customer_id = o.customer_id)
        Buffers: shared hit=3000
  ->  Index Scan using products_pkey on products p  (cost=0.29..16.61 rows=1 width=30) (actual time=0.002..0.003 rows=1 loops=1000)
        Index Cond: (product_id = o.product_id)
        Buffers: shared hit=2000
Planning Time: 0.125 ms
Execution Time: 1234.789 ms

πŸ”§ Part 3: Common Bottlenecks & Solutions

Bottleneck 1: Sequential Scans on Large Tables

-- BAD: Full table scan
EXPLAIN ANALYZE
SELECT * FROM orders WHERE total_amount > 5000;

-- Solution: Create appropriate index
CREATE INDEX idx_orders_amount ON orders(total_amount);

-- Verify improvement
EXPLAIN ANALYZE
SELECT * FROM orders WHERE total_amount > 5000;

Bottleneck 2: Nested Loop on Large Datasets

-- BAD: Nested loop with large intermediate result
EXPLAIN ANALYZE
SELECT *
FROM orders o
WHERE o.customer_id IN (
    SELECT customer_id
    FROM orders
    WHERE order_date >= '2024-01-01'
);

-- Solution 1: Use JOIN instead
EXPLAIN ANALYZE
SELECT DISTINCT o.*
FROM orders o
JOIN orders o2 ON o.customer_id = o2.customer_id
WHERE o2.order_date >= '2024-01-01';

-- Solution 2: Use EXISTS
EXPLAIN ANALYZE
SELECT *
FROM orders o
WHERE EXISTS (
    SELECT 1
    FROM orders o2
    WHERE o2.customer_id = o.customer_id
    AND o2.order_date >= '2024-01-01'
);

Bottleneck 3: Sort Operations

-- BAD: Large sort without index
EXPLAIN ANALYZE
SELECT * FROM orders
ORDER BY order_date DESC
LIMIT 100;

-- Solution: Create covering index
CREATE INDEX idx_orders_date_desc ON orders(order_date DESC);

-- Or use INCLUDE for covering index (PostgreSQL 11+)
CREATE INDEX idx_orders_date_covering
ON orders(order_date DESC)
INCLUDE (customer_id, total_amount, status);

Bottleneck 4: Hash Aggregate Memory

-- BAD: Hash aggregate with high memory usage
EXPLAIN (ANALYZE, BUFFERS)
SELECT customer_id, COUNT(*), SUM(total_amount)
FROM orders
GROUP BY customer_id;

-- Solution: Increase work_mem for this session
SET work_mem = '256MB';

-- Or optimize with pre-aggregation
CREATE MATERIALIZED VIEW customer_order_summary AS
SELECT
    customer_id,
    COUNT(*) AS order_count,
    SUM(total_amount) AS total_spent
FROM orders
GROUP BY customer_id;

πŸ“Š Part 4: EXPLAIN Options

-- VERBOSE: Shows detailed output
EXPLAIN (ANALYZE, VERBOSE, BUFFERS)
SELECT * FROM orders WHERE customer_id = 123;

-- FORMAT options: TEXT, JSON, YAML, XML
EXPLAIN (ANALYZE, FORMAT JSON)
SELECT * FROM orders WHERE customer_id = 123;

-- SUMMARY: Shows planning/execution statistics
EXPLAIN (ANALYZE, SUMMARY)
SELECT * FROM orders WHERE customer_id = 123;

-- COSTS: Toggle cost display
EXPLAIN (ANALYZE, COSTS FALSE)
SELECT * FROM orders WHERE customer_id = 123;

-- TIMING: Toggle timing information
EXPLAIN (ANALYZE, TIMING FALSE)
SELECT * FROM orders WHERE customer_id = 123;

-- WAL: Show WAL (Write-Ahead Log) information
EXPLAIN (ANALYZE, WAL)
INSERT INTO orders SELECT * FROM orders WHERE customer_id = 123;

🎯 Part 5: Query Rewriting Techniques

πŸ’‘πŸ’‘ Query Rewriting Tips

  1. Replace IN with JOIN or EXISTS
  2. Move complex subqueries to CTEs
  3. Use UNION ALL instead of UNION when duplicates don't matter
  4. Avoid SELECT * β€” specify only needed columns
  5. Push predicates as close to source as possible

Example: Rewriting Complex Queries

-- ORIGINAL: Slow query with IN clause
EXPLAIN ANALYZE
SELECT *
FROM orders
WHERE customer_id IN (
    SELECT customer_id
    FROM orders
    WHERE order_date >= '2024-01-01'
    AND status = 'shipped'
)
AND order_date < '2024-01-01';

-- REWRITTEN: Using CTE and JOIN
WITH recent_shipped AS (
    SELECT DISTINCT customer_id
    FROM orders
    WHERE order_date >= '2024-01-01'
    AND status = 'shipped'
)
EXPLAIN ANALYZE
SELECT o.*
FROM orders o
JOIN recent_shipped rs ON o.customer_id = rs.customer_id
WHERE o.order_date < '2024-01-01';

Example: Eliminating Subqueries

-- ORIGINAL: Correlated subquery
EXPLAIN ANALYZE
SELECT
    *,
    (SELECT COUNT(*) FROM orders o2 WHERE o2.customer_id = o1.customer_id) AS customer_order_count
FROM orders o1
WHERE order_date >= '2024-01-01';

-- REWRITTEN: Using window function
EXPLAIN ANALYZE
SELECT *,
    COUNT(*) OVER (PARTITION BY customer_id) AS customer_order_count
FROM orders
WHERE order_date >= '2024-01-01';

πŸ“ˆ Part 6: Statistics and ANALYZE

⚠️⚠️ Stale Statistics

PostgreSQL uses statistics to estimate costs. Stale statistics lead to poor plan choices. Run ANALYZE regularly on large tables.

-- Check table statistics
SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup,
    last_vacuum,
    last_autovacuum,
    last_analyze,
    last_autoanalyze
FROM pg_stat_user_tables
WHERE relname = 'orders';

-- Update statistics manually
ANALYZE orders;

-- Check index usage
SELECT
    schemaname,
    relname,
    indexrelname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
WHERE relname = 'orders';

-- Check index bloat
SELECT
    indexrelname,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE relname = 'orders';

πŸ—οΈ Part 7: Real-World Optimization Workflow

Step-by-Step Process

-- Step 1: Run EXPLAIN ANALYZE to establish baseline
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT ... ;

-- Step 2: Check for sequential scans
-- Look for: Seq Scan on large tables

-- Step 3: Check for nested loops with large row estimates
-- Look for: Nested Loop with high row estimates

-- Step 4: Check for sort operations
-- Look for: Sort with external merge or high memory

-- Step 5: Create indexes based on findings
CREATE INDEX idx_xxx ON table(column);

-- Step 6: Verify improvement
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;

-- Step 7: Compare metrics
-- Before: actual time=X rows=Y
-- After: actual time=X' rows=Y'

Comparison Template

-- BEFORE optimization
EXPLAIN (ANALYZE, BUFFERS, SUMMARY)
SELECT
    DATE_TRUNC('month', order_date) AS month,
    customer_id,
    COUNT(*) AS order_count,
    SUM(total_amount) AS revenue
FROM orders
WHERE order_date >= '2023-01-01'
AND status != 'cancelled'
GROUP BY DATE_TRUNC('month', order_date), customer_id
HAVING SUM(total_amount) > 10000
ORDER BY revenue DESC;

-- Apply optimization...
CREATE INDEX idx_orders_status_date_amount
ON orders(status, order_date DESC)
INCLUDE (total_amount, customer_id);

-- AFTER optimization
EXPLAIN (ANALYZE, BUFFERS, SUMMARY)
SELECT ... ;  -- Same query

🎯 Quiz Section

πŸ† Best Practices for Interviews

πŸ’‘βœ… EXPLAIN Best Practices

1. Always Use ANALYZE for Actual Performance:

-- Planning-only EXPLAIN can be misleading
EXPLAIN SELECT ...;

-- Always verify with actual execution
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;

2. Check Buffers for I/O Insights:

EXPLAIN (ANALYZE, BUFFERS)
-- High 'read' vs 'hit' indicates disk I/O
-- Consider increasing shared_buffers or optimizing queries

3. Use FORMAT for Programmatic Analysis:

EXPLAIN (ANALYZE, FORMAT JSON) SELECT ...;
-- Useful for automated monitoring tools

4. Compare Before and After:

  • Document baseline metrics
  • Create indexes
  • Re-run and compare
  • Measure improvement percentage

5. Watch for These Red Flags:

  • High row estimates vs actual (stale statistics)
  • Sort Method: external merge (insufficient work_mem)
  • Hash Batches > 1 (insufficient work_mem)
  • Nested Loop with high row estimates (missing index)

πŸ”— Common Follow-Up Questions

  1. "How do you identify if statistics are stale?" β€” Compare estimated vs actual rows in EXPLAIN output
  2. "When should you use a covering index?" β€” When query only needs columns in index (INCLUDE)
  3. "How do you handle queries that can't use indexes?" β€” Consider materialized views, pre-aggregation
  4. "What's the difference between index scan and index only scan?" β€” Index only scan reads from index without touching table

⚠️⚠️ Interview Tip

Always explain your thought process when reading EXPLAIN output. Interviewers want to see that you can:

  1. Identify the most expensive operation
  2. Understand why the planner chose that plan
  3. Suggest specific improvements
  4. Verify improvements with metrics

Advertisement