Snowflake Advanced · Interview Prep

Query Optimization & Clustering

Difficulty: Hard · Commonly asked at Google, Uber, Netflix

Interview Question

"A critical dashboard query that used to run in 5 seconds now takes 5 minutes. Walk me through your systematic approach to diagnose and fix the performance issue. How do clustering keys and micro-partition pruning help?"

ℹ️

Companies Asking This: Google (Senior Data Engineer), Uber (Staff Data Engineer), Netflix (Senior Data Engineer), Amazon (L6 Data Engineer)

Micro-Partition Pruning

Snowflake automatically divides data into micro-partitions (50-500MB compressed). Pruning eliminates micro-partitions that don't match query predicates, dramatically reducing data scanned.

How Pruning Works

-- Create a table with natural clustering
CREATE TABLE sales (
    sale_id NUMBER AUTOINCREMENT,
    sale_date DATE,
    region VARCHAR(20),
    product_category VARCHAR(50),
    amount NUMBER(12,2),
    customer_id VARCHAR(100)
);

-- Insert sample data
INSERT INTO sales (sale_date, region, product_category, amount, customer_id)
SELECT 
    DATEADD(day, -seq % 365, CURRENT_DATE()),
    CASE MOD(seq, 5) 
        WHEN 0 THEN 'US-EAST' WHEN 1 THEN 'US-WEST' 
        WHEN 2 THEN 'EU' WHEN 3 THEN 'APAC' ELSE 'LATAM'
    END,
    CASE MOD(seq, 8) 
        WHEN 0 THEN 'Electronics' WHEN 1 THEN 'Clothing'
        WHEN 2 THEN 'Books' WHEN 3 THEN 'Home'
        WHEN 4 THEN 'Sports' WHEN 5 THEN 'Toys'
        WHEN 6 THEN 'Automotive' ELSE 'Health'
    END,
    ROUND(RANDOM() * 1000, 2),
    'cust_' || MOD(seq, 10000)
FROM TABLE(GENERATOR(ROWCOUNT => 10000000)) seq;

-- Check pruning efficiency
SELECT 
    query_id,
    partitions_scanned,
    partitions_total,
    ROUND(partitions_scanned * 100.0 / NULLIF(partitions_total, 0), 2) AS scan_percentage,
    bytes_scanned / (1024*1024) AS mb_scanned,
    compilation_time_ms,
    execution_time_ms,
    total_elapsed_time_ms
FROM snowflake.account_usage.query_history
WHERE query_type = 'SELECT'
  AND query_text LIKE '%sales%'
  AND start_time >= DATEADD(hour, -24, CURRENT_TIMESTAMP())
ORDER BY total_elapsed_time_ms DESC;

Clustering Keys

Clustering keys control how data is physically organized within micro-partitions, improving pruning for common query patterns.

Creating Clustering Keys

-- Add clustering key to existing table
ALTER TABLE sales CLUSTER BY (sale_date, region);

-- Create table with clustering key
CREATE TABLE sales_clustered (
    sale_id NUMBER AUTOINCREMENT,
    sale_date DATE,
    region VARCHAR(20),
    product_category VARCHAR(50),
    amount NUMBER(12,2),
    customer_id VARCHAR(100)
) CLUSTER BY (sale_date, region);

-- Check clustering depth and quality
SELECT 
    SYSTEM$CLUSTERING_DEPTH('sales_clustered') AS avg_depth,
    SYSTEM$CLUSTERING_INFORMATION('sales_clustered') AS clustering_info;

-- Detailed clustering analysis
SELECT 
    sale_date,
    region,
    COUNT(*) AS row_count,
    COUNT(DISTINCT product_category) AS categories,
    SUM(amount) AS total_amount
FROM sales_clustered
GROUP BY 1, 2
ORDER BY 1, 2;

-- Monitor clustering operations
SELECT 
    query_id,
    query_text,
    start_time,
    end_time,
    execution_status,
    bytes_spilled_to_local_storage,
    bytes_spilled_to_remote_storage
FROM snowflake.account_usage.query_history
WHERE query_text LIKE '%CLUSTER%'
  AND start_time >= DATEADD(day, -7, CURRENT_TIMESTAMP())
ORDER BY start_time DESC;

ℹ️

Key Insight: Clustering is most effective when query filters align with clustering keys. A query filtering on sale_date and region will benefit from clustering on (sale_date, region), but a query filtering only on product_category may not benefit at all.

Real-World Scenario: Google

Question: "How do you choose the right clustering key for a table that's queried both by date range and by region? These are often conflicting patterns."

Solution: Multi-Pattern Clustering Strategy

-- Analyze query patterns first
SELECT 
    query_text,
    COUNT(*) AS execution_count,
    AVG(total_elapsed_time_ms) AS avg_execution_time,
    MAX(total_elapsed_time_ms) AS max_execution_time
FROM snowflake.account_usage.query_history
WHERE query_type = 'SELECT'
  AND query_text LIKE '%sales%'
  AND start_time >= DATEADD(day, -30, CURRENT_TIMESTAMP())
GROUP BY query_text
ORDER BY execution_count DESC;

-- Pattern 1: Date-first clustering (good for date range queries)
CREATE TABLE sales_by_date CLONE sales;
ALTER TABLE sales_by_date CLUSTER BY (sale_date, region);

-- Pattern 2: Region-first clustering (good for region queries)
CREATE TABLE sales_by_region CLONE sales;
ALTER TABLE sales_by_region CLUSTER BY (region, sale_date);

-- Pattern 3: Composite clustering with trade-offs
-- Test different clustering combinations
CREATE TABLE sales_optimized CLONE sales;

-- Test clustering quality
ALTER TABLE sales_optimized CLUSTER BY (sale_date, region);
SELECT SYSTEM$CLUSTERING_DEPTH('sales_optimized');

-- If depth is too high, try different order
ALTER TABLE sales_optimized CLUSTER BY (region, sale_date);
SELECT SYSTEM$CLUSTERING_DEPTH('sales_optimized');

-- Pattern 4: Use materialized views for different access patterns
CREATE MATERIALIZED VIEW sales_mv_by_date AS
SELECT 
    sale_date,
    region,
    product_category,
    COUNT(*) AS sale_count,
    SUM(amount) AS total_amount,
    AVG(amount) AS avg_amount
FROM sales
GROUP BY 1, 2, 3;

-- Create index-like structure for region queries
CREATE MATERIALIZED VIEW sales_mv_by_region AS
SELECT 
    region,
    product_category,
    DATE_TRUNC('month', sale_date) AS sale_month,
    COUNT(*) AS sale_count,
    SUM(amount) AS total_amount
FROM sales
GROUP BY 1, 2, 3;

Real-World Scenario: Uber

Question: "A query joins two large tables and spills to disk. How do you optimize this without changing the query logic?"

Query Performance Optimization

-- 1. Analyze the query with PROFILE
SELECT 
    s.sale_id,
    s.sale_date,
    s.amount,
    c.customer_name,
    c.customer_segment
FROM sales s
JOIN customers c ON s.customer_id = c.customer_id
WHERE s.sale_date >= '2024-01-01'
  AND s.region = 'US-EAST';

-- Get query profile
SELECT 
    query_id,
    query_text,
    compilation_time_ms,
    execution_time_ms,
    total_elapsed_time_ms,
    bytes_scanned,
    bytes_written,
    bytes_spilled_to_local_storage,
    bytes_spilled_to_remote_storage,
    partitions_scanned,
    partitions_total
FROM snowflake.account_usage.query_history
WHERE query_text LIKE '%JOIN%customers%sales%'
  AND start_time >= DATEADD(hour, -1, CURRENT_TIMESTAMP())
ORDER BY total_elapsed_time_ms DESC
LIMIT 1;

-- 2. Check if clustering helps the join
-- Add clustering keys to both tables
ALTER TABLE sales CLUSTER BY (customer_id, sale_date);
ALTER TABLE customers CLUSTER BY (customer_id);

-- 3. Use query tags for monitoring
ALTER SESSION SET QUERY_TAG = '{"workload": "critical_dashboard", "priority": "high"}';

-- 4. Consider materialized view for frequent joins
CREATE MATERIALIZED VIEW sales_customer_mv AS
SELECT 
    s.sale_id,
    s.sale_date,
    s.amount,
    s.region,
    c.customer_name,
    c.customer_segment,
    c.customer_lifetime_value
FROM sales s
JOIN customers c ON s.customer_id = c.customer_id;

-- 5. Monitor spill metrics
SELECT 
    query_id,
    bytes_spilled_to_local_storage / (1024*1024) AS local_spill_mb,
    bytes_spilled_to_remote_storage / (1024*1024) AS remote_spill_mb,
    CASE 
        WHEN bytes_spilled_to_remote_storage > 0 THEN 'CRITICAL: Remote spill'
        WHEN bytes_spilled_to_local_storage > 0 THEN 'WARNING: Local spill'
        ELSE 'GOOD: No spill'
    END AS spill_status
FROM snowflake.account_usage.query_history
WHERE query_id = 'your-query-id-here';

Materialized Views

-- Create materialized view for aggregated queries
CREATE MATERIALIZED VIEW daily_sales_summary AS
SELECT 
    sale_date,
    region,
    product_category,
    COUNT(*) AS transaction_count,
    SUM(amount) AS total_revenue,
    AVG(amount) AS avg_transaction,
    COUNT(DISTINCT customer_id) AS unique_customers
FROM sales
GROUP BY 1, 2, 3;

-- Refresh happens automatically when base table changes
-- Check refresh status
SELECT 
    name,
    refresh_start_time,
    refresh_end_time,
    state,
    data_metric_query_frequency
FROM information_schema.materialized_view_refresh_history
WHERE name = 'DAILY_SALES_SUMMARY'
ORDER BY refresh_start_time DESC;

-- Compare performance: base table vs materialized view
-- Query against base table (slower)
SELECT 
    sale_date,
    region,
    SUM(amount)
FROM sales
WHERE sale_date >= '2024-01-01'
GROUP BY 1, 2;

-- Query against materialized view (faster)
SELECT 
    sale_date,
    region,
    SUM(total_revenue)
FROM daily_sales_summary
WHERE sale_date >= '2024-01-01'
GROUP BY 1, 2;

Result Cache

-- Snowflake automatically caches query results
-- Same query within 24 hours returns cached result
SELECT * FROM sales WHERE sale_date = CURRENT_DATE();

-- Check if result was cached
SELECT 
    query_id,
    query_text,
    result_cache_hit,  -- 1 = cached, 0 = not cached
    compilation_time_ms,
    execution_time_ms,
    total_elapsed_time_ms
FROM snowflake.account_usage.query_history
WHERE query_text LIKE '%sale_date = CURRENT_DATE%'
  AND start_time >= DATEADD(hour, -1, CURRENT_TIMESTAMP())
ORDER BY start_time DESC;

-- Force re-execution (bypass cache)
SELECT * FROM sales WHERE sale_date = CURRENT_DATE()
    /* NO_RESULT_CACHE */;

-- Result cache is invalidated when:
-- 1. Underlying data changes
-- 2. Functions like CURRENT_DATE() are used (volatile)
-- 3. User role changes
-- 4. Schema changes

Performance Monitoring

-- 1. Identify slow queries
SELECT 
    query_id,
    query_text,
    user_name,
    warehouse_name,
    total_elapsed_time_ms / 1000 AS execution_seconds,
    bytes_scanned / (1024*1024*1024) AS gb_scanned,
    partitions_scanned,
    partitions_total,
    CASE 
        WHEN partitions_scanned = partitions_total THEN 'FULL SCAN'
        WHEN partitions_scanned > partitions_total * 0.5 THEN 'POOR PRUNING'
        ELSE 'GOOD PRUNING'
    END AS pruning_quality
FROM snowflake.account_usage.query_history
WHERE query_type = 'SELECT'
  AND start_time >= DATEADD(hour, -24, CURRENT_TIMESTAMP())
  AND total_elapsed_time_ms > 10000  -- > 10 seconds
ORDER BY total_elapsed_time_ms DESC
LIMIT 20;

-- 2. Analyze warehouse utilization
SELECT 
    warehouse_name,
    AVG(queries_queued) AS avg_queue_depth,
    AVG(queries_executing) AS avg_concurrent_queries,
    SUM(credits_used) AS total_credits,
    AVG(query_execution_time_ms) / 1000 AS avg_query_seconds
FROM snowflake.account_usage.warehouse_load_history
WHERE start_time >= DATEADD(day, -7, CURRENT_TIMESTAMP())
GROUP BY warehouse_name
ORDER BY total_credits DESC;

-- 3. Monitor clustering quality over time
SELECT 
    TABLE_NAME,
    SYSTEM$CLUSTERING_DEPTH(TABLE_NAME) AS avg_depth,
    SYSTEM$CLUSTERING_INFORMATION(TABLE_NAME) AS clustering_info
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'PUBLIC'
  AND CLUSTERING_KEY IS NOT NULL;

-- 4. Track query performance trends
SELECT 
    DATE(start_time) AS query_date,
    AVG(total_elapsed_time_ms) / 1000 AS avg_seconds,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY total_elapsed_time_ms) / 1000 AS median_seconds,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY total_elapsed_time_ms) / 1000 AS p95_seconds,
    COUNT(*) AS query_count
FROM snowflake.account_usage.query_history
WHERE query_type = 'SELECT'
  AND start_time >= DATEADD(day, -30, CURRENT_TIMESTAMP())
GROUP BY 1
ORDER BY 1;

Best Practices

Issue	Solution
Full table scan	Add clustering key aligned with query filters
Disk spill	Increase warehouse size, reduce data volume with filters
Slow joins	Cluster both tables by join key
Repeated aggregations	Use materialized views
Cold cache	Use result cache, avoid volatile functions
High queue depth	Use multi-cluster warehouse

⚠️

Anti-Patterns:

Over-clustering — Too many clustering keys increase write overhead
Misaligned clustering — Keys that don't match query patterns waste resources
Ignoring spill — Remote spill kills performance; increase warehouse size
Not monitoring — Set up alerts for query time degradation

Topic: Query Optimization & Clustering