Snowflake Advanced Β· Interview Prep
Query Optimization & Clustering
Difficulty: Hard Β· Commonly asked at Google, Uber, Netflix
Interview Question
"A critical dashboard query that used to run in 5 seconds now takes 5 minutes. Walk me through your systematic approach to diagnose and fix the performance issue. How do clustering keys and micro-partition pruning help?"
βΉοΈ
Companies Asking This: Google (Senior Data Engineer), Uber (Staff Data Engineer), Netflix (Senior Data Engineer), Amazon (L6 Data Engineer)
Micro-Partition Pruning
Snowflake automatically divides data into micro-partitions (50-500MB compressed). Pruning eliminates micro-partitions that don't match query predicates, dramatically reducing data scanned.
How Pruning Works
-- Create a table with natural clustering
CREATE TABLE sales (
sale_id NUMBER AUTOINCREMENT,
sale_date DATE,
region VARCHAR(20),
product_category VARCHAR(50),
amount NUMBER(12,2),
customer_id VARCHAR(100)
);
-- Insert sample data
INSERT INTO sales (sale_date, region, product_category, amount, customer_id)
SELECT
DATEADD(day, -seq % 365, CURRENT_DATE()),
CASE MOD(seq, 5)
WHEN 0 THEN 'US-EAST' WHEN 1 THEN 'US-WEST'
WHEN 2 THEN 'EU' WHEN 3 THEN 'APAC' ELSE 'LATAM'
END,
CASE MOD(seq, 8)
WHEN 0 THEN 'Electronics' WHEN 1 THEN 'Clothing'
WHEN 2 THEN 'Books' WHEN 3 THEN 'Home'
WHEN 4 THEN 'Sports' WHEN 5 THEN 'Toys'
WHEN 6 THEN 'Automotive' ELSE 'Health'
END,
ROUND(RANDOM() * 1000, 2),
'cust_' || MOD(seq, 10000)
FROM TABLE(GENERATOR(ROWCOUNT => 10000000)) seq;
-- Check pruning efficiency
SELECT
query_id,
partitions_scanned,
partitions_total,
ROUND(partitions_scanned * 100.0 / NULLIF(partitions_total, 0), 2) AS scan_percentage,
bytes_scanned / (1024*1024) AS mb_scanned,
compilation_time_ms,
execution_time_ms,
total_elapsed_time_ms
FROM snowflake.account_usage.query_history
WHERE query_type = 'SELECT'
AND query_text LIKE '%sales%'
AND start_time >= DATEADD(hour, -24, CURRENT_TIMESTAMP())
ORDER BY total_elapsed_time_ms DESC;
Clustering Keys
Clustering keys control how data is physically organized within micro-partitions, improving pruning for common query patterns.
Creating Clustering Keys
-- Add clustering key to existing table
ALTER TABLE sales CLUSTER BY (sale_date, region);
-- Create table with clustering key
CREATE TABLE sales_clustered (
sale_id NUMBER AUTOINCREMENT,
sale_date DATE,
region VARCHAR(20),
product_category VARCHAR(50),
amount NUMBER(12,2),
customer_id VARCHAR(100)
) CLUSTER BY (sale_date, region);
-- Check clustering depth and quality
SELECT
SYSTEM$CLUSTERING_DEPTH('sales_clustered') AS avg_depth,
SYSTEM$CLUSTERING_INFORMATION('sales_clustered') AS clustering_info;
-- Detailed clustering analysis
SELECT
sale_date,
region,
COUNT(*) AS row_count,
COUNT(DISTINCT product_category) AS categories,
SUM(amount) AS total_amount
FROM sales_clustered
GROUP BY 1, 2
ORDER BY 1, 2;
-- Monitor clustering operations
SELECT
query_id,
query_text,
start_time,
end_time,
execution_status,
bytes_spilled_to_local_storage,
bytes_spilled_to_remote_storage
FROM snowflake.account_usage.query_history
WHERE query_text LIKE '%CLUSTER%'
AND start_time >= DATEADD(day, -7, CURRENT_TIMESTAMP())
ORDER BY start_time DESC;
βΉοΈ
Key Insight: Clustering is most effective when query filters align with clustering keys. A query filtering on sale_date and region will benefit from clustering on (sale_date, region), but a query filtering only on product_category may not benefit at all.
Real-World Scenario: Google
Question: "How do you choose the right clustering key for a table that's queried both by date range and by region? These are often conflicting patterns."
Solution: Multi-Pattern Clustering Strategy
-- Analyze query patterns first
SELECT
query_text,
COUNT(*) AS execution_count,
AVG(total_elapsed_time_ms) AS avg_execution_time,
MAX(total_elapsed_time_ms) AS max_execution_time
FROM snowflake.account_usage.query_history
WHERE query_type = 'SELECT'
AND query_text LIKE '%sales%'
AND start_time >= DATEADD(day, -30, CURRENT_TIMESTAMP())
GROUP BY query_text
ORDER BY execution_count DESC;
-- Pattern 1: Date-first clustering (good for date range queries)
CREATE TABLE sales_by_date CLONE sales;
ALTER TABLE sales_by_date CLUSTER BY (sale_date, region);
-- Pattern 2: Region-first clustering (good for region queries)
CREATE TABLE sales_by_region CLONE sales;
ALTER TABLE sales_by_region CLUSTER BY (region, sale_date);
-- Pattern 3: Composite clustering with trade-offs
-- Test different clustering combinations
CREATE TABLE sales_optimized CLONE sales;
-- Test clustering quality
ALTER TABLE sales_optimized CLUSTER BY (sale_date, region);
SELECT SYSTEM$CLUSTERING_DEPTH('sales_optimized');
-- If depth is too high, try different order
ALTER TABLE sales_optimized CLUSTER BY (region, sale_date);
SELECT SYSTEM$CLUSTERING_DEPTH('sales_optimized');
-- Pattern 4: Use materialized views for different access patterns
CREATE MATERIALIZED VIEW sales_mv_by_date AS
SELECT
sale_date,
region,
product_category,
COUNT(*) AS sale_count,
SUM(amount) AS total_amount,
AVG(amount) AS avg_amount
FROM sales
GROUP BY 1, 2, 3;
-- Create index-like structure for region queries
CREATE MATERIALIZED VIEW sales_mv_by_region AS
SELECT
region,
product_category,
DATE_TRUNC('month', sale_date) AS sale_month,
COUNT(*) AS sale_count,
SUM(amount) AS total_amount
FROM sales
GROUP BY 1, 2, 3;
Real-World Scenario: Uber
Question: "A query joins two large tables and spills to disk. How do you optimize this without changing the query logic?"
Query Performance Optimization
-- 1. Analyze the query with PROFILE
SELECT
s.sale_id,
s.sale_date,
s.amount,
c.customer_name,
c.customer_segment
FROM sales s
JOIN customers c ON s.customer_id = c.customer_id
WHERE s.sale_date >= '2024-01-01'
AND s.region = 'US-EAST';
-- Get query profile
SELECT
query_id,
query_text,
compilation_time_ms,
execution_time_ms,
total_elapsed_time_ms,
bytes_scanned,
bytes_written,
bytes_spilled_to_local_storage,
bytes_spilled_to_remote_storage,
partitions_scanned,
partitions_total
FROM snowflake.account_usage.query_history
WHERE query_text LIKE '%JOIN%customers%sales%'
AND start_time >= DATEADD(hour, -1, CURRENT_TIMESTAMP())
ORDER BY total_elapsed_time_ms DESC
LIMIT 1;
-- 2. Check if clustering helps the join
-- Add clustering keys to both tables
ALTER TABLE sales CLUSTER BY (customer_id, sale_date);
ALTER TABLE customers CLUSTER BY (customer_id);
-- 3. Use query tags for monitoring
ALTER SESSION SET QUERY_TAG = '{"workload": "critical_dashboard", "priority": "high"}';
-- 4. Consider materialized view for frequent joins
CREATE MATERIALIZED VIEW sales_customer_mv AS
SELECT
s.sale_id,
s.sale_date,
s.amount,
s.region,
c.customer_name,
c.customer_segment,
c.customer_lifetime_value
FROM sales s
JOIN customers c ON s.customer_id = c.customer_id;
-- 5. Monitor spill metrics
SELECT
query_id,
bytes_spilled_to_local_storage / (1024*1024) AS local_spill_mb,
bytes_spilled_to_remote_storage / (1024*1024) AS remote_spill_mb,
CASE
WHEN bytes_spilled_to_remote_storage > 0 THEN 'CRITICAL: Remote spill'
WHEN bytes_spilled_to_local_storage > 0 THEN 'WARNING: Local spill'
ELSE 'GOOD: No spill'
END AS spill_status
FROM snowflake.account_usage.query_history
WHERE query_id = 'your-query-id-here';
Materialized Views
-- Create materialized view for aggregated queries
CREATE MATERIALIZED VIEW daily_sales_summary AS
SELECT
sale_date,
region,
product_category,
COUNT(*) AS transaction_count,
SUM(amount) AS total_revenue,
AVG(amount) AS avg_transaction,
COUNT(DISTINCT customer_id) AS unique_customers
FROM sales
GROUP BY 1, 2, 3;
-- Refresh happens automatically when base table changes
-- Check refresh status
SELECT
name,
refresh_start_time,
refresh_end_time,
state,
data_metric_query_frequency
FROM information_schema.materialized_view_refresh_history
WHERE name = 'DAILY_SALES_SUMMARY'
ORDER BY refresh_start_time DESC;
-- Compare performance: base table vs materialized view
-- Query against base table (slower)
SELECT
sale_date,
region,
SUM(amount)
FROM sales
WHERE sale_date >= '2024-01-01'
GROUP BY 1, 2;
-- Query against materialized view (faster)
SELECT
sale_date,
region,
SUM(total_revenue)
FROM daily_sales_summary
WHERE sale_date >= '2024-01-01'
GROUP BY 1, 2;
Result Cache
-- Snowflake automatically caches query results
-- Same query within 24 hours returns cached result
SELECT * FROM sales WHERE sale_date = CURRENT_DATE();
-- Check if result was cached
SELECT
query_id,
query_text,
result_cache_hit, -- 1 = cached, 0 = not cached
compilation_time_ms,
execution_time_ms,
total_elapsed_time_ms
FROM snowflake.account_usage.query_history
WHERE query_text LIKE '%sale_date = CURRENT_DATE%'
AND start_time >= DATEADD(hour, -1, CURRENT_TIMESTAMP())
ORDER BY start_time DESC;
-- Force re-execution (bypass cache)
SELECT * FROM sales WHERE sale_date = CURRENT_DATE()
/* NO_RESULT_CACHE */;
-- Result cache is invalidated when:
-- 1. Underlying data changes
-- 2. Functions like CURRENT_DATE() are used (volatile)
-- 3. User role changes
-- 4. Schema changes
Performance Monitoring
-- 1. Identify slow queries
SELECT
query_id,
query_text,
user_name,
warehouse_name,
total_elapsed_time_ms / 1000 AS execution_seconds,
bytes_scanned / (1024*1024*1024) AS gb_scanned,
partitions_scanned,
partitions_total,
CASE
WHEN partitions_scanned = partitions_total THEN 'FULL SCAN'
WHEN partitions_scanned > partitions_total * 0.5 THEN 'POOR PRUNING'
ELSE 'GOOD PRUNING'
END AS pruning_quality
FROM snowflake.account_usage.query_history
WHERE query_type = 'SELECT'
AND start_time >= DATEADD(hour, -24, CURRENT_TIMESTAMP())
AND total_elapsed_time_ms > 10000 -- > 10 seconds
ORDER BY total_elapsed_time_ms DESC
LIMIT 20;
-- 2. Analyze warehouse utilization
SELECT
warehouse_name,
AVG(queries_queued) AS avg_queue_depth,
AVG(queries_executing) AS avg_concurrent_queries,
SUM(credits_used) AS total_credits,
AVG(query_execution_time_ms) / 1000 AS avg_query_seconds
FROM snowflake.account_usage.warehouse_load_history
WHERE start_time >= DATEADD(day, -7, CURRENT_TIMESTAMP())
GROUP BY warehouse_name
ORDER BY total_credits DESC;
-- 3. Monitor clustering quality over time
SELECT
TABLE_NAME,
SYSTEM$CLUSTERING_DEPTH(TABLE_NAME) AS avg_depth,
SYSTEM$CLUSTERING_INFORMATION(TABLE_NAME) AS clustering_info
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'PUBLIC'
AND CLUSTERING_KEY IS NOT NULL;
-- 4. Track query performance trends
SELECT
DATE(start_time) AS query_date,
AVG(total_elapsed_time_ms) / 1000 AS avg_seconds,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY total_elapsed_time_ms) / 1000 AS median_seconds,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY total_elapsed_time_ms) / 1000 AS p95_seconds,
COUNT(*) AS query_count
FROM snowflake.account_usage.query_history
WHERE query_type = 'SELECT'
AND start_time >= DATEADD(day, -30, CURRENT_TIMESTAMP())
GROUP BY 1
ORDER BY 1;
Best Practices
| Issue | Solution |
|---|---|
| Full table scan | Add clustering key aligned with query filters |
| Disk spill | Increase warehouse size, reduce data volume with filters |
| Slow joins | Cluster both tables by join key |
| Repeated aggregations | Use materialized views |
| Cold cache | Use result cache, avoid volatile functions |
| High queue depth | Use multi-cluster warehouse |
β οΈ
Anti-Patterns:
- Over-clustering β Too many clustering keys increase write overhead
- Misaligned clustering β Keys that don't match query patterns waste resources
- Ignoring spill β Remote spill kills performance; increase warehouse size
- Not monitoring β Set up alerts for query time degradation