Performance Optimization Framework
Partitioning Strategies
-- Time-unit partitioning (recommended)
CREATE TABLE `project.dataset.events`
PARTITION BY DATE(event_timestamp)
OPTIONS (
partition_expiration_days = 730,
require_partition_filter = true
);
-- Integer-range partitioning
CREATE TABLE `project.dataset.sales`
PARTITION BY RANGE_BUCKET(sale_id, GENERATE_ARRAY(0, 10000000, 100000));
-- Partition pruning in queries
SELECT * FROM `project.dataset.events`
WHERE event_date >= '2025-01-01' -- Partition pruning
AND event_type = 'purchase'; -- Cluster pruning
Clustering Optimization
-- Clustering for multi-column filtering
CREATE TABLE `project.dataset.sales`
PARTITION BY DATE(order_date)
CLUSTER BY customer_id, product_id, region;
-- Cluster pruning in queries
SELECT
customer_id,
SUM(amount) as total
FROM `project.dataset.sales`
WHERE order_date = '2025-01-15'
AND customer_id = 'user_123' -- Cluster pruning
AND region = 'US'; -- Cluster pruning
Query Optimization
-- Bad: Full table scan
SELECT * FROM `project.dataset.events`;
-- Good: Partition and cluster pruning
SELECT
event_type,
COUNT(*) as count
FROM `project.dataset.events`
WHERE event_date = '2025-01-15'
GROUP BY 1;
-- Use approximate aggregation for large datasets
SELECT
APPROX_COUNT_DISTINCT(user_id) as approx_users
FROM `project.dataset.events`
WHERE event_date = '2025-01-15';
-- Optimize JOINs with broadcast hints
SELECT /*+ BROADCAST(p) */
o.order_id,
p.product_name
FROM `project.dataset.orders` o
JOIN `project.dataset.small_products` p ON o.product_id = p.product_id;
Slot Monitoring
-- Monitor slot usage
SELECT
job_creation_time,
job_id,
total_bytes_processed,
slot_ms_used,
ROUND(slot_ms_used / 1000 / 60, 2) as slot_minutes
FROM `region-us-central1`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE DATE(job_creation_time) = CURRENT_DATE()
ORDER BY slot_ms_used DESC;
-- Check slot utilization
SELECT
TIMESTAMP_TRUNC(timestamp, MINUTE) as minute,
AVG(num_slots) as avg_slots,
MAX(num_slots) as max_slots
FROM `region-us-central1`.INFORMATION_SCHEMA.JOBS_TIMELINE
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
GROUP BY 1
ORDER BY 1;
βΉοΈ
Pro Tip: Use require_partition_filter = true on all partitioned tables to prevent expensive full-table scans. Set maximum_bytes_billed at the session level to cap query costs. Monitor slot utilization to right-size committed slot purchases.
Common Interview Questions
Q1: What is the difference between partitioning and clustering?
Answer: Partitioning divides tables into segments (max 4,000), best for date/time data. Clustering sorts data within partitions (up to 4 fields), best for high-cardinality filtering. Use both together for optimal performance.
Q2: How do you prevent expensive queries?
Answer: 1) Use require_partition_filter, 2) Set maximum_bytes_billed, 3) Use query dry-run for cost estimation, 4) Avoid SELECT *, 5) Use partition and cluster pruning, 6) Monitor slot usage.
Q3: When should you use materialized views?
Answer: Materialized views are best for frequently-run aggregations on large tables. They're automatically maintained by BigQuery and provide faster query performance. Use them for dashboard queries and common aggregations.
Q4: How do you optimize JOINs in BigQuery?
Answer: 1) Use broadcast JOINs for small tables (<1GB), 2) Join on partitioned/clustering columns, 3) Filter before JOINing, 4) Use APPROX_COUNT_DISTINCT for large cardinality, 5) Consider denormalization for complex joins.
Q5: What is the benefit of BI Engine?
Answer: BI Engine caches hot data in-memory for sub-second dashboard queries. It automatically manages cache invalidation and works with Looker and Connected Sheets. Reserve capacity based on your hot data size.