BigQuery: Serverless Analytical Processing at Scale
Google BigQuery is a fully managed, serverless data warehouse that separates compute from storage and uses Google's Dremel query execution engine.
Why BigQuery Matters
Key Benefits:
- No infrastructure management β no clusters, no nodes, no capacity planning
- Columnar storage (Capacitor) β efficient analytical queries
- Distributed execution (Dremel) β petabyte-scale analysis
- Automatic optimization β simplest path to petabyte-scale analytics
Key Insight: BigQuery pioneered the serverless model for analytical workloads.
Architecture Overview
BigQuery Architecture
Serverless vs. Provisioned
A serverless data warehouse abstracts all infrastructure management from the user. Compute resources are allocated per query and billed per usage, with no minimum commitment. BigQuery pioneered this model for analytical workloads.
BigQuery Cost Model
- Storage: 0.01/GB/month (long-term >90 days)
- On-Demand Query: $6.25/TB of data scanned (up to 1 PB/day)
- Flat-Rate Slots: 2,900/month)
- Editions: Standard (basic), Enterprise (advanced), Enterprise Plus (premium)
- BI Engine: $0.01/GB of in-memory cache per hour
- Cost per Query = Data_Scanned_TB Γ $6.25 (on-demand) OR Slots_Used Γ Time Γ Rate (flat-rate)
-- Create dataset with regional storage
CREATE SCHEMA IF NOT EXISTS analytics
OPTIONS (
location = 'US',
default_table_expiration_days = NULL,
default_partition_expiration_days = 365,
storage_billing_model = 'PHYSICAL' -- Or LOGICAL
);
-- Create partitioned and clustered table
CREATE TABLE analytics.fact_orders (
order_id INT64 NOT NULL,
customer_id INT64,
product_id INT64,
order_date DATE NOT NULL,
order_timestamp TIMESTAMP,
quantity INT64,
unit_price NUMERIC(12,2),
discount_pct NUMERIC(5,4),
net_amount NUMERIC(14,2),
order_status STRING,
region STRING
)
PARTITION BY DATE(order_date)
CLUSTER BY (customer_id, order_status, region)
OPTIONS (
description = 'Fact table for customer orders',
require_partition_filter = TRUE,
partition_expiration_days = 730
);
-- Query with partition filter (required)
SELECT
EXTRACT(YEAR FROM order_date) AS order_year,
EXTRACT(MONTH FROM order_date) AS order_month,
region,
COUNT(*) AS order_count,
SUM(net_amount) AS total_revenue,
AVG(net_amount) AS avg_order_value
FROM analytics.fact_orders
WHERE order_date BETWEEN '2024-01-01' AND '2025-12-31'
GROUP BY 1, 2, 3
ORDER BY total_revenue DESC;
-- DDL for nested/repeated fields (STRUCTs and ARRAYs)
CREATE TABLE analytics.customer_events (
customer_id INT64,
customer_name STRING,
events ARRAY<STRUCT<
event_type STRING,
event_time TIMESTAMP,
event_data JSON,
session_id STRING
>>,
address STRUCT<
street STRING,
city STRING,
state STRING,
zip STRING
>
);
-- Query nested data
SELECT
customer_id,
event.event_type,
event.event_time,
event.event_data['page'] AS page_viewed
FROM analytics.customer_events,
UNNEST(events) AS event
WHERE event.event_type = 'page_view'
AND event.event_time >= '2025-01-01';
Partitioning and Clustering
Partitioning divides a table into segments based on a column value (DATE, TIMESTAMP, INT64 range, or ingestion-time). Partitions are independently pruned during queries, reducing data scanned and cost.
Clustering sorts data within each partition by up to 4 columns. Clustered columns are sorted within each partition block, enabling further data pruning beyond partitioning.
Partition-Cluster Pruning
- Data Scanned = Ξ£ (partition_size Γ cluster_pruning_ratio)
- Cluster Pruning Ratio = 1 / (number_of_distinct_cluster_values_in_partition)
- Optimal Pattern: Partition by date, cluster by high-cardinality filter columns
- Cost Savings = (1 - scanned_after_optimization / scanned_before) Γ $6.25/TB
- Example: 1 TB partition, cluster pruning reduces to 50 GB -> savings = $5.94/query
-- Ingestion-time partitioned table
CREATE TABLE analytics.events_ingest_time (
event_id STRING,
payload JSON
)
PARTITION BY _PARTITIONTIME
OPTIONS (
require_partition_filter = TRUE
);
-- Integer-range partitioned table
CREATE TABLE analytics.hourly_metrics (
metric_hour INT64 NOT NULL,
metric_name STRING,
metric_value FLOAT64
)
PARTITION BY RANGE_BUCKET(metric_hour, GENERATE_ARRAY(2024010100, 2026010100, 1));
-- Materialized view with partitioning
CREATE MATERIALIZED VIEW analytics.mv_daily_revenue
PARTITION BY DATE(order_date)
CLUSTER BY (region)
AS
SELECT
DATE(order_date) AS order_date,
region,
SUM(net_amount) AS total_revenue,
COUNT(*) AS order_count
FROM analytics.fact_orders
WHERE order_date >= '2024-01-01'
GROUP BY 1, 2;
-- Refresh materialized view
ALTER MATERIALIZED VIEW analytics.mv_daily_revenue
SET OPTIONS (
refresh_interval_minutes = 60,
enable_refresh = TRUE
);
Slots and Capacity Management
A slot is BigQuery's unit of computational capacity. Each slot can process approximately 1 GB/s of data. Queries consume slots dynamically, and BigQuery automatically manages slot allocation across concurrent queries.
Slot Utilization
- On-Demand Mode: Up to 2,000 slots per query, shared across project
- Autoscaling Mode: 500-10,000 slots, scales with demand
- Capacity Formula: Required_Slots = (Data_Scanned_GB / Time_Target_sec) / 1.0
- Slot Hour Cost = Slots_Used Γ Hours Γ $0.04 (for committed slots)
- Break-Even: On-Demand is cheaper below ~360 TB/month scanned
-- Monitor slot usage in real-time
SELECT
TIMESTAMP_TRUNC(timestamp, MINUTE) AS time_bucket,
AVG(slot_count) AS avg_slots,
MAX(slot_count) AS max_slots,
SUM(estimated_total_bytes_processed) / 1024 / 1024 / 1024 AS data_scanned_gb,
COUNT(*) AS query_count
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
GROUP BY 1
ORDER BY 1 DESC;
-- Check slot utilization
SELECT
job_id,
query,
TIMESTAMP_DIFF(end_time, start_time, MILLISECOND) AS duration_ms,
total_slot_ms / TIMESTAMP_DIFF(end_time, start_time, MILLISECOND) AS avg_slots,
total_bytes_processed / 1024 / 1024 / 1024 AS data_gb
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
AND statement_type = 'SELECT'
ORDER BY total_slot_ms DESC
LIMIT 20;
-- Flat-rate slot commitment
-- Purchase via Console: BigQuery > Administration > Capacity Management
-- Example: Commit 500 slots = ~$1,450/month
BI Engine
BI Engine is BigQuery's in-memory acceleration layer that caches frequently accessed data in RAM. It provides sub-100ms query latency for dashboards and interactive reports by serving results directly from memory without touching disk storage.
-- Reserve BI Engine capacity
-- Via Console: BigQuery > Administration > BI Engine > Add Reservation
-- Allocate up to 1 TB of in-memory cache
-- Create table optimized for BI Engine
CREATE TABLE analytics.bi_daily_metrics (
metric_date DATE,
region STRING,
product_line STRING,
metric_name STRING,
metric_value FLOAT64
);
-- BI Engine automatically accelerates queries with:
-- - Equality filters on partitioned columns
-- - Aggregations on small/medium dimension tables
-- - Dashboard queries with fixed query patterns
-- Check BI Engine reservation
SELECT
reservation_name,
location,
state,
slots_allocated,
slots_min,
slots_max
FROM `region-us`.INFORMATION_SCHEMA.BI_CAPACITY_COMMITMENTS;
Key Concepts Summary
| Concept | Description | Benefit | Cost Impact |
|---|---|---|---|
| Capacitor | Columnar storage format | 10:1 compression ratio | Reduces storage cost |
| Dremel | Distributed query engine | MPP query execution | Per-query billing |
| Partitioning | Table segmentation by column | Partition pruning | 70-95% less data scanned |
| Clustering | In-partition sorting | Column-level pruning | Additional 20-60% savings |
| Slots | Compute units | Elastic MPP capacity | On-demand or flat-rate |
| BI Engine | In-memory acceleration | <100ms dashboard queries | $0.01/GB/hour |
| Materialized View | Pre-computed aggregations | Faster queries, lower cost | Automatic refresh |
| BigQuery ML | In-database ML models | No data movement | Per-model training cost |
| Streaming Buffer | Real-time ingestion | <1 second latency | $0.01/GB for streaming |
| Authorized Views | Cross-dataset access control | Security without copies | No additional cost |
| Reservation | Committed slot capacity | Predictable pricing | Monthly commitment |
Performance Metrics
| Query Type | Data Volume | On-Demand Cost | Slot Mode | Latency |
|---|---|---|---|---|
| Ad-hoc dashboard | 10 GB | $0.06 | On-Demand | 2-5 sec |
| Partitioned query | 100 GB | $0.63 | On-Demand | 5-15 sec |
| Large ETL job | 1 TB | $6.25 | On-Demand | 30-120 sec |
| BI Engine cached | 1 GB | $0.01 | BI Engine | <100 ms |
| Streaming insert | 1 KB | $0.01 | Streaming | <1 sec |
| ML prediction | 100 GB | $0.63 | On-Demand | 10-30 sec |
10 Best Practices
- Always use partition filters β enable
require_partition_filter = TRUEto prevent full table scans - Cluster by filter columns β cluster by columns used in WHERE/GROUP BY for 20-60% additional savings
- Use on-demand for <360 TB/month β flat-rate slots only make sense at scale
- Leverage materialized views for repeated aggregations β they auto-refresh and reduce cost
- Optimize file sizes β BigQuery performs best with files >1 MB, avoid tiny files
- Use STRUCT types for nested data instead of JOINs β reduces data movement
- Monitor slots via INFORMATION_SCHEMA β identify queries consuming excess slots
- Use dry runs (
--dry_runflag) to estimate cost before executing queries - Enable BI Engine for dashboard workloads with <1 TB hot data
- Set table expiration policies to automatically clean up temporary/staging data
- BigQuery's serverless architecture eliminates infrastructure management entirely
- Partitioning + clustering reduce data scanned by 90-99% with proper design
- On-demand billing is cost-effective below 360 TB/month; flat-rate above
- BI Engine provides in-memory acceleration for interactive dashboards
- STRUCT types and nested data reduce JOIN overhead and improve query performance
See Also
- Redshift Fundamentals β Amazon's MPP data warehouse comparison
- Snowflake Fundamentals β Snowflake's compute-storage separation architecture
- Data Lake Architecture β Designing scalable raw data storage
- Delta Lake & Iceberg β Open table formats for data lakes
- Partitioning & Indexing β Optimizing data access patterns
- Cost Optimization β Managing BigQuery and cloud data platform costs