Google BigQuery Fundamentals: Serverless Data Warehouse

BigQuery: Serverless Analytical Processing at Scale

Google BigQuery is a fully managed, serverless data warehouse that separates compute from storage and uses Google's Dremel query execution engine.

Why BigQuery Matters

Key Benefits:

No infrastructure management — no clusters, no nodes, no capacity planning
Columnar storage (Capacitor) — efficient analytical queries
Distributed execution (Dremel) — petabyte-scale analysis
Automatic optimization — simplest path to petabyte-scale analytics

Key Insight: BigQuery pioneered the serverless model for analytical workloads.

Architecture Overview

BigQuery Architecture

Serverless vs. Provisioned

A serverless data warehouse abstracts all infrastructure management from the user. Compute resources are allocated per query and billed per usage, with no minimum commitment. BigQuery pioneered this model for analytical workloads.

BigQuery Cost Model

Storage: $0.02/GB/month (active),$ 0.01/GB/month (long-term >90 days)
On-Demand Query: $6.25/TB of data scanned (up to 1 PB/day)
Flat-Rate Slots: $0.04/slot/hour (100-slot minimum, ~$ 2,900/month)
Editions: Standard (basic), Enterprise (advanced), Enterprise Plus (premium)
BI Engine: $0.01/GB of in-memory cache per hour
Cost per Query = Data_Scanned_TB × $6.25 (on-demand) OR Slots_Used × Time × Rate (flat-rate)

-- Create dataset with regional storage
CREATE SCHEMA IF NOT EXISTS analytics
    OPTIONS (
        location = 'US',
        default_table_expiration_days = NULL,
        default_partition_expiration_days = 365,
        storage_billing_model = 'PHYSICAL'  -- Or LOGICAL
    );

-- Create partitioned and clustered table
CREATE TABLE analytics.fact_orders (
    order_id        INT64 NOT NULL,
    customer_id     INT64,
    product_id      INT64,
    order_date      DATE NOT NULL,
    order_timestamp TIMESTAMP,
    quantity        INT64,
    unit_price      NUMERIC(12,2),
    discount_pct    NUMERIC(5,4),
    net_amount      NUMERIC(14,2),
    order_status    STRING,
    region          STRING
)
PARTITION BY DATE(order_date)
CLUSTER BY (customer_id, order_status, region)
OPTIONS (
    description = 'Fact table for customer orders',
    require_partition_filter = TRUE,
    partition_expiration_days = 730
);

-- Query with partition filter (required)
SELECT
    EXTRACT(YEAR FROM order_date)  AS order_year,
    EXTRACT(MONTH FROM order_date) AS order_month,
    region,
    COUNT(*)                       AS order_count,
    SUM(net_amount)                AS total_revenue,
    AVG(net_amount)                AS avg_order_value
FROM analytics.fact_orders
WHERE order_date BETWEEN '2024-01-01' AND '2025-12-31'
GROUP BY 1, 2, 3
ORDER BY total_revenue DESC;

-- DDL for nested/repeated fields (STRUCTs and ARRAYs)
CREATE TABLE analytics.customer_events (
    customer_id     INT64,
    customer_name   STRING,
    events          ARRAY<STRUCT<
        event_type    STRING,
        event_time    TIMESTAMP,
        event_data    JSON,
        session_id    STRING
    >>,
    address         STRUCT<
        street        STRING,
        city          STRING,
        state         STRING,
        zip           STRING
    >
);

-- Query nested data
SELECT
    customer_id,
    event.event_type,
    event.event_time,
    event.event_data['page'] AS page_viewed
FROM analytics.customer_events,
UNNEST(events) AS event
WHERE event.event_type = 'page_view'
  AND event.event_time >= '2025-01-01';

Partitioning and Clustering

Partitioning divides a table into segments based on a column value (DATE, TIMESTAMP, INT64 range, or ingestion-time). Partitions are independently pruned during queries, reducing data scanned and cost.

Clustering sorts data within each partition by up to 4 columns. Clustered columns are sorted within each partition block, enabling further data pruning beyond partitioning.

Partition-Cluster Pruning

Data Scanned = Σ (partition_size × cluster_pruning_ratio)
Cluster Pruning Ratio = 1 / (number_of_distinct_cluster_values_in_partition)
Optimal Pattern: Partition by date, cluster by high-cardinality filter columns
Cost Savings = (1 - scanned_after_optimization / scanned_before) × $6.25/TB
Example: 1 TB partition, cluster pruning reduces to 50 GB -> savings = $5.94/query

-- Ingestion-time partitioned table
CREATE TABLE analytics.events_ingest_time (
    event_id    STRING,
    payload     JSON
)
PARTITION BY _PARTITIONTIME
OPTIONS (
    require_partition_filter = TRUE
);

-- Integer-range partitioned table
CREATE TABLE analytics.hourly_metrics (
    metric_hour  INT64 NOT NULL,
    metric_name  STRING,
    metric_value FLOAT64
)
PARTITION BY RANGE_BUCKET(metric_hour, GENERATE_ARRAY(2024010100, 2026010100, 1));

-- Materialized view with partitioning
CREATE MATERIALIZED VIEW analytics.mv_daily_revenue
    PARTITION BY DATE(order_date)
    CLUSTER BY (region)
AS
SELECT
    DATE(order_date)  AS order_date,
    region,
    SUM(net_amount)   AS total_revenue,
    COUNT(*)          AS order_count
FROM analytics.fact_orders
WHERE order_date >= '2024-01-01'
GROUP BY 1, 2;

-- Refresh materialized view
ALTER MATERIALIZED VIEW analytics.mv_daily_revenue
    SET OPTIONS (
        refresh_interval_minutes = 60,
        enable_refresh = TRUE
    );

Slots and Capacity Management

A slot is BigQuery's unit of computational capacity. Each slot can process approximately 1 GB/s of data. Queries consume slots dynamically, and BigQuery automatically manages slot allocation across concurrent queries.

Slot Utilization

On-Demand Mode: Up to 2,000 slots per query, shared across project
Autoscaling Mode: 500-10,000 slots, scales with demand
Capacity Formula: Required_Slots = (Data_Scanned_GB / Time_Target_sec) / 1.0
Slot Hour Cost = Slots_Used × Hours × $0.04 (for committed slots)
Break-Even: On-Demand is cheaper below ~360 TB/month scanned

-- Monitor slot usage in real-time
SELECT
    TIMESTAMP_TRUNC(timestamp, MINUTE) AS time_bucket,
    AVG(slot_count)                    AS avg_slots,
    MAX(slot_count)                    AS max_slots,
    SUM(estimated_total_bytes_processed) / 1024 / 1024 / 1024 AS data_scanned_gb,
    COUNT(*)                           AS query_count
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
GROUP BY 1
ORDER BY 1 DESC;

-- Check slot utilization
SELECT
    job_id,
    query,
    TIMESTAMP_DIFF(end_time, start_time, MILLISECOND) AS duration_ms,
    total_slot_ms / TIMESTAMP_DIFF(end_time, start_time, MILLISECOND) AS avg_slots,
    total_bytes_processed / 1024 / 1024 / 1024 AS data_gb
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
  AND statement_type = 'SELECT'
ORDER BY total_slot_ms DESC
LIMIT 20;

-- Flat-rate slot commitment
-- Purchase via Console: BigQuery > Administration > Capacity Management
-- Example: Commit 500 slots = ~$1,450/month

BI Engine

BI Engine is BigQuery's in-memory acceleration layer that caches frequently accessed data in RAM. It provides sub-100ms query latency for dashboards and interactive reports by serving results directly from memory without touching disk storage.

-- Reserve BI Engine capacity
-- Via Console: BigQuery > Administration > BI Engine > Add Reservation
-- Allocate up to 1 TB of in-memory cache

-- Create table optimized for BI Engine
CREATE TABLE analytics.bi_daily_metrics (
    metric_date  DATE,
    region       STRING,
    product_line STRING,
    metric_name  STRING,
    metric_value FLOAT64
);

-- BI Engine automatically accelerates queries with:
-- - Equality filters on partitioned columns
-- - Aggregations on small/medium dimension tables
-- - Dashboard queries with fixed query patterns

-- Check BI Engine reservation
SELECT
    reservation_name,
    location,
    state,
    slots_allocated,
    slots_min,
    slots_max
FROM `region-us`.INFORMATION_SCHEMA.BI_CAPACITY_COMMITMENTS;

Key Concepts Summary

Concept	Description	Benefit	Cost Impact
Capacitor	Columnar storage format	10:1 compression ratio	Reduces storage cost
Dremel	Distributed query engine	MPP query execution	Per-query billing
Partitioning	Table segmentation by column	Partition pruning	70-95% less data scanned
Clustering	In-partition sorting	Column-level pruning	Additional 20-60% savings
Slots	Compute units	Elastic MPP capacity	On-demand or flat-rate
BI Engine	In-memory acceleration	<100ms dashboard queries	$0.01/GB/hour
Materialized View	Pre-computed aggregations	Faster queries, lower cost	Automatic refresh
BigQuery ML	In-database ML models	No data movement	Per-model training cost
Streaming Buffer	Real-time ingestion	<1 second latency	$0.01/GB for streaming
Authorized Views	Cross-dataset access control	Security without copies	No additional cost
Reservation	Committed slot capacity	Predictable pricing	Monthly commitment

Performance Metrics

Query Type	Data Volume	On-Demand Cost	Slot Mode	Latency
Ad-hoc dashboard	10 GB	$0.06	On-Demand	2-5 sec
Partitioned query	100 GB	$0.63	On-Demand	5-15 sec
Large ETL job	1 TB	$6.25	On-Demand	30-120 sec
BI Engine cached	1 GB	$0.01	BI Engine	<100 ms
Streaming insert	1 KB	$0.01	Streaming	<1 sec
ML prediction	100 GB	$0.63	On-Demand	10-30 sec

10 Best Practices

Always use partition filters — enable require_partition_filter = TRUE to prevent full table scans
Cluster by filter columns — cluster by columns used in WHERE/GROUP BY for 20-60% additional savings
Use on-demand for <360 TB/month — flat-rate slots only make sense at scale
Leverage materialized views for repeated aggregations — they auto-refresh and reduce cost
Optimize file sizes — BigQuery performs best with files >1 MB, avoid tiny files
Use STRUCT types for nested data instead of JOINs — reduces data movement
Monitor slots via INFORMATION_SCHEMA — identify queries consuming excess slots
Use dry runs (--dry_run flag) to estimate cost before executing queries
Enable BI Engine for dashboard workloads with <1 TB hot data
Set table expiration policies to automatically clean up temporary/staging data

BigQuery's serverless architecture eliminates infrastructure management entirely
Partitioning + clustering reduce data scanned by 90-99% with proper design
On-demand billing is cost-effective below 360 TB/month; flat-rate above
BI Engine provides in-memory acceleration for interactive dashboards
STRUCT types and nested data reduce JOIN overhead and improve query performance