πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Google BigQuery Fundamentals: Serverless Data Warehouse

Module 3: Data Warehouses & StorageCloud Data Platforms🟒 Free Lesson

Advertisement

BigQuery: Serverless Analytical Processing at Scale

Google BigQuery is a fully managed, serverless data warehouse that separates compute from storage and uses Google's Dremel query execution engine.

Why BigQuery Matters


Key Benefits:

  • No infrastructure management β€” no clusters, no nodes, no capacity planning
  • Columnar storage (Capacitor) β€” efficient analytical queries
  • Distributed execution (Dremel) β€” petabyte-scale analysis
  • Automatic optimization β€” simplest path to petabyte-scale analytics

Key Insight: BigQuery pioneered the serverless model for analytical workloads.


Architecture Overview

BigQuery Architecture

BigQuery Serverless ArchitectureSQL InterfaceConsole, CLI, APIQuery PlannerParse, OptimizeDremel EngineLeaf & Root SlotsCapacitorColumnar StorageColossusGCSPartitioningBy: DATE, TIMESTAMP, INTEGERGranularity: DAY, HOUR, MONTHBenefit: partition pruning (scan 1 partition only)ClusteringUp to 4 columns per tableAuto-sort within partitionsBenefit: range filter optimization

Serverless vs. Provisioned

A serverless data warehouse abstracts all infrastructure management from the user. Compute resources are allocated per query and billed per usage, with no minimum commitment. BigQuery pioneered this model for analytical workloads.

BigQuery Cost Model

  • Storage: 0.02/GB/month(active),0.02/GB/month (active),0.01/GB/month (long-term >90 days)
  • On-Demand Query: $6.25/TB of data scanned (up to 1 PB/day)
  • Flat-Rate Slots: 0.04/slot/hour(100βˆ’slotminimum,Β 0.04/slot/hour (100-slot minimum, ~2,900/month)
  • Editions: Standard (basic), Enterprise (advanced), Enterprise Plus (premium)
  • BI Engine: $0.01/GB of in-memory cache per hour
  • Cost per Query = Data_Scanned_TB Γ— $6.25 (on-demand) OR Slots_Used Γ— Time Γ— Rate (flat-rate)
-- Create dataset with regional storage
CREATE SCHEMA IF NOT EXISTS analytics
    OPTIONS (
        location = 'US',
        default_table_expiration_days = NULL,
        default_partition_expiration_days = 365,
        storage_billing_model = 'PHYSICAL'  -- Or LOGICAL
    );

-- Create partitioned and clustered table
CREATE TABLE analytics.fact_orders (
    order_id        INT64 NOT NULL,
    customer_id     INT64,
    product_id      INT64,
    order_date      DATE NOT NULL,
    order_timestamp TIMESTAMP,
    quantity        INT64,
    unit_price      NUMERIC(12,2),
    discount_pct    NUMERIC(5,4),
    net_amount      NUMERIC(14,2),
    order_status    STRING,
    region          STRING
)
PARTITION BY DATE(order_date)
CLUSTER BY (customer_id, order_status, region)
OPTIONS (
    description = 'Fact table for customer orders',
    require_partition_filter = TRUE,
    partition_expiration_days = 730
);

-- Query with partition filter (required)
SELECT
    EXTRACT(YEAR FROM order_date)  AS order_year,
    EXTRACT(MONTH FROM order_date) AS order_month,
    region,
    COUNT(*)                       AS order_count,
    SUM(net_amount)                AS total_revenue,
    AVG(net_amount)                AS avg_order_value
FROM analytics.fact_orders
WHERE order_date BETWEEN '2024-01-01' AND '2025-12-31'
GROUP BY 1, 2, 3
ORDER BY total_revenue DESC;

-- DDL for nested/repeated fields (STRUCTs and ARRAYs)
CREATE TABLE analytics.customer_events (
    customer_id     INT64,
    customer_name   STRING,
    events          ARRAY<STRUCT<
        event_type    STRING,
        event_time    TIMESTAMP,
        event_data    JSON,
        session_id    STRING
    >>,
    address         STRUCT<
        street        STRING,
        city          STRING,
        state         STRING,
        zip           STRING
    >
);

-- Query nested data
SELECT
    customer_id,
    event.event_type,
    event.event_time,
    event.event_data['page'] AS page_viewed
FROM analytics.customer_events,
UNNEST(events) AS event
WHERE event.event_type = 'page_view'
  AND event.event_time >= '2025-01-01';

Partitioning and Clustering

Partitioning divides a table into segments based on a column value (DATE, TIMESTAMP, INT64 range, or ingestion-time). Partitions are independently pruned during queries, reducing data scanned and cost.

Clustering sorts data within each partition by up to 4 columns. Clustered columns are sorted within each partition block, enabling further data pruning beyond partitioning.

Partition-Cluster Pruning

  • Data Scanned = Ξ£ (partition_size Γ— cluster_pruning_ratio)
  • Cluster Pruning Ratio = 1 / (number_of_distinct_cluster_values_in_partition)
  • Optimal Pattern: Partition by date, cluster by high-cardinality filter columns
  • Cost Savings = (1 - scanned_after_optimization / scanned_before) Γ— $6.25/TB
  • Example: 1 TB partition, cluster pruning reduces to 50 GB -> savings = $5.94/query
-- Ingestion-time partitioned table
CREATE TABLE analytics.events_ingest_time (
    event_id    STRING,
    payload     JSON
)
PARTITION BY _PARTITIONTIME
OPTIONS (
    require_partition_filter = TRUE
);

-- Integer-range partitioned table
CREATE TABLE analytics.hourly_metrics (
    metric_hour  INT64 NOT NULL,
    metric_name  STRING,
    metric_value FLOAT64
)
PARTITION BY RANGE_BUCKET(metric_hour, GENERATE_ARRAY(2024010100, 2026010100, 1));

-- Materialized view with partitioning
CREATE MATERIALIZED VIEW analytics.mv_daily_revenue
    PARTITION BY DATE(order_date)
    CLUSTER BY (region)
AS
SELECT
    DATE(order_date)  AS order_date,
    region,
    SUM(net_amount)   AS total_revenue,
    COUNT(*)          AS order_count
FROM analytics.fact_orders
WHERE order_date >= '2024-01-01'
GROUP BY 1, 2;

-- Refresh materialized view
ALTER MATERIALIZED VIEW analytics.mv_daily_revenue
    SET OPTIONS (
        refresh_interval_minutes = 60,
        enable_refresh = TRUE
    );

Slots and Capacity Management

A slot is BigQuery's unit of computational capacity. Each slot can process approximately 1 GB/s of data. Queries consume slots dynamically, and BigQuery automatically manages slot allocation across concurrent queries.

Slot Utilization

  • On-Demand Mode: Up to 2,000 slots per query, shared across project
  • Autoscaling Mode: 500-10,000 slots, scales with demand
  • Capacity Formula: Required_Slots = (Data_Scanned_GB / Time_Target_sec) / 1.0
  • Slot Hour Cost = Slots_Used Γ— Hours Γ— $0.04 (for committed slots)
  • Break-Even: On-Demand is cheaper below ~360 TB/month scanned
-- Monitor slot usage in real-time
SELECT
    TIMESTAMP_TRUNC(timestamp, MINUTE) AS time_bucket,
    AVG(slot_count)                    AS avg_slots,
    MAX(slot_count)                    AS max_slots,
    SUM(estimated_total_bytes_processed) / 1024 / 1024 / 1024 AS data_scanned_gb,
    COUNT(*)                           AS query_count
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
GROUP BY 1
ORDER BY 1 DESC;

-- Check slot utilization
SELECT
    job_id,
    query,
    TIMESTAMP_DIFF(end_time, start_time, MILLISECOND) AS duration_ms,
    total_slot_ms / TIMESTAMP_DIFF(end_time, start_time, MILLISECOND) AS avg_slots,
    total_bytes_processed / 1024 / 1024 / 1024 AS data_gb
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
  AND statement_type = 'SELECT'
ORDER BY total_slot_ms DESC
LIMIT 20;

-- Flat-rate slot commitment
-- Purchase via Console: BigQuery > Administration > Capacity Management
-- Example: Commit 500 slots = ~$1,450/month

BI Engine

BI Engine is BigQuery's in-memory acceleration layer that caches frequently accessed data in RAM. It provides sub-100ms query latency for dashboards and interactive reports by serving results directly from memory without touching disk storage.

-- Reserve BI Engine capacity
-- Via Console: BigQuery > Administration > BI Engine > Add Reservation
-- Allocate up to 1 TB of in-memory cache

-- Create table optimized for BI Engine
CREATE TABLE analytics.bi_daily_metrics (
    metric_date  DATE,
    region       STRING,
    product_line STRING,
    metric_name  STRING,
    metric_value FLOAT64
);

-- BI Engine automatically accelerates queries with:
-- - Equality filters on partitioned columns
-- - Aggregations on small/medium dimension tables
-- - Dashboard queries with fixed query patterns

-- Check BI Engine reservation
SELECT
    reservation_name,
    location,
    state,
    slots_allocated,
    slots_min,
    slots_max
FROM `region-us`.INFORMATION_SCHEMA.BI_CAPACITY_COMMITMENTS;

Key Concepts Summary

ConceptDescriptionBenefitCost Impact
CapacitorColumnar storage format10:1 compression ratioReduces storage cost
DremelDistributed query engineMPP query executionPer-query billing
PartitioningTable segmentation by columnPartition pruning70-95% less data scanned
ClusteringIn-partition sortingColumn-level pruningAdditional 20-60% savings
SlotsCompute unitsElastic MPP capacityOn-demand or flat-rate
BI EngineIn-memory acceleration<100ms dashboard queries$0.01/GB/hour
Materialized ViewPre-computed aggregationsFaster queries, lower costAutomatic refresh
BigQuery MLIn-database ML modelsNo data movementPer-model training cost
Streaming BufferReal-time ingestion<1 second latency$0.01/GB for streaming
Authorized ViewsCross-dataset access controlSecurity without copiesNo additional cost
ReservationCommitted slot capacityPredictable pricingMonthly commitment

Performance Metrics

Query TypeData VolumeOn-Demand CostSlot ModeLatency
Ad-hoc dashboard10 GB$0.06On-Demand2-5 sec
Partitioned query100 GB$0.63On-Demand5-15 sec
Large ETL job1 TB$6.25On-Demand30-120 sec
BI Engine cached1 GB$0.01BI Engine<100 ms
Streaming insert1 KB$0.01Streaming<1 sec
ML prediction100 GB$0.63On-Demand10-30 sec

10 Best Practices

  1. Always use partition filters β€” enable require_partition_filter = TRUE to prevent full table scans
  2. Cluster by filter columns β€” cluster by columns used in WHERE/GROUP BY for 20-60% additional savings
  3. Use on-demand for <360 TB/month β€” flat-rate slots only make sense at scale
  4. Leverage materialized views for repeated aggregations β€” they auto-refresh and reduce cost
  5. Optimize file sizes β€” BigQuery performs best with files >1 MB, avoid tiny files
  6. Use STRUCT types for nested data instead of JOINs β€” reduces data movement
  7. Monitor slots via INFORMATION_SCHEMA β€” identify queries consuming excess slots
  8. Use dry runs (--dry_run flag) to estimate cost before executing queries
  9. Enable BI Engine for dashboard workloads with <1 TB hot data
  10. Set table expiration policies to automatically clean up temporary/staging data

  • BigQuery's serverless architecture eliminates infrastructure management entirely
  • Partitioning + clustering reduce data scanned by 90-99% with proper design
  • On-demand billing is cost-effective below 360 TB/month; flat-rate above
  • BI Engine provides in-memory acceleration for interactive dashboards
  • STRUCT types and nested data reduce JOIN overhead and improve query performance

See Also

⭐

Premium Content

Google BigQuery Fundamentals: Serverless Data Warehouse

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Data Engineering Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement