BigQuery Deep Dive

Master BigQuery architecture, slot management, BI Engine, Omni multi-cloud, and advanced data warehouse patterns.

25 min readAdvanced

BigQuery Architecture

BigQuery is a serverless, highly scalable, multi-tenant enterprise data warehouse. It separates storage and compute, allowing independent scaling and cost optimization.

Architecture Overview

📊 BigQuery Architecture for Data Engineering

Interview Tip: BigQuery separates storage and compute. Queries are charged by slots (compute) + bytes scanned. Always partition and cluster tables to reduce costs.

Slot Management

Slots are BigQuery's compute units. Each slot provides a portion of CPU, memory, and I/O for query execution.

Slot Types

📊 BigQuery Architecture for Data Engineering

Interview Tip: BigQuery separates storage and compute. Queries are charged by slots (compute) + bytes scanned. Always partition and cluster tables to reduce costs.

# Create a slot commitment
from google.cloud import bigquery_reservation_v1

client = bigquery_reservation_v1.ReservationServiceClient()

# Create a commitment
commitment = bigquery_reservation_v1.CapacityCommitment(
    plan=bigquery_reservation_v1.CapacityCommitment.CommitmentPlan.ONE_YEAR,
    slot_count=100
)

request = bigquery_reservation_v1.CreateCapacityCommitmentRequest(
    parent="projects/my-project/locations/us-central1",
    commitment=commitment
)

result = client.create_capacity_commitment(request=request)
print(f"Created commitment: {result.name} with {result.slot_count} slots")

ℹ️

Pro Tip: Use a hybrid approach: 100 committed slots for baseline load + autoscale for peak demand. This provides cost savings for steady-state while handling burst capacity. Monitor slot utilization with INFORMATION_SCHEMA slots views.

Partitioning and Clustering

Partitioning Strategies

-- Time-unit partitioning (recommended for time-series data)
CREATE TABLE `project.dataset.events_partitioned`
(
  event_id STRING,
  event_type STRING,
  user_id STRING,
  event_timestamp TIMESTAMP,
  payload JSON
)
PARTITION BY DATE(event_timestamp)
OPTIONS (
  partition_expiration_days = 730,  -- 2 years retention
  partition_description = 'Events partitioned by date'
);

-- Ingestion-time partitioning
CREATE TABLE `project.dataset.events_ingestion`
(
  data STRING
)
PARTITION BY _PARTITIONDATE
OPTIONS (
  partition_expiration_days = 365
);

-- Integer-range partitioning
CREATE TABLE `project.dataset.sales_int_partitioned`
(
  sale_id INT64,
  amount FLOAT64,
  region STRING
)
PARTITION BY
  RANGE_BUCKET(sale_id, GENERATE_ARRAY(0, 10000000, 100000));

Clustering

-- Clustering (sub-partitioning within partitions)
CREATE TABLE `project.dataset.events_clustered`
(
  event_id STRING,
  event_type STRING,
  user_id STRING,
  event_timestamp TIMESTAMP,
  payload JSON
)
PARTITION BY DATE(event_timestamp)
CLUSTER BY event_type, user_id
OPTIONS (
  clustering_fields = 'event_type,user_id'
);

-- Query optimization with partition pruning and clustering
SELECT *
FROM `project.dataset.events_clustered`
WHERE event_date = '2025-01-15'  -- Partition pruning
  AND event_type = 'purchase'    -- Cluster pruning
  AND user_id = 'user_123';     -- Cluster pruning

GCP Pricing Models for Data Engineering

💳

On-Demand

Pay per use, no commitment

Dev/Test

📋

Committed (1yr)

Up to 37%

1-year commitment

Steady production

📝

Committed (3yr)

Up to 55%

3-year commitment

Long-term infra

⚡

Preemptible/Spot

Up to 91%

Short-lived VMs

Batch processing

💰

Sustained Use

Up to 30%

Auto discounts for long use

Always-on

🔥

Serverless

N/A

Pay per query/invocation

Event-driven

BI Engine

BI Engine provides sub-second query response times for dashboards by caching hot data in-memory.

# Reserve BI Engine capacity
from google.cloud import bigquery_v1

client = bigquery_v1.Client()

# Create BI Engine reservation
reservation = bigquery_v1.Reservation(
    name="projects/my-project/locations/us-central1/reservations/bi-engine",
    slot_capacity=1000,  # 1,000 slots for BI Engine
    ignore_idle_slots=False,
    autoscale=bigquery_v1.Reservation.Autoscale(
        max_slots=2000  # Auto-scale up to 2,000 slots
    )
)

-- Create BI Engine model for caching
CREATE OR REPLACE MODEL `project.dataset.bi_sales_model`
OPTIONS (
  model_type = 'BIGQUERY_BI_ENGINE',
  location = 'us-central1'
) AS
SELECT
  product_category,
  region,
  SUM(revenue) as total_revenue,
  COUNT(*) as transaction_count
FROM `project.dataset.sales`
GROUP BY 1, 2;

-- Query using BI Engine (auto-cached)
SELECT *
FROM `project.dataset.bi_sales_model`
WHERE region = 'North America';

BigQuery Omni (Multi-Cloud)

BigQuery Omni allows querying data across GCS, Amazon S3, and Azure Blob Storage.

-- Create external table for Amazon S3
CREATE EXTERNAL TABLE `project.dataset.aws_sales`
WITH CONNECTION `us-central1.aws-connection`
OPTIONS (
  format = 'PARQUET',
  uris = ['s3://my-aws-bucket/sales/**/*.parquet'],
  max_staleness = INTERVAL 1 HOUR
);

-- Query across GCS and S3 in single query
SELECT
  'GCP' as source,
  SUM(revenue) as total_revenue
FROM `project.dataset.gcp_sales`
UNION ALL
SELECT
  'AWS' as source,
  SUM(revenue) as total_revenue
FROM `project.dataset.aws_sales`;

Streaming into BigQuery

from google.cloud import bigquery
from google.cloud.bigquery import StreamingInsertRow
import json

def stream_to_bigquery(dataset_id, table_id, rows):
    """Stream rows to BigQuery in real-time."""
    client = bigquery.Client()
    table_ref = client.dataset(dataset_id).table(table_id)

    errors = client.insert_rows_json(
        table_ref,
        rows,
        row_ids=[None] * len(rows)  # Auto-generate insert IDs
    )

    if errors:
        print(f"Streaming errors: {errors}")
        return False

    print(f"Successfully streamed {len(rows)} rows")
    return True

# Example usage
events = [
    {
        "event_id": "evt_001",
        "event_type": "purchase",
        "user_id": "user_123",
        "event_timestamp": "2025-01-15T10:30:00Z",
        "amount": 99.99
    },
    {
        "event_id": "evt_002",
        "event_type": "view",
        "user_id": "user_456",
        "event_timestamp": "2025-01-15T10:31:00Z",
        "amount": 0.0
    }
]

stream_to_bigquery("analytics", "real_time_events", events)

⚠️

Warning: BigQuery streaming inserts have a 1 GB/s per table limit and cost $0.01/200MB (after 2GB free). For high-volume streaming (>1GB/s), use Dataflow to BigQuery or the Storage Write API instead of streaming inserts.

⚠️ Cost Alert

Always monitor your BigQuery costs using INFORMATION_SCHEMA. Set up budget alerts at 50%, 80%, and 100% thresholds.

Cost Optimization

# Cost analysis for BigQuery
cost_analysis = {
    "on_demand": {
        "query_cost_per_tb": 5.00,
        "storage_per_gb_month": 0.02,
        "streaming_per_200mb": 0.01,
        "best_for": "Ad-hoc queries, <100 queries/day"
    },
    "flat_rate_100_slots": {
        "monthly_cost": 2000,  # 1yr CUD
        "cost_per_slot_hour": 0.04,
        "best_for": "Steady workloads, >100 queries/day"
    },
    "autoscale_100_slots": {
        "minimum_cost": 292,  # 100 slots * 730 hours * $0.04
        "cost_per_slot_hour": 0.04,
        "best_for": "Variable workloads, batch processing"
    }
}

# Cost optimization strategies
optimizations = {
    "partitioning": "Reduces data scanned by 50-90%",
    "clustering": "Reduces data scanned by 10-50%",
    "materialized_views": "Pre-computed results, automatic refresh",
    "cached_results": "Repeated queries use cache (24-hour TTL)",
    "slot_management": "Use committed slots for predictable load",
    "query_optimization": "Avoid SELECT *, use WHERE on partitioned columns"
}

💬

Common Interview Questions

Q1: Explain BigQuery's separation of storage and compute.

Answer: BigQuery stores data in Colossus (Google's distributed file system) and processes queries using Dremel (distributed query engine). Storage and compute scale independently — you can store petabytes while using minimal compute, or run complex queries on small datasets. This separation enables cost optimization: pay for storage separately from compute, and scale each independently.

Q2: When would you use BI Engine vs. materialized views?

Answer: BI Engine is best for interactive dashboards requiring sub-second response times on frequently accessed data. It caches hot data in-memory and auto-refreshes. Materialized views are better for pre-computed aggregations that don't change frequently. BI Engine provides better performance for ad-hoc queries, while materialized views reduce query costs for common patterns.

Q3: How do you optimize BigQuery costs for a data warehouse?

Answer: 1) Partition tables by date to reduce data scanned, 2) Cluster by frequently filtered columns, 3) Use materialized views for common aggregations, 4) Leverage cached results for repeated queries, 5) Use committed slots for predictable workloads, 6) Avoid SELECT * and use WHERE clauses on partitioned columns, 7) Use BigQuery Omni for multi-cloud queries.

Q4: What is the difference between streaming inserts and Storage Write API?

Answer: Streaming inserts provide at-least-once delivery with 1GB/s per table limit and $0.01/200MB cost. Storage Write API provides exactly-once delivery, higher throughput, and lower cost. For new implementations, use Storage Write API for streaming ingestion.

Q5: How does BigQuery handle schema evolution?

Answer: BigQuery supports additive schema changes (adding new columns) without downtime. Use ALTER TABLE ADD COLUMN for manual evolution, or enable schema auto-detection for JSON/Avro files. Schema changes don't require rewriting existing data — new columns are added to the end of the table.

BigQuery: Architecture, Slots, BI Engine & Multi-Cloud