BigQuery Architecture
BigQuery is a serverless, highly scalable, multi-tenant enterprise data warehouse. It separates storage and compute, allowing independent scaling and cost optimization.
Architecture Overview
Slot Management
Slots are BigQuery's compute units. Each slot provides a portion of CPU, memory, and I/O for query execution.
Slot Types
# Create a slot commitment
from google.cloud import bigquery_reservation_v1
client = bigquery_reservation_v1.ReservationServiceClient()
# Create a commitment
commitment = bigquery_reservation_v1.CapacityCommitment(
plan=bigquery_reservation_v1.CapacityCommitment.CommitmentPlan.ONE_YEAR,
slot_count=100
)
request = bigquery_reservation_v1.CreateCapacityCommitmentRequest(
parent="projects/my-project/locations/us-central1",
commitment=commitment
)
result = client.create_capacity_commitment(request=request)
print(f"Created commitment: {result.name} with {result.slot_count} slots")
βΉοΈ
Pro Tip: Use a hybrid approach: 100 committed slots for baseline load + autoscale for peak demand. This provides cost savings for steady-state while handling burst capacity. Monitor slot utilization with INFORMATION_SCHEMA slots views.
Partitioning and Clustering
Partitioning Strategies
-- Time-unit partitioning (recommended for time-series data)
CREATE TABLE `project.dataset.events_partitioned`
(
event_id STRING,
event_type STRING,
user_id STRING,
event_timestamp TIMESTAMP,
payload JSON
)
PARTITION BY DATE(event_timestamp)
OPTIONS (
partition_expiration_days = 730, -- 2 years retention
partition_description = 'Events partitioned by date'
);
-- Ingestion-time partitioning
CREATE TABLE `project.dataset.events_ingestion`
(
data STRING
)
PARTITION BY _PARTITIONDATE
OPTIONS (
partition_expiration_days = 365
);
-- Integer-range partitioning
CREATE TABLE `project.dataset.sales_int_partitioned`
(
sale_id INT64,
amount FLOAT64,
region STRING
)
PARTITION BY
RANGE_BUCKET(sale_id, GENERATE_ARRAY(0, 10000000, 100000));
Clustering
-- Clustering (sub-partitioning within partitions)
CREATE TABLE `project.dataset.events_clustered`
(
event_id STRING,
event_type STRING,
user_id STRING,
event_timestamp TIMESTAMP,
payload JSON
)
PARTITION BY DATE(event_timestamp)
CLUSTER BY event_type, user_id
OPTIONS (
clustering_fields = 'event_type,user_id'
);
-- Query optimization with partition pruning and clustering
SELECT *
FROM `project.dataset.events_clustered`
WHERE event_date = '2025-01-15' -- Partition pruning
AND event_type = 'purchase' -- Cluster pruning
AND user_id = 'user_123'; -- Cluster pruning
BI Engine
BI Engine provides sub-second query response times for dashboards by caching hot data in-memory.
# Reserve BI Engine capacity
from google.cloud import bigquery_v1
client = bigquery_v1.Client()
# Create BI Engine reservation
reservation = bigquery_v1.Reservation(
name="projects/my-project/locations/us-central1/reservations/bi-engine",
slot_capacity=1000, # 1,000 slots for BI Engine
ignore_idle_slots=False,
autoscale=bigquery_v1.Reservation.Autoscale(
max_slots=2000 # Auto-scale up to 2,000 slots
)
)
-- Create BI Engine model for caching
CREATE OR REPLACE MODEL `project.dataset.bi_sales_model`
OPTIONS (
model_type = 'BIGQUERY_BI_ENGINE',
location = 'us-central1'
) AS
SELECT
product_category,
region,
SUM(revenue) as total_revenue,
COUNT(*) as transaction_count
FROM `project.dataset.sales`
GROUP BY 1, 2;
-- Query using BI Engine (auto-cached)
SELECT *
FROM `project.dataset.bi_sales_model`
WHERE region = 'North America';
BigQuery Omni (Multi-Cloud)
BigQuery Omni allows querying data across GCS, Amazon S3, and Azure Blob Storage.
-- Create external table for Amazon S3
CREATE EXTERNAL TABLE `project.dataset.aws_sales`
WITH CONNECTION `us-central1.aws-connection`
OPTIONS (
format = 'PARQUET',
uris = ['s3://my-aws-bucket/sales/**/*.parquet'],
max_staleness = INTERVAL 1 HOUR
);
-- Query across GCS and S3 in single query
SELECT
'GCP' as source,
SUM(revenue) as total_revenue
FROM `project.dataset.gcp_sales`
UNION ALL
SELECT
'AWS' as source,
SUM(revenue) as total_revenue
FROM `project.dataset.aws_sales`;
Streaming into BigQuery
from google.cloud import bigquery
from google.cloud.bigquery import StreamingInsertRow
import json
def stream_to_bigquery(dataset_id, table_id, rows):
"""Stream rows to BigQuery in real-time."""
client = bigquery.Client()
table_ref = client.dataset(dataset_id).table(table_id)
errors = client.insert_rows_json(
table_ref,
rows,
row_ids=[None] * len(rows) # Auto-generate insert IDs
)
if errors:
print(f"Streaming errors: {errors}")
return False
print(f"Successfully streamed {len(rows)} rows")
return True
# Example usage
events = [
{
"event_id": "evt_001",
"event_type": "purchase",
"user_id": "user_123",
"event_timestamp": "2025-01-15T10:30:00Z",
"amount": 99.99
},
{
"event_id": "evt_002",
"event_type": "view",
"user_id": "user_456",
"event_timestamp": "2025-01-15T10:31:00Z",
"amount": 0.0
}
]
stream_to_bigquery("analytics", "real_time_events", events)
β οΈ
Warning: BigQuery streaming inserts have a 1 GB/s per table limit and cost $0.01/200MB (after 2GB free). For high-volume streaming (>1GB/s), use Dataflow to BigQuery or the Storage Write API instead of streaming inserts.
Always monitor your BigQuery costs using INFORMATION_SCHEMA. Set up budget alerts at 50%, 80%, and 100% thresholds.
Cost Optimization
# Cost analysis for BigQuery
cost_analysis = {
"on_demand": {
"query_cost_per_tb": 5.00,
"storage_per_gb_month": 0.02,
"streaming_per_200mb": 0.01,
"best_for": "Ad-hoc queries, <100 queries/day"
},
"flat_rate_100_slots": {
"monthly_cost": 2000, # 1yr CUD
"cost_per_slot_hour": 0.04,
"best_for": "Steady workloads, >100 queries/day"
},
"autoscale_100_slots": {
"minimum_cost": 292, # 100 slots * 730 hours * $0.04
"cost_per_slot_hour": 0.04,
"best_for": "Variable workloads, batch processing"
}
}
# Cost optimization strategies
optimizations = {
"partitioning": "Reduces data scanned by 50-90%",
"clustering": "Reduces data scanned by 10-50%",
"materialized_views": "Pre-computed results, automatic refresh",
"cached_results": "Repeated queries use cache (24-hour TTL)",
"slot_management": "Use committed slots for predictable load",
"query_optimization": "Avoid SELECT *, use WHERE on partitioned columns"
}
Common Interview Questions
Q1: Explain BigQuery's separation of storage and compute.
Answer: BigQuery stores data in Colossus (Google's distributed file system) and processes queries using Dremel (distributed query engine). Storage and compute scale independently β you can store petabytes while using minimal compute, or run complex queries on small datasets. This separation enables cost optimization: pay for storage separately from compute, and scale each independently.
Q2: When would you use BI Engine vs. materialized views?
Answer: BI Engine is best for interactive dashboards requiring sub-second response times on frequently accessed data. It caches hot data in-memory and auto-refreshes. Materialized views are better for pre-computed aggregations that don't change frequently. BI Engine provides better performance for ad-hoc queries, while materialized views reduce query costs for common patterns.
Q3: How do you optimize BigQuery costs for a data warehouse?
Answer: 1) Partition tables by date to reduce data scanned, 2) Cluster by frequently filtered columns, 3) Use materialized views for common aggregations, 4) Leverage cached results for repeated queries, 5) Use committed slots for predictable workloads, 6) Avoid SELECT * and use WHERE clauses on partitioned columns, 7) Use BigQuery Omni for multi-cloud queries.
Q4: What is the difference between streaming inserts and Storage Write API?
Answer: Streaming inserts provide at-least-once delivery with 1GB/s per table limit and $0.01/200MB cost. Storage Write API provides exactly-once delivery, higher throughput, and lower cost. For new implementations, use Storage Write API for streaming ingestion.
Q5: How does BigQuery handle schema evolution?
Answer: BigQuery supports additive schema changes (adding new columns) without downtime. Use ALTER TABLE ADD COLUMN for manual evolution, or enable schema auto-detection for JSON/Avro files. Schema changes don't require rewriting existing data β new columns are added to the end of the table.