Data Warehouse: BigQuery Deep Dive

Master BigQuery data warehouse including schema design, partitioning, clustering, materialized views, and performance optimization.

22 min readAdvanced

BigQuery Data Warehouse Architecture

🏗️ GCP Data Engineering Reference Architecture

Interview Tip: GCP's data engineering stack is serverless-first. Dataflow (Apache Beam) handles both streaming and batch. BigQuery is the flagship analytics service.

Schema Design Patterns

Star Schema

-- Fact table: Sales transactions
CREATE TABLE `project.warehouse.fact_sales`
(
  sale_key INT64 OPTIONS(description="Surrogate key"),
  order_id STRING OPTIONS(description="Natural key"),
  customer_key INT64 OPTIONS(description="FK to dim_customer"),
  product_key INT64 OPTIONS(description="FK to dim_product"),
  date_key INT64 OPTIONS(description="FK to dim_date"),
  store_key INT64 OPTIONS(description="FK to dim_store"),
  quantity INT64,
  unit_price FLOAT64,
  discount_amount FLOAT64,
  tax_amount FLOAT64,
  total_amount FLOAT64,
  sale_timestamp TIMESTAMP,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
)
PARTITION BY DATE(sale_timestamp)
CLUSTER BY customer_key, product_key
OPTIONS(
  description="Sales fact table partitioned by date",
  require_partition_filter = true
);

-- Dimension table: Customers
CREATE TABLE `project.warehouse.dim_customer`
(
  customer_key INT64 OPTIONS(description="Surrogate key"),
  customer_id STRING OPTIONS(description="Natural key"),
  customer_name STRING,
  email STRING,
  segment STRING,
  region STRING,
  country STRING,
  effective_date DATE,
  expiry_date DATE,
  is_current BOOLEAN DEFAULT TRUE
)
OPTIONS(
  description="SCD Type 2 customer dimension"
);

Partitioning and Clustering

-- Partition by date (recommended for time-series)
CREATE TABLE `project.warehouse.events_partitioned`
(
  event_id STRING,
  event_type STRING,
  user_id STRING,
  event_timestamp TIMESTAMP,
  payload JSON
)
PARTITION BY DATE(event_timestamp)
CLUSTER BY event_type, user_id
OPTIONS (
  partition_expiration_days = 730,
  require_partition_filter = true
);

-- Clustering for high-cardinality columns
CREATE TABLE `project.warehouse.sales_clustered`
(
  order_id STRING,
  customer_id STRING,
  product_id STRING,
  amount FLOAT64,
  order_date TIMESTAMP
)
PARTITION BY DATE(order_date)
CLUSTER BY customer_id, product_id;

Performance Optimization

-- Materialized view for common aggregations
CREATE MATERIALIZED VIEW `project.warehouse.mv_daily_sales`
PARTITION BY DATE(day)
CLUSTER BY product_category
AS
SELECT
  DATE(order_date) as day,
  product_category,
  COUNT(*) as order_count,
  SUM(amount) as total_revenue,
  AVG(amount) as avg_order_value
FROM `project.warehouse.fact_sales`
JOIN `project.warehouse.dim_product` USING (product_key)
GROUP BY 1, 2;

-- BI Engine reservation
ALTER TABLE `project.warehouse.fact_sales`
SET OPTIONS (
  -- Reserve 10GB for BI Engine
);

✨

Best Practice: Always partition large tables by date. Cluster by frequently filtered/joined columns. Use materialized views for dashboard aggregations. Enable BI Engine for sub-second dashboard queries. Use require_partition_filter to prevent full-table scans.

💬

Common Interview Questions

Q1: What is the difference between partitioning and clustering?

Answer: Partitioning divides tables into segments (max 4,000), best for date/time data. Clustering sorts data within partitions (up to 4 fields), best for high-cardinality filtering. Use partitioning for date-range queries, clustering for multi-column filtering.

Q2: When would you use a star schema vs. snowflake schema?

Answer: Star schema denormalizes dimensions for faster joins and simpler queries. Snowflake normalizes dimensions to reduce redundancy. Use star schema for analytics (BigQuery), snowflake for operational systems. BigQuery's columnar storage makes denormalization efficient.

Q3: How do you implement SCD Type 2 in BigQuery?

Answer: Use MERGE statement to insert new rows and update existing rows. Set effective_date, expiry_date, and is_current columns. Use row_hash for change detection. Implement with scheduled SQL jobs or Dataform.

Q4: What is the purpose of require_partition_filter?

Answer: It forces queries to include a partition filter, preventing expensive full-table scans. Essential for large partitioned tables to control costs. Can be overridden with --allow_large_results flag for special cases.

Q5: How do you optimize BigQuery costs for a data warehouse?

Answer: 1) Partition tables by date, 2) Cluster by filtered columns, 3) Use materialized views, 4) Enable BI Engine for dashboards, 5) Use committed slots for predictable workloads, 6) Archive old data to Coldline, 7) Avoid SELECT *.

Data Warehouse: BigQuery, Partitioning, Clustering & Optimization