ELT Architecture on GCP
Loading Data into BigQuery
External Tables (Schema on Read)
-- External table for Parquet files in GCS
CREATE OR REPLACE EXTERNAL TABLE `project.bronze.sales_external`
OPTIONS (
format = 'PARQUET',
uris = ['gs://my-data-lake/bronze/sales/**/*.parquet']
);
-- External table for JSON files
CREATE OR REPLACE EXTERNAL TABLE `project.bronze.events_external`
OPTIONS (
format = 'JSON',
uris = ['gs://my-data-lake/bronze/events/*.json.gz'],
max_bad_records = 100
);
-- Query external table directly
SELECT
product_category,
SUM(amount) as total_revenue
FROM `project.bronze.sales_external`
WHERE order_date >= '2025-01-01'
GROUP BY 1;
Load Jobs (Physical Copy)
from google.cloud import bigquery
client = bigquery.Client()
# Load CSV from GCS to BigQuery
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.CSV,
skip_leading_rows=1,
autodetect=True,
write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
time_partitioning=bigquery.TimePartitioning(
type_=bigquery.TimePartitioningType.DAY,
field="order_date"
),
clustering_fields=["product_category", "region"]
)
load_job = client.load_table_from_uri(
'gs://my-data-lake/bronze/sales/*.csv',
'project.bronze.sales',
job_config=job_config
)
load_job.result()
print(f"Loaded {load_job.output_rows} rows")
Materialized Views
-- Materialized view with auto-refresh
CREATE MATERIALIZED VIEW `project.analytics.sales_summary_mv`
PARTITION BY DATE(order_date)
CLUSTER BY product_category
AS
SELECT
DATE(order_date) as order_date,
product_category,
COUNT(*) as order_count,
SUM(amount) as total_revenue,
AVG(amount) as avg_order_value,
COUNT(DISTINCT customer_id) as unique_customers
FROM `project.bronze.sales`
WHERE status = 'completed'
GROUP BY 1, 2;
-- Refresh materialized view
ALTER MATERIALIZED VIEW `project.analytics.sales_summary_mv`
REFRESH;
✨
Best Practice: Use external tables for flexible exploration, materialized views for performance-critical dashboards, and physical tables for complex transformations. BigQuery automatically rewrites queries to use materialized views when beneficial.
dbt Integration
-- dbt model: stg_sales.sql
WITH source AS (
SELECT * FROM {{ source('bronze', 'sales') }}
),
transformed AS (
SELECT
order_id,
customer_id,
product_category,
amount,
order_date,
CURRENT_TIMESTAMP() as loaded_at
FROM source
WHERE amount > 0
)
SELECT * FROM transformed
# dbt/models/schema.yml
version: 2
models:
- name: stg_sales
description: "Staged sales data"
columns:
- name: order_id
description: "Primary key"
tests:
- unique
- not_null
- name: amount
description: "Order amount"
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
Common Interview Questions
Q1: What is the difference between ETL and ELT?
Answer: ETL transforms data before loading into the target (Transform → Load). ELT loads raw data first, then transforms within the target (Load → Transform). ELT is preferred with BigQuery because it separates storage and compute, allowing cheap storage and on-demand transformation.
Q2: When would you use external tables vs. physical tables?
Answer: External tables for flexible exploration without data duplication, ad-hoc queries, and schema-on-read scenarios. Physical tables for performance-critical workloads, complex transformations, and when you need partitioning/clustering optimization. External tables have query costs but no storage costs.
Q3: How do materialized views improve performance?
Answer: Materialized views pre-compute and store query results, reducing query execution time. BigQuery automatically rewrites queries to use materialized views when beneficial. They're refreshed incrementally, minimizing compute costs. Ideal for dashboards and frequently-run aggregations.
Q4: How does dbt integrate with BigQuery?
Answer: dbt connects via the BigQuery adapter, using BigQuery's SQL dialect. Models are written in SQL with Jinja templating. dbt handles dependencies, materializations, and testing. Use dbt for version-controlled transformations with built-in documentation and lineage.
Q5: What are the cost implications of ELT on BigQuery?
Answer: 1) External tables: no storage cost, query cost only, 2) Physical tables: storage + query cost, 3) Materialized views: refresh cost + storage, but faster queries, 4) On-demand pricing: $5/TB scanned, 5) Flat-rate: committed slots for predictable costs.