ELT Pipeline Patterns on GCP

Master ELT patterns including GCS to BigQuery loading, external tables, materialized views, and dbt integration on GCP.

18 min readAdvanced

ELT Architecture on GCP

📊 BigQuery Architecture for Data Engineering

Interview Tip: BigQuery separates storage and compute. Queries are charged by slots (compute) + bytes scanned. Always partition and cluster tables to reduce costs.

Loading Data into BigQuery

External Tables (Schema on Read)

-- External table for Parquet files in GCS
CREATE OR REPLACE EXTERNAL TABLE `project.bronze.sales_external`
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://my-data-lake/bronze/sales/**/*.parquet']
);

-- External table for JSON files
CREATE OR REPLACE EXTERNAL TABLE `project.bronze.events_external`
OPTIONS (
  format = 'JSON',
  uris = ['gs://my-data-lake/bronze/events/*.json.gz'],
  max_bad_records = 100
);

-- Query external table directly
SELECT
  product_category,
  SUM(amount) as total_revenue
FROM `project.bronze.sales_external`
WHERE order_date >= '2025-01-01'
GROUP BY 1;

Load Jobs (Physical Copy)

from google.cloud import bigquery

client = bigquery.Client()

# Load CSV from GCS to BigQuery
job_config = bigquery.LoadJobConfig(
    source_format=bigquery.SourceFormat.CSV,
    skip_leading_rows=1,
    autodetect=True,
    write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
    time_partitioning=bigquery.TimePartitioning(
        type_=bigquery.TimePartitioningType.DAY,
        field="order_date"
    ),
    clustering_fields=["product_category", "region"]
)

load_job = client.load_table_from_uri(
    'gs://my-data-lake/bronze/sales/*.csv',
    'project.bronze.sales',
    job_config=job_config
)

load_job.result()
print(f"Loaded {load_job.output_rows} rows")

Materialized Views

-- Materialized view with auto-refresh
CREATE MATERIALIZED VIEW `project.analytics.sales_summary_mv`
PARTITION BY DATE(order_date)
CLUSTER BY product_category
AS
SELECT
  DATE(order_date) as order_date,
  product_category,
  COUNT(*) as order_count,
  SUM(amount) as total_revenue,
  AVG(amount) as avg_order_value,
  COUNT(DISTINCT customer_id) as unique_customers
FROM `project.bronze.sales`
WHERE status = 'completed'
GROUP BY 1, 2;

-- Refresh materialized view
ALTER MATERIALIZED VIEW `project.analytics.sales_summary_mv` 
REFRESH;

✨

Best Practice: Use external tables for flexible exploration, materialized views for performance-critical dashboards, and physical tables for complex transformations. BigQuery automatically rewrites queries to use materialized views when beneficial.

dbt Integration

-- dbt model: stg_sales.sql
WITH source AS (
    SELECT * FROM {{ source('bronze', 'sales') }}
),

transformed AS (
    SELECT
        order_id,
        customer_id,
        product_category,
        amount,
        order_date,
        CURRENT_TIMESTAMP() as loaded_at
    FROM source
    WHERE amount > 0
)

SELECT * FROM transformed

# dbt/models/schema.yml
version: 2
models:
  - name: stg_sales
    description: "Staged sales data"
    columns:
      - name: order_id
        description: "Primary key"
        tests:
          - unique
          - not_null
      - name: amount
        description: "Order amount"
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0

💬

Common Interview Questions

Q1: What is the difference between ETL and ELT?

Answer: ETL transforms data before loading into the target (Transform → Load). ELT loads raw data first, then transforms within the target (Load → Transform). ELT is preferred with BigQuery because it separates storage and compute, allowing cheap storage and on-demand transformation.

Q2: When would you use external tables vs. physical tables?

Answer: External tables for flexible exploration without data duplication, ad-hoc queries, and schema-on-read scenarios. Physical tables for performance-critical workloads, complex transformations, and when you need partitioning/clustering optimization. External tables have query costs but no storage costs.

Q3: How do materialized views improve performance?

Answer: Materialized views pre-compute and store query results, reducing query execution time. BigQuery automatically rewrites queries to use materialized views when beneficial. They're refreshed incrementally, minimizing compute costs. Ideal for dashboards and frequently-run aggregations.

Q4: How does dbt integrate with BigQuery?

Answer: dbt connects via the BigQuery adapter, using BigQuery's SQL dialect. Models are written in SQL with Jinja templating. dbt handles dependencies, materializations, and testing. Use dbt for version-controlled transformations with built-in documentation and lineage.

Q5: What are the cost implications of ELT on BigQuery?

Answer: 1) External tables: no storage cost, query cost only, 2) Physical tables: storage + query cost, 3) Materialized views: refresh cost + storage, but faster queries, 4) On-demand pricing: $5/TB scanned, 5) Flat-rate: committed slots for predictable costs.

ELT Pipeline Patterns: GCS → BigQuery, External Tables & Materialized Views