🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

ELT Pipeline Patterns: GCS → BigQuery, External Tables & Materialized Views

GCP Data EngineeringELT Pipelines⭐ Premium

Advertisement

ELT Pipeline Patterns on GCP

Master ELT patterns including GCS to BigQuery loading, external tables, materialized views, and dbt integration on GCP.

18 min readAdvanced

ELT Architecture on GCP

📊 BigQuery Architecture for Data Engineering
COLUMNAR STORAGE (Capacitor)Column 1Int64Column 2StringColumn 3Float64Column 4TimestampColumn 5JSONColumn N...QUERY ENGINE (Dremel)Tree ArchitectureDistributed executionSlot-basedAuto-scaling computeColumn pruningRead only needed columnsPredicate pushdownFilter earlyKEY FEATURESBI EngineIn-memory analyticsStreaming BufferReal-time insertsPartitioningTime-unit / IntegerClusteringAuto-sort columnsSLOT USAGEStandardShared slotsEnterpriseReserved slotsFlex SlotsPay per useAutoscaleDynamic allocation
Interview Tip: BigQuery separates storage and compute. Queries are charged by slots (compute) + bytes scanned. Always partition and cluster tables to reduce costs.

Loading Data into BigQuery

External Tables (Schema on Read)

-- External table for Parquet files in GCS
CREATE OR REPLACE EXTERNAL TABLE `project.bronze.sales_external`
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://my-data-lake/bronze/sales/**/*.parquet']
);

-- External table for JSON files
CREATE OR REPLACE EXTERNAL TABLE `project.bronze.events_external`
OPTIONS (
  format = 'JSON',
  uris = ['gs://my-data-lake/bronze/events/*.json.gz'],
  max_bad_records = 100
);

-- Query external table directly
SELECT
  product_category,
  SUM(amount) as total_revenue
FROM `project.bronze.sales_external`
WHERE order_date >= '2025-01-01'
GROUP BY 1;

Load Jobs (Physical Copy)

from google.cloud import bigquery

client = bigquery.Client()

# Load CSV from GCS to BigQuery
job_config = bigquery.LoadJobConfig(
    source_format=bigquery.SourceFormat.CSV,
    skip_leading_rows=1,
    autodetect=True,
    write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
    time_partitioning=bigquery.TimePartitioning(
        type_=bigquery.TimePartitioningType.DAY,
        field="order_date"
    ),
    clustering_fields=["product_category", "region"]
)

load_job = client.load_table_from_uri(
    'gs://my-data-lake/bronze/sales/*.csv',
    'project.bronze.sales',
    job_config=job_config
)

load_job.result()
print(f"Loaded {load_job.output_rows} rows")

Materialized Views

-- Materialized view with auto-refresh
CREATE MATERIALIZED VIEW `project.analytics.sales_summary_mv`
PARTITION BY DATE(order_date)
CLUSTER BY product_category
AS
SELECT
  DATE(order_date) as order_date,
  product_category,
  COUNT(*) as order_count,
  SUM(amount) as total_revenue,
  AVG(amount) as avg_order_value,
  COUNT(DISTINCT customer_id) as unique_customers
FROM `project.bronze.sales`
WHERE status = 'completed'
GROUP BY 1, 2;

-- Refresh materialized view
ALTER MATERIALIZED VIEW `project.analytics.sales_summary_mv` 
REFRESH;

Best Practice: Use external tables for flexible exploration, materialized views for performance-critical dashboards, and physical tables for complex transformations. BigQuery automatically rewrites queries to use materialized views when beneficial.

dbt Integration

-- dbt model: stg_sales.sql
WITH source AS (
    SELECT * FROM {{ source('bronze', 'sales') }}
),

transformed AS (
    SELECT
        order_id,
        customer_id,
        product_category,
        amount,
        order_date,
        CURRENT_TIMESTAMP() as loaded_at
    FROM source
    WHERE amount > 0
)

SELECT * FROM transformed
# dbt/models/schema.yml
version: 2
models:
  - name: stg_sales
    description: "Staged sales data"
    columns:
      - name: order_id
        description: "Primary key"
        tests:
          - unique
          - not_null
      - name: amount
        description: "Order amount"
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
💬

Common Interview Questions

Q1: What is the difference between ETL and ELT?

Answer: ETL transforms data before loading into the target (Transform → Load). ELT loads raw data first, then transforms within the target (Load → Transform). ELT is preferred with BigQuery because it separates storage and compute, allowing cheap storage and on-demand transformation.

Q2: When would you use external tables vs. physical tables?

Answer: External tables for flexible exploration without data duplication, ad-hoc queries, and schema-on-read scenarios. Physical tables for performance-critical workloads, complex transformations, and when you need partitioning/clustering optimization. External tables have query costs but no storage costs.

Q3: How do materialized views improve performance?

Answer: Materialized views pre-compute and store query results, reducing query execution time. BigQuery automatically rewrites queries to use materialized views when beneficial. They're refreshed incrementally, minimizing compute costs. Ideal for dashboards and frequently-run aggregations.

Q4: How does dbt integrate with BigQuery?

Answer: dbt connects via the BigQuery adapter, using BigQuery's SQL dialect. Models are written in SQL with Jinja templating. dbt handles dependencies, materializations, and testing. Use dbt for version-controlled transformations with built-in documentation and lineage.

Q5: What are the cost implications of ELT on BigQuery?

Answer: 1) External tables: no storage cost, query cost only, 2) Physical tables: storage + query cost, 3) Materialized views: refresh cost + storage, but faster queries, 4) On-demand pricing: $5/TB scanned, 5) Flat-rate: committed slots for predictable costs.

Advertisement