dbt with BigQuery

BigQuery Architecture

Partitioning Strategy

Clustering Strategy

Detailed Explanation

What is BigQuery + dbt?

BigQuery is a serverless, highly scalable data warehouse that integrates seamlessly with dbt for data transformation.

Core Features

Feature	Description
Native Integration	Direct SQL execution
Partitioning	Time and integer-based partitioning
Clustering	Column-based data organization
Materialized Views	Pre-computed aggregations
Slot Management	Compute resource allocation

Partitioning Strategies

BigQuery supports several partitioning strategies:

Day-partitioned — Partition by date column
Hour-partitioned — Partition by timestamp (hour granularity)
Integer-range — Partition by integer column
Ingestion-time — Partition by data insertion time

Clustering

Clustering organizes data within partitions for better performance:

Up to 4 clustering columns
Automatic re-clustering
Improved query performance
Reduced costs

Cost Optimization

Pricing Factor	Rate
Storage	$0.02/GB/month
Query	$5/TB scanned
Streaming	$0.01/200MB
Slots	Variable pricing

Key Takeaway: Partitioning and clustering together can reduce query costs by up to 99% and improve performance by 95%.

Code Examples

BigQuery Profile Configuration

# profiles.yml
my_profile:
  target: dev
  
  outputs:
    dev:
      type: bigquery
      method: oauth
      project: my-gcp-project
      dataset: dbt_dev
      threads: 8
      timeout_seconds: 300
      location: US
      priority: interactive
      retries: 1
      maximum_bytes_billed: 1000000000  # 1GB limit
    
    prod:
      type: bigquery
      method: service-account
      project: my-gcp-project
      dataset: analytics_prod
      threads: 16
      timeout_seconds: 600
      location: US
      priority: batch
      retries: 3
      maximum_bytes_billed: 100000000000  # 100GB limit

Partitioned Model

-- models/marts/fct_orders.sql
{{
    config(
        materialized='incremental',
        unique_key='order_id',
        partition_by={
            "field": "order_date",
            "data_type": "date",
            "granularity": "day"
        },
        cluster_by=['customer_id', 'order_status', 'product_category'],
        incremental_strategy='merge'
    )
}}

with orders as (
    select * from {{ ref('stg_orders') }}
),

final as (
    select
        order_id,
        customer_id,
        order_date,
        order_status,
        product_category,
        amount,
        current_timestamp() as updated_at
    from orders
)

select * from final

{% if is_incremental() %}
where order_date >= date_sub(
    (select max(order_date) from {{ this }}),
    interval 7 day
)
{% endif %}

Materialized View

-- models/marts/mv_daily_revenue.sql
{{
    config(
        materialized='materialized_view',
        partition_by={
            "field": "order_date",
            "data_type": "date"
        },
        enable_refresh=true,
        refresh_interval_minutes=60
    )
}}

with daily_revenue as (
    select
        order_date,
        product_category,
        sum(amount) as total_revenue,
        count(*) as order_count
    from {{ ref('fct_orders') }}
    group by 1, 2
)

select * from daily_revenue

External Table

-- models/staging/stg_external_data.sql
{{
    config(
        materialized='view'
    )
}}

with external_data as (
    select * from {{ source('external', 'gcs_data') }}
)

select * from external_data

BigQuery-Specific Macros

-- macros/bigquery/safe_divide.sql
{% macro safe_divide(numerator, denominator) %}
    safe_divide({{ numerator }}, {{ denominator }})
{% endmacro %}

-- macros/bigquery/array_agg.sql
{% macro array_agg(column_name) %}
    array_agg({{ column_name }} ignore nulls)
{% endmacro %}

-- macros/bigquery/struct.sql
{% macro make_struct(columns) %}
    struct({{ columns | join(', ') }})
{% endmacro %}

Incremental with BigQuery Optimizations

-- models/marts/fct_events.sql
{{
    config(
        materialized='incremental',
        unique_key='event_id',
        partition_by={
            "field": "event_timestamp",
            "data_type": "timestamp",
            "granularity": "hour"
        },
        cluster_by=['user_id', 'event_type'],
        incremental_strategy='merge',
        post_hook=[
            "analyze table {{ this }} compute statistics for all columns"
        ]
    )
}}

with events as (
    select * from {{ ref('stg_events') }}
),

final as (
    select
        event_id,
        user_id,
        event_type,
        event_timestamp,
        event_properties,
        current_timestamp() as updated_at
    from events
)

select * from final

{% if is_incremental() %}
where event_timestamp >= timestamp_sub(
    (select max(event_timestamp) from {{ this }}),
    interval 3 day
)
{% endif %}

Performance Metrics

Metric	Without Optimization	With Optimization	Improvement
Query Time	10s	0.5s	95% faster
Cost	$5.00 \|$ 0.01	99.8% cheaper
Storage	100GB	100GB	Same
Scan	100M rows	274K rows	99.7% less

Best Practices

Always partition large tables by date
Cluster by query patterns - filter/join columns
Use incremental models - minimize data scanned
Partition expiration - manage storage costs
Use materialized views - for frequent aggregations
Monitor slot usage - optimize compute resources
Use dry runs - estimate query costs
Leverage caching - reduce repeated queries

dbt with BigQuery

dbt with BigQuery

BigQuery Architecture

Partitioning Strategy

Clustering Strategy

Detailed Explanation

What is BigQuery + dbt?

Core Features

Partitioning Strategies

Clustering

Cost Optimization

Code Examples

BigQuery Profile Configuration

Partitioned Model

Materialized View

External Table

BigQuery-Specific Macros

Incremental with BigQuery Optimizations

Performance Metrics

Best Practices

See Also

Premium Content

Need Expert dbt Help?