dbt with BigQuery
BigQuery Architecture
Partitioning Strategy
Clustering Strategy
Detailed Explanation
What is BigQuery + dbt?
BigQuery is a serverless, highly scalable data warehouse that integrates seamlessly with dbt for data transformation.
Core Features
| Feature | Description |
|---|---|
| Native Integration | Direct SQL execution |
| Partitioning | Time and integer-based partitioning |
| Clustering | Column-based data organization |
| Materialized Views | Pre-computed aggregations |
| Slot Management | Compute resource allocation |
Partitioning Strategies
BigQuery supports several partitioning strategies:
- Day-partitioned β Partition by date column
- Hour-partitioned β Partition by timestamp (hour granularity)
- Integer-range β Partition by integer column
- Ingestion-time β Partition by data insertion time
Clustering
Clustering organizes data within partitions for better performance:
- Up to 4 clustering columns
- Automatic re-clustering
- Improved query performance
- Reduced costs
Cost Optimization
| Pricing Factor | Rate |
|---|---|
| Storage | $0.02/GB/month |
| Query | $5/TB scanned |
| Streaming | $0.01/200MB |
| Slots | Variable pricing |
Key Takeaway: Partitioning and clustering together can reduce query costs by up to 99% and improve performance by 95%.
Code Examples
BigQuery Profile Configuration
# profiles.yml
my_profile:
target: dev
outputs:
dev:
type: bigquery
method: oauth
project: my-gcp-project
dataset: dbt_dev
threads: 8
timeout_seconds: 300
location: US
priority: interactive
retries: 1
maximum_bytes_billed: 1000000000 # 1GB limit
prod:
type: bigquery
method: service-account
project: my-gcp-project
dataset: analytics_prod
threads: 16
timeout_seconds: 600
location: US
priority: batch
retries: 3
maximum_bytes_billed: 100000000000 # 100GB limit
Partitioned Model
-- models/marts/fct_orders.sql
{{
config(
materialized='incremental',
unique_key='order_id',
partition_by={
"field": "order_date",
"data_type": "date",
"granularity": "day"
},
cluster_by=['customer_id', 'order_status', 'product_category'],
incremental_strategy='merge'
)
}}
with orders as (
select * from {{ ref('stg_orders') }}
),
final as (
select
order_id,
customer_id,
order_date,
order_status,
product_category,
amount,
current_timestamp() as updated_at
from orders
)
select * from final
{% if is_incremental() %}
where order_date >= date_sub(
(select max(order_date) from {{ this }}),
interval 7 day
)
{% endif %}
Materialized View
-- models/marts/mv_daily_revenue.sql
{{
config(
materialized='materialized_view',
partition_by={
"field": "order_date",
"data_type": "date"
},
enable_refresh=true,
refresh_interval_minutes=60
)
}}
with daily_revenue as (
select
order_date,
product_category,
sum(amount) as total_revenue,
count(*) as order_count
from {{ ref('fct_orders') }}
group by 1, 2
)
select * from daily_revenue
External Table
-- models/staging/stg_external_data.sql
{{
config(
materialized='view'
)
}}
with external_data as (
select * from {{ source('external', 'gcs_data') }}
)
select * from external_data
BigQuery-Specific Macros
-- macros/bigquery/safe_divide.sql
{% macro safe_divide(numerator, denominator) %}
safe_divide({{ numerator }}, {{ denominator }})
{% endmacro %}
-- macros/bigquery/array_agg.sql
{% macro array_agg(column_name) %}
array_agg({{ column_name }} ignore nulls)
{% endmacro %}
-- macros/bigquery/struct.sql
{% macro make_struct(columns) %}
struct({{ columns | join(', ') }})
{% endmacro %}
Incremental with BigQuery Optimizations
-- models/marts/fct_events.sql
{{
config(
materialized='incremental',
unique_key='event_id',
partition_by={
"field": "event_timestamp",
"data_type": "timestamp",
"granularity": "hour"
},
cluster_by=['user_id', 'event_type'],
incremental_strategy='merge',
post_hook=[
"analyze table {{ this }} compute statistics for all columns"
]
)
}}
with events as (
select * from {{ ref('stg_events') }}
),
final as (
select
event_id,
user_id,
event_type,
event_timestamp,
event_properties,
current_timestamp() as updated_at
from events
)
select * from final
{% if is_incremental() %}
where event_timestamp >= timestamp_sub(
(select max(event_timestamp) from {{ this }}),
interval 3 day
)
{% endif %}
Performance Metrics
| Metric | Without Optimization | With Optimization | Improvement |
|---|---|---|---|
| Query Time | 10s | 0.5s | 95% faster |
| Cost | 0.01 | 99.8% cheaper | |
| Storage | 100GB | 100GB | Same |
| Scan | 100M rows | 274K rows | 99.7% less |
Best Practices
- Always partition large tables by date
- Cluster by query patterns - filter/join columns
- Use incremental models - minimize data scanned
- Partition expiration - manage storage costs
- Use materialized views - for frequent aggregations
- Monitor slot usage - optimize compute resources
- Use dry runs - estimate query costs
- Leverage caching - reduce repeated queries
See Also
- Performance Tuning β General optimization strategies
- dbt Snowflake β Snowflake warehouse optimization
- dbt Redshift β Redshift distribution strategies
- dbt Best Practices β Project structure and patterns