Incremental Models
Incremental Architecture
Merge Strategy Flow
Partition Strategy
Detailed Explanation
Incremental models are one of dbt's most powerful features for optimizing performance and reducing costs. They allow you to process only new or changed data rather than rebuilding entire tables.
How do Incremental Models work?
- First Run: The model executes as a full table build
- Subsequent Runs: Only new/changed data is processed
- Merge Logic: New data is merged into existing table
- Partition Pruning: Queries are optimized with partition filters
What are the Incremental Strategies?
| Strategy | Use Case | SQL Pattern |
|---|---|---|
| Merge | Upsert operations | merge into target using source... |
| Append | Add new rows only | insert into target select... |
| Delete+Insert | Replace partitions | delete from target... insert into target... |
Merge Strategy: The most common strategy for upsert operations:
merge into target using source on target.id = source.id
when matched then update set ...
when not matched then insert ...
Append Strategy: Simple append of new rows without updates:
insert into target select * from source
where source.id not in (select id from target)
Delete+Insert Strategy: Replace specific partitions or segments:
delete from target where date = '2024-01-01'
insert into target select * from source where date = '2024-01-01'
How does Partition Management work?
Partitioning is crucial for incremental performance:
- Date-based partitioning: Most common, partition by date
- Timestamp-based partitioning: Higher granularity for large datasets
- Integer-based partitioning: For numeric keys
- Clustered partitioning: Combined partitioning and clustering
What is Watermark Detection?
dbt uses watermark detection to identify new data:
{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}
What are the Performance Considerations?
- Partition pruning: Always filter on partition columns
- Cluster alignment: Cluster by frequently filtered columns
- Batch processing: Process data in appropriate batch sizes
- Resource allocation: Use appropriate warehouse sizes
Key Takeaway: Incremental models optimize performance by processing only new or changed data, with strategies like merge, append, and delete+insert to handle different use cases.
Code Examples
Basic Incremental Model
-- models/marts/fct_orders.sql
{{
config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge'
)
}}
with orders as (
select * from {{ ref('stg_orders') }}
),
final as (
select
order_id,
customer_id,
order_date,
status,
amount,
current_timestamp() as updated_at
from orders
)
select * from final
{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}
Partitioned Incremental Model (BigQuery)
-- models/marts/fct_events.sql
{{
config(
materialized='incremental',
unique_key='event_id',
partition_by={
"field": "event_date",
"data_type": "date",
"granularity": "day"
},
cluster_by=['user_id', 'event_type'],
incremental_strategy='merge'
)
}}
with events as (
select * from {{ ref('stg_events') }}
),
final as (
select
event_id,
user_id,
event_type,
event_timestamp,
cast(event_timestamp as date) as event_date,
event_properties,
current_timestamp() as updated_at
from events
)
select * from final
{% if is_incremental() %}
where event_date >= date_sub(
(select max(event_date) from {{ this }}),
interval 3 day
)
{% endif %}
Append-Only Incremental Model
-- models/marts/fct_events_append.sql
{{
config(
materialized='incremental',
incremental_strategy='append',
unique_key='event_id'
)
}}
with events as (
select
event_id,
user_id,
event_type,
event_timestamp,
event_properties
from {{ ref('stg_events') }}
)
select * from events
{% if is_incremental() %}
where event_id not in (select event_id from {{ this }})
{% endif %}
Delete+Insert Strategy
-- models/marts/fct_daily_metrics.sql
{{
config(
materialized='incremental',
incremental_strategy='delete+insert',
unique_key=['metric_date', 'metric_name']
)
}}
with daily_metrics as (
select
cast(metric_timestamp as date) as metric_date,
metric_name,
metric_value,
current_timestamp() as updated_at
from {{ ref('stg_metrics') }}
)
select * from daily_metrics
{% if is_incremental() %}
where metric_date >= date_sub(
(select max(metric_date) from {{ this }}),
interval 7 day
)
{% endif %}
Advanced Incremental with Macro
-- models/marts/fct_orders_advanced.sql
{{
config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge',
partition_by={
"field": "order_date",
"data_type": "date",
"granularity": "day"
},
cluster_by=['customer_id', 'order_status']
)
}}
{% macro get_incremental_filter() %}
{% if is_incremental() %}
where order_date >= date_sub(
(select max(order_date) from {{ this }}),
interval {{ var('incremental_lookback_days', 7) }} day
)
{% endif %}
{% endmacro %}
with orders as (
select * from {{ ref('stg_orders') }}
),
customers as (
select * from {{ ref('dim_customers') }}
),
final as (
select
orders.order_id,
orders.customer_id,
customers.customer_name,
customers.segment,
orders.order_date,
orders.status as order_status,
orders.amount,
current_timestamp() as updated_at
from orders
left join customers on orders.customer_id = customers.customer_id
{{ get_incremental_filter() }}
)
select * from final
Performance Metrics
| Metric | Full Refresh | Incremental | Improvement |
|---|---|---|---|
| Processing Time | 100% | 10-30% | 70-90% |
| Cost | 100% | 20-40% | 60-80% |
| Data Scanned | 100% | 5-20% | 80-95% |
| Warehouse Size | Large | Medium | 50% reduction |
| Parallelism | Low | High | 2-3x improvement |
Best Practices
- Always use unique_key for merge strategies
- Partition large tables for efficient pruning
- Cluster by frequently filtered columns
- Use appropriate incremental lookback for late-arriving data
- Test incremental logic with both full and incremental runs
- Monitor partition counts to avoid excessive partitions
- Use delete+insert for simple date-based replacements
- Consider resource costs when choosing strategies
See Also
- Materializations in dbt β Table, view, incremental, and ephemeral strategies
- Models, Seeds, and Snapshots β Core dbt components for data transformation
- dbt Core Architecture β Manifest, DAG, and compilation pipeline
- PySpark Performance Optimization β Partitioning and optimization in Spark
- Snowflake Query Performance β Clustering, partitioning, and warehouse optimization