Models, Seeds, and Snapshots
Component Architecture
Data Flow Diagram
Snapshot Lifecycle
Record 1001
Active β Inactive
2024-01-01 to 9999-12-31
Record 1002
Active
2024-02-01 to 9999-12-31
Record 1003
Pending β Active
2024-03-01 to 9999-12-31
Detailed Explanation
dbt provides several core components for data transformation: models, seeds, snapshots, and sources. Understanding when and how to use each component is crucial for building robust data pipelines.
What are Models?
Models are the primary building blocks of dbt projectsβSQL files (or Python scripts) that define transformations on your data.
- Definition: SQL or Python files that define data transformations
- Templating: Support Jinja templating for dynamic SQL generation
- Organization: Organized into a DAG (Directed Acyclic Graph) based on dependencies
- Materialization: Each model can have different materialization strategies (table, view, incremental, ephemeral)
- Testing: Support testing and documentation
What are Seeds?
Seeds are CSV files that dbt can load into your data warehouse for static or configuration data.
| Use Case | Example |
|---|---|
| Static lookup tables | Country codes, product categories |
| Configuration data | Queryable settings and parameters |
| Small datasets | Data that complements main sources |
- Storage: CSV files stored in the
seeds/directory - Loading: Loaded as tables in your warehouse
- Testing: Support dbt tests (uniqueness, not_null, relationships)
- Referencing: Can be referenced in models using
{{ ref('seed_name') }}
What are Snapshots?
Snapshots capture the historical state of data over time, implementing Slowly Changing Dimension (SCD) Type 2 logic.
- Change Tracking: Track changes in source data over time
- Timestamps: Maintain valid_from and valid_to timestamps
- Current Row: Support current row identification
- Detection Methods: Use either timestamp-based or check-based change detection
- Storage: Stored in the
snapshots/directory
What are Sources?
Sources are declarations that tell dbt about your existing tables in the warehouse, providing metadata about external data.
- Declaration: Declared in YAML files under
sources: - Freshness: Enable source freshness checks
- Documentation: Provide documentation for upstream tables
- Testing: Support testing at the source level
- Lineage: Track data lineage from source to mart
Key Takeaway: Each component serves a specific purposeβmodels transform data, seeds load static data, snapshots track history, and sources document external data.
Code Examples
Model Materialization Patterns
-- models/staging/stg_orders.sql
{{
config(
materialized='view',
schema='staging'
)
}}
with source as (
select * from {{ source('raw', 'orders') }}
),
renamed as (
select
id as order_id,
customer_id,
status,
created_at as order_date,
updated_at as modified_date,
_fivetran_synced as synced_at
from source
)
select * from renamed
Incremental Model with Merge Strategy
-- models/marts/fct_orders.sql
{{
config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge',
partition_by={
"field": "order_date",
"data_type": "timestamp"
},
cluster_by=['customer_id', 'order_status']
)
}}
with orders as (
select * from {{ ref('stg_orders') }}
),
customers as (
select * from {{ ref('dim_customers') }}
),
final as (
select
orders.order_id,
orders.customer_id,
customers.customer_name,
orders.order_date,
orders.status as order_status,
orders.amount,
current_timestamp() as dbt_updated_at
from orders
left join customers on orders.customer_id = customers.customer_id
)
select * from final
{% if is_incremental() %}
where dbt_updated_at > (select max(dbt_updated_at) from {{ this }})
{% endif %}
Snapshot Definition
-- snapshots/customers_snapshot.sql
{% snapshot customers_snapshot %}
{{
config(
target_schema='snapshots',
unique_key='customer_id',
strategy='timestamp',
updated_at='updated_at',
invalidate_hard_deletes=True
)
}}
select * from {{ source('raw', 'customers') }}
{% endsnapshot %}
Seed Configuration
# seeds/country_codes.yml
version: 2
seeds:
- name: country_codes
description: "ISO country codes for geo analysis"
config:
column_types:
country_code: varchar(2)
country_name: varchar(100)
continent: varchar(50)
region: varchar(50)
columns:
- name: country_code
description: "ISO 3166-1 alpha-2 country code"
data_tests:
- unique
- not_null
- name: country_name
description: "Full country name"
data_tests:
- not_null
Source Freshness Configuration
# models/staging/_sources.yml
version: 2
sources:
- name: shopify
database: raw
schema: shopify
loader: fivetran
loaded_at_field: _fivetran_synced
freshness:
warn_after: {count: 6, period: hour}
error_after: {count: 24, period: hour}
tables:
- name: orders
description: "Shopify orders"
data_tests:
- unique:
column_name: id
- not_null:
column_name: id
- name: customers
description: "Shopify customers"
columns:
- name: id
data_tests:
- unique
- not_null
- name: email
data_tests:
- not_null
Performance Metrics
| Component | Load Time | Storage | Query Performance | Use Case |
|---|---|---|---|---|
| Table | Medium | High | Excellent | Full refresh |
| View | Fast | Low | Good | Simple transforms |
| Incremental | Variable | Medium | Excellent | Large datasets |
| Ephemeral | N/A | None | Varies | Reusable CTEs |
| Seed | Fast | Low | Good | Static data |
| Snapshot | Medium | High | Good | Historical tracking |
Best Practices
- Use staging models for 1:1 mappings with source tables
- Keep models small and focused on single responsibilities
- Use ephemeral models for intermediate transformations that don't need persistence
- Implement incremental models for large fact tables
- Use snapshots for SCD Type 2 requirements
- Load seeds for static lookup data
- Always add tests to models and sources
- Document everything with descriptions in YAML files
See Also
- dbt Core Architecture β Project structure and compilation pipeline
- The ref() Function β Model reference resolution and dependency management
- Incremental Models β Merge strategies and performance optimization
- dbt Testing Framework β Schema tests, data tests, and custom validations
- Snowflake Data Loading β Loading data into Snowflake warehouses