Models, Seeds, and Snapshots

Component Architecture

Data Flow Diagram

Snapshot Lifecycle

📋

Record 1001

Active → Inactive

2024-01-01 to 9999-12-31

✅

Record 1002

Active

2024-02-01 to 9999-12-31

⏳

Record 1003

Pending → Active

2024-03-01 to 9999-12-31

Detailed Explanation

dbt provides several core components for data transformation: models, seeds, snapshots, and sources. Understanding when and how to use each component is crucial for building robust data pipelines.

What are Models?

Models are the primary building blocks of dbt projects—SQL files (or Python scripts) that define transformations on your data.

Definition: SQL or Python files that define data transformations
Templating: Support Jinja templating for dynamic SQL generation
Organization: Organized into a DAG (Directed Acyclic Graph) based on dependencies
Materialization: Each model can have different materialization strategies (table, view, incremental, ephemeral)
Testing: Support testing and documentation

What are Seeds?

Seeds are CSV files that dbt can load into your data warehouse for static or configuration data.

Use Case	Example
Static lookup tables	Country codes, product categories
Configuration data	Queryable settings and parameters
Small datasets	Data that complements main sources

Storage: CSV files stored in the seeds/ directory
Loading: Loaded as tables in your warehouse
Testing: Support dbt tests (uniqueness, not_null, relationships)
Referencing: Can be referenced in models using {{ ref('seed_name') }}

What are Snapshots?

Snapshots capture the historical state of data over time, implementing Slowly Changing Dimension (SCD) Type 2 logic.

Change Tracking: Track changes in source data over time
Timestamps: Maintain valid_from and valid_to timestamps
Current Row: Support current row identification
Detection Methods: Use either timestamp-based or check-based change detection
Storage: Stored in the snapshots/ directory

What are Sources?

Sources are declarations that tell dbt about your existing tables in the warehouse, providing metadata about external data.

Declaration: Declared in YAML files under sources:
Freshness: Enable source freshness checks
Documentation: Provide documentation for upstream tables
Testing: Support testing at the source level
Lineage: Track data lineage from source to mart

Key Takeaway: Each component serves a specific purpose—models transform data, seeds load static data, snapshots track history, and sources document external data.

Code Examples

Model Materialization Patterns

-- models/staging/stg_orders.sql
{{
    config(
        materialized='view',
        schema='staging'
    )
}}

with source as (
    select * from {{ source('raw', 'orders') }}
),

renamed as (
    select
        id as order_id,
        customer_id,
        status,
        created_at as order_date,
        updated_at as modified_date,
        _fivetran_synced as synced_at
    from source
)

select * from renamed

Incremental Model with Merge Strategy

-- models/marts/fct_orders.sql
{{
    config(
        materialized='incremental',
        unique_key='order_id',
        incremental_strategy='merge',
        partition_by={
            "field": "order_date",
            "data_type": "timestamp"
        },
        cluster_by=['customer_id', 'order_status']
    )
}}

with orders as (
    select * from {{ ref('stg_orders') }}
),

customers as (
    select * from {{ ref('dim_customers') }}
),

final as (
    select
        orders.order_id,
        orders.customer_id,
        customers.customer_name,
        orders.order_date,
        orders.status as order_status,
        orders.amount,
        current_timestamp() as dbt_updated_at
    from orders
    left join customers on orders.customer_id = customers.customer_id
)

select * from final

{% if is_incremental() %}
where dbt_updated_at > (select max(dbt_updated_at) from {{ this }})
{% endif %}

Snapshot Definition

-- snapshots/customers_snapshot.sql
{% snapshot customers_snapshot %}

{{
    config(
        target_schema='snapshots',
        unique_key='customer_id',
        strategy='timestamp',
        updated_at='updated_at',
        invalidate_hard_deletes=True
    )
}}

select * from {{ source('raw', 'customers') }}

{% endsnapshot %}

Seed Configuration

# seeds/country_codes.yml
version: 2

seeds:
  - name: country_codes
    description: "ISO country codes for geo analysis"
    config:
      column_types:
        country_code: varchar(2)
        country_name: varchar(100)
        continent: varchar(50)
        region: varchar(50)
    columns:
      - name: country_code
        description: "ISO 3166-1 alpha-2 country code"
        data_tests:
          - unique
          - not_null
      - name: country_name
        description: "Full country name"
        data_tests:
          - not_null

Source Freshness Configuration

# models/staging/_sources.yml
version: 2

sources:
  - name: shopify
    database: raw
    schema: shopify
    loader: fivetran
    loaded_at_field: _fivetran_synced
    freshness:
      warn_after: {count: 6, period: hour}
      error_after: {count: 24, period: hour}
    tables:
      - name: orders
        description: "Shopify orders"
        data_tests:
          - unique:
              column_name: id
          - not_null:
              column_name: id
      - name: customers
        description: "Shopify customers"
        columns:
          - name: id
            data_tests:
              - unique
              - not_null
          - name: email
            data_tests:
              - not_null

Performance Metrics

Component	Load Time	Storage	Query Performance	Use Case
Table	Medium	High	Excellent	Full refresh
View	Fast	Low	Good	Simple transforms
Incremental	Variable	Medium	Excellent	Large datasets
Ephemeral	N/A	None	Varies	Reusable CTEs
Seed	Fast	Low	Good	Static data
Snapshot	Medium	High	Good	Historical tracking

Best Practices

Use staging models for 1:1 mappings with source tables
Keep models small and focused on single responsibilities
Use ephemeral models for intermediate transformations that don't need persistence
Implement incremental models for large fact tables
Use snapshots for SCD Type 2 requirements
Load seeds for static lookup data
Always add tests to models and sources
Document everything with descriptions in YAML files

Models, Seeds, and Snapshots

Models, Seeds, and Snapshots

Component Architecture

Data Flow Diagram

Snapshot Lifecycle

Detailed Explanation

What are Models?

What are Seeds?

What are Snapshots?

What are Sources?

Code Examples

Model Materialization Patterns

Incremental Model with Merge Strategy

Snapshot Definition

Seed Configuration

Source Freshness Configuration

Performance Metrics

Best Practices

See Also

Premium Content

Need Expert dbt Help?