πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Models, Seeds, and Snapshots

🟒 Free Lesson

Advertisement

Models, Seeds, and Snapshots

Component Architecture

Data Flow Diagram

Snapshot Lifecycle

πŸ“‹

Record 1001

Active β†’ Inactive

2024-01-01 to 9999-12-31

βœ…

Record 1002

Active

2024-02-01 to 9999-12-31

⏳

Record 1003

Pending β†’ Active

2024-03-01 to 9999-12-31

Detailed Explanation

dbt provides several core components for data transformation: models, seeds, snapshots, and sources. Understanding when and how to use each component is crucial for building robust data pipelines.


What are Models?

Models are the primary building blocks of dbt projectsβ€”SQL files (or Python scripts) that define transformations on your data.

  • Definition: SQL or Python files that define data transformations
  • Templating: Support Jinja templating for dynamic SQL generation
  • Organization: Organized into a DAG (Directed Acyclic Graph) based on dependencies
  • Materialization: Each model can have different materialization strategies (table, view, incremental, ephemeral)
  • Testing: Support testing and documentation

What are Seeds?

Seeds are CSV files that dbt can load into your data warehouse for static or configuration data.

Use CaseExample
Static lookup tablesCountry codes, product categories
Configuration dataQueryable settings and parameters
Small datasetsData that complements main sources
  • Storage: CSV files stored in the seeds/ directory
  • Loading: Loaded as tables in your warehouse
  • Testing: Support dbt tests (uniqueness, not_null, relationships)
  • Referencing: Can be referenced in models using {{ ref('seed_name') }}

What are Snapshots?

Snapshots capture the historical state of data over time, implementing Slowly Changing Dimension (SCD) Type 2 logic.

  • Change Tracking: Track changes in source data over time
  • Timestamps: Maintain valid_from and valid_to timestamps
  • Current Row: Support current row identification
  • Detection Methods: Use either timestamp-based or check-based change detection
  • Storage: Stored in the snapshots/ directory

What are Sources?

Sources are declarations that tell dbt about your existing tables in the warehouse, providing metadata about external data.

  • Declaration: Declared in YAML files under sources:
  • Freshness: Enable source freshness checks
  • Documentation: Provide documentation for upstream tables
  • Testing: Support testing at the source level
  • Lineage: Track data lineage from source to mart

Key Takeaway: Each component serves a specific purposeβ€”models transform data, seeds load static data, snapshots track history, and sources document external data.

Code Examples

Model Materialization Patterns

-- models/staging/stg_orders.sql
{{
    config(
        materialized='view',
        schema='staging'
    )
}}

with source as (
    select * from {{ source('raw', 'orders') }}
),

renamed as (
    select
        id as order_id,
        customer_id,
        status,
        created_at as order_date,
        updated_at as modified_date,
        _fivetran_synced as synced_at
    from source
)

select * from renamed

Incremental Model with Merge Strategy

-- models/marts/fct_orders.sql
{{
    config(
        materialized='incremental',
        unique_key='order_id',
        incremental_strategy='merge',
        partition_by={
            "field": "order_date",
            "data_type": "timestamp"
        },
        cluster_by=['customer_id', 'order_status']
    )
}}

with orders as (
    select * from {{ ref('stg_orders') }}
),

customers as (
    select * from {{ ref('dim_customers') }}
),

final as (
    select
        orders.order_id,
        orders.customer_id,
        customers.customer_name,
        orders.order_date,
        orders.status as order_status,
        orders.amount,
        current_timestamp() as dbt_updated_at
    from orders
    left join customers on orders.customer_id = customers.customer_id
)

select * from final

{% if is_incremental() %}
where dbt_updated_at > (select max(dbt_updated_at) from {{ this }})
{% endif %}

Snapshot Definition

-- snapshots/customers_snapshot.sql
{% snapshot customers_snapshot %}

{{
    config(
        target_schema='snapshots',
        unique_key='customer_id',
        strategy='timestamp',
        updated_at='updated_at',
        invalidate_hard_deletes=True
    )
}}

select * from {{ source('raw', 'customers') }}

{% endsnapshot %}

Seed Configuration

# seeds/country_codes.yml
version: 2

seeds:
  - name: country_codes
    description: "ISO country codes for geo analysis"
    config:
      column_types:
        country_code: varchar(2)
        country_name: varchar(100)
        continent: varchar(50)
        region: varchar(50)
    columns:
      - name: country_code
        description: "ISO 3166-1 alpha-2 country code"
        data_tests:
          - unique
          - not_null
      - name: country_name
        description: "Full country name"
        data_tests:
          - not_null

Source Freshness Configuration

# models/staging/_sources.yml
version: 2

sources:
  - name: shopify
    database: raw
    schema: shopify
    loader: fivetran
    loaded_at_field: _fivetran_synced
    freshness:
      warn_after: {count: 6, period: hour}
      error_after: {count: 24, period: hour}
    tables:
      - name: orders
        description: "Shopify orders"
        data_tests:
          - unique:
              column_name: id
          - not_null:
              column_name: id
      - name: customers
        description: "Shopify customers"
        columns:
          - name: id
            data_tests:
              - unique
              - not_null
          - name: email
            data_tests:
              - not_null

Performance Metrics

ComponentLoad TimeStorageQuery PerformanceUse Case
TableMediumHighExcellentFull refresh
ViewFastLowGoodSimple transforms
IncrementalVariableMediumExcellentLarge datasets
EphemeralN/ANoneVariesReusable CTEs
SeedFastLowGoodStatic data
SnapshotMediumHighGoodHistorical tracking

Best Practices

  1. Use staging models for 1:1 mappings with source tables
  2. Keep models small and focused on single responsibilities
  3. Use ephemeral models for intermediate transformations that don't need persistence
  4. Implement incremental models for large fact tables
  5. Use snapshots for SCD Type 2 requirements
  6. Load seeds for static lookup data
  7. Always add tests to models and sources
  8. Document everything with descriptions in YAML files

See Also

⭐

Premium Content

Models, Seeds, and Snapshots

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert dbt Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement