πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

dbt Fundamentals: SQL-First Analytics Engineering

Module 3: Data Warehouses & StorageData Transformation🟒 Free Lesson

Advertisement

dbt: The SQL Transformation Layer

dbt (data build tool) transforms raw data into analytics-ready datasets using SQL SELECT statements.

Why dbt Matters


Traditional ETL:

  • Relies on procedural code (Python scripts, stored procedures)
  • Complex transformation logic
  • Hard to test and version

dbt's Approach:

  • Declarative SQL models β€” each model is a SELECT statement
  • Software engineering best practices applied to analytics code
  • Version control, testing, documentation, modularity

Key Benefits:

  1. Dependency resolution β€” dbt handles execution ordering
  2. Incremental processing β€” only process new/changed data
  3. Testing β€” built-in data quality tests
  4. Documentation β€” auto-generated docs from SQL and YAML

Key Insight: dbt shifts to declarative SQL models where each model is a SELECT statement and the tool handles dependency resolution, execution ordering, and incremental processing.


Architecture Overview

dbt Project Flow

dbt Project Execution Flowdbt Projectmodels/, ymldbt_project.ymlParse & BuildJinja -> DAGGenerate SQLExecute SQLdbt rundbt testWarehouseMaterializedviews/tablesDocsdbt docsgen. docsdbt DAG (Directed Acyclic Graph)stg_ordersstg_paymentsfct_ordersint_order_paymentsrpt_revenuedbt resolves dependencies, builds DAG, executes models in correct order

dbt Project Structure

Architecture Diagram
my_dbt_project/
+-- dbt_project.yml          # Project configuration
+-- packages.yml              # Package dependencies
+-- profiles.yml              # Connection profiles
+-- models/
|   +-- staging/
|   |   +-- _staging__sources.yml
|   |   +-- _staging__models.yml
|   |   +-- stg_orders.sql
|   |   +-- stg_customers.sql
|   |   +-- stg_products.sql
|   +-- intermediate/
|   |   +-- int_order_payments.sql
|   |   +-- int_customer_lifetime.sql
|   +-- marts/
|       +-- _marts__models.yml
|       +-- dim_customers.sql
|       +-- fact_orders.sql
|       +-- agg_monthly_revenue.sql
+-- tests/
|   +-- generic/
|       +-- test_positive_value.sql
+-- macros/
|   +-- generate_schema_name.sql
+-- seeds/
|   +-- country_codes.csv
+-- snapshots/
    +-- scd_customers.sql

Models

A dbt model is a single SQL SELECT statement that produces a result set. Models are the building blocks of dbt β€” they define transformations, create tables/views, and form the DAG (Directed Acyclic Graph) that dbt executes in dependency order.

# dbt_project.yml
name: 'analytics'
version: '1.0.0'
config-version: 2

profile: 'snowflake_prod'

model-paths: ["models"]
test-paths: ["tests"]
analysis-paths: ["analyses"]
macro-paths: ["macros"]
snapshot-paths: ["schemas"]

clean-targets:
  - "target"
  - "dbt_packages"

models:
  analytics:
    staging:
      +materialized: view
      +schema: staging
    intermediate:
      +materialized: ephemeral
    marts:
      +materialized: table
      +schema: analytics

vars:
  start_date: '2024-01-01'
  load_timestamp: "{{ run_started_at }}"
-- models/staging/stg_orders.sql
-- Source: raw data from source system
-- Materialized as view (always fresh)

WITH source AS (
    SELECT * FROM {{ source('raw', 'orders') }}
),

renamed AS (
    SELECT
        id                              AS order_id,
        customer_id,
        status                          AS order_status,
        CAST(order_date AS DATE)        AS order_date,
        CAST(shipped_date AS DATE)      AS shipped_date,
        CAST(delivered_date AS DATE)    AS delivered_date,
        amount                          AS gross_amount,
        tax_amount,
        discount_amount,
        amount - COALESCE(discount_amount, 0) AS net_amount,
        created_at,
        updated_at
    FROM source
    WHERE id IS NOT NULL
)

SELECT * FROM renamed
-- models/marts/fact_orders.sql
-- Materialized as table (refreshed on schedule)

WITH orders AS (
    SELECT * FROM {{ ref('stg_orders') }}
),

customers AS (
    SELECT * FROM {{ ref('dim_customers') }}
),

daily_metrics AS (
    SELECT
        order_date,
        customer_id,
        COUNT(*)                    AS order_count,
        SUM(net_amount)             AS total_revenue,
        AVG(net_amount)             AS avg_order_value,
        COUNTIF(order_status = 'completed') AS completed_orders,
        MIN(order_date)             AS first_order_date,
        MAX(order_date)             AS last_order_date
    FROM orders
    WHERE order_date >= '{{ var("start_date") }}'
    GROUP BY 1, 2
)

SELECT
    {{ dbt_utils.generate_surrogate_key(['order_date', 'customer_id']) }} AS order_key,
    d.*,
    c.customer_segment,
    c.customer_lifetime_value,
    CURRENT_TIMESTAMP()           AS refreshed_at
FROM daily_metrics d
LEFT JOIN customers c
    ON d.customer_id = c.customer_id

Sources and References

Sources declare existing database tables that dbt models consume. They enable testing freshness, schema changes, and provide lineage tracking from raw to transformed data.

# models/staging/_staging__sources.yml
version: 2

sources:
  - name: raw
    description: "Raw data loaded from source systems"
    database: raw_data
    schema: public
    loader: "fivetran"
    loaded_at_field: _fivetran_synced
    freshness:
      warn_after: {count: 6, period: hour}
      error_after: {count: 24, period: hour}
    tables:
      - name: orders
        description: "Customer orders from Shopify"
        loaded_at_field: updated_at
        freshness:
          warn_after: {count: 12, period: hour}
        columns:
          - name: id
            description: "Primary key"
            tests:
              - unique
              - not_null
          - name: customer_id
            description: "Foreign key to customers table"
            tests:
              - not_null
              - relationships:
                  to: ref('dim_customers')
                  field: customer_id

      - name: customers
        description: "Customer records from Shopify"
        columns:
          - name: id
            tests:
              - unique
              - not_null
# models/marts/_marts__models.yml
version: 2

models:
  - name: fact_orders
    description: "Fact table containing all customer orders"
    config:
      tags: ["finance", "daily"]
    columns:
      - name: order_key
        description: "Surrogate key"
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 1
      - name: total_revenue
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              inclusive: false

  - name: dim_customers
    description: "Customer dimension with SCD Type 2"
    columns:
      - name: customer_key
        tests:
          - unique
          - not_null
      - name: customer_segment
        tests:
          - accepted_values:
              values: ['enterprise', 'mid_market', 'smb', 'freemium']

Jinja Templating

Jinja2 is a templating language that dbt uses to generate dynamic SQL. It enables parameterized models, control flow (loops, conditionals), macros, and reusable logic.

-- Jinja macro for common patterns
-- macros/generate_schema_name.sql
{% macro generate_schema_name(custom_schema_name=none) -%}
    {%- set default_schema = target.schema -%}
    {%- if custom_schema_name is none -%}
        {{ default_schema }}
    {%- else -%}
        {{ default_schema }}_{{ custom_schema_name }}
    {%- endif -%}
{%- endmacro %}

-- macros/positive_value_test.sql
{% test positive_value(model, column_name) %}
SELECT *
FROM {{ model }}
WHERE {{ column_name }} <= 0
{% endtest %}

-- macros/union_tables.sql
{% macro union_tables(table_list) %}
    {%- for table in table_list %}
        SELECT * FROM {{ table }}
        {% if not loop.last %} UNION ALL {% endif %}
    {%- endfor %}
{% endmacro %}

-- Usage in model
{% set tables = ['stg_orders', 'stg_returns', 'stg_refunds'] %}
{{ union_tables(tables) }}

-- Jinja for conditional logic
SELECT
    order_id,
    customer_id,
    net_amount,
    {% if var('include_calculations', false) %}
        net_amount * 0.1 AS estimated_tax,
        net_amount * 0.9 AS estimated_net,
    {% endif %}
    CURRENT_TIMESTAMP() AS loaded_at
FROM {{ ref('stg_orders') }}

-- Jinja loop for dynamic columns
WITH metrics AS (
    SELECT
        customer_id,
        {% set metric_columns = ['orders', 'returns', 'refunds'] %}
        {% for col in metric_columns %}
            COUNTIF(type = '{{ col }}') AS {{ col }}_count,
            SUM(CASE WHEN type = '{{ col }}' THEN amount ELSE 0 END) AS {{ col }}_total,
        {% endfor %}
        COUNT(*) AS total_transactions
    FROM {{ ref('stg_transactions') }}
    GROUP BY 1
)

SELECT * FROM metrics

Incremental Models

An incremental model in dbt processes only new or changed records rather than rebuilding the entire table each run. It uses a strategy (append, merge, or delete+insert) to efficiently update the target table.

-- models/marts/fact_orders_incremental.sql
{{
    config(
        materialized='incremental',
        unique_key='order_key',
        incremental_strategy='merge',
        on_schema_change='sync_all_columns',
        partition_by={'field': 'order_date', 'data_type': 'date'},
        cluster_by=['customer_id', 'order_status']
    )
}}

WITH orders AS (
    SELECT * FROM {{ ref('stg_orders') }}

    {% if is_incremental() %}
        WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
    {% endif %}
),

customers AS (
    SELECT * FROM {{ ref('dim_customers') }}
)

SELECT
    {{ dbt_utils.generate_surrogate_key(['o.order_id']) }} AS order_key,
    o.order_id,
    o.customer_id,
    o.order_date,
    o.order_status,
    o.net_amount,
    o.created_at,
    o.updated_at,
    c.customer_segment,
    CURRENT_TIMESTAMP() AS dbt_loaded_at
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id

Testing

dbt tests are assertions defined in YAML that validate data quality. They execute as SQL queries that return failing rows β€” a test passes when the query returns zero rows.

# tests/test_order_amount_positive.yml
version: 2

models:
  - name: fact_orders
    columns:
      - name: net_amount
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              inclusive: false
          - dbt_utils.standard_deviation:
              column_name: net_amount
              threshold: 3  # Flag if >3 standard deviations
-- Custom test: test_order_status_transition.sql
-- Validates that order status transitions are valid
SELECT
    o1.order_id,
    o1.order_status AS from_status,
    o2.order_status AS to_status,
    o2.updated_at AS transition_time
FROM {{ ref('fct_orders') }} o1
JOIN {{ ref('fct_orders') }} o2
    ON o1.order_id = o2.order_id
    AND o2.updated_at = (
        SELECT MIN(updated_at)
        FROM {{ ref('fct_orders') }}
        WHERE order_id = o1.order_id
          AND updated_at > o1.updated_at
    )
WHERE NOT (
    (o1.order_status = 'pending' AND o2.order_status IN ('processing', 'cancelled'))
    OR (o1.order_status = 'processing' AND o2.order_status IN ('shipped', 'cancelled'))
    OR (o1.order_status = 'shipped' AND o2.order_status IN ('delivered', 'returned'))
    OR (o1.order_status = 'delivered' AND o2.order_status IN ('returned', 'refunded'))
)

Key Concepts Summary

ConceptDescriptionMaterializationUse Case
ModelSQL SELECT statementview/table/incrementalAll transformations
SourceExisting raw tableN/AInput declarations
TestData quality assertionN/AQuality validation
MacroReusable SQL functionN/ACommon patterns
SnapshotSCD Type 2 trackingTableHistorical dimensions
SeedStatic CSV dataTableReference data
PackageReusable dbt codeN/ACommunity patterns
DAGDependency graphN/AExecution ordering
RefModel referenceN/ACross-model dependencies
VariableConfiguration parameterN/AEnvironment-specific values
On-run HookPre/post executionN/ACustom logic
DocumentationColumn/table docsN/AData governance

Performance Metrics

MetricImpactMeasurementOptimization
Model Build TimePipeline speedSeconds per modelReduce columns, filter early
Test ExecutionQuality assuranceSeconds per testOptimize test SQL
DAG DepthParallelizationNumber of layersFlatten dependencies
Incremental BenefitCost savings% data skippedProper unique_key design
Documentation CoverageDiscoverability% models documentedEnforce via tests
Schema ChangesStabilityFrequencyon_schema_change config

10 Best Practices

  1. Follow the staging -> intermediate -> marts pattern β€” keep raw transformations clean and reusable
  2. Use ephemeral materialization for intermediate models to avoid unnecessary table creation
  3. Always define sources β€” never query raw tables directly with hardcoded names
  4. Write tests for every model β€” at minimum: unique + not_null on primary keys
  5. Use is_incremental() macro β€” build only new data, not full rebuilds
  6. Tag models by domain β€” enable selective builds with dbt run --select tag:finance
  7. Use dbt_utils package β€” avoid reinventing common patterns (surrogate keys, date spine, etc.)
  8. Document everything β€” column descriptions appear in dbt docs and data catalogs
  9. Use on_schema_change carefully β€” sync_all_columns prevents breaking changes
  10. Version control dbt projects β€” use PRs, code review, and CI/CD for model changes

  • dbt transforms raw data into analytics-ready datasets using declarative SQL
  • The staging -> intermediate -> marts pattern creates maintainable data pipelines
  • Jinja templating enables parameterized, reusable SQL components
  • Incremental models reduce compute cost by processing only new/changed data
  • Testing and documentation are built-in β€” enforce quality at the code level

See Also

⭐

Premium Content

dbt Fundamentals: SQL-First Analytics Engineering

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Data Engineering Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement