dbt Fundamentals: SQL-First Analytics Engineering

dbt: The SQL Transformation Layer

dbt (data build tool) transforms raw data into analytics-ready datasets using SQL SELECT statements.

Why dbt Matters

Traditional ETL:

Relies on procedural code (Python scripts, stored procedures)
Complex transformation logic
Hard to test and version

dbt's Approach:

Declarative SQL models — each model is a SELECT statement
Software engineering best practices applied to analytics code
Version control, testing, documentation, modularity

Key Benefits:

Dependency resolution — dbt handles execution ordering
Incremental processing — only process new/changed data
Testing — built-in data quality tests
Documentation — auto-generated docs from SQL and YAML

Key Insight: dbt shifts to declarative SQL models where each model is a SELECT statement and the tool handles dependency resolution, execution ordering, and incremental processing.

Architecture Overview

dbt Project Flow

dbt Project Structure

Architecture Diagram

my_dbt_project/
+-- dbt_project.yml          # Project configuration
+-- packages.yml              # Package dependencies
+-- profiles.yml              # Connection profiles
+-- models/
|   +-- staging/
|   |   +-- _staging__sources.yml
|   |   +-- _staging__models.yml
|   |   +-- stg_orders.sql
|   |   +-- stg_customers.sql
|   |   +-- stg_products.sql
|   +-- intermediate/
|   |   +-- int_order_payments.sql
|   |   +-- int_customer_lifetime.sql
|   +-- marts/
|       +-- _marts__models.yml
|       +-- dim_customers.sql
|       +-- fact_orders.sql
|       +-- agg_monthly_revenue.sql
+-- tests/
|   +-- generic/
|       +-- test_positive_value.sql
+-- macros/
|   +-- generate_schema_name.sql
+-- seeds/
|   +-- country_codes.csv
+-- snapshots/
    +-- scd_customers.sql

Models

A dbt model is a single SQL SELECT statement that produces a result set. Models are the building blocks of dbt — they define transformations, create tables/views, and form the DAG (Directed Acyclic Graph) that dbt executes in dependency order.

# dbt_project.yml
name: 'analytics'
version: '1.0.0'
config-version: 2

profile: 'snowflake_prod'

model-paths: ["models"]
test-paths: ["tests"]
analysis-paths: ["analyses"]
macro-paths: ["macros"]
snapshot-paths: ["schemas"]

clean-targets:
  - "target"
  - "dbt_packages"

models:
  analytics:
    staging:
      +materialized: view
      +schema: staging
    intermediate:
      +materialized: ephemeral
    marts:
      +materialized: table
      +schema: analytics

vars:
  start_date: '2024-01-01'
  load_timestamp: "{{ run_started_at }}"

-- models/staging/stg_orders.sql
-- Source: raw data from source system
-- Materialized as view (always fresh)

WITH source AS (
    SELECT * FROM {{ source('raw', 'orders') }}
),

renamed AS (
    SELECT
        id                              AS order_id,
        customer_id,
        status                          AS order_status,
        CAST(order_date AS DATE)        AS order_date,
        CAST(shipped_date AS DATE)      AS shipped_date,
        CAST(delivered_date AS DATE)    AS delivered_date,
        amount                          AS gross_amount,
        tax_amount,
        discount_amount,
        amount - COALESCE(discount_amount, 0) AS net_amount,
        created_at,
        updated_at
    FROM source
    WHERE id IS NOT NULL
)

SELECT * FROM renamed

-- models/marts/fact_orders.sql
-- Materialized as table (refreshed on schedule)

WITH orders AS (
    SELECT * FROM {{ ref('stg_orders') }}
),

customers AS (
    SELECT * FROM {{ ref('dim_customers') }}
),

daily_metrics AS (
    SELECT
        order_date,
        customer_id,
        COUNT(*)                    AS order_count,
        SUM(net_amount)             AS total_revenue,
        AVG(net_amount)             AS avg_order_value,
        COUNTIF(order_status = 'completed') AS completed_orders,
        MIN(order_date)             AS first_order_date,
        MAX(order_date)             AS last_order_date
    FROM orders
    WHERE order_date >= '{{ var("start_date") }}'
    GROUP BY 1, 2
)

SELECT
    {{ dbt_utils.generate_surrogate_key(['order_date', 'customer_id']) }} AS order_key,
    d.*,
    c.customer_segment,
    c.customer_lifetime_value,
    CURRENT_TIMESTAMP()           AS refreshed_at
FROM daily_metrics d
LEFT JOIN customers c
    ON d.customer_id = c.customer_id

Sources and References

Sources declare existing database tables that dbt models consume. They enable testing freshness, schema changes, and provide lineage tracking from raw to transformed data.

# models/staging/_staging__sources.yml
version: 2

sources:
  - name: raw
    description: "Raw data loaded from source systems"
    database: raw_data
    schema: public
    loader: "fivetran"
    loaded_at_field: _fivetran_synced
    freshness:
      warn_after: {count: 6, period: hour}
      error_after: {count: 24, period: hour}
    tables:
      - name: orders
        description: "Customer orders from Shopify"
        loaded_at_field: updated_at
        freshness:
          warn_after: {count: 12, period: hour}
        columns:
          - name: id
            description: "Primary key"
            tests:
              - unique
              - not_null
          - name: customer_id
            description: "Foreign key to customers table"
            tests:
              - not_null
              - relationships:
                  to: ref('dim_customers')
                  field: customer_id

      - name: customers
        description: "Customer records from Shopify"
        columns:
          - name: id
            tests:
              - unique
              - not_null

# models/marts/_marts__models.yml
version: 2

models:
  - name: fact_orders
    description: "Fact table containing all customer orders"
    config:
      tags: ["finance", "daily"]
    columns:
      - name: order_key
        description: "Surrogate key"
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 1
      - name: total_revenue
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              inclusive: false

  - name: dim_customers
    description: "Customer dimension with SCD Type 2"
    columns:
      - name: customer_key
        tests:
          - unique
          - not_null
      - name: customer_segment
        tests:
          - accepted_values:
              values: ['enterprise', 'mid_market', 'smb', 'freemium']

Jinja Templating

Jinja2 is a templating language that dbt uses to generate dynamic SQL. It enables parameterized models, control flow (loops, conditionals), macros, and reusable logic.

-- Jinja macro for common patterns
-- macros/generate_schema_name.sql
{% macro generate_schema_name(custom_schema_name=none) -%}
    {%- set default_schema = target.schema -%}
    {%- if custom_schema_name is none -%}
        {{ default_schema }}
    {%- else -%}
        {{ default_schema }}_{{ custom_schema_name }}
    {%- endif -%}
{%- endmacro %}

-- macros/positive_value_test.sql
{% test positive_value(model, column_name) %}
SELECT *
FROM {{ model }}
WHERE {{ column_name }} <= 0
{% endtest %}

-- macros/union_tables.sql
{% macro union_tables(table_list) %}
    {%- for table in table_list %}
        SELECT * FROM {{ table }}
        {% if not loop.last %} UNION ALL {% endif %}
    {%- endfor %}
{% endmacro %}

-- Usage in model
{% set tables = ['stg_orders', 'stg_returns', 'stg_refunds'] %}
{{ union_tables(tables) }}

-- Jinja for conditional logic
SELECT
    order_id,
    customer_id,
    net_amount,
    {% if var('include_calculations', false) %}
        net_amount * 0.1 AS estimated_tax,
        net_amount * 0.9 AS estimated_net,
    {% endif %}
    CURRENT_TIMESTAMP() AS loaded_at
FROM {{ ref('stg_orders') }}

-- Jinja loop for dynamic columns
WITH metrics AS (
    SELECT
        customer_id,
        {% set metric_columns = ['orders', 'returns', 'refunds'] %}
        {% for col in metric_columns %}
            COUNTIF(type = '{{ col }}') AS {{ col }}_count,
            SUM(CASE WHEN type = '{{ col }}' THEN amount ELSE 0 END) AS {{ col }}_total,
        {% endfor %}
        COUNT(*) AS total_transactions
    FROM {{ ref('stg_transactions') }}
    GROUP BY 1
)

SELECT * FROM metrics

Incremental Models

An incremental model in dbt processes only new or changed records rather than rebuilding the entire table each run. It uses a strategy (append, merge, or delete+insert) to efficiently update the target table.

-- models/marts/fact_orders_incremental.sql
{{
    config(
        materialized='incremental',
        unique_key='order_key',
        incremental_strategy='merge',
        on_schema_change='sync_all_columns',
        partition_by={'field': 'order_date', 'data_type': 'date'},
        cluster_by=['customer_id', 'order_status']
    )
}}

WITH orders AS (
    SELECT * FROM {{ ref('stg_orders') }}

    {% if is_incremental() %}
        WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
    {% endif %}
),

customers AS (
    SELECT * FROM {{ ref('dim_customers') }}
)

SELECT
    {{ dbt_utils.generate_surrogate_key(['o.order_id']) }} AS order_key,
    o.order_id,
    o.customer_id,
    o.order_date,
    o.order_status,
    o.net_amount,
    o.created_at,
    o.updated_at,
    c.customer_segment,
    CURRENT_TIMESTAMP() AS dbt_loaded_at
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id

Testing

dbt tests are assertions defined in YAML that validate data quality. They execute as SQL queries that return failing rows — a test passes when the query returns zero rows.

# tests/test_order_amount_positive.yml
version: 2

models:
  - name: fact_orders
    columns:
      - name: net_amount
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              inclusive: false
          - dbt_utils.standard_deviation:
              column_name: net_amount
              threshold: 3  # Flag if >3 standard deviations

-- Custom test: test_order_status_transition.sql
-- Validates that order status transitions are valid
SELECT
    o1.order_id,
    o1.order_status AS from_status,
    o2.order_status AS to_status,
    o2.updated_at AS transition_time
FROM {{ ref('fct_orders') }} o1
JOIN {{ ref('fct_orders') }} o2
    ON o1.order_id = o2.order_id
    AND o2.updated_at = (
        SELECT MIN(updated_at)
        FROM {{ ref('fct_orders') }}
        WHERE order_id = o1.order_id
          AND updated_at > o1.updated_at
    )
WHERE NOT (
    (o1.order_status = 'pending' AND o2.order_status IN ('processing', 'cancelled'))
    OR (o1.order_status = 'processing' AND o2.order_status IN ('shipped', 'cancelled'))
    OR (o1.order_status = 'shipped' AND o2.order_status IN ('delivered', 'returned'))
    OR (o1.order_status = 'delivered' AND o2.order_status IN ('returned', 'refunded'))
)

Key Concepts Summary

Concept	Description	Materialization	Use Case
Model	SQL SELECT statement	view/table/incremental	All transformations
Source	Existing raw table	N/A	Input declarations
Test	Data quality assertion	N/A	Quality validation
Macro	Reusable SQL function	N/A	Common patterns
Snapshot	SCD Type 2 tracking	Table	Historical dimensions
Seed	Static CSV data	Table	Reference data
Package	Reusable dbt code	N/A	Community patterns
DAG	Dependency graph	N/A	Execution ordering
Ref	Model reference	N/A	Cross-model dependencies
Variable	Configuration parameter	N/A	Environment-specific values
On-run Hook	Pre/post execution	N/A	Custom logic
Documentation	Column/table docs	N/A	Data governance

Performance Metrics

Metric	Impact	Measurement	Optimization
Model Build Time	Pipeline speed	Seconds per model	Reduce columns, filter early
Test Execution	Quality assurance	Seconds per test	Optimize test SQL
DAG Depth	Parallelization	Number of layers	Flatten dependencies
Incremental Benefit	Cost savings	% data skipped	Proper unique_key design
Documentation Coverage	Discoverability	% models documented	Enforce via tests
Schema Changes	Stability	Frequency	on_schema_change config

10 Best Practices

Follow the staging -> intermediate -> marts pattern — keep raw transformations clean and reusable
Use ephemeral materialization for intermediate models to avoid unnecessary table creation
Always define sources — never query raw tables directly with hardcoded names
Write tests for every model — at minimum: unique + not_null on primary keys
Use is_incremental() macro — build only new data, not full rebuilds
Tag models by domain — enable selective builds with dbt run --select tag:finance
Use dbt_utils package — avoid reinventing common patterns (surrogate keys, date spine, etc.)
Document everything — column descriptions appear in dbt docs and data catalogs
Use on_schema_change carefully — sync_all_columns prevents breaking changes
Version control dbt projects — use PRs, code review, and CI/CD for model changes

dbt transforms raw data into analytics-ready datasets using declarative SQL
The staging -> intermediate -> marts pattern creates maintainable data pipelines
Jinja templating enables parameterized, reusable SQL components
Incremental models reduce compute cost by processing only new/changed data
Testing and documentation are built-in — enforce quality at the code level