dbt: The SQL Transformation Layer
dbt (data build tool) transforms raw data into analytics-ready datasets using SQL SELECT statements.
Why dbt Matters
Traditional ETL:
- Relies on procedural code (Python scripts, stored procedures)
- Complex transformation logic
- Hard to test and version
dbt's Approach:
- Declarative SQL models β each model is a SELECT statement
- Software engineering best practices applied to analytics code
- Version control, testing, documentation, modularity
Key Benefits:
- Dependency resolution β dbt handles execution ordering
- Incremental processing β only process new/changed data
- Testing β built-in data quality tests
- Documentation β auto-generated docs from SQL and YAML
Key Insight: dbt shifts to declarative SQL models where each model is a SELECT statement and the tool handles dependency resolution, execution ordering, and incremental processing.
Architecture Overview
dbt Project Flow
dbt Project Structure
my_dbt_project/
+-- dbt_project.yml # Project configuration
+-- packages.yml # Package dependencies
+-- profiles.yml # Connection profiles
+-- models/
| +-- staging/
| | +-- _staging__sources.yml
| | +-- _staging__models.yml
| | +-- stg_orders.sql
| | +-- stg_customers.sql
| | +-- stg_products.sql
| +-- intermediate/
| | +-- int_order_payments.sql
| | +-- int_customer_lifetime.sql
| +-- marts/
| +-- _marts__models.yml
| +-- dim_customers.sql
| +-- fact_orders.sql
| +-- agg_monthly_revenue.sql
+-- tests/
| +-- generic/
| +-- test_positive_value.sql
+-- macros/
| +-- generate_schema_name.sql
+-- seeds/
| +-- country_codes.csv
+-- snapshots/
+-- scd_customers.sql
Models
A dbt model is a single SQL SELECT statement that produces a result set. Models are the building blocks of dbt β they define transformations, create tables/views, and form the DAG (Directed Acyclic Graph) that dbt executes in dependency order.
# dbt_project.yml
name: 'analytics'
version: '1.0.0'
config-version: 2
profile: 'snowflake_prod'
model-paths: ["models"]
test-paths: ["tests"]
analysis-paths: ["analyses"]
macro-paths: ["macros"]
snapshot-paths: ["schemas"]
clean-targets:
- "target"
- "dbt_packages"
models:
analytics:
staging:
+materialized: view
+schema: staging
intermediate:
+materialized: ephemeral
marts:
+materialized: table
+schema: analytics
vars:
start_date: '2024-01-01'
load_timestamp: "{{ run_started_at }}"
-- models/staging/stg_orders.sql
-- Source: raw data from source system
-- Materialized as view (always fresh)
WITH source AS (
SELECT * FROM {{ source('raw', 'orders') }}
),
renamed AS (
SELECT
id AS order_id,
customer_id,
status AS order_status,
CAST(order_date AS DATE) AS order_date,
CAST(shipped_date AS DATE) AS shipped_date,
CAST(delivered_date AS DATE) AS delivered_date,
amount AS gross_amount,
tax_amount,
discount_amount,
amount - COALESCE(discount_amount, 0) AS net_amount,
created_at,
updated_at
FROM source
WHERE id IS NOT NULL
)
SELECT * FROM renamed
-- models/marts/fact_orders.sql
-- Materialized as table (refreshed on schedule)
WITH orders AS (
SELECT * FROM {{ ref('stg_orders') }}
),
customers AS (
SELECT * FROM {{ ref('dim_customers') }}
),
daily_metrics AS (
SELECT
order_date,
customer_id,
COUNT(*) AS order_count,
SUM(net_amount) AS total_revenue,
AVG(net_amount) AS avg_order_value,
COUNTIF(order_status = 'completed') AS completed_orders,
MIN(order_date) AS first_order_date,
MAX(order_date) AS last_order_date
FROM orders
WHERE order_date >= '{{ var("start_date") }}'
GROUP BY 1, 2
)
SELECT
{{ dbt_utils.generate_surrogate_key(['order_date', 'customer_id']) }} AS order_key,
d.*,
c.customer_segment,
c.customer_lifetime_value,
CURRENT_TIMESTAMP() AS refreshed_at
FROM daily_metrics d
LEFT JOIN customers c
ON d.customer_id = c.customer_id
Sources and References
Sources declare existing database tables that dbt models consume. They enable testing freshness, schema changes, and provide lineage tracking from raw to transformed data.
# models/staging/_staging__sources.yml
version: 2
sources:
- name: raw
description: "Raw data loaded from source systems"
database: raw_data
schema: public
loader: "fivetran"
loaded_at_field: _fivetran_synced
freshness:
warn_after: {count: 6, period: hour}
error_after: {count: 24, period: hour}
tables:
- name: orders
description: "Customer orders from Shopify"
loaded_at_field: updated_at
freshness:
warn_after: {count: 12, period: hour}
columns:
- name: id
description: "Primary key"
tests:
- unique
- not_null
- name: customer_id
description: "Foreign key to customers table"
tests:
- not_null
- relationships:
to: ref('dim_customers')
field: customer_id
- name: customers
description: "Customer records from Shopify"
columns:
- name: id
tests:
- unique
- not_null
# models/marts/_marts__models.yml
version: 2
models:
- name: fact_orders
description: "Fact table containing all customer orders"
config:
tags: ["finance", "daily"]
columns:
- name: order_key
description: "Surrogate key"
tests:
- unique
- not_null
- name: customer_id
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 1
- name: total_revenue
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
inclusive: false
- name: dim_customers
description: "Customer dimension with SCD Type 2"
columns:
- name: customer_key
tests:
- unique
- not_null
- name: customer_segment
tests:
- accepted_values:
values: ['enterprise', 'mid_market', 'smb', 'freemium']
Jinja Templating
Jinja2 is a templating language that dbt uses to generate dynamic SQL. It enables parameterized models, control flow (loops, conditionals), macros, and reusable logic.
-- Jinja macro for common patterns
-- macros/generate_schema_name.sql
{% macro generate_schema_name(custom_schema_name=none) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{{ default_schema }}
{%- else -%}
{{ default_schema }}_{{ custom_schema_name }}
{%- endif -%}
{%- endmacro %}
-- macros/positive_value_test.sql
{% test positive_value(model, column_name) %}
SELECT *
FROM {{ model }}
WHERE {{ column_name }} <= 0
{% endtest %}
-- macros/union_tables.sql
{% macro union_tables(table_list) %}
{%- for table in table_list %}
SELECT * FROM {{ table }}
{% if not loop.last %} UNION ALL {% endif %}
{%- endfor %}
{% endmacro %}
-- Usage in model
{% set tables = ['stg_orders', 'stg_returns', 'stg_refunds'] %}
{{ union_tables(tables) }}
-- Jinja for conditional logic
SELECT
order_id,
customer_id,
net_amount,
{% if var('include_calculations', false) %}
net_amount * 0.1 AS estimated_tax,
net_amount * 0.9 AS estimated_net,
{% endif %}
CURRENT_TIMESTAMP() AS loaded_at
FROM {{ ref('stg_orders') }}
-- Jinja loop for dynamic columns
WITH metrics AS (
SELECT
customer_id,
{% set metric_columns = ['orders', 'returns', 'refunds'] %}
{% for col in metric_columns %}
COUNTIF(type = '{{ col }}') AS {{ col }}_count,
SUM(CASE WHEN type = '{{ col }}' THEN amount ELSE 0 END) AS {{ col }}_total,
{% endfor %}
COUNT(*) AS total_transactions
FROM {{ ref('stg_transactions') }}
GROUP BY 1
)
SELECT * FROM metrics
Incremental Models
An incremental model in dbt processes only new or changed records rather than rebuilding the entire table each run. It uses a strategy (append, merge, or delete+insert) to efficiently update the target table.
-- models/marts/fact_orders_incremental.sql
{{
config(
materialized='incremental',
unique_key='order_key',
incremental_strategy='merge',
on_schema_change='sync_all_columns',
partition_by={'field': 'order_date', 'data_type': 'date'},
cluster_by=['customer_id', 'order_status']
)
}}
WITH orders AS (
SELECT * FROM {{ ref('stg_orders') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}
),
customers AS (
SELECT * FROM {{ ref('dim_customers') }}
)
SELECT
{{ dbt_utils.generate_surrogate_key(['o.order_id']) }} AS order_key,
o.order_id,
o.customer_id,
o.order_date,
o.order_status,
o.net_amount,
o.created_at,
o.updated_at,
c.customer_segment,
CURRENT_TIMESTAMP() AS dbt_loaded_at
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
Testing
dbt tests are assertions defined in YAML that validate data quality. They execute as SQL queries that return failing rows β a test passes when the query returns zero rows.
# tests/test_order_amount_positive.yml
version: 2
models:
- name: fact_orders
columns:
- name: net_amount
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
inclusive: false
- dbt_utils.standard_deviation:
column_name: net_amount
threshold: 3 # Flag if >3 standard deviations
-- Custom test: test_order_status_transition.sql
-- Validates that order status transitions are valid
SELECT
o1.order_id,
o1.order_status AS from_status,
o2.order_status AS to_status,
o2.updated_at AS transition_time
FROM {{ ref('fct_orders') }} o1
JOIN {{ ref('fct_orders') }} o2
ON o1.order_id = o2.order_id
AND o2.updated_at = (
SELECT MIN(updated_at)
FROM {{ ref('fct_orders') }}
WHERE order_id = o1.order_id
AND updated_at > o1.updated_at
)
WHERE NOT (
(o1.order_status = 'pending' AND o2.order_status IN ('processing', 'cancelled'))
OR (o1.order_status = 'processing' AND o2.order_status IN ('shipped', 'cancelled'))
OR (o1.order_status = 'shipped' AND o2.order_status IN ('delivered', 'returned'))
OR (o1.order_status = 'delivered' AND o2.order_status IN ('returned', 'refunded'))
)
Key Concepts Summary
| Concept | Description | Materialization | Use Case |
|---|---|---|---|
| Model | SQL SELECT statement | view/table/incremental | All transformations |
| Source | Existing raw table | N/A | Input declarations |
| Test | Data quality assertion | N/A | Quality validation |
| Macro | Reusable SQL function | N/A | Common patterns |
| Snapshot | SCD Type 2 tracking | Table | Historical dimensions |
| Seed | Static CSV data | Table | Reference data |
| Package | Reusable dbt code | N/A | Community patterns |
| DAG | Dependency graph | N/A | Execution ordering |
| Ref | Model reference | N/A | Cross-model dependencies |
| Variable | Configuration parameter | N/A | Environment-specific values |
| On-run Hook | Pre/post execution | N/A | Custom logic |
| Documentation | Column/table docs | N/A | Data governance |
Performance Metrics
| Metric | Impact | Measurement | Optimization |
|---|---|---|---|
| Model Build Time | Pipeline speed | Seconds per model | Reduce columns, filter early |
| Test Execution | Quality assurance | Seconds per test | Optimize test SQL |
| DAG Depth | Parallelization | Number of layers | Flatten dependencies |
| Incremental Benefit | Cost savings | % data skipped | Proper unique_key design |
| Documentation Coverage | Discoverability | % models documented | Enforce via tests |
| Schema Changes | Stability | Frequency | on_schema_change config |
10 Best Practices
- Follow the staging -> intermediate -> marts pattern β keep raw transformations clean and reusable
- Use
ephemeralmaterialization for intermediate models to avoid unnecessary table creation - Always define sources β never query raw tables directly with hardcoded names
- Write tests for every model β at minimum: unique + not_null on primary keys
- Use
is_incremental()macro β build only new data, not full rebuilds - Tag models by domain β enable selective builds with
dbt run --select tag:finance - Use dbt_utils package β avoid reinventing common patterns (surrogate keys, date spine, etc.)
- Document everything β column descriptions appear in dbt docs and data catalogs
- Use
on_schema_changecarefully βsync_all_columnsprevents breaking changes - Version control dbt projects β use PRs, code review, and CI/CD for model changes
- dbt transforms raw data into analytics-ready datasets using declarative SQL
- The staging -> intermediate -> marts pattern creates maintainable data pipelines
- Jinja templating enables parameterized, reusable SQL components
- Incremental models reduce compute cost by processing only new/changed data
- Testing and documentation are built-in β enforce quality at the code level
See Also
- dbt Advanced β Custom macros, packages, CI/CD, and multi-project architectures
- Data Warehouse Concepts β Star schema, snowflake schema, and SCD patterns
- Snowflake Fundamentals β Cloud warehouse for dbt transformations
- Data Contracts β Formal producer-consumer data agreements
- CI/CD for Data Pipelines β Automated dbt testing and deployment
- Project 3: Warehouse Build β End-to-end dbt + Snowflake project