dbt Core Architecture
Architecture Overview
Compilation Pipeline
The dbt compilation pipeline processes your project through these stages:
- PARSE β Reads YAML/SQL/Graph definitions
- RESOLVE β Resolves refs(), sources(), and packages
- COMPILE β Jinja rendering and SQL validation
- EXECUTE β Sends SQL to target warehouse (BQ/SF/Redshift)
- TEST β Runs schema, data, and custom tests
- ARTIFACTS β Generates manifest.json, run_results.json, catalog.json
Project Structure
Architecture Diagram
+-----------------------------------------------------------------------------+
| DBT PROJECT DIRECTORY STRUCTURE |
+-----------------------------------------------------------------------------+
| |
| my_dbt_project/ |
| +-- dbt_project.yml β-- Project configuration |
| +-- packages.yml β-- Package dependencies |
| +-- profiles.yml β-- Connection profiles |
| +-- models/ |
| | +-- staging/ β-- Staging models (1:1 with sources) |
| | | +-- _sources.yml β-- Source definitions |
| | | +-- stg_customers.sql |
| | | +-- stg_orders.sql |
| | +-- intermediate/ β-- Business logic transformations |
| | | +-- int_orders.sql |
| | +-- marts/ β-- Final analytical models |
| | +-- finance/ |
| | | +-- fct_orders.sql |
| | +-- customers/ |
| | +-- dim_customers.sql |
| +-- seeds/ β-- CSV files loaded as tables |
| +-- snapshots/ β-- Slowly changing dimensions |
| +-- tests/ β-- Custom data tests |
| +-- macros/ β-- Reusable Jinja macros |
| +-- analysis/ β-- Ad-hoc analysis queries |
| +-- target/ β-- Compiled output (gitignored) |
| |
+-----------------------------------------------------------------------------+
Detailed Explanation
dbt (data build tool) is a SQL-based transformation tool that enables data teams to build clean, tested, documented data models using SQL and Python, applying software engineering principles to analytics code.
What is the Compilation Process?
- Parsing: dbt reads all YAML configuration files and SQL model files to build a project manifest
- Manifest: The single source of truth containing metadata about models, sources, tests, macros, and their relationships
- DAG Construction: dbt builds a directed acyclic graph representing all dependencies between models
- Execution Order: Models are processed in topological order to ensure upstream dependencies are materialized first
- Resolution:
ref()function calls are resolved to actual database object references (e.g.,database.schema.model_namefor Snowflake orproject.dataset.tablefor BigQuery)
How does the Execution Engine work?
- Connection Management: Handles connection pooling, query submission, error handling, and result collection
- Materialization Strategies: Supports multiple execution patterns:
| Strategy | Behavior |
|---|---|
| Table | Drops and recreates the target table on each run |
| View | Creates or replaces a view definition |
| Incremental | Only processes new or changed records |
| Ephemeral | Injects CTEs into downstream models |
What are the benefits of Graph-Based Processing?
- Parallel Execution: Independent models can be executed concurrently
- Selective Materialization: Only changed models and their dependents are rebuilt
- Lineage Tracking: Complete audit trail of data transformations
- Impact Analysis: Understand downstream effects of schema changes
Key Takeaway: dbt's architecture transforms SQL and YAML configurations into executable queries through a sophisticated pipeline that optimizes execution order and resource usage.
Key Concepts
| Concept | Description | Purpose |
|---|---|---|
| Manifest | JSON representation of the project | Central metadata store |
| DAG | Directed Acyclic Graph of dependencies | Execution ordering |
| Materialization | How models are persisted | Storage strategy |
| Ref Function | Model reference resolution | Dependency management |
| Source Function | Source table references | Data lineage |
| Test | Data quality assertions | Validation |
| Macro | Reusable Jinja code blocks | Code reuse |
| Snapshot | Historical state tracking | SCD Type 2 |
Code Examples
dbt_project.yml Configuration
# dbt_project.yml
name: 'my_analytics_project'
version: '1.0.0'
config-version: 2
profile: 'analytics'
model-paths: ["models"]
analysis-paths: ["analysis"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
docs-paths: ["docs"]
clean-targets:
- "target"
- "dbt_packages"
- "dbt_modules"
models:
my_analytics_project:
staging:
+materialized: view
+schema: staging
intermediate:
+materialized: ephemeral
marts:
+materialized: incremental
+schema: analytics
vars:
start_date: '2020-01-01'
enable_debug: false
Model Compilation Example
-- models/marts/fct_orders.sql
{{
config(
materialized='incremental',
unique_key='order_id',
partition_by={
"field": "order_date",
"data_type": "date",
"granularity": "day"
},
cluster_by=['customer_id', 'product_category']
)
}}
with orders as (
select * from {{ ref('stg_orders') }}
),
customers as (
select * from {{ ref('dim_customers') }}
),
order_items as (
select * from {{ ref('stg_order_items') }}
),
final as (
select
orders.order_id,
orders.order_date,
orders.status,
customers.customer_id,
customers.customer_name,
customers.segment,
sum(order_items.quantity * order_items.unit_price) as order_total,
count(order_items.item_id) as item_count,
{{ dbt_utils.current_timestamp() }} as updated_at
from orders
left join customers on orders.customer_id = customers.customer_id
left join order_items on orders.order_id = order_items.order_id
group by 1, 2, 3, 4, 5, 6
)
select * from final
{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}
Source Definition
# models/staging/_sources.yml
version: 2
sources:
- name: raw
database: raw_data
schema: public
loader: fivetran
loaded_at_field: _fivetran_synced
freshness:
warn_after: {count: 6, period: hour}
error_after: {count: 12, period: hour}
tables:
- name: orders
description: "Raw orders from Shopify"
columns:
- name: id
description: "Primary key"
data_tests:
- unique
- not_null
- name: customer_id
description: "Foreign key to customers"
data_tests:
- not_null
- relationships:
to: ref('stg_customers')
field: customer_id
- name: status
description: "Order status"
data_tests:
- accepted_values:
values: ['pending', 'shipped', 'delivered', 'cancelled']
Performance Metrics
| Metric | Description | Typical Value |
|---|---|---|
| Parse Time | Time to read manifest | 2-5 seconds |
| Compilation | Jinja rendering time | 1-3 seconds |
| Execution | SQL execution time | Variable |
| Test Time | Data test execution | 1-10 seconds |
| Documentation | Docs generation | 5-15 seconds |
Best Practices
- Layered Architecture: Organize models into staging, intermediate, and mart layers
- Naming Conventions: Use
stg_prefix for staging,fct_for facts,dim_for dimensions - Single Source of Truth: Define all sources in
_sources.ymlfiles - Incremental Models: Use incremental materialization for large tables
- Testing: Add tests to all critical models and columns
- Documentation: Document every model and column with descriptions
- Version Control: Use Git for all dbt code with proper branching strategies
- CI/CD: Implement automated testing and deployment pipelines
See Also
- Models, Seeds, and Snapshots β Core dbt components for data transformation
- The ref() Function β Model reference resolution and dependency management
- Materializations in dbt β Table, view, incremental, and ephemeral strategies
- Snowflake Architecture β Snowflake cloud data platform fundamentals
- Data Engineering Fundamentals β Modern data stack overview