dbt Core Architecture

Architecture Overview

Compilation Pipeline

The dbt compilation pipeline processes your project through these stages:

PARSE — Reads YAML/SQL/Graph definitions
RESOLVE — Resolves refs(), sources(), and packages
COMPILE — Jinja rendering and SQL validation
EXECUTE — Sends SQL to target warehouse (BQ/SF/Redshift)
TEST — Runs schema, data, and custom tests
ARTIFACTS — Generates manifest.json, run_results.json, catalog.json

Project Structure

Architecture Diagram

+-----------------------------------------------------------------------------+
|                     DBT PROJECT DIRECTORY STRUCTURE                         |
+-----------------------------------------------------------------------------+
|                                                                             |
|  my_dbt_project/                                                            |
|  +-- dbt_project.yml          ◄-- Project configuration                     |
|  +-- packages.yml             ◄-- Package dependencies                      |
|  +-- profiles.yml             ◄-- Connection profiles                       |
|  +-- models/                                                                |
|  |   +-- staging/             ◄-- Staging models (1:1 with sources)         |
|  |   |   +-- _sources.yml     ◄-- Source definitions                        |
|  |   |   +-- stg_customers.sql                                              |
|  |   |   +-- stg_orders.sql                                                 |
|  |   +-- intermediate/        ◄-- Business logic transformations            |
|  |   |   +-- int_orders.sql                                                 |
|  |   +-- marts/               ◄-- Final analytical models                   |
|  |       +-- finance/                                                       |
|  |       |   +-- fct_orders.sql                                             |
|  |       +-- customers/                                                     |
|  |           +-- dim_customers.sql                                          |
|  +-- seeds/                   ◄-- CSV files loaded as tables                |
|  +-- snapshots/               ◄-- Slowly changing dimensions                |
|  +-- tests/                   ◄-- Custom data tests                         |
|  +-- macros/                  ◄-- Reusable Jinja macros                     |
|  +-- analysis/                ◄-- Ad-hoc analysis queries                   |
|  +-- target/                  ◄-- Compiled output (gitignored)              |
|                                                                             |
+-----------------------------------------------------------------------------+

Detailed Explanation

dbt (data build tool) is a SQL-based transformation tool that enables data teams to build clean, tested, documented data models using SQL and Python, applying software engineering principles to analytics code.

What is the Compilation Process?

Parsing: dbt reads all YAML configuration files and SQL model files to build a project manifest
Manifest: The single source of truth containing metadata about models, sources, tests, macros, and their relationships
DAG Construction: dbt builds a directed acyclic graph representing all dependencies between models
Execution Order: Models are processed in topological order to ensure upstream dependencies are materialized first
Resolution: ref() function calls are resolved to actual database object references (e.g., database.schema.model_name for Snowflake or project.dataset.table for BigQuery)

How does the Execution Engine work?

Connection Management: Handles connection pooling, query submission, error handling, and result collection
Materialization Strategies: Supports multiple execution patterns:

Strategy	Behavior
Table	Drops and recreates the target table on each run
View	Creates or replaces a view definition
Incremental	Only processes new or changed records
Ephemeral	Injects CTEs into downstream models

What are the benefits of Graph-Based Processing?

Parallel Execution: Independent models can be executed concurrently
Selective Materialization: Only changed models and their dependents are rebuilt
Lineage Tracking: Complete audit trail of data transformations
Impact Analysis: Understand downstream effects of schema changes

Key Takeaway: dbt's architecture transforms SQL and YAML configurations into executable queries through a sophisticated pipeline that optimizes execution order and resource usage.

Key Concepts

Concept	Description	Purpose
Manifest	JSON representation of the project	Central metadata store
DAG	Directed Acyclic Graph of dependencies	Execution ordering
Materialization	How models are persisted	Storage strategy
Ref Function	Model reference resolution	Dependency management
Source Function	Source table references	Data lineage
Test	Data quality assertions	Validation
Macro	Reusable Jinja code blocks	Code reuse
Snapshot	Historical state tracking	SCD Type 2

Code Examples

dbt_project.yml Configuration

# dbt_project.yml
name: 'my_analytics_project'
version: '1.0.0'
config-version: 2

profile: 'analytics'

model-paths: ["models"]
analysis-paths: ["analysis"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
docs-paths: ["docs"]

clean-targets:
  - "target"
  - "dbt_packages"
  - "dbt_modules"

models:
  my_analytics_project:
    staging:
      +materialized: view
      +schema: staging
    intermediate:
      +materialized: ephemeral
    marts:
      +materialized: incremental
      +schema: analytics

vars:
  start_date: '2020-01-01'
  enable_debug: false

Model Compilation Example

-- models/marts/fct_orders.sql
{{
    config(
        materialized='incremental',
        unique_key='order_id',
        partition_by={
            "field": "order_date",
            "data_type": "date",
            "granularity": "day"
        },
        cluster_by=['customer_id', 'product_category']
    )
}}

with orders as (
    select * from {{ ref('stg_orders') }}
),

customers as (
    select * from {{ ref('dim_customers') }}
),

order_items as (
    select * from {{ ref('stg_order_items') }}
),

final as (
    select
        orders.order_id,
        orders.order_date,
        orders.status,
        customers.customer_id,
        customers.customer_name,
        customers.segment,
        sum(order_items.quantity * order_items.unit_price) as order_total,
        count(order_items.item_id) as item_count,
        {{ dbt_utils.current_timestamp() }} as updated_at
    from orders
    left join customers on orders.customer_id = customers.customer_id
    left join order_items on orders.order_id = order_items.order_id
    group by 1, 2, 3, 4, 5, 6
)

select * from final

{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}

Source Definition

# models/staging/_sources.yml
version: 2

sources:
  - name: raw
    database: raw_data
    schema: public
    loader: fivetran
    loaded_at_field: _fivetran_synced
    freshness:
      warn_after: {count: 6, period: hour}
      error_after: {count: 12, period: hour}
    tables:
      - name: orders
        description: "Raw orders from Shopify"
        columns:
          - name: id
            description: "Primary key"
            data_tests:
              - unique
              - not_null
          - name: customer_id
            description: "Foreign key to customers"
            data_tests:
              - not_null
              - relationships:
                  to: ref('stg_customers')
                  field: customer_id
          - name: status
            description: "Order status"
            data_tests:
              - accepted_values:
                  values: ['pending', 'shipped', 'delivered', 'cancelled']

Performance Metrics

Metric	Description	Typical Value
Parse Time	Time to read manifest	2-5 seconds
Compilation	Jinja rendering time	1-3 seconds
Execution	SQL execution time	Variable
Test Time	Data test execution	1-10 seconds
Documentation	Docs generation	5-15 seconds

Best Practices

Layered Architecture: Organize models into staging, intermediate, and mart layers
Naming Conventions: Use stg_ prefix for staging, fct_ for facts, dim_ for dimensions
Single Source of Truth: Define all sources in _sources.yml files
Incremental Models: Use incremental materialization for large tables
Testing: Add tests to all critical models and columns
Documentation: Document every model and column with descriptions
Version Control: Use Git for all dbt code with proper branching strategies
CI/CD: Implement automated testing and deployment pipelines

dbt Core Architecture

dbt Core Architecture

Architecture Overview

Compilation Pipeline

Project Structure

Detailed Explanation

What is the Compilation Process?

How does the Execution Engine work?

What are the benefits of Graph-Based Processing?

Key Concepts

Code Examples

dbt_project.yml Configuration

Model Compilation Example

Source Definition

Performance Metrics

Best Practices

See Also

Premium Content

Need Expert dbt Help?