πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

dbt Core Architecture

🟒 Free Lesson

Advertisement

dbt Core Architecture

Architecture Overview

Compilation Pipeline

The dbt compilation pipeline processes your project through these stages:

  1. PARSE β€” Reads YAML/SQL/Graph definitions
  2. RESOLVE β€” Resolves refs(), sources(), and packages
  3. COMPILE β€” Jinja rendering and SQL validation
  4. EXECUTE β€” Sends SQL to target warehouse (BQ/SF/Redshift)
  5. TEST β€” Runs schema, data, and custom tests
  6. ARTIFACTS β€” Generates manifest.json, run_results.json, catalog.json

Project Structure

Architecture Diagram
+-----------------------------------------------------------------------------+
|                     DBT PROJECT DIRECTORY STRUCTURE                         |
+-----------------------------------------------------------------------------+
|                                                                             |
|  my_dbt_project/                                                            |
|  +-- dbt_project.yml          β—„-- Project configuration                     |
|  +-- packages.yml             β—„-- Package dependencies                      |
|  +-- profiles.yml             β—„-- Connection profiles                       |
|  +-- models/                                                                |
|  |   +-- staging/             β—„-- Staging models (1:1 with sources)         |
|  |   |   +-- _sources.yml     β—„-- Source definitions                        |
|  |   |   +-- stg_customers.sql                                              |
|  |   |   +-- stg_orders.sql                                                 |
|  |   +-- intermediate/        β—„-- Business logic transformations            |
|  |   |   +-- int_orders.sql                                                 |
|  |   +-- marts/               β—„-- Final analytical models                   |
|  |       +-- finance/                                                       |
|  |       |   +-- fct_orders.sql                                             |
|  |       +-- customers/                                                     |
|  |           +-- dim_customers.sql                                          |
|  +-- seeds/                   β—„-- CSV files loaded as tables                |
|  +-- snapshots/               β—„-- Slowly changing dimensions                |
|  +-- tests/                   β—„-- Custom data tests                         |
|  +-- macros/                  β—„-- Reusable Jinja macros                     |
|  +-- analysis/                β—„-- Ad-hoc analysis queries                   |
|  +-- target/                  β—„-- Compiled output (gitignored)              |
|                                                                             |
+-----------------------------------------------------------------------------+

Detailed Explanation

dbt (data build tool) is a SQL-based transformation tool that enables data teams to build clean, tested, documented data models using SQL and Python, applying software engineering principles to analytics code.


What is the Compilation Process?

  • Parsing: dbt reads all YAML configuration files and SQL model files to build a project manifest
  • Manifest: The single source of truth containing metadata about models, sources, tests, macros, and their relationships
  • DAG Construction: dbt builds a directed acyclic graph representing all dependencies between models
  • Execution Order: Models are processed in topological order to ensure upstream dependencies are materialized first
  • Resolution: ref() function calls are resolved to actual database object references (e.g., database.schema.model_name for Snowflake or project.dataset.table for BigQuery)

How does the Execution Engine work?

  • Connection Management: Handles connection pooling, query submission, error handling, and result collection
  • Materialization Strategies: Supports multiple execution patterns:
StrategyBehavior
TableDrops and recreates the target table on each run
ViewCreates or replaces a view definition
IncrementalOnly processes new or changed records
EphemeralInjects CTEs into downstream models

What are the benefits of Graph-Based Processing?

  • Parallel Execution: Independent models can be executed concurrently
  • Selective Materialization: Only changed models and their dependents are rebuilt
  • Lineage Tracking: Complete audit trail of data transformations
  • Impact Analysis: Understand downstream effects of schema changes

Key Takeaway: dbt's architecture transforms SQL and YAML configurations into executable queries through a sophisticated pipeline that optimizes execution order and resource usage.

Key Concepts

ConceptDescriptionPurpose
ManifestJSON representation of the projectCentral metadata store
DAGDirected Acyclic Graph of dependenciesExecution ordering
MaterializationHow models are persistedStorage strategy
Ref FunctionModel reference resolutionDependency management
Source FunctionSource table referencesData lineage
TestData quality assertionsValidation
MacroReusable Jinja code blocksCode reuse
SnapshotHistorical state trackingSCD Type 2

Code Examples

dbt_project.yml Configuration

# dbt_project.yml
name: 'my_analytics_project'
version: '1.0.0'
config-version: 2

profile: 'analytics'

model-paths: ["models"]
analysis-paths: ["analysis"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
docs-paths: ["docs"]

clean-targets:
  - "target"
  - "dbt_packages"
  - "dbt_modules"

models:
  my_analytics_project:
    staging:
      +materialized: view
      +schema: staging
    intermediate:
      +materialized: ephemeral
    marts:
      +materialized: incremental
      +schema: analytics

vars:
  start_date: '2020-01-01'
  enable_debug: false

Model Compilation Example

-- models/marts/fct_orders.sql
{{
    config(
        materialized='incremental',
        unique_key='order_id',
        partition_by={
            "field": "order_date",
            "data_type": "date",
            "granularity": "day"
        },
        cluster_by=['customer_id', 'product_category']
    )
}}

with orders as (
    select * from {{ ref('stg_orders') }}
),

customers as (
    select * from {{ ref('dim_customers') }}
),

order_items as (
    select * from {{ ref('stg_order_items') }}
),

final as (
    select
        orders.order_id,
        orders.order_date,
        orders.status,
        customers.customer_id,
        customers.customer_name,
        customers.segment,
        sum(order_items.quantity * order_items.unit_price) as order_total,
        count(order_items.item_id) as item_count,
        {{ dbt_utils.current_timestamp() }} as updated_at
    from orders
    left join customers on orders.customer_id = customers.customer_id
    left join order_items on orders.order_id = order_items.order_id
    group by 1, 2, 3, 4, 5, 6
)

select * from final

{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}

Source Definition

# models/staging/_sources.yml
version: 2

sources:
  - name: raw
    database: raw_data
    schema: public
    loader: fivetran
    loaded_at_field: _fivetran_synced
    freshness:
      warn_after: {count: 6, period: hour}
      error_after: {count: 12, period: hour}
    tables:
      - name: orders
        description: "Raw orders from Shopify"
        columns:
          - name: id
            description: "Primary key"
            data_tests:
              - unique
              - not_null
          - name: customer_id
            description: "Foreign key to customers"
            data_tests:
              - not_null
              - relationships:
                  to: ref('stg_customers')
                  field: customer_id
          - name: status
            description: "Order status"
            data_tests:
              - accepted_values:
                  values: ['pending', 'shipped', 'delivered', 'cancelled']

Performance Metrics

MetricDescriptionTypical Value
Parse TimeTime to read manifest2-5 seconds
CompilationJinja rendering time1-3 seconds
ExecutionSQL execution timeVariable
Test TimeData test execution1-10 seconds
DocumentationDocs generation5-15 seconds

Best Practices

  1. Layered Architecture: Organize models into staging, intermediate, and mart layers
  2. Naming Conventions: Use stg_ prefix for staging, fct_ for facts, dim_ for dimensions
  3. Single Source of Truth: Define all sources in _sources.yml files
  4. Incremental Models: Use incremental materialization for large tables
  5. Testing: Add tests to all critical models and columns
  6. Documentation: Document every model and column with descriptions
  7. Version Control: Use Git for all dbt code with proper branching strategies
  8. CI/CD: Implement automated testing and deployment pipelines

See Also

⭐

Premium Content

dbt Core Architecture

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert dbt Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement