dbt Advanced: Custom Macros, Packages, CI/CD & Production Patterns

Advanced dbt: Production-Grade Analytics Engineering

Beyond basic models and tests, production dbt requires custom macros for complex logic, package development for reusable patterns, CI/CD integration for safe deployments, and performance optimization for large-scale transformations.

Why Advanced dbt Matters

Key Benefits:

Governance patterns — for hundreds of models
Automated CI testing — catch issues before production
Selective model builds — efficient development workflows
Cross-team collaboration — enterprise-scale analytics engineering

Key Insight: Advanced dbt techniques enable enterprise-scale analytics engineering.

Architecture Overview

dbt CI/CD Pipeline Flow

Custom Macros

A dbt macro is a reusable Jinja+SQL function that can be called from models, tests, or other macros. Macros encapsulate complex logic, reduce code duplication, and enable dynamic SQL generation.

-- macros/generate_surrogate_key.sql
{% macro generate_surrogate_key(field_list) %}
    {%- set fields = [] -%}
    {%- for field in field_list -%}
        {%- do fields.append(
            "COALESCE(CAST(" ~ field ~ " AS " ~ dbt_utils.type_string() ~ "), '')"
        ) -%}
    {%- endfor -%}

    {{ dbt_utils.generate_surrogate_key(fields) }}
{% endmacro %}

-- macros/merge_table.sql
{% macro merge_table(target, source, unique_key, update_columns=none) %}
    {%- set update_cols = update_columns or dbt_utils.get_filtered_columns(
        from=source, except=[unique_key]
    ) -%}

    MERGE INTO {{ target }} AS target
    USING ({{ source }}) AS source
    ON target.{{ unique_key }} = source.{{ unique_key }}
    WHEN MATCHED THEN
        UPDATE SET
            {% for col in update_cols %}
                target.{{ col }} = source.{{ col }}
                {% if not loop.last %},{% endif %}
            {% endfor %}
    WHEN NOT MATCHED THEN
        INSERT (
            {{ unique_key }},
            {{ update_cols | join(', ') }}
        )
        VALUES (
            source.{{ unique_key }},
            {{ update_cols | join(', ') }}
        )
{% endmacro %}

-- Usage in model
{{
    merge_table(
        target=ref('dim_customers'),
        source=ref('stg_customers'),
        unique_key='customer_id'
    )
}}

-- macros/apply_data_quality_checks.sql
{% macro apply_data_quality_checks(model_name, checks) %}
    {%- for check in checks %}
        SELECT
            '{{ model_name }}' AS model_name,
            '{{ check.column }}' AS column_name,
            '{{ check.type }}' AS check_type,
            COUNT(*) AS failing_rows
        FROM {{ model_name }}
        WHERE {% if check.type == 'not_null' %}
            {{ check.column }} IS NULL
        {% elif check.type == 'unique' %}
            {{ check.column }} IN (
                SELECT {{ check.column }}
                FROM {{ model_name }}
                GROUP BY {{ check.column }}
                HAVING COUNT(*) > 1
            )
        {% elif check.type == 'positive' %}
            {{ check.column }} <= 0
        {% endif %}
        HAVING COUNT(*) > 0
        {% if not loop.last %}UNION ALL{% endif %}
    {%- endfor %}
{% endmacro %}

-- macros/partition_management.sql
{% macro drop_partitions_before(table_name, date_column, days_to_keep=7) %}
    {% call statement('drop_partitions') %}
        {% if target.type == 'bigquery' %}
            DECLARE partition_date DATE DEFAULT DATE_SUB(CURRENT_DATE(), INTERVAL {{ days_to_keep }} DAY);
            EXECUTE IMMEDIATE FORMAT(
                'ALTER TABLE %s DROP IF EXISTS PARTITIONS WHERE %s < "%s"',
                '{{ table_name }}', '{{ date_column }}', partition_date
            );
        {% elif target.type == 'snowflake' %}
            -- Snowflake handles via table expiration policies
            SELECT 1;
        {% endif %}
    {% endcall %}
{% endmacro %}

Package Development

A dbt package is a reusable collection of models, macros, tests, and configurations shared across projects via the dbt Hub or Git repositories. Packages promote standardization and reduce duplication across teams.

# packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: [">=1.0.0", "<2.0.0"]

  - package: dbt-labs/codegen
    version: [">=0.4.0"]

  - package: calogica/dbt_expectations
    version: [">=0.8.0"]

  - package: elementary-data/elementary
    version: [">=0.10.0"]

  # Internal package from Git
  - git: "https://github.com/myorg/dbt-utils.git"
    revision: "v1.2.0"

  # Local package
  - local: "../shared-dbt-utils"

# dbt_project.yml for a package
name: 'my_org_utils'
version: '1.0.0'
config-version: 2

models:
  my_org_utils:
    +materialized: ephemeral
    +tags: ["internal"]

macros:
  +docs:
    description: "Internal utility macros for my organization"

State-Based Selection

State-based selection (--select state:modified+) enables dbt to build only models that changed since the last production run, along with their downstream dependents. This dramatically reduces build times and costs in large projects.

# GitHub Actions CI/CD Pipeline
# .github/workflows/dbt-ci.yml
name: dbt CI Pipeline

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  dbt-build:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        environment: [dev, staging]

    steps:
      - uses: actions/checkout@v4

      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install dbt-core dbt-snowflake dbt-utils dbt-expectations

      - name: Configure dbt profile
        run: |
          mkdir -p ~/.dbt
          cat << EOF > ~/.dbt/profiles.yml
          my_project:
            target: ${{ matrix.environment }}
            outputs:
              dev:
                type: snowflake
                account: ${{ secrets.SNOWFLAKE_ACCOUNT }}
                user: ${{ secrets.SNOWFLAKE_USER }}
                password: ${{ secrets.SNOWFLAKE_PASSWORD }}
                role: TRANSFORMER
                database: ANALYTICS_DEV
                warehouse: TRANSFORM_WH
                schema: dbt_${{ github.actor }}
              staging:
                type: snowflake
                account: ${{ secrets.SNOWFLAKE_ACCOUNT }}
                user: ${{ secrets.SNOWFLAKE_USER }}
                password: ${{ secrets.SNOWFLAKE_PASSWORD }}
                role: TRANSFORMER
                database: ANALYTICS_STAGING
                warehouse: TRANSFORM_WH
                schema: public
          EOF

      - name: dbt deps
        run: dbt deps

      - name: dbt build (PR)
        if: github.event_name == 'pull_request'
        run: |
          dbt build --select state:modified+ \
            --defer \
            --state ./artifacts \
            --target dev \
            --full-refresh

      - name: dbt build (Main)
        if: github.event_name == 'push'
        run: |
          dbt build --full-refresh --target staging

      - name: dbt docs generate
        run: dbt docs generate --target staging

      - name: Upload artifacts
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: dbt-artifacts
          path: |
            target/manifest.json
            target/run_results.json
            target/catalog.json

Performance Optimization

dbt Build Optimization

Model Build Time = Σ (Execution_Time_i × Parallelism_Factor_i)
Parallelism Factor = 1 / (DAG_Degree_of_Concurrency)
Incremental Benefit = (Full_Rebuild_Cost - Incremental_Cost) / Full_Rebuild_Cost × 100%
Materialization Impact: view (no storage) < ephemeral (no table) < incremental (partial) < table (full)

-- Performance optimization: use ephemeral for intermediate models
-- models/intermediate/int_order_enriched.sql
{{
    config(
        materialized='ephemeral',
        tags=['performance']
    )
}}

WITH base AS (
    SELECT * FROM {{ ref('stg_orders') }}
),

enriched AS (
    SELECT
        o.*,
        c.customer_segment,
        c.customer_tier,
        p.product_category,
        p.product_margin
    FROM base o
    LEFT JOIN {{ ref('dim_customers') }} c ON o.customer_id = c.customer_id
    LEFT JOIN {{ ref('dim_products') }} p ON o.product_id = p.product_id
)

SELECT * FROM enriched

-- Performance: use pre-hook for table optimization
{{
    config(
        materialized='table',
        pre_hook=[
            "ALTER TABLE {{ this }} SET AUTO_REFRESH = TRUE",
            "ALTER TABLE {{ this }} CLUSTER BY (order_date, customer_id)"
        ],
        post_hook=[
            "GRANT SELECT ON {{ this }} TO ROLE analytics_reader"
        ]
    )
}}

SELECT * FROM {{ ref('int_order_enriched') }}

-- Performance: use dbt_utils for common patterns
SELECT
    {{ dbt_utils.date_spine(
        datepart="day",
        start_date="cast('2024-01-01' as date)",
        end_date="current_date"
    ) }} AS date_day,
    {{ dbt_utils.generate_surrogate_key(['date_day']) }} AS date_key

Multi-Project Architecture

dbt Mesh is a pattern for sharing models, metrics, and semantic definitions across multiple dbt projects. It enables domain teams to own their data while providing governed interfaces for cross-domain consumption.

# Project A: Order Domain (owner: orders-team)
# dbt_project.yml
name: 'order_domain'
version: '1.0.0'

models:
  order_domain:
    +materialized: table
    +public: true  # Expose via Mesh
    +schema: order_analytics

# Export model for consumption by other projects
# models/marts/fact_orders_export.sql
{{
    config(
        materialized='incremental',
        unique_key='order_key',
        post_hook=[
            "GRANT SELECT ON {{ this }} TO ROLE cross_domain_reader"
        ]
    )
}}

SELECT * FROM {{ ref('fact_orders') }}

# Project B: Finance Domain (consumer: finance-team)
# dbt_project.yml
name: 'finance_domain'
version: '1.0.0'

models:
  finance_domain:
    +materialized: table
    +schema: finance_analytics

# Consume from Order Domain via ref
# models/marts/finance_revenue.sql
WITH orders AS (
    -- Reference from different project
    SELECT * FROM {{ ref('order_domain', 'fact_orders_export') }}
),

revenue AS (
    SELECT
        DATE_TRUNC('month', order_date) AS revenue_month,
        SUM(net_amount) AS total_revenue
    FROM orders
    GROUP BY 1
)

SELECT * FROM revenue

Key Concepts Summary

Concept	Description	Benefit	When to Use
Custom Macro	Reusable Jinja+SQL function	Code reuse, consistency	Complex patterns
Package	Shared project component	Standardization	Cross-team governance
State Selection	Build only changed models	80-95% faster builds	CI/CD, dev workflows
Defer	Use production artifacts	Fresh data in dev	Feature development
Multi-Project	Domain-oriented ownership	Scalable governance	Large organizations
Semantic Layer	Centralized metric definitions	Consistent metrics	Enterprise BI
On-run Hook	Pre/post execution logic	Automation	Cleanup, notifications
Custom Materialization	Override default behavior	Advanced patterns	Complex pipelines
Jinja Exposure	Generate exposure docs	Data discovery	Stakeholder communication
Metric Layer	dbt Semantic Layer	Single metric source	Organization-wide

Performance Metrics

Optimization	Build Time Reduction	Cost Savings	Complexity
Incremental models	60-90%	High	Medium
Ephemeral materialization	20-40%	Medium	Low
State-based selection	80-95%	Very High	Medium
Parallel model execution	30-60%	Medium	Low
Pre-aggregated tables	50-80%	High	High
Materialized views	40-70%	Medium	Low
Schema optimization	10-30%	Low	Low
Test optimization	20-40%	Low	Low
DAG optimization	10-20%	Low	Medium
Caching (dbt Server)	50-80%	High	High

10 Best Practices

Use ephemeral for intermediate models — avoid creating unnecessary tables in the warehouse
Implement state-based selection in CI — only build and test changed models and their dependents
Use --defer in development — reference production data without rebuilding upstream models
Version control package dependencies — pin to specific versions, not branches
Use on_schema_change='fail' by default — force explicit handling of schema changes
Implement dbt Mesh for multi-team projects — domain ownership with governed interfaces
Use dbt_utils and dbt_expectations — leverage battle-tested patterns instead of custom code
Tag models by priority and domain — enable selective builds and monitoring
Use pre_hook and post_hook for DDL operations — grants, clustering, vacuum
Monitor dbt run performance — track model build times and optimize slow models

Custom macros encapsulate complex Jinja+SQL logic for reusable patterns
State-based selection reduces CI/CD build times by 80-95%
Multi-project architecture (dbt Mesh) enables domain-oriented data governance
Ephemeral materialization and incremental strategies are the primary performance levers
CI/CD integration with GitHub Actions enables safe, automated deployments