πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

dbt Advanced: Custom Macros, Packages, CI/CD & Production Patterns

Module 3: Data Warehouses & StorageData Transformation - Advanced🟒 Free Lesson

Advertisement

Advanced dbt: Production-Grade Analytics Engineering

Beyond basic models and tests, production dbt requires custom macros for complex logic, package development for reusable patterns, CI/CD integration for safe deployments, and performance optimization for large-scale transformations.

Why Advanced dbt Matters


Key Benefits:

  • Governance patterns β€” for hundreds of models
  • Automated CI testing β€” catch issues before production
  • Selective model builds β€” efficient development workflows
  • Cross-team collaboration β€” enterprise-scale analytics engineering

Key Insight: Advanced dbt techniques enable enterprise-scale analytics engineering.


Architecture Overview

dbt CI/CD Pipeline Flow

dbt CI/CD PipelineDev BranchFeature PRCI Checkdbt build --selectTest Modelsdbt testReview PRApprove & MergeDeploydbt build proddbt Mesh (Cross-Project References)Project A (Staging)Project B (Marts)Semantic LayerMetrics StoreFederated models via project:// URIs | Shared semantic layer | Consistent metrics across teams

Custom Macros

A dbt macro is a reusable Jinja+SQL function that can be called from models, tests, or other macros. Macros encapsulate complex logic, reduce code duplication, and enable dynamic SQL generation.

-- macros/generate_surrogate_key.sql
{% macro generate_surrogate_key(field_list) %}
    {%- set fields = [] -%}
    {%- for field in field_list -%}
        {%- do fields.append(
            "COALESCE(CAST(" ~ field ~ " AS " ~ dbt_utils.type_string() ~ "), '')"
        ) -%}
    {%- endfor -%}

    {{ dbt_utils.generate_surrogate_key(fields) }}
{% endmacro %}

-- macros/merge_table.sql
{% macro merge_table(target, source, unique_key, update_columns=none) %}
    {%- set update_cols = update_columns or dbt_utils.get_filtered_columns(
        from=source, except=[unique_key]
    ) -%}

    MERGE INTO {{ target }} AS target
    USING ({{ source }}) AS source
    ON target.{{ unique_key }} = source.{{ unique_key }}
    WHEN MATCHED THEN
        UPDATE SET
            {% for col in update_cols %}
                target.{{ col }} = source.{{ col }}
                {% if not loop.last %},{% endif %}
            {% endfor %}
    WHEN NOT MATCHED THEN
        INSERT (
            {{ unique_key }},
            {{ update_cols | join(', ') }}
        )
        VALUES (
            source.{{ unique_key }},
            {{ update_cols | join(', ') }}
        )
{% endmacro %}

-- Usage in model
{{
    merge_table(
        target=ref('dim_customers'),
        source=ref('stg_customers'),
        unique_key='customer_id'
    )
}}

-- macros/apply_data_quality_checks.sql
{% macro apply_data_quality_checks(model_name, checks) %}
    {%- for check in checks %}
        SELECT
            '{{ model_name }}' AS model_name,
            '{{ check.column }}' AS column_name,
            '{{ check.type }}' AS check_type,
            COUNT(*) AS failing_rows
        FROM {{ model_name }}
        WHERE {% if check.type == 'not_null' %}
            {{ check.column }} IS NULL
        {% elif check.type == 'unique' %}
            {{ check.column }} IN (
                SELECT {{ check.column }}
                FROM {{ model_name }}
                GROUP BY {{ check.column }}
                HAVING COUNT(*) > 1
            )
        {% elif check.type == 'positive' %}
            {{ check.column }} <= 0
        {% endif %}
        HAVING COUNT(*) > 0
        {% if not loop.last %}UNION ALL{% endif %}
    {%- endfor %}
{% endmacro %}

-- macros/partition_management.sql
{% macro drop_partitions_before(table_name, date_column, days_to_keep=7) %}
    {% call statement('drop_partitions') %}
        {% if target.type == 'bigquery' %}
            DECLARE partition_date DATE DEFAULT DATE_SUB(CURRENT_DATE(), INTERVAL {{ days_to_keep }} DAY);
            EXECUTE IMMEDIATE FORMAT(
                'ALTER TABLE %s DROP IF EXISTS PARTITIONS WHERE %s < "%s"',
                '{{ table_name }}', '{{ date_column }}', partition_date
            );
        {% elif target.type == 'snowflake' %}
            -- Snowflake handles via table expiration policies
            SELECT 1;
        {% endif %}
    {% endcall %}
{% endmacro %}

Package Development

A dbt package is a reusable collection of models, macros, tests, and configurations shared across projects via the dbt Hub or Git repositories. Packages promote standardization and reduce duplication across teams.

# packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: [">=1.0.0", "<2.0.0"]

  - package: dbt-labs/codegen
    version: [">=0.4.0"]

  - package: calogica/dbt_expectations
    version: [">=0.8.0"]

  - package: elementary-data/elementary
    version: [">=0.10.0"]

  # Internal package from Git
  - git: "https://github.com/myorg/dbt-utils.git"
    revision: "v1.2.0"

  # Local package
  - local: "../shared-dbt-utils"
# dbt_project.yml for a package
name: 'my_org_utils'
version: '1.0.0'
config-version: 2

models:
  my_org_utils:
    +materialized: ephemeral
    +tags: ["internal"]

macros:
  +docs:
    description: "Internal utility macros for my organization"

State-Based Selection

State-based selection (--select state:modified+) enables dbt to build only models that changed since the last production run, along with their downstream dependents. This dramatically reduces build times and costs in large projects.

# GitHub Actions CI/CD Pipeline
# .github/workflows/dbt-ci.yml
name: dbt CI Pipeline

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  dbt-build:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        environment: [dev, staging]

    steps:
      - uses: actions/checkout@v4

      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install dbt-core dbt-snowflake dbt-utils dbt-expectations

      - name: Configure dbt profile
        run: |
          mkdir -p ~/.dbt
          cat << EOF > ~/.dbt/profiles.yml
          my_project:
            target: ${{ matrix.environment }}
            outputs:
              dev:
                type: snowflake
                account: ${{ secrets.SNOWFLAKE_ACCOUNT }}
                user: ${{ secrets.SNOWFLAKE_USER }}
                password: ${{ secrets.SNOWFLAKE_PASSWORD }}
                role: TRANSFORMER
                database: ANALYTICS_DEV
                warehouse: TRANSFORM_WH
                schema: dbt_${{ github.actor }}
              staging:
                type: snowflake
                account: ${{ secrets.SNOWFLAKE_ACCOUNT }}
                user: ${{ secrets.SNOWFLAKE_USER }}
                password: ${{ secrets.SNOWFLAKE_PASSWORD }}
                role: TRANSFORMER
                database: ANALYTICS_STAGING
                warehouse: TRANSFORM_WH
                schema: public
          EOF

      - name: dbt deps
        run: dbt deps

      - name: dbt build (PR)
        if: github.event_name == 'pull_request'
        run: |
          dbt build --select state:modified+ \
            --defer \
            --state ./artifacts \
            --target dev \
            --full-refresh

      - name: dbt build (Main)
        if: github.event_name == 'push'
        run: |
          dbt build --full-refresh --target staging

      - name: dbt docs generate
        run: dbt docs generate --target staging

      - name: Upload artifacts
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: dbt-artifacts
          path: |
            target/manifest.json
            target/run_results.json
            target/catalog.json

Performance Optimization

dbt Build Optimization

  • Model Build Time = Ξ£ (Execution_Time_i Γ— Parallelism_Factor_i)
  • Parallelism Factor = 1 / (DAG_Degree_of_Concurrency)
  • Incremental Benefit = (Full_Rebuild_Cost - Incremental_Cost) / Full_Rebuild_Cost Γ— 100%
  • Materialization Impact: view (no storage) < ephemeral (no table) < incremental (partial) < table (full)
-- Performance optimization: use ephemeral for intermediate models
-- models/intermediate/int_order_enriched.sql
{{
    config(
        materialized='ephemeral',
        tags=['performance']
    )
}}

WITH base AS (
    SELECT * FROM {{ ref('stg_orders') }}
),

enriched AS (
    SELECT
        o.*,
        c.customer_segment,
        c.customer_tier,
        p.product_category,
        p.product_margin
    FROM base o
    LEFT JOIN {{ ref('dim_customers') }} c ON o.customer_id = c.customer_id
    LEFT JOIN {{ ref('dim_products') }} p ON o.product_id = p.product_id
)

SELECT * FROM enriched

-- Performance: use pre-hook for table optimization
{{
    config(
        materialized='table',
        pre_hook=[
            "ALTER TABLE {{ this }} SET AUTO_REFRESH = TRUE",
            "ALTER TABLE {{ this }} CLUSTER BY (order_date, customer_id)"
        ],
        post_hook=[
            "GRANT SELECT ON {{ this }} TO ROLE analytics_reader"
        ]
    )
}}

SELECT * FROM {{ ref('int_order_enriched') }}

-- Performance: use dbt_utils for common patterns
SELECT
    {{ dbt_utils.date_spine(
        datepart="day",
        start_date="cast('2024-01-01' as date)",
        end_date="current_date"
    ) }} AS date_day,
    {{ dbt_utils.generate_surrogate_key(['date_day']) }} AS date_key

Multi-Project Architecture

dbt Mesh is a pattern for sharing models, metrics, and semantic definitions across multiple dbt projects. It enables domain teams to own their data while providing governed interfaces for cross-domain consumption.

# Project A: Order Domain (owner: orders-team)
# dbt_project.yml
name: 'order_domain'
version: '1.0.0'

models:
  order_domain:
    +materialized: table
    +public: true  # Expose via Mesh
    +schema: order_analytics

# Export model for consumption by other projects
# models/marts/fact_orders_export.sql
{{
    config(
        materialized='incremental',
        unique_key='order_key',
        post_hook=[
            "GRANT SELECT ON {{ this }} TO ROLE cross_domain_reader"
        ]
    )
}}

SELECT * FROM {{ ref('fact_orders') }}
# Project B: Finance Domain (consumer: finance-team)
# dbt_project.yml
name: 'finance_domain'
version: '1.0.0'

models:
  finance_domain:
    +materialized: table
    +schema: finance_analytics

# Consume from Order Domain via ref
# models/marts/finance_revenue.sql
WITH orders AS (
    -- Reference from different project
    SELECT * FROM {{ ref('order_domain', 'fact_orders_export') }}
),

revenue AS (
    SELECT
        DATE_TRUNC('month', order_date) AS revenue_month,
        SUM(net_amount) AS total_revenue
    FROM orders
    GROUP BY 1
)

SELECT * FROM revenue

Key Concepts Summary

ConceptDescriptionBenefitWhen to Use
Custom MacroReusable Jinja+SQL functionCode reuse, consistencyComplex patterns
PackageShared project componentStandardizationCross-team governance
State SelectionBuild only changed models80-95% faster buildsCI/CD, dev workflows
DeferUse production artifactsFresh data in devFeature development
Multi-ProjectDomain-oriented ownershipScalable governanceLarge organizations
Semantic LayerCentralized metric definitionsConsistent metricsEnterprise BI
On-run HookPre/post execution logicAutomationCleanup, notifications
Custom MaterializationOverride default behaviorAdvanced patternsComplex pipelines
Jinja ExposureGenerate exposure docsData discoveryStakeholder communication
Metric Layerdbt Semantic LayerSingle metric sourceOrganization-wide

Performance Metrics

OptimizationBuild Time ReductionCost SavingsComplexity
Incremental models60-90%HighMedium
Ephemeral materialization20-40%MediumLow
State-based selection80-95%Very HighMedium
Parallel model execution30-60%MediumLow
Pre-aggregated tables50-80%HighHigh
Materialized views40-70%MediumLow
Schema optimization10-30%LowLow
Test optimization20-40%LowLow
DAG optimization10-20%LowMedium
Caching (dbt Server)50-80%HighHigh

10 Best Practices

  1. Use ephemeral for intermediate models β€” avoid creating unnecessary tables in the warehouse
  2. Implement state-based selection in CI β€” only build and test changed models and their dependents
  3. Use --defer in development β€” reference production data without rebuilding upstream models
  4. Version control package dependencies β€” pin to specific versions, not branches
  5. Use on_schema_change='fail' by default β€” force explicit handling of schema changes
  6. Implement dbt Mesh for multi-team projects β€” domain ownership with governed interfaces
  7. Use dbt_utils and dbt_expectations β€” leverage battle-tested patterns instead of custom code
  8. Tag models by priority and domain β€” enable selective builds and monitoring
  9. Use pre_hook and post_hook for DDL operations β€” grants, clustering, vacuum
  10. Monitor dbt run performance β€” track model build times and optimize slow models

  • Custom macros encapsulate complex Jinja+SQL logic for reusable patterns
  • State-based selection reduces CI/CD build times by 80-95%
  • Multi-project architecture (dbt Mesh) enables domain-oriented data governance
  • Ephemeral materialization and incremental strategies are the primary performance levers
  • CI/CD integration with GitHub Actions enables safe, automated deployments

See Also

⭐

Premium Content

dbt Advanced: Custom Macros, Packages, CI/CD & Production Patterns

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Data Engineering Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement