πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

dbt Data Vault Modeling

🟒 Free Lesson

Advertisement

dbt Data Vault Modeling

Data Vault Architecture

Data Vault Pipeline

Formal Definitions

DfHub

A hub stores the unique business keys that represent core business entities (customers, products, orders). Each hub contains: a hash key (MD5/SHA256 of business key columns), the business key value(s), a load date timestamp, and a record source identifier. Hubs are inserted only when a new business key is encountered. Formally, a hub H = {HK, BK, LD, RS} where HK is the hash key, BK is the business key, LD is load date, and RS is record source.

DfLink

A link represents a many-to-many relationship between business keys from one or more hubs. Links contain: a hash key (from the combination of related hub hash keys), hash keys referencing each hub, a load date, and a record source. Links capture associations like customer-to-order or order-to-product. Formally, a link L = {HK, HK_1, HK_2, ..., HK_n, LD, RS}.

DfSatellite

A satellite stores descriptive attributes and context for a hub or link, tracked over time. Each satellite contains: a parent hash key (from the hub or link), a hash difference (detecting changes), all descriptive columns, a load date, and a record source. Satellites are inserted whenever attributes change. Formally, a satellite S = {PK, HD, Attr_1, ..., Attr_n, LD, RS}.

DfHash Key

A hash key is a deterministic, fixed-length identifier generated by hashing the business key columns using MD5 or SHA256. For hubs, HK = HASH(BK_1, BK_2, ..., BK_n). For links, HK = HASH(HK_hub_1, HK_hub_2, ..., HK_hub_n). Hash keys enable efficient joins, detect duplicates, and decouple the vault from surrogate key generation.

Detailed Explanation


What is Data Vault 2.0?

Data Vault 2.0 is an agile, auditable data modeling methodology designed for enterprise data warehousing. It separates business keys (hubs), relationships (links), and descriptive attributes (satellites) into distinct entity types.


Why Data Vault with dbt?

PrincipleHow dbt Enables It
Append-only loadingINSERT-only via incremental models
Parallel loadingHash keys eliminate dependencies
Full auditabilityRecord source tracking in every row
Source changesSatellites absorb attribute changes gracefully

Key Takeaway: Data Vault separates keys, relationships, and attributes β€” enabling parallel loading and full auditability. dbt's incremental materialization aligns perfectly with the append-only philosophy.

Data Vault 2.0 uses hash keys instead of sequence-based surrogate keys. This enables parallel loading across hubs, links, and satellites without coordination. Hash keys are deterministic: the same business key always produces the same hash, regardless of load order.

Use dbt's dbt_utils.generate_surrogate_key or custom macros for hash key generation. SHA256 is recommended over MD5 for collision resistance. Store hash keys as strings (VARCHAR) rather than binary to simplify debugging and downstream joins.

Code Examples

Hub Model

-- models/vault/hub_customer.sql
{{
    config(
        materialized='incremental',
        unique_key='hub_customer_hashkey',
        incremental_strategy='merge',
        tags=['vault', 'hub']
    )
}}

{%- set source_models = ['stg_crm_customers', 'stg_erp_customers'] -%}

with source_union as (
    {%- for source in source_models %}
    select
        customer_id as business_key,
        'source_{{ source }}' as record_source,
        _loaded_at as load_date
    from {{ ref(source) }}
    {%- if not loop.last %}
    union all
    {%- endif %}
    {%- endfor %}
),

deduplicated as (
    select
        {{ dbt_utils.generate_surrogate_key(['business_key']) }} as hub_customer_hashkey,
        business_key,
        min(load_date) as load_date,
        min(record_source) as record_source
    from source_union
    group by business_key
)

select
    hub_customer_hashkey,
    business_key,
    load_date,
    record_source
from deduplicated

{% if is_incremental() %}
where hub_customer_hashkey not in (
    select hub_customer_hashkey from {{ this }}
)
{% endif %}

Link Model

-- models/vault/link_order_customer.sql
{{
    config(
        materialized='incremental',
        unique_key='link_order_customer_hashkey',
        incremental_strategy='merge',
        tags=['vault', 'link']
    )
}}

with source_data as (
    select
        order_id as order_business_key,
        customer_id as customer_business_key,
        'stg_orders' as record_source,
        _loaded_at as load_date
    from {{ ref('stg_orders') }}
),

hash_keys as (
    select
        {{ dbt_utils.generate_surrogate_key(['order_business_key']) }} as hub_order_hashkey,
        {{ dbt_utils.generate_surrogate_key(['customer_business_key']) }} as hub_customer_hashkey,
        {{ dbt_utils.generate_surrogate_key(['order_business_key', 'customer_business_key']) }} as link_order_customer_hashkey,
        record_source,
        load_date
    from source_data
)

select distinct
    link_order_customer_hashkey,
    hub_order_hashkey,
    hub_customer_hashkey,
    load_date,
    record_source
from hash_keys

{% if is_incremental() %}
where link_order_customer_hashkey not in (
    select link_order_customer_hashkey from {{ this }}
)
{% endif %}

Satellite Model

-- models/vault/sat_order_details.sql
{{
    config(
        materialized='incremental',
        unique_key=['hub_order_hashkey', 'load_date'],
        incremental_strategy='merge',
        tags=['vault', 'satellite']
    )
}}

with source_data as (
    select
        order_id as business_key,
        status,
        amount,
        currency,
        order_date,
        _loaded_at as load_date,
        'stg_orders' as record_source
    from {{ ref('stg_orders') }}
),

hash_keys as (
    select
        {{ dbt_utils.generate_surrogate_key(['business_key']) }} as hub_order_hashkey,
        status,
        amount,
        currency,
        order_date,
        load_date,
        record_source,
        {{ dbt_utils.generate_surrogate_key(['status', 'amount', 'currency', 'order_date']) }} as hashdiff
    from source_data
),

{% if is_incremental() %}
previous_hashes as (
    select hub_order_hashkey, hashdiff
    from {{ this }}
),

new_records as (
    select
        h.hub_order_hashkey,
        h.status,
        h.amount,
        h.currency,
        h.order_date,
        h.load_date,
        h.record_source,
        h.hashdiff
    from hash_keys h
    left join previous_hashes p on h.hub_order_hashkey = p.hub_order_hashkey
    where p.hub_order_hashkey is null
       or h.hashdiff != p.hashdiff
)
select * from new_records
{% else %}
select * from hash_keys
{% endif %}

Hash Key Macro

-- macros/generate_hashkey.sql
{% macro generate_hashkey(columns, algorithm='sha256') %}
    {% if target.type == 'bigquery' %}
        {{ algorithm }}(concat({% for col in columns %}coalesce(cast({{ col }} as string), '^^')
        {%- if not loop.last %}, '||'{% endif %}{% endfor %}))
    {% elif target.type == 'snowflake' %}
        hash({% for col in columns %}coalesce(cast({{ col }} as varchar), '^^')
        {%- if not loop.last %}, '||'{% endif %}{% endfor %}, '{{ algorithm }}')
    {% else %}
        md5({% for col in columns %}coalesce(cast({{ col }} as varchar), '^^')
        {%- if not loop.last %}, '||'{% endif %}{% endfor %})
    {% endif %}
{% endmacro %}

Point-in-Time Table

-- models/vault/pit_order.sql
{{
    config(
        materialized='incremental',
        unique_key='hub_order_hashkey',
        tags=['vault', 'pit']
    )
}}

with dates as (
    select * from {{ ref('dim_date') }}
    where date_value >= '2024-01-01'
),

satellites as (
    select
        hub_order_hashkey,
        load_date,
        status,
        amount
    from {{ ref('sat_order_details') }}
),

pit_records as (
    select
        d.date_value as pit_date,
        s.hub_order_hashkey,
        s.status,
        s.amount,
        s.load_date as sat_load_date
    from dates d
    cross join (
        select distinct hub_order_hashkey from satellites
    ) s
    left join satellites s
        on s.hub_order_hashkey = s.hub_order_hashkey
        and s.load_date = (
            select max(load_date)
            from satellites
            where hub_order_hashkey = s.hub_order_hashkey
            and load_date <= d.date_value
        )
)

select * from pit_records

Comparison: Data Vault vs Dimensional

AspectData VaultDimensional
Design GoalAuditability, agilityQuery performance
StructureHubs, Links, SatellitesFacts, Dimensions
LoadingAppend-only (INSERT)Upsert (MERGE)
HistoryFull history in satellitesSCD Type 1 or 2
KeysHash keys (MD5/SHA256)Surrogate keys
PerformanceRequires views for queriesPre-joined star schema
ScalabilityExcellent parallel loadingGood with partitioning
ComplexityHigher modeling overheadLower modeling overhead

Performance Metrics

MetricDescriptionTarget
Hub Load TimeTime to insert new keys< 5 minutes
Satellite Change DetectionHashdiff comparison speed< 10 minutes
View MaterializationBusiness view creation< 15 minutes
Hash Collision RateDuplicate hash keys< 1 in 10^15
Audit CompletenessRecord source tracking100%

Best Practices

  1. Hash key consistency - Use the same algorithm and null handling across all models
  2. Record source tracking - Always capture the originating system
  3. Load date accuracy - Use source system timestamps, not dbt run timestamps
  4. Incremental strategy - Use merge for hubs/links, append for satellites
  5. Hashdiff optimization - Only include changed attributes in hashdiff
  6. Business vault - Apply business rules in a separate business vault layer
  7. PIT tables - Create point-in-time tables for efficient querying
  8. Documentation - Document business keys and relationships explicitly

See Also

⭐

Premium Content

dbt Data Vault Modeling

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert dbt Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement