πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

dbt Documentation and Lineage

🟒 Free Lesson

Advertisement

dbt Documentation and Lineage

Documentation Architecture

Lineage Graph Structure

Documentation Generation Flow

Detailed Explanation

dbt documentation provides a comprehensive way to document your data transformation pipeline, track data lineage, and generate interactive documentation sites.


What are the Documentation Components?

Model Documentation

Models can be documented with descriptions, tags, and metadata:

  • Description: What the model does and its business context
  • Columns: Detailed descriptions for each column
  • Tests: Data quality tests attached to models
  • Tags: Organizational tags for filtering and grouping

Source Documentation

Sources provide metadata about external data:

  • Freshness: Monitoring data freshness
  • Descriptions: What each source table contains
  • Columns: Column-level documentation
  • Loader: Information about the data loading process

Exposure Documentation

Exposures define how data is used downstream:

  • Dashboards: BI tool dashboards
  • Applications: Data applications
  • Exports: Data exports to external systems

How does Lineage Tracking work?

dbt automatically tracks data lineage through the ref() and source() functions:

  1. Column-level lineage: Track how columns flow through transformations
  2. Model-level lineage: See dependencies between models
  3. Impact analysis: Understand downstream effects of changes
  4. Root cause analysis: Trace issues back to their source

How is Documentation generated?

Running dbt docs generate creates a static documentation site:

  1. Catalog: Schema information for all models
  2. Manifest: Complete project metadata
  3. Lineage Graph: Interactive visualization
  4. Search: Full-text search across documentation

What metadata does dbt manage?

  • Schema information: Column names, types, descriptions
  • Test results: Pass/fail status for all tests
  • Run history: Execution times and status
  • Freshness: Data freshness metrics

Key Takeaway: dbt documentation provides a centralized way to document, track lineage, and generate interactive sites, making data transformations transparent and maintainable.

Code Examples

Model Documentation YAML

# models/marts/fct_orders.yml
version: 2

models:
  - name: fct_orders
    description: >
      Fact table containing all order transactions. This is the central
      fact table for the order analytics domain. It contains one row
      per order with aggregated metrics and dimension attributes.
    
    config:
      tags: ['finance', 'core']
      meta:
        owner: data-engineering
        team: analytics
        cost_center: marketing
    
    columns:
      - name: order_id
        description: "Unique identifier for each order"
        data_tests:
          - unique
          - not_null
        meta:
          system: shopify
          pii: false
      
      - name: customer_id
        description: "Foreign key to dim_customers"
        data_tests:
          - not_null
          - relationships:
              to: ref('dim_customers')
              field: customer_id
        meta:
          pii: false
      
      - name: order_date
        description: "Date when the order was placed"
        data_tests:
          - not_null
        meta:
          format: YYYY-MM-DD

Source Documentation

# models/staging/_sources.yml
version: 2

sources:
  - name: shopify
    description: "Raw data from Shopify e-commerce platform"
    database: raw
    schema: shopify
    loader: fivetran
    loaded_at_field: _fivetran_synced
    
    freshness:
      warn_after: {count: 6, period: hour}
      error_after: {count: 24, period: hour}
    
    meta:
      owner: data-engineering
      team: ecommerce
      cost_center: platform
    
    tables:
      - name: orders
        description: "All orders from Shopify"
        meta:
          incremental: true
          partition_column: created_at
        
        columns:
          - name: id
            description: "Order ID from Shopify"
            data_tests:
              - unique
              - not_null
          
          - name: customer_id
            description: "Customer ID from Shopify"
            data_tests:
              - not_null
              - relationships:
                  to: ref('stg_customers')
                  field: customer_id

Exposure Documentation

# exposures/order_dashboard.yml
version: 2

exposures:
  - name: order_analytics_dashboard
    type: dashboard
    description: "Main dashboard for order analytics"
    url: https://looker.example.com/dashboards/order_analytics
    depends_on:
      - ref('fct_orders')
      - ref('dim_customers')
      - ref('fct_order_items')
    
    meta:
      owner: analytics-team
      team: business-intelligence
      refresh_frequency: hourly
    
    owner:
      name: Data Engineering
      email: data-eng@company.com

Custom Metadata Tags

# models/marts/dim_customers.yml
version: 2

models:
  - name: dim_customers
    description: "Customer dimension table"
    
    config:
      tags: ['customer', 'dimension', 'pii']
      meta:
        data_classification: confidential
        retention_days: 730
        backup_policy: daily
        compliance:
          - gdpr
          - ccpa
    
    columns:
      - name: customer_id
        description: "Unique customer identifier"
        meta:
          pii: false
          system: primary_key
      
      - name: email
        description: "Customer email address"
        meta:
          pii: true
          encryption: aes-256
          masking_policy: hash

Lineage Query

-- Query to find all downstream dependencies
-- (using metadata tables)

with model_dependencies as (
    select
        source_model,
        target_model
    from {{ ref('dbt_model_dependencies') }}
),

recursive_lineage as (
    select
        target_model as model_name,
        1 as level,
        cast(target_model as varchar(1000)) as path
    from model_dependencies
    where source_model = 'fct_orders'
    
    union all
    
    select
        md.target_model,
        rl.level + 1,
        cast(rl.path || ' -> ' || md.target_model as varchar(1000))
    from model_dependencies md
    inner join recursive_lineage rl on md.source_model = rl.model_name
    where rl.level < 10
)

select distinct
    model_name,
    level,
    path
from recursive_lineage
order by level, model_name

Performance Metrics

MetricDescriptionTypical Value
Doc Generation TimeTime to build docs site10-30 seconds
Search Index SizeSize of search index1-5 MB
Lineage Graph SizeNodes in lineage graph100-1000+
Documentation Coverage% of models documented80-100%
Test Coverage% of columns tested70-90%

Best Practices

  1. Document everything - Add descriptions to all models and columns
  2. Use consistent naming - Follow a naming convention for tags and metadata
  3. Track lineage - Use ref() and source() for complete lineage
  4. Monitor freshness - Configure source freshness checks
  5. Tag appropriately - Use tags for filtering and organization
  6. Define exposures - Document how data is consumed downstream
  7. Review regularly - Keep documentation up to date
  8. Use metadata - Add custom metadata for governance and compliance

See Also

⭐

Premium Content

dbt Documentation and Lineage

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert dbt Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement