dbt Documentation and Lineage

Documentation Architecture

Lineage Graph Structure

Documentation Generation Flow

Detailed Explanation

dbt documentation provides a comprehensive way to document your data transformation pipeline, track data lineage, and generate interactive documentation sites.

What are the Documentation Components?

Model Documentation

Models can be documented with descriptions, tags, and metadata:

Description: What the model does and its business context
Columns: Detailed descriptions for each column
Tests: Data quality tests attached to models
Tags: Organizational tags for filtering and grouping

Source Documentation

Sources provide metadata about external data:

Freshness: Monitoring data freshness
Descriptions: What each source table contains
Columns: Column-level documentation
Loader: Information about the data loading process

Exposure Documentation

Exposures define how data is used downstream:

Dashboards: BI tool dashboards
Applications: Data applications
Exports: Data exports to external systems

How does Lineage Tracking work?

dbt automatically tracks data lineage through the ref() and source() functions:

Column-level lineage: Track how columns flow through transformations
Model-level lineage: See dependencies between models
Impact analysis: Understand downstream effects of changes
Root cause analysis: Trace issues back to their source

How is Documentation generated?

Running dbt docs generate creates a static documentation site:

Catalog: Schema information for all models
Manifest: Complete project metadata
Lineage Graph: Interactive visualization
Search: Full-text search across documentation

What metadata does dbt manage?

Schema information: Column names, types, descriptions
Test results: Pass/fail status for all tests
Run history: Execution times and status
Freshness: Data freshness metrics

Key Takeaway: dbt documentation provides a centralized way to document, track lineage, and generate interactive sites, making data transformations transparent and maintainable.

Code Examples

Model Documentation YAML

# models/marts/fct_orders.yml
version: 2

models:
  - name: fct_orders
    description: >
      Fact table containing all order transactions. This is the central
      fact table for the order analytics domain. It contains one row
      per order with aggregated metrics and dimension attributes.
    
    config:
      tags: ['finance', 'core']
      meta:
        owner: data-engineering
        team: analytics
        cost_center: marketing
    
    columns:
      - name: order_id
        description: "Unique identifier for each order"
        data_tests:
          - unique
          - not_null
        meta:
          system: shopify
          pii: false
      
      - name: customer_id
        description: "Foreign key to dim_customers"
        data_tests:
          - not_null
          - relationships:
              to: ref('dim_customers')
              field: customer_id
        meta:
          pii: false
      
      - name: order_date
        description: "Date when the order was placed"
        data_tests:
          - not_null
        meta:
          format: YYYY-MM-DD

Source Documentation

# models/staging/_sources.yml
version: 2

sources:
  - name: shopify
    description: "Raw data from Shopify e-commerce platform"
    database: raw
    schema: shopify
    loader: fivetran
    loaded_at_field: _fivetran_synced
    
    freshness:
      warn_after: {count: 6, period: hour}
      error_after: {count: 24, period: hour}
    
    meta:
      owner: data-engineering
      team: ecommerce
      cost_center: platform
    
    tables:
      - name: orders
        description: "All orders from Shopify"
        meta:
          incremental: true
          partition_column: created_at
        
        columns:
          - name: id
            description: "Order ID from Shopify"
            data_tests:
              - unique
              - not_null
          
          - name: customer_id
            description: "Customer ID from Shopify"
            data_tests:
              - not_null
              - relationships:
                  to: ref('stg_customers')
                  field: customer_id

Exposure Documentation

# exposures/order_dashboard.yml
version: 2

exposures:
  - name: order_analytics_dashboard
    type: dashboard
    description: "Main dashboard for order analytics"
    url: https://looker.example.com/dashboards/order_analytics
    depends_on:
      - ref('fct_orders')
      - ref('dim_customers')
      - ref('fct_order_items')
    
    meta:
      owner: analytics-team
      team: business-intelligence
      refresh_frequency: hourly
    
    owner:
      name: Data Engineering
      email: data-eng@company.com

Custom Metadata Tags

# models/marts/dim_customers.yml
version: 2

models:
  - name: dim_customers
    description: "Customer dimension table"
    
    config:
      tags: ['customer', 'dimension', 'pii']
      meta:
        data_classification: confidential
        retention_days: 730
        backup_policy: daily
        compliance:
          - gdpr
          - ccpa
    
    columns:
      - name: customer_id
        description: "Unique customer identifier"
        meta:
          pii: false
          system: primary_key
      
      - name: email
        description: "Customer email address"
        meta:
          pii: true
          encryption: aes-256
          masking_policy: hash

Lineage Query

-- Query to find all downstream dependencies
-- (using metadata tables)

with model_dependencies as (
    select
        source_model,
        target_model
    from {{ ref('dbt_model_dependencies') }}
),

recursive_lineage as (
    select
        target_model as model_name,
        1 as level,
        cast(target_model as varchar(1000)) as path
    from model_dependencies
    where source_model = 'fct_orders'
    
    union all
    
    select
        md.target_model,
        rl.level + 1,
        cast(rl.path || ' -> ' || md.target_model as varchar(1000))
    from model_dependencies md
    inner join recursive_lineage rl on md.source_model = rl.model_name
    where rl.level < 10
)

select distinct
    model_name,
    level,
    path
from recursive_lineage
order by level, model_name

Performance Metrics

Metric	Description	Typical Value
Doc Generation Time	Time to build docs site	10-30 seconds
Search Index Size	Size of search index	1-5 MB
Lineage Graph Size	Nodes in lineage graph	100-1000+
Documentation Coverage	% of models documented	80-100%
Test Coverage	% of columns tested	70-90%

Best Practices

Document everything - Add descriptions to all models and columns
Use consistent naming - Follow a naming convention for tags and metadata
Track lineage - Use ref() and source() for complete lineage
Monitor freshness - Configure source freshness checks
Tag appropriately - Use tags for filtering and organization
Define exposures - Document how data is consumed downstream
Review regularly - Keep documentation up to date
Use metadata - Add custom metadata for governance and compliance

dbt Documentation and Lineage

dbt Documentation and Lineage

Documentation Architecture

Lineage Graph Structure

Documentation Generation Flow

Detailed Explanation

What are the Documentation Components?

Model Documentation

Source Documentation

Exposure Documentation

How does Lineage Tracking work?

How is Documentation generated?

What metadata does dbt manage?

Code Examples

Model Documentation YAML

Source Documentation

Exposure Documentation

Custom Metadata Tags

Lineage Query

Performance Metrics

Best Practices

See Also

Premium Content

Need Expert dbt Help?