dbt Documentation and Lineage
Documentation Architecture
Lineage Graph Structure
Documentation Generation Flow
Detailed Explanation
dbt documentation provides a comprehensive way to document your data transformation pipeline, track data lineage, and generate interactive documentation sites.
What are the Documentation Components?
Model Documentation
Models can be documented with descriptions, tags, and metadata:
- Description: What the model does and its business context
- Columns: Detailed descriptions for each column
- Tests: Data quality tests attached to models
- Tags: Organizational tags for filtering and grouping
Source Documentation
Sources provide metadata about external data:
- Freshness: Monitoring data freshness
- Descriptions: What each source table contains
- Columns: Column-level documentation
- Loader: Information about the data loading process
Exposure Documentation
Exposures define how data is used downstream:
- Dashboards: BI tool dashboards
- Applications: Data applications
- Exports: Data exports to external systems
How does Lineage Tracking work?
dbt automatically tracks data lineage through the ref() and source() functions:
- Column-level lineage: Track how columns flow through transformations
- Model-level lineage: See dependencies between models
- Impact analysis: Understand downstream effects of changes
- Root cause analysis: Trace issues back to their source
How is Documentation generated?
Running dbt docs generate creates a static documentation site:
- Catalog: Schema information for all models
- Manifest: Complete project metadata
- Lineage Graph: Interactive visualization
- Search: Full-text search across documentation
What metadata does dbt manage?
- Schema information: Column names, types, descriptions
- Test results: Pass/fail status for all tests
- Run history: Execution times and status
- Freshness: Data freshness metrics
Key Takeaway: dbt documentation provides a centralized way to document, track lineage, and generate interactive sites, making data transformations transparent and maintainable.
Code Examples
Model Documentation YAML
# models/marts/fct_orders.yml
version: 2
models:
- name: fct_orders
description: >
Fact table containing all order transactions. This is the central
fact table for the order analytics domain. It contains one row
per order with aggregated metrics and dimension attributes.
config:
tags: ['finance', 'core']
meta:
owner: data-engineering
team: analytics
cost_center: marketing
columns:
- name: order_id
description: "Unique identifier for each order"
data_tests:
- unique
- not_null
meta:
system: shopify
pii: false
- name: customer_id
description: "Foreign key to dim_customers"
data_tests:
- not_null
- relationships:
to: ref('dim_customers')
field: customer_id
meta:
pii: false
- name: order_date
description: "Date when the order was placed"
data_tests:
- not_null
meta:
format: YYYY-MM-DD
Source Documentation
# models/staging/_sources.yml
version: 2
sources:
- name: shopify
description: "Raw data from Shopify e-commerce platform"
database: raw
schema: shopify
loader: fivetran
loaded_at_field: _fivetran_synced
freshness:
warn_after: {count: 6, period: hour}
error_after: {count: 24, period: hour}
meta:
owner: data-engineering
team: ecommerce
cost_center: platform
tables:
- name: orders
description: "All orders from Shopify"
meta:
incremental: true
partition_column: created_at
columns:
- name: id
description: "Order ID from Shopify"
data_tests:
- unique
- not_null
- name: customer_id
description: "Customer ID from Shopify"
data_tests:
- not_null
- relationships:
to: ref('stg_customers')
field: customer_id
Exposure Documentation
# exposures/order_dashboard.yml
version: 2
exposures:
- name: order_analytics_dashboard
type: dashboard
description: "Main dashboard for order analytics"
url: https://looker.example.com/dashboards/order_analytics
depends_on:
- ref('fct_orders')
- ref('dim_customers')
- ref('fct_order_items')
meta:
owner: analytics-team
team: business-intelligence
refresh_frequency: hourly
owner:
name: Data Engineering
email: data-eng@company.com
Custom Metadata Tags
# models/marts/dim_customers.yml
version: 2
models:
- name: dim_customers
description: "Customer dimension table"
config:
tags: ['customer', 'dimension', 'pii']
meta:
data_classification: confidential
retention_days: 730
backup_policy: daily
compliance:
- gdpr
- ccpa
columns:
- name: customer_id
description: "Unique customer identifier"
meta:
pii: false
system: primary_key
- name: email
description: "Customer email address"
meta:
pii: true
encryption: aes-256
masking_policy: hash
Lineage Query
-- Query to find all downstream dependencies
-- (using metadata tables)
with model_dependencies as (
select
source_model,
target_model
from {{ ref('dbt_model_dependencies') }}
),
recursive_lineage as (
select
target_model as model_name,
1 as level,
cast(target_model as varchar(1000)) as path
from model_dependencies
where source_model = 'fct_orders'
union all
select
md.target_model,
rl.level + 1,
cast(rl.path || ' -> ' || md.target_model as varchar(1000))
from model_dependencies md
inner join recursive_lineage rl on md.source_model = rl.model_name
where rl.level < 10
)
select distinct
model_name,
level,
path
from recursive_lineage
order by level, model_name
Performance Metrics
| Metric | Description | Typical Value |
|---|---|---|
| Doc Generation Time | Time to build docs site | 10-30 seconds |
| Search Index Size | Size of search index | 1-5 MB |
| Lineage Graph Size | Nodes in lineage graph | 100-1000+ |
| Documentation Coverage | % of models documented | 80-100% |
| Test Coverage | % of columns tested | 70-90% |
Best Practices
- Document everything - Add descriptions to all models and columns
- Use consistent naming - Follow a naming convention for tags and metadata
- Track lineage - Use ref() and source() for complete lineage
- Monitor freshness - Configure source freshness checks
- Tag appropriately - Use tags for filtering and organization
- Define exposures - Document how data is consumed downstream
- Review regularly - Keep documentation up to date
- Use metadata - Add custom metadata for governance and compliance
See Also
- dbt Testing Framework β Schema tests, data tests, and custom validations
- dbt Core Architecture β Manifest, DAG, and compilation pipeline
- Exposures and Semantic Layer β Metrics, dimensions, and downstream consumption
- Data Engineering Fundamentals β Modern data stack overview
- Snowflake Data Governance β Access control and data governance in Snowflake