CI/CD for Data: Safe, Automated Deployments
CI/CD for data pipelines automates testing, validation, and deployment of data transformations, infrastructure, and models.
Why CI/CD for Data Matters
Problems Without CI/CD:
- Manual testing and deployment
- Production failures
- Data quality incidents
- Extended downtime
CI/CD Benefits:
- Automated testing β catch issues in minutes instead of hours
- Validation before production β prevent data quality incidents
- Rollback procedures β quick recovery from failures
- Monitoring β post-deploy validation
Key Insight: Automated CI/CD catches issues in minutes instead of hours.
Architecture Overview
GitHub Actions CI/CD Pipeline
# .github/workflows/data-pipeline-ci.yml
name: Data Pipeline CI
on:
pull_request:
branches: [main]
paths:
- 'models/**'
- 'tests/**'
- 'macros/**'
- 'dbt_project.yml'
env:
DBT_PROFILES_DIR: ./
DBT_TARGET: ci
jobs:
lint-and-validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install dbt-core dbt-snowflake sqlfluff
- name: Lint SQL files
run: |
sqlfluff lint models/ --dialect snowflake
- name: Validate dbt project
run: |
dbt deps
dbt parse
dbt compile
test-changed-models:
needs: lint-and-validate
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Full history for state selection
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install dbt-core dbt-snowflake dbt-utils dbt-expectations
- name: Configure dbt profile
run: |
mkdir -p ~/.dbt
echo "${{ secrets.DBT_PROFILE }}" > ~/.dbt/profiles.yml
- name: dbt deps
run: dbt deps
- name: Download main branch artifacts
uses: dawidd6/action-download-artifact@v3
with:
name: dbt-manifest
branch: main
path: ./artifacts
continue-on-error: true
- name: Run tests on changed models (PR)
if: github.event_name == 'pull_request'
run: |
dbt build --select state:modified+ \
--defer --state ./artifacts \
--target ci \
--fail-fast
- name: Generate test report
if: always()
run: |
dbt build --target ci --store-failures
cat target/run_results.json | jq '.results[] | select(.status == "error")'
- name: Comment on PR with test results
if: always()
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const results = JSON.parse(fs.readFileSync('target/run_results.json'));
const errors = results.results.filter(r => r.status === 'error');
const passed = results.results.filter(r => r.status === 'success');
let body = `## dbt Test Results\n\n`;
body += `β
Passed: ${passed.length}\n`;
body += `β Failed: ${errors.length}\n\n`;
if (errors.length > 0) {
body += `### Failed Tests\n`;
errors.forEach(e => {
body += `- ${e.unique_id}: ${e.message}\n`;
});
}
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: body
});
infrastructure-plan:
needs: lint-and-validate
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: "1.6.0"
- name: Terraform Init
run: terraform init
working-directory: ./terraform
- name: Terraform Validate
run: terraform validate
working-directory: ./terraform
- name: Terraform Plan
run: terraform plan -out=tfplan
working-directory: ./terraform
env:
TF_VAR_snowflake_account: ${{ secrets.SNOWFLAKE_ACCOUNT }}
TF_VAR_snowflake_username: ${{ secrets.SNOWFLAKE_USER }}
TF_VAR_snowflake_password: ${{ secrets.SNOWFLAKE_PASSWORD }}
- name: Comment PR with Terraform Plan
uses: actions/github-script@v7
with:
script: |
const { execSync } = require('child_process');
const plan = execSync('cd terraform && terraform show -no-color tfplan').toString();
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: `## Terraform Plan\n\`\`\`\n${plan.substring(0, 60000)}\n\`\`\``
});
CD Pipeline: Production Deployment
# .github/workflows/data-pipeline-cd.yml
name: Data Pipeline CD
on:
push:
branches: [main]
paths:
- 'models/**'
- 'tests/**'
- 'terraform/**'
jobs:
deploy-staging:
runs-on: ubuntu-latest
environment: staging
steps:
- uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install dbt-core dbt-snowflake
- name: Configure dbt profile
run: |
mkdir -p ~/.dbt
echo "${{ secrets.DBT_STAGING_PROFILE }}" > ~/.dbt/profiles.yml
- name: dbt deps
run: dbt deps
- name: Deploy to staging
run: |
dbt build --target staging --full-refresh
dbt test --target staging
- name: Run staging validation
run: |
dbt build --target staging --select tag:critical
- name: Notify staging deployment
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "Staging deployment complete: ${{ github.sha }}"
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
deploy-production:
needs: deploy-staging
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install dbt-core dbt-snowflake
- name: Configure dbt profile
run: |
mkdir -p ~/.dbt
echo "${{ secrets.DBT_PROD_PROFILE }}" > ~/.dbt/profiles.yml
- name: dbt deps
run: dbt deps
- name: Create pre-deployment snapshot
run: |
dbt run-operation create_pre_deploy_snapshot \
--args '{target_schema: "pre_deploy_snapshots"}'
- name: Deploy to production
run: |
dbt build --target production
- name: Run production tests
run: |
dbt test --target production
- name: Post-deployment validation
run: |
dbt build --target production --select tag:critical
- name: Update documentation
run: |
dbt docs generate --target production
- name: Notify production deployment
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "Production deployment complete: ${{ github.sha }}"
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
rollback:
needs: deploy-production
if: failure()
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install dbt-core dbt-snowflake
- name: Configure dbt profile
run: |
mkdir -p ~/.dbt
echo "${{ secrets.DBT_PROD_PROFILE }}" > ~/.dbt/profiles.yml
- name: Execute rollback
run: |
dbt run-operation execute_rollback \
--args '{snapshot_name: "pre_deploy_snapshots"}'
- name: Verify rollback
run: |
dbt build --target production --select tag:critical
- name: Notify rollback
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "ROLLBACK executed for production: ${{ github.sha }}"
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
Rollback Procedures
Rollback reverts a data pipeline deployment to the previous known-good state. Data rollbacks are more complex than code rollbacks because they involve restoring data state, not just code.
-- dbt Macro: Pre-deployment snapshot for rollback
{% macro create_pre_deploy_snapshot(target_schema) %}
{% set snapshot_tables = dbt_utils.get_filtered_columns(
from=ref('fact_orders'),
except=['_dbt_valid_from', '_dbt_valid_to', '_dbt_scd_id']
) %}
CREATE SCHEMA IF NOT EXISTS {{ target_schema }};
CREATE TABLE {{ target_schema }}.fact_orders_snapshot AS
SELECT * FROM {{ ref('fact_orders') }};
CREATE TABLE {{ target_schema }}.dim_customers_snapshot AS
SELECT * FROM {{ ref('dim_customers') }};
-- Record deployment metadata
INSERT INTO {{ target_schema }}.deployment_log
VALUES (
CURRENT_TIMESTAMP(),
'{{ target_schema }}',
'{{ var("git_sha", "unknown") }}',
'pre_deployment'
);
{% endmacro %}
-- dbt Macro: Execute rollback
{% macro execute_rollback(snapshot_name) %}
{% set target_database = target.database %}
{% set snapshot_schema = snapshot_name %}
-- Restore fact_orders from snapshot
CREATE OR REPLACE TABLE {{ target_database }}.marts.fact_orders AS
SELECT * FROM {{ target_database }}.{{ snapshot_schema }}.fact_orders_snapshot;
-- Restore dim_customers from snapshot
CREATE OR REPLACE TABLE {{ target_database }}.marts.dim_customers AS
SELECT * FROM {{ target_database }}.{{ snapshot_schema }}.dim_customers_snapshot;
-- Log rollback
INSERT INTO {{ target_database }}.{{ snapshot_schema }}.deployment_log
VALUES (
CURRENT_TIMESTAMP(),
'rollback',
'{{ var("git_sha", "unknown") }}',
'post_rollback'
);
{% endmacro %}
Key Concepts Summary
| Component | Description | Tool | Stage |
|---|---|---|---|
| Linting | SQL/code style checking | sqlfluff, flake8 | CI |
| Unit Testing | Model logic testing | dbt tests, pytest | CI |
| Integration Testing | End-to-end pipeline testing | dbt build | CI |
| Infrastructure Validation | IaC plan and validate | terraform plan | CI |
| Staging Deployment | Deploy to non-production | dbt build --target staging | CD |
| Production Deployment | Deploy to production | dbt build --target production | CD |
| Rollback | Revert to previous state | Snapshot restore | CD |
| Monitoring | Post-deployment monitoring | Custom alerts | Ops |
| Documentation | Auto-generate docs | dbt docs generate | CD |
| Notification | Deploy status alerts | Slack, Email | CD |
Performance Metrics
| Metric | Manual Deployment | CI/CD | Improvement |
|---|---|---|---|
| Deployment Frequency | Weekly | Daily-Hourly | 5-20x |
| Lead Time | Days | Hours | 5-10x |
| Failure Rate | 30-50% | 5-10% | -80% |
| Mean Time to Recovery | Hours | Minutes | 10-20x |
| Change Failure Rate | High | Low | -70% |
| Rollback Time | Hours | Minutes | 10-20x |
| Test Coverage | 20-40% | 80-100% | +50-80% |
| Deployment Confidence | Low | High | +50-80% |
10 Best Practices
- Test every PR β run dbt build and tests before merging to main
- Use state-based selection in CI β only test changed models and dependents
- Deploy to staging first β always validate in staging before production
- Implement automatic rollback β if production tests fail, execute rollback immediately
- Use pre-deployment snapshots β enable data rollback without backup restoration
- Tag critical models β prioritize testing and monitoring for business-critical data
- Notify on deployment status β Slack/email alerts for success and failure
- Version control infrastructure β Terraform changes go through the same CI/CD pipeline
- Implement canary deployments β deploy to a subset of tables first
- Monitor post-deployment β alert on data quality anomalies after production deploys
- CI/CD automates testing, validation, and deployment of data pipelines
- State-based selection in CI reduces test time by 80-95%
- Pre-deployment snapshots enable data rollback without backup restoration
- Staging -> production deployment with automatic rollback protects production data
- CI/CD increases deployment frequency while reducing failure rates
See Also
- dbt Advanced β State-based selection and defer in CI
- Infrastructure as Code β Terraform plan/apply in CI
- Data Contracts β Contract validation in CI/CD
- Data Governance & Catalog β Post-deployment catalog updates
- Performance Optimization β CI performance benchmarking
- Project 3: Warehouse Build β Complete CI/CD pipeline example