πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Version Control with Git for Data Engineers

Data Engineering FoundationsGit and Version Control🟒 Free Lesson

Advertisement

Why Git Matters for Data Engineers

Version control is fundamental for managing data pipeline code, configuration files, SQL transformations, and infrastructure-as-code. Git enables collaboration, rollback, and audit trails.

Git Branching Strategymaindevelopfeature/add-pipelinehotfix/fix-bug
Architecture Diagram
+-------------------------------------------------------------+
|                  GIT USE CASES IN DATA ENG                  |
+-------------------------------------------------------------+
|  Pipeline Code        |  Python scripts, DAGs               |
|  SQL Transformations  |  dbt models, views, functions       |
|  Infrastructure       |  Terraform, CloudFormation          |
|  Configuration        |  YAML, JSON, env files              |
|  Documentation        |  README, runbooks, data catalogs    |
|  Tests                |  Unit tests, integration tests      |
|  CI/CD                |  GitHub Actions, GitLab CI          |
+-------------------------------------------------------------+

Theory: How Git Works Internally

Git stores data as a directed acyclic graph (DAG) of commits. Each commit points to its parent(s) and contains a snapshot of all tracked files.

  • Blob: Stores file content (compressed).
  • Tree: Directory structure mapping file names to blobs.
  • Commit: Points to a tree, author, message, and parent commit(s).
  • HEAD: A pointer to the current branch (which points to a commit).

Git Object Model

Git Basics

Git Workflow: Clone to Mergegit cloneGet repositorygit branchCreate featuregit checkoutSwitch branchgit commitSave changesgit pushUpload changesgit mergeCombine branchesFull workflow: clone β†’ branch β†’ checkout β†’ code β†’ commit β†’ push β†’ PR β†’ merge

Essential Commands

# Initialize repository
git init

# Clone repository
git clone https://github.com/org/data-pipeline.git

# Check status
git status

# Add files to staging
git add file.py                    # Add specific file
git add .                          # Add all changes
git add -p                         # Interactive staging (patch mode)

# Commit
git commit -m "feat: add order processing pipeline"
git commit --amend                 # Amend last commit

# View history
git log --oneline -10              # Last 10 commits
git log --graph --oneline          # Visual branch history
git log --stat                     # Show changed files

# Diff
git diff                           # Unstaged changes
git diff --staged                  # Staged changes
git diff main..feature-branch      # Compare branches

# Remote operations
git remote add origin https://github.com/org/repo.git
git push origin main
git pull origin main
git fetch origin

Undoing Changes

# Discard unstaged changes
git checkout -- file.py
git restore file.py                # Git 2.23+

# Unstage a file
git reset HEAD file.py
git restore --staged file.py       # Git 2.23+

# Amend last commit (before push)
git commit --amend -m "new message"

# Revert a commit (creates new commit)
git revert abc123

# Reset to specific commit (DANGEROUS)
git reset --hard abc123            # Discard all changes
git reset --soft HEAD~1            # Keep changes, undo commit

Undo Operations Reference

CommandEffectSafety
git restore <file>Discard working dir changesSafe
git restore --staged <file>Unstage a fileSafe
git commit --amendModify last commitSafe (before push)
git revert <sha>Create undo commitSafe (preserves history)
git reset --soft HEAD~1Undo commit, keep changesSafe
git reset --mixed HEAD~1Undo commit, unstage changesSafe
git reset --hard HEAD~1Discard everythingDANGEROUS

Branching Strategies

Git Flow

Git Flow Branching Strategy:

BranchPurposeLifetime
mainProduction-ready codePermanent
developIntegration branchPermanent
feature/*New featuresTemporary
release/*Release preparationTemporary
hotfix/*Critical fixesTemporary

Git Flow Workflow:

  1. Create develop from main
  2. Create feature/* branches from develop
  3. Merge features back to develop
  4. Create release/* from develop when ready
  5. Merge release to main and tag
  6. Create hotfix/* from main for critical fixes
# Create feature branch
git checkout -b feature/order-pipeline

# Work on feature
git add .
git commit -m "feat: implement order extraction"

# Push to remote
git push -u origin feature/order-pipeline

# Create pull request (via GitHub/GitLab CLI)
gh pr create --title "Add order pipeline" --body "Implements..."

# Merge to develop
git checkout develop
git merge feature/order-pipeline
git push origin develop

# Delete branch
git branch -d feature/order-pipeline
git push origin --delete feature/order-pipeline

Trunk-Based Development

Trunk-Based Development Workflow:

StepActionDescription
1Create short-lived branchBranch from main for feature
2Commit frequentlySmall, atomic commits
3Open PR quicklyKeep PRs small and focused
4Review and mergeFast review cycles
5Delete branchClean up after merge

Trunk-Based Best Practices:

  • Keep branches short-lived (hours to days)
  • Use feature flags for incomplete features
  • Automate testing and deployment
  • Merge to main at least daily

Branching Strategy Comparison

FactorGit FlowTrunk-Based
ComplexityHigh (5 branch types)Low (main + short-lived branches)
Release cadenceScheduled releasesContinuous deployment
Best forProjects with versioned releasesCI/CD-heavy environments
Merge conflictsMore (long-lived branches)Fewer (short-lived branches)
Data pipeline fitGood for batch pipelinesGood for streaming/real-time

Commit Message Conventions

Conventional Commits

Architecture Diagram
<type>[optional scope]: <description>

[optional body]

[optional footer(s)]

Types

TypeDescriptionExample
featNew featurefeat(pipeline): add order extraction
fixBug fixfix: correct date parsing in transform
docsDocumentationdocs: update README with setup instructions
styleFormattingstyle: apply black formatting
refactorCode restructuringrefactor: extract common utilities
testAdding teststest: add unit tests for transform
choreMaintenancechore: update dependencies
perfPerformanceperf: optimize bulk insert
ciCI/CDci: add GitHub Actions workflow
buildBuild systembuild: add Dockerfile

Examples

# Simple commit
git commit -m "feat: add daily sales aggregation pipeline"

# With scope
git commit -m "feat(orders): implement CDC ingestion from PostgreSQL"

# With body
git commit -m "fix(transform): handle null values in amount column

Previously, null amounts would cause the aggregation to fail.
This fix adds a coalesce to default null amounts to 0.

Closes #123"

# Breaking change
git commit -m "feat!: change order schema to include customer segment

BREAKING CHANGE: order table now requires customer_segment column"

Pull Request Best Practices

PR Template

## Description
Brief description of changes

## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update

## Testing
- [ ] Unit tests added/updated
- [ ] Integration tests pass
- [ ] Manual testing completed

## Checklist
- [ ] Code follows style guidelines
- [ ] Self-review completed
- [ ] Documentation updated
- [ ] No new warnings

Review Checklist for Data Engineers

Architecture Diagram
+-------------------------------------------------------------+
|              CODE REVIEW CHECKLIST                          |
+-------------------------------------------------------------+
|  Code follows project style guidelines                      |
|  SQL queries are optimized and indexed                      |
|  Error handling is comprehensive                            |
|  Logging is appropriate and informative                     |
|  Data quality checks are included                           |
|  Tests cover edge cases                                     |
|  Documentation is updated                                   |
|  No hardcoded credentials or secrets                        |
|  Backward compatibility maintained                          |
|  Performance implications considered                        |
|  Rollback strategy documented                               |
+-------------------------------------------------------------+

CI/CD Integration

GitHub Actions Workflow

# .github/workflows/data-pipeline.yml
name: Data Pipeline CI/CD

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      
      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install black flake8 mypy
      
      - name: Run black
        run: black --check .
      
      - name: Run flake8
        run: flake8 .
      
      - name: Run mypy
        run: mypy .

  test:
    runs-on: ubuntu-latest
    needs: lint
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      
      - name: Run tests
        run: pytest tests/ -v --cov=src --cov-report=xml
      
      - name: Upload coverage
        uses: codecov/codecov-action@v3

  deploy:
    runs-on: ubuntu-latest
    needs: test
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4
      
      - name: Configure AWS
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1
      
      - name: Deploy to S3
        run: |
          aws s3 sync src/ s3://my-bucket/pipeline/
      
      - name: Notify deployment
        run: |
          curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
            -d '{"text": "Pipeline deployed to production"}'

CI/CD Pipeline Flow for Data Engineering

Git Hooks

Pre-commit Hooks

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.5.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: check-json
      - id: check-merge-conflict
      - id: detect-private-key
  
  - repo: https://github.com/psf/black
    rev: 24.1.0
    hooks:
      - id: black
  
  - repo: https://github.com/pycqa/flake8
    rev: 7.0.0
    hooks:
      - id: flake8
  
  - repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.8.0
    hooks:
      - id: mypy

Managing Pipeline Code

Project Structure

Architecture Diagram
data-pipeline/
+-- .github/
|   +-- workflows/
|       +-- ci.yml
|       +-- deploy.yml
+-- src/
|   +-- extractors/
|   |   +-- __init__.py
|   |   +-- postgres.py
|   |   +-- api.py
|   +-- transformers/
|   |   +-- __init__.py
|   |   +-- orders.py
|   |   +-- customers.py
|   +-- loaders/
|   |   +-- __init__.py
|   |   +-- warehouse.py
|   +-- utils/
|       +-- __init__.py
|       +-- config.py
+-- tests/
|   +-- unit/
|   +-- integration/
+-- sql/
|   +-- migrations/
|   +-- models/
+-- configs/
|   +-- dev.yaml
|   +-- staging.yaml
|   +-- prod.yaml
+-- docker/
|   +-- Dockerfile
+-- docs/
|   +-- architecture.md
+-- README.md
+-- requirements.txt
+-- pyproject.toml

dbt Project Structure

Architecture Diagram
dbt_project/
+-- models/
|   +-- staging/
|   |   +-- stg_orders.sql
|   |   +-- stg_customers.sql
|   |   +-- stg_products.sql
|   +-- intermediate/
|   |   +-- int_orders_enriched.sql
|   |   +-- int_customer_metrics.sql
|   +-- marts/
|   |   +-- fct_orders.sql
|   |   +-- dim_customers.sql
|   |   +-- dim_products.sql
|   +-- sources/
|       +-- source.yml
+-- tests/
|   +-- assert_positive_amount.sql
|   +-- test_unique_orders.sql
+-- macros/
|   +-- generate_schema_name.sql
+-- snapshots/
|   +-- scd_customers.sql
+-- seeds/
|   +-- country_codes.csv
+-- dbt_project.yml
+-- profiles.yml

Monorepo vs Polyrepo

Comparison

FactorMonorepoPolyrepo
Dependency managementSimpleComplex
Atomic changesYesNo
Access controlCoarseFine-grained
CI/CD complexityHigherLower
Code discoverabilityEasyHarder
Team scalingBetter for large teamsBetter for distributed teams

Decision Framework

Best Practices for Data Engineering Repos

PracticeRationale
Separate config from codeDifferent env configs (dev/staging/prod) without code changes
Version SQL migrationsDatabase schema changes tracked in Git
Use .gitignore for dataNever commit large CSV/Parquet files
Tag releasesgit tag v1.2.0 for rollback and auditing
Protect main branchRequire PR reviews and CI checks before merge
Automate with CI/CDLint, test, and deploy on every merge
Document architectureREADME with diagrams and data flow

MathSummary Takeaways

  1. Git is essential β€” master the core commands (add, commit, push, pull, branch, merge, rebase).
  2. Use conventional commits β€” clear, consistent commit messages (feat:, fix:, docs:) improve changelogs and traceability.
  3. Branch strategically β€” Git Flow for scheduled releases, trunk-based for continuous deployment.
  4. PRs enable review β€” use templates and checklists for quality; require at least one reviewer.
  5. CI/CD automates quality β€” lint, test, and deploy automatically on merge to main.
  6. Git hooks prevent issues β€” catch formatting errors, secrets, and type issues before they reach the repository.
  7. Structure matters β€” organize pipeline code with clear separation of extractors, transformers, and loaders.
  8. Choose repo strategy wisely β€” monorepo for tight coupling and shared code, polyrepo for independent team ownership.

See Also

Practice Exercises

  1. Git workflow: Set up a Git repository with Git Flow branching strategy. Create a feature branch, make changes, create a PR, and merge.

  2. CI/CD pipeline: Create a GitHub Actions workflow that runs linting, tests, and deploys on merge to main.

  3. Pre-commit hooks: Set up pre-commit hooks that run black, flake8, and check for secrets.

  4. Code review: Review a teammate's PR using the data engineering checklist. Document your findings.

  5. Repository structure: Design a project structure for a data pipeline that handles 3 different data sources and loads to a data warehouse.

⭐

Premium Content

Version Control with Git for Data Engineers

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Data Engineering Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement