Why Git Matters for Data Engineers
Version control is fundamental for managing data pipeline code, configuration files, SQL transformations, and infrastructure-as-code. Git enables collaboration, rollback, and audit trails.
+-------------------------------------------------------------+
| GIT USE CASES IN DATA ENG |
+-------------------------------------------------------------+
| Pipeline Code | Python scripts, DAGs |
| SQL Transformations | dbt models, views, functions |
| Infrastructure | Terraform, CloudFormation |
| Configuration | YAML, JSON, env files |
| Documentation | README, runbooks, data catalogs |
| Tests | Unit tests, integration tests |
| CI/CD | GitHub Actions, GitLab CI |
+-------------------------------------------------------------+
Theory: How Git Works Internally
Git stores data as a directed acyclic graph (DAG) of commits. Each commit points to its parent(s) and contains a snapshot of all tracked files.
- Blob: Stores file content (compressed).
- Tree: Directory structure mapping file names to blobs.
- Commit: Points to a tree, author, message, and parent commit(s).
- HEAD: A pointer to the current branch (which points to a commit).
Git Object Model
Git Basics
Essential Commands
# Initialize repository
git init
# Clone repository
git clone https://github.com/org/data-pipeline.git
# Check status
git status
# Add files to staging
git add file.py # Add specific file
git add . # Add all changes
git add -p # Interactive staging (patch mode)
# Commit
git commit -m "feat: add order processing pipeline"
git commit --amend # Amend last commit
# View history
git log --oneline -10 # Last 10 commits
git log --graph --oneline # Visual branch history
git log --stat # Show changed files
# Diff
git diff # Unstaged changes
git diff --staged # Staged changes
git diff main..feature-branch # Compare branches
# Remote operations
git remote add origin https://github.com/org/repo.git
git push origin main
git pull origin main
git fetch origin
Undoing Changes
# Discard unstaged changes
git checkout -- file.py
git restore file.py # Git 2.23+
# Unstage a file
git reset HEAD file.py
git restore --staged file.py # Git 2.23+
# Amend last commit (before push)
git commit --amend -m "new message"
# Revert a commit (creates new commit)
git revert abc123
# Reset to specific commit (DANGEROUS)
git reset --hard abc123 # Discard all changes
git reset --soft HEAD~1 # Keep changes, undo commit
Undo Operations Reference
| Command | Effect | Safety |
|---|---|---|
git restore <file> | Discard working dir changes | Safe |
git restore --staged <file> | Unstage a file | Safe |
git commit --amend | Modify last commit | Safe (before push) |
git revert <sha> | Create undo commit | Safe (preserves history) |
git reset --soft HEAD~1 | Undo commit, keep changes | Safe |
git reset --mixed HEAD~1 | Undo commit, unstage changes | Safe |
git reset --hard HEAD~1 | Discard everything | DANGEROUS |
Branching Strategies
Git Flow
Git Flow Branching Strategy:
| Branch | Purpose | Lifetime |
|---|---|---|
main | Production-ready code | Permanent |
develop | Integration branch | Permanent |
feature/* | New features | Temporary |
release/* | Release preparation | Temporary |
hotfix/* | Critical fixes | Temporary |
Git Flow Workflow:
- Create
developfrommain - Create
feature/*branches fromdevelop - Merge features back to
develop - Create
release/*fromdevelopwhen ready - Merge release to
mainand tag - Create
hotfix/*frommainfor critical fixes
# Create feature branch
git checkout -b feature/order-pipeline
# Work on feature
git add .
git commit -m "feat: implement order extraction"
# Push to remote
git push -u origin feature/order-pipeline
# Create pull request (via GitHub/GitLab CLI)
gh pr create --title "Add order pipeline" --body "Implements..."
# Merge to develop
git checkout develop
git merge feature/order-pipeline
git push origin develop
# Delete branch
git branch -d feature/order-pipeline
git push origin --delete feature/order-pipeline
Trunk-Based Development
Trunk-Based Development Workflow:
| Step | Action | Description |
|---|---|---|
| 1 | Create short-lived branch | Branch from main for feature |
| 2 | Commit frequently | Small, atomic commits |
| 3 | Open PR quickly | Keep PRs small and focused |
| 4 | Review and merge | Fast review cycles |
| 5 | Delete branch | Clean up after merge |
Trunk-Based Best Practices:
- Keep branches short-lived (hours to days)
- Use feature flags for incomplete features
- Automate testing and deployment
- Merge to main at least daily
Branching Strategy Comparison
| Factor | Git Flow | Trunk-Based |
|---|---|---|
| Complexity | High (5 branch types) | Low (main + short-lived branches) |
| Release cadence | Scheduled releases | Continuous deployment |
| Best for | Projects with versioned releases | CI/CD-heavy environments |
| Merge conflicts | More (long-lived branches) | Fewer (short-lived branches) |
| Data pipeline fit | Good for batch pipelines | Good for streaming/real-time |
Commit Message Conventions
Conventional Commits
<type>[optional scope]: <description>
[optional body]
[optional footer(s)]
Types
| Type | Description | Example |
|---|---|---|
| feat | New feature | feat(pipeline): add order extraction |
| fix | Bug fix | fix: correct date parsing in transform |
| docs | Documentation | docs: update README with setup instructions |
| style | Formatting | style: apply black formatting |
| refactor | Code restructuring | refactor: extract common utilities |
| test | Adding tests | test: add unit tests for transform |
| chore | Maintenance | chore: update dependencies |
| perf | Performance | perf: optimize bulk insert |
| ci | CI/CD | ci: add GitHub Actions workflow |
| build | Build system | build: add Dockerfile |
Examples
# Simple commit
git commit -m "feat: add daily sales aggregation pipeline"
# With scope
git commit -m "feat(orders): implement CDC ingestion from PostgreSQL"
# With body
git commit -m "fix(transform): handle null values in amount column
Previously, null amounts would cause the aggregation to fail.
This fix adds a coalesce to default null amounts to 0.
Closes #123"
# Breaking change
git commit -m "feat!: change order schema to include customer segment
BREAKING CHANGE: order table now requires customer_segment column"
Pull Request Best Practices
PR Template
## Description
Brief description of changes
## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update
## Testing
- [ ] Unit tests added/updated
- [ ] Integration tests pass
- [ ] Manual testing completed
## Checklist
- [ ] Code follows style guidelines
- [ ] Self-review completed
- [ ] Documentation updated
- [ ] No new warnings
Review Checklist for Data Engineers
+-------------------------------------------------------------+
| CODE REVIEW CHECKLIST |
+-------------------------------------------------------------+
| Code follows project style guidelines |
| SQL queries are optimized and indexed |
| Error handling is comprehensive |
| Logging is appropriate and informative |
| Data quality checks are included |
| Tests cover edge cases |
| Documentation is updated |
| No hardcoded credentials or secrets |
| Backward compatibility maintained |
| Performance implications considered |
| Rollback strategy documented |
+-------------------------------------------------------------+
CI/CD Integration
GitHub Actions Workflow
# .github/workflows/data-pipeline.yml
name: Data Pipeline CI/CD
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install black flake8 mypy
- name: Run black
run: black --check .
- name: Run flake8
run: flake8 .
- name: Run mypy
run: mypy .
test:
runs-on: ubuntu-latest
needs: lint
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Run tests
run: pytest tests/ -v --cov=src --cov-report=xml
- name: Upload coverage
uses: codecov/codecov-action@v3
deploy:
runs-on: ubuntu-latest
needs: test
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- name: Configure AWS
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- name: Deploy to S3
run: |
aws s3 sync src/ s3://my-bucket/pipeline/
- name: Notify deployment
run: |
curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
-d '{"text": "Pipeline deployed to production"}'
CI/CD Pipeline Flow for Data Engineering
Git Hooks
Pre-commit Hooks
# .pre-commit-config.yaml
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-json
- id: check-merge-conflict
- id: detect-private-key
- repo: https://github.com/psf/black
rev: 24.1.0
hooks:
- id: black
- repo: https://github.com/pycqa/flake8
rev: 7.0.0
hooks:
- id: flake8
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.8.0
hooks:
- id: mypy
Managing Pipeline Code
Project Structure
data-pipeline/
+-- .github/
| +-- workflows/
| +-- ci.yml
| +-- deploy.yml
+-- src/
| +-- extractors/
| | +-- __init__.py
| | +-- postgres.py
| | +-- api.py
| +-- transformers/
| | +-- __init__.py
| | +-- orders.py
| | +-- customers.py
| +-- loaders/
| | +-- __init__.py
| | +-- warehouse.py
| +-- utils/
| +-- __init__.py
| +-- config.py
+-- tests/
| +-- unit/
| +-- integration/
+-- sql/
| +-- migrations/
| +-- models/
+-- configs/
| +-- dev.yaml
| +-- staging.yaml
| +-- prod.yaml
+-- docker/
| +-- Dockerfile
+-- docs/
| +-- architecture.md
+-- README.md
+-- requirements.txt
+-- pyproject.toml
dbt Project Structure
dbt_project/
+-- models/
| +-- staging/
| | +-- stg_orders.sql
| | +-- stg_customers.sql
| | +-- stg_products.sql
| +-- intermediate/
| | +-- int_orders_enriched.sql
| | +-- int_customer_metrics.sql
| +-- marts/
| | +-- fct_orders.sql
| | +-- dim_customers.sql
| | +-- dim_products.sql
| +-- sources/
| +-- source.yml
+-- tests/
| +-- assert_positive_amount.sql
| +-- test_unique_orders.sql
+-- macros/
| +-- generate_schema_name.sql
+-- snapshots/
| +-- scd_customers.sql
+-- seeds/
| +-- country_codes.csv
+-- dbt_project.yml
+-- profiles.yml
Monorepo vs Polyrepo
Comparison
| Factor | Monorepo | Polyrepo |
|---|---|---|
| Dependency management | Simple | Complex |
| Atomic changes | Yes | No |
| Access control | Coarse | Fine-grained |
| CI/CD complexity | Higher | Lower |
| Code discoverability | Easy | Harder |
| Team scaling | Better for large teams | Better for distributed teams |
Decision Framework
Best Practices for Data Engineering Repos
| Practice | Rationale |
|---|---|
| Separate config from code | Different env configs (dev/staging/prod) without code changes |
| Version SQL migrations | Database schema changes tracked in Git |
Use .gitignore for data | Never commit large CSV/Parquet files |
| Tag releases | git tag v1.2.0 for rollback and auditing |
| Protect main branch | Require PR reviews and CI checks before merge |
| Automate with CI/CD | Lint, test, and deploy on every merge |
| Document architecture | README with diagrams and data flow |
MathSummary Takeaways
- Git is essential β master the core commands (
add,commit,push,pull,branch,merge,rebase). - Use conventional commits β clear, consistent commit messages (
feat:,fix:,docs:) improve changelogs and traceability. - Branch strategically β Git Flow for scheduled releases, trunk-based for continuous deployment.
- PRs enable review β use templates and checklists for quality; require at least one reviewer.
- CI/CD automates quality β lint, test, and deploy automatically on merge to main.
- Git hooks prevent issues β catch formatting errors, secrets, and type issues before they reach the repository.
- Structure matters β organize pipeline code with clear separation of extractors, transformers, and loaders.
- Choose repo strategy wisely β monorepo for tight coupling and shared code, polyrepo for independent team ownership.
See Also
- What is Data Engineering β Introduction to data engineering
- Python for Data Engineers β Python libraries and patterns
- Docker for Data Engineers β Containerizing data pipelines
- Command Line & Shell Scripting β Bash fundamentals
- Cloud Platforms Overview β AWS, GCP, and Azure comparison
Practice Exercises
-
Git workflow: Set up a Git repository with Git Flow branching strategy. Create a feature branch, make changes, create a PR, and merge.
-
CI/CD pipeline: Create a GitHub Actions workflow that runs linting, tests, and deploys on merge to main.
-
Pre-commit hooks: Set up pre-commit hooks that run black, flake8, and check for secrets.
-
Code review: Review a teammate's PR using the data engineering checklist. Document your findings.
-
Repository structure: Design a project structure for a data pipeline that handles 3 different data sources and loads to a data warehouse.