Docker for Data Engineers

Data engineers build and maintain the infrastructure that powers data pipelines, warehouses, and analytics systems. Docker ensures your pipelines run identically on your laptop, in CI/CD, and in production.

Overview

Why Docker Matters for Data Engineering

Problem	Docker Solution
"Works on my machine"	Container image is identical everywhere
Dependency conflicts	Each container has its own dependencies
Reproducible builds	Dockerfile is version-controlled
Complex environment setup	`docker-compose up` starts everything
Scaling pipelines	Containers are lightweight and disposable
Testing with databases	Spin up PostgreSQL/MySQL containers for integration tests

Docker Fundamentals

Image vs Container

Concept	Analogy	Description
Image	Class / Blueprint	Read-only template with code + dependencies
Container	Object / Instance	Running process created from an image
Dockerfile	Recipe	Instructions to build an image
Registry	Library	Where images are stored (Docker Hub, ECR, GCR)

Essential Commands

# Build image
docker build -t my-etl-pipeline:latest .

# Run container
docker run -d --name etl-run \
    -e DATABASE_URL="postgresql://..." \
    -v /data:/data \
    my-etl-pipeline:latest

# List containers
docker ps                          # Running
docker ps -a                       # All (including stopped)

# Logs
docker logs etl-run
docker logs -f etl-run             # Follow (tail)

# Execute command in running container
docker exec -it etl-run /bin/bash

# Stop and remove
docker stop etl-run
docker rm etl-run

# Cleanup
docker system prune -a

Writing Dockerfiles for Data Pipelines

Dockerfile Anatomy

# Use slim base to reduce image size
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Install system dependencies (if needed)
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc \
    libpq-dev \
    && rm -rf /var/lib/apt/lists/*

# Copy requirements first (layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY src/ ./src/
COPY configs/ ./configs/

# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV LOG_LEVEL=INFO

# Health check
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
    CMD python -c "import requests; requests.get('http://localhost:8080/health')" || exit 1

# Run the pipeline
ENTRYPOINT ["python", "-m", "src.pipeline"]

Layer Caching Optimization

Optimization	Impact
Copy `requirements.txt` before code	Dependencies layer is cached unless requirements change
Use `.dockerignore`	Excludes `__pycache__`, `.git`, `node_modules`
Combine `RUN` commands	Fewer layers = smaller image
Use slim base images	`python:3.11-slim` vs `python:3.11` (800MB vs 150MB)
Multi-stage builds	Build in one stage, run in another (minimal final image)

Multi-Stage Build

# Stage 1: Build dependencies
FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt

# Stage 2: Production image
FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /root/.local /root/.local
COPY src/ ./src/
ENV PATH=/root/.local/bin:$PATH
ENTRYPOINT ["python", "-m", "src.pipeline"]

Docker Compose for Data Pipelines

Complete Data Pipeline Stack

# docker-compose.yml
version: '3.8'

services:
  postgres:
    image: postgres:15
    environment:
      POSTGRES_DB: analytics
      POSTGRES_USER: pipeline
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U pipeline"]
      interval: 10s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  etl-pipeline:
    build: .
    environment:
      DATABASE_URL: postgresql://pipeline:${DB_PASSWORD}@postgres:5432/analytics
      REDIS_URL: redis://redis:6379
      LOG_LEVEL: INFO
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_started
    volumes:
      - ./data:/data
      - ./logs:/app/logs

  airflow:
    image: apache/airflow:2.8.0
    environment:
      AIRFLOW__CORE__EXECUTOR: LocalExecutor
      AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql://pipeline:${DB_PASSWORD}@postgres:5432/airflow
    ports:
      - "8080:8080"
    volumes:
      - ./dags:/opt/airflow/dags
    depends_on:
      postgres:
        condition: service_healthy

volumes:
  pgdata:

# Start entire stack
docker-compose up -d

# View logs
docker-compose logs -f etl-pipeline

# Run one-off command
docker-compose run etl-pipeline python -m src.backfill --start-date 2024-01-01

# Stop everything
docker-compose down
docker-compose down -v  # Remove volumes too

Networking

Container Communication

Network Mode	Description	Use Case
bridge (default)	Containers on private network	Most data pipeline services
host	Container shares host network stack	Performance-critical, single-container
none	No networking	Isolated computation
overlay	Multi-host networking	Docker Swarm, multi-node clusters

Volumes and Data Persistence

# Named volumes (Docker-managed)
docker volume create pgdata
docker run -v pgdata:/var/lib/postgresql/data postgres:15

# Bind mounts (host directory)
docker run -v /host/data:/container/data my-etl-tool

# Read-only mounts
docker run -v /host/configs:/configs:ro my-etl-tool

# Backup a volume
docker run --rm -v pgdata:/data -v /backup:/backup alpine \
    tar czf /backup/pgdata-backup.tar.gz -C /data .

Volume Best Practices

Scenario	Recommendation
Database data	Named volume (durable, Docker-managed)
Development code	Bind mount (edit locally, changes reflect in container)
Configuration files	Bind mount or `COPY` in Dockerfile
Temporary data	tmpfs mount (in-memory, ephemeral)
Backups	Named volume + periodic tar export

Best Practices for Data Engineers

Practice	Rationale
Always use `.dockerignore`	Prevents copying `.git`, `__pycache__`, large files
Use specific image tags	`postgres:15` not `postgres:latest` for reproducibility
Health checks	Ensure dependent services are ready before starting
Environment variables	Never hardcode credentials; use env vars or secrets
Resource limits	Prevent OOM: `--memory=2g --cpus=2`
Logging to stdout	Docker collects stdout; use structured logging
Non-root user	Run as non-root for security in production
Clean up regularly	`docker system prune` to reclaim disk space

MathSummary Takeaways

Docker eliminates "works on my machine" — containers package code with all dependencies for identical behavior everywhere.
Layer caching speeds up builds — copy requirements.txt before application code so dependency layers are cached.
Docker Compose orchestrates multi-service stacks — start databases, caches, and pipelines with a single command.
Named volumes ensure data persistence — don't store database data in containers; use volumes instead.
Health checks prevent race conditions — use depends_on: condition: service_healthy to wait for databases.
Slim base images reduce attack surface — use python:3.11-slim and remove unnecessary packages.
Environment variables keep secrets out of images — never embed credentials in Dockerfiles or images.
Multi-stage builds minimize image size — build dependencies in one stage, copy only what's needed to the final image.

Practice Exercises

Dockerize a pipeline: Take an existing Python ETL script and create a Dockerfile that packages it with all dependencies.
Compose stack: Create a Docker Compose file with PostgreSQL, Redis, and a Python application that connects to both.
Multi-stage build: Write a multi-stage Dockerfile that builds a Python application and produces a minimal production image under 200MB.
Volume backup: Write a shell script that backs up a PostgreSQL Docker volume and restores it to a new container.
CI/CD integration: Modify a GitHub Actions workflow to build a Docker image, push to a registry, and deploy to a server.

Docker for Data Engineers

Docker for Data Engineers

Overview

Why Docker Matters for Data Engineering

Docker Fundamentals

Image vs Container

Essential Commands

Writing Dockerfiles for Data Pipelines

Dockerfile Anatomy

Layer Caching Optimization

Multi-Stage Build

Docker Compose for Data Pipelines

Complete Data Pipeline Stack

Networking

Container Communication

Volumes and Data Persistence

Volume Best Practices

Best Practices for Data Engineers

MathSummary Takeaways

See Also

Practice Exercises

Premium Content

Need Expert Data Engineering Help?