Docker for Data Engineers
Data engineers build and maintain the infrastructure that powers data pipelines, warehouses, and analytics systems. Docker ensures your pipelines run identically on your laptop, in CI/CD, and in production.
Overview
Why Docker Matters for Data Engineering
| Problem | Docker Solution |
|---|---|
| "Works on my machine" | Container image is identical everywhere |
| Dependency conflicts | Each container has its own dependencies |
| Reproducible builds | Dockerfile is version-controlled |
| Complex environment setup | docker-compose up starts everything |
| Scaling pipelines | Containers are lightweight and disposable |
| Testing with databases | Spin up PostgreSQL/MySQL containers for integration tests |
Docker Fundamentals
Image vs Container
| Concept | Analogy | Description |
|---|---|---|
| Image | Class / Blueprint | Read-only template with code + dependencies |
| Container | Object / Instance | Running process created from an image |
| Dockerfile | Recipe | Instructions to build an image |
| Registry | Library | Where images are stored (Docker Hub, ECR, GCR) |
Essential Commands
# Build image
docker build -t my-etl-pipeline:latest .
# Run container
docker run -d --name etl-run \
-e DATABASE_URL="postgresql://..." \
-v /data:/data \
my-etl-pipeline:latest
# List containers
docker ps # Running
docker ps -a # All (including stopped)
# Logs
docker logs etl-run
docker logs -f etl-run # Follow (tail)
# Execute command in running container
docker exec -it etl-run /bin/bash
# Stop and remove
docker stop etl-run
docker rm etl-run
# Cleanup
docker system prune -a
Writing Dockerfiles for Data Pipelines
Dockerfile Anatomy
# Use slim base to reduce image size
FROM python:3.11-slim
# Set working directory
WORKDIR /app
# Install system dependencies (if needed)
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc \
libpq-dev \
&& rm -rf /var/lib/apt/lists/*
# Copy requirements first (layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY src/ ./src/
COPY configs/ ./configs/
# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV LOG_LEVEL=INFO
# Health check
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD python -c "import requests; requests.get('http://localhost:8080/health')" || exit 1
# Run the pipeline
ENTRYPOINT ["python", "-m", "src.pipeline"]
Layer Caching Optimization
| Optimization | Impact |
|---|---|
Copy requirements.txt before code | Dependencies layer is cached unless requirements change |
Use .dockerignore | Excludes __pycache__, .git, node_modules |
Combine RUN commands | Fewer layers = smaller image |
| Use slim base images | python:3.11-slim vs python:3.11 (800MB vs 150MB) |
| Multi-stage builds | Build in one stage, run in another (minimal final image) |
Multi-Stage Build
# Stage 1: Build dependencies
FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt
# Stage 2: Production image
FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /root/.local /root/.local
COPY src/ ./src/
ENV PATH=/root/.local/bin:$PATH
ENTRYPOINT ["python", "-m", "src.pipeline"]
Docker Compose for Data Pipelines
Complete Data Pipeline Stack
# docker-compose.yml
version: '3.8'
services:
postgres:
image: postgres:15
environment:
POSTGRES_DB: analytics
POSTGRES_USER: pipeline
POSTGRES_PASSWORD: ${DB_PASSWORD}
ports:
- "5432:5432"
volumes:
- pgdata:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U pipeline"]
interval: 10s
timeout: 5s
retries: 5
redis:
image: redis:7-alpine
ports:
- "6379:6379"
etl-pipeline:
build: .
environment:
DATABASE_URL: postgresql://pipeline:${DB_PASSWORD}@postgres:5432/analytics
REDIS_URL: redis://redis:6379
LOG_LEVEL: INFO
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_started
volumes:
- ./data:/data
- ./logs:/app/logs
airflow:
image: apache/airflow:2.8.0
environment:
AIRFLOW__CORE__EXECUTOR: LocalExecutor
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql://pipeline:${DB_PASSWORD}@postgres:5432/airflow
ports:
- "8080:8080"
volumes:
- ./dags:/opt/airflow/dags
depends_on:
postgres:
condition: service_healthy
volumes:
pgdata:
# Start entire stack
docker-compose up -d
# View logs
docker-compose logs -f etl-pipeline
# Run one-off command
docker-compose run etl-pipeline python -m src.backfill --start-date 2024-01-01
# Stop everything
docker-compose down
docker-compose down -v # Remove volumes too
Networking
Container Communication
| Network Mode | Description | Use Case |
|---|---|---|
| bridge (default) | Containers on private network | Most data pipeline services |
| host | Container shares host network stack | Performance-critical, single-container |
| none | No networking | Isolated computation |
| overlay | Multi-host networking | Docker Swarm, multi-node clusters |
Volumes and Data Persistence
# Named volumes (Docker-managed)
docker volume create pgdata
docker run -v pgdata:/var/lib/postgresql/data postgres:15
# Bind mounts (host directory)
docker run -v /host/data:/container/data my-etl-tool
# Read-only mounts
docker run -v /host/configs:/configs:ro my-etl-tool
# Backup a volume
docker run --rm -v pgdata:/data -v /backup:/backup alpine \
tar czf /backup/pgdata-backup.tar.gz -C /data .
Volume Best Practices
| Scenario | Recommendation |
|---|---|
| Database data | Named volume (durable, Docker-managed) |
| Development code | Bind mount (edit locally, changes reflect in container) |
| Configuration files | Bind mount or COPY in Dockerfile |
| Temporary data | tmpfs mount (in-memory, ephemeral) |
| Backups | Named volume + periodic tar export |
Best Practices for Data Engineers
| Practice | Rationale |
|---|---|
Always use .dockerignore | Prevents copying .git, __pycache__, large files |
| Use specific image tags | postgres:15 not postgres:latest for reproducibility |
| Health checks | Ensure dependent services are ready before starting |
| Environment variables | Never hardcode credentials; use env vars or secrets |
| Resource limits | Prevent OOM: --memory=2g --cpus=2 |
| Logging to stdout | Docker collects stdout; use structured logging |
| Non-root user | Run as non-root for security in production |
| Clean up regularly | docker system prune to reclaim disk space |
MathSummary Takeaways
- Docker eliminates "works on my machine" β containers package code with all dependencies for identical behavior everywhere.
- Layer caching speeds up builds β copy
requirements.txtbefore application code so dependency layers are cached. - Docker Compose orchestrates multi-service stacks β start databases, caches, and pipelines with a single command.
- Named volumes ensure data persistence β don't store database data in containers; use volumes instead.
- Health checks prevent race conditions β use
depends_on: condition: service_healthyto wait for databases. - Slim base images reduce attack surface β use
python:3.11-slimand remove unnecessary packages. - Environment variables keep secrets out of images β never embed credentials in Dockerfiles or images.
- Multi-stage builds minimize image size β build dependencies in one stage, copy only what's needed to the final image.
See Also
- What is Data Engineering β Introduction to data engineering
- Python for Data Engineers β Python libraries and patterns
- Command Line & Shell Scripting β Bash fundamentals
- Cloud Platforms Overview β AWS, GCP, and Azure comparison
- Linux and Networking β Linux essentials
Practice Exercises
-
Dockerize a pipeline: Take an existing Python ETL script and create a Dockerfile that packages it with all dependencies.
-
Compose stack: Create a Docker Compose file with PostgreSQL, Redis, and a Python application that connects to both.
-
Multi-stage build: Write a multi-stage Dockerfile that builds a Python application and produces a minimal production image under 200MB.
-
Volume backup: Write a shell script that backs up a PostgreSQL Docker volume and restores it to a new container.
-
CI/CD integration: Modify a GitHub Actions workflow to build a Docker image, push to a registry, and deploy to a server.