πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Docker for Data Engineers

Data Engineering FoundationsData Engineering Fundamentals🟒 Free Lesson

Advertisement

Docker for Data Engineers

Data engineers build and maintain the infrastructure that powers data pipelines, warehouses, and analytics systems. Docker ensures your pipelines run identically on your laptop, in CI/CD, and in production.

Docker Architecture OverviewDocker HostDocker EngineContainer RuntimeContainer 1Container 2Images (Read-only)DockerfileFROM python:3.11-slimCOPY requirements.txt .RUN pip install -rCOPY src/ ./src/RegistryDocker Hub / ECRImage Storageetl-pipeline:v1.0spark-app:latestdbt-runner:2.0buildpush

Overview

Why Docker Matters for Data Engineering

ProblemDocker Solution
"Works on my machine"Container image is identical everywhere
Dependency conflictsEach container has its own dependencies
Reproducible buildsDockerfile is version-controlled
Complex environment setupdocker-compose up starts everything
Scaling pipelinesContainers are lightweight and disposable
Testing with databasesSpin up PostgreSQL/MySQL containers for integration tests

Docker Fundamentals

Image vs Container

ConceptAnalogyDescription
ImageClass / BlueprintRead-only template with code + dependencies
ContainerObject / InstanceRunning process created from an image
DockerfileRecipeInstructions to build an image
RegistryLibraryWhere images are stored (Docker Hub, ECR, GCR)

Essential Commands

# Build image
docker build -t my-etl-pipeline:latest .

# Run container
docker run -d --name etl-run \
    -e DATABASE_URL="postgresql://..." \
    -v /data:/data \
    my-etl-pipeline:latest

# List containers
docker ps                          # Running
docker ps -a                       # All (including stopped)

# Logs
docker logs etl-run
docker logs -f etl-run             # Follow (tail)

# Execute command in running container
docker exec -it etl-run /bin/bash

# Stop and remove
docker stop etl-run
docker rm etl-run

# Cleanup
docker system prune -a

Writing Dockerfiles for Data Pipelines

Dockerfile Anatomy

# Use slim base to reduce image size
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Install system dependencies (if needed)
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc \
    libpq-dev \
    && rm -rf /var/lib/apt/lists/*

# Copy requirements first (layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY src/ ./src/
COPY configs/ ./configs/

# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV LOG_LEVEL=INFO

# Health check
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
    CMD python -c "import requests; requests.get('http://localhost:8080/health')" || exit 1

# Run the pipeline
ENTRYPOINT ["python", "-m", "src.pipeline"]

Layer Caching Optimization

OptimizationImpact
Copy requirements.txt before codeDependencies layer is cached unless requirements change
Use .dockerignoreExcludes __pycache__, .git, node_modules
Combine RUN commandsFewer layers = smaller image
Use slim base imagespython:3.11-slim vs python:3.11 (800MB vs 150MB)
Multi-stage buildsBuild in one stage, run in another (minimal final image)

Multi-Stage Build

# Stage 1: Build dependencies
FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt

# Stage 2: Production image
FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /root/.local /root/.local
COPY src/ ./src/
ENV PATH=/root/.local/bin:$PATH
ENTRYPOINT ["python", "-m", "src.pipeline"]

Docker Compose for Data Pipelines

Complete Data Pipeline Stack

# docker-compose.yml
version: '3.8'

services:
  postgres:
    image: postgres:15
    environment:
      POSTGRES_DB: analytics
      POSTGRES_USER: pipeline
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U pipeline"]
      interval: 10s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  etl-pipeline:
    build: .
    environment:
      DATABASE_URL: postgresql://pipeline:${DB_PASSWORD}@postgres:5432/analytics
      REDIS_URL: redis://redis:6379
      LOG_LEVEL: INFO
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_started
    volumes:
      - ./data:/data
      - ./logs:/app/logs

  airflow:
    image: apache/airflow:2.8.0
    environment:
      AIRFLOW__CORE__EXECUTOR: LocalExecutor
      AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql://pipeline:${DB_PASSWORD}@postgres:5432/airflow
    ports:
      - "8080:8080"
    volumes:
      - ./dags:/opt/airflow/dags
    depends_on:
      postgres:
        condition: service_healthy

volumes:
  pgdata:
# Start entire stack
docker-compose up -d

# View logs
docker-compose logs -f etl-pipeline

# Run one-off command
docker-compose run etl-pipeline python -m src.backfill --start-date 2024-01-01

# Stop everything
docker-compose down
docker-compose down -v  # Remove volumes too

Networking

Container Communication

Network ModeDescriptionUse Case
bridge (default)Containers on private networkMost data pipeline services
hostContainer shares host network stackPerformance-critical, single-container
noneNo networkingIsolated computation
overlayMulti-host networkingDocker Swarm, multi-node clusters

Volumes and Data Persistence

# Named volumes (Docker-managed)
docker volume create pgdata
docker run -v pgdata:/var/lib/postgresql/data postgres:15

# Bind mounts (host directory)
docker run -v /host/data:/container/data my-etl-tool

# Read-only mounts
docker run -v /host/configs:/configs:ro my-etl-tool

# Backup a volume
docker run --rm -v pgdata:/data -v /backup:/backup alpine \
    tar czf /backup/pgdata-backup.tar.gz -C /data .

Volume Best Practices

ScenarioRecommendation
Database dataNamed volume (durable, Docker-managed)
Development codeBind mount (edit locally, changes reflect in container)
Configuration filesBind mount or COPY in Dockerfile
Temporary datatmpfs mount (in-memory, ephemeral)
BackupsNamed volume + periodic tar export

Best Practices for Data Engineers

PracticeRationale
Always use .dockerignorePrevents copying .git, __pycache__, large files
Use specific image tagspostgres:15 not postgres:latest for reproducibility
Health checksEnsure dependent services are ready before starting
Environment variablesNever hardcode credentials; use env vars or secrets
Resource limitsPrevent OOM: --memory=2g --cpus=2
Logging to stdoutDocker collects stdout; use structured logging
Non-root userRun as non-root for security in production
Clean up regularlydocker system prune to reclaim disk space

MathSummary Takeaways

  1. Docker eliminates "works on my machine" β€” containers package code with all dependencies for identical behavior everywhere.
  2. Layer caching speeds up builds β€” copy requirements.txt before application code so dependency layers are cached.
  3. Docker Compose orchestrates multi-service stacks β€” start databases, caches, and pipelines with a single command.
  4. Named volumes ensure data persistence β€” don't store database data in containers; use volumes instead.
  5. Health checks prevent race conditions β€” use depends_on: condition: service_healthy to wait for databases.
  6. Slim base images reduce attack surface β€” use python:3.11-slim and remove unnecessary packages.
  7. Environment variables keep secrets out of images β€” never embed credentials in Dockerfiles or images.
  8. Multi-stage builds minimize image size β€” build dependencies in one stage, copy only what's needed to the final image.

See Also

Practice Exercises

  1. Dockerize a pipeline: Take an existing Python ETL script and create a Dockerfile that packages it with all dependencies.

  2. Compose stack: Create a Docker Compose file with PostgreSQL, Redis, and a Python application that connects to both.

  3. Multi-stage build: Write a multi-stage Dockerfile that builds a Python application and produces a minimal production image under 200MB.

  4. Volume backup: Write a shell script that backs up a PostgreSQL Docker volume and restores it to a new container.

  5. CI/CD integration: Modify a GitHub Actions workflow to build a Docker image, push to a registry, and deploy to a server.

⭐

Premium Content

Docker for Data Engineers

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Data Engineering Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement