📦 Batch Processing Pipelines

Master S3-Glue-Redshift batch processing architecture, ETL patterns, and cost optimization.

Module: AWS Data Engineering • Topic 16 of 65 • Premium Content

Batch Processing Architecture

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│              BATCH PROCESSING PIPELINE: S3 → GLUE → REDSHIFT                 │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  INGESTION                                                          │    │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐          │    │
│  │  │  SFTP    │  │  API     │  │  JDBC    │  │  Kinesis │          │    │
│  │  │  Files   │  │  Pull    │  │  Extract │  │  Buffer  │          │    │
│  │  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘          │    │
│  └───────┼──────────────┼──────────────┼──────────────┼───────────────┘    │
│          ▼              ▼              ▼              ▼                     │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  RAW ZONE (S3)                                                      │    │
│  │  s3://data-lake-raw/landing/{date}/{source}/                        │    │
│  │  Format: CSV, JSON, XML (as-is)                                     │    │
│  └─────────────────────────────┬───────────────────────────────────────┘    │
│                                │                                           │
│                                ▼                                           │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  GLUE CRAWLER (Schema Discovery)                                    │    │
│  │  • Run daily or on new data arrival                                 │    │
│  │  • Update Glue Data Catalog                                         │    │
│  │  • Detect schema changes                                            │    │
│  └─────────────────────────────┬───────────────────────────────────────┘    │
│                                │                                           │
│                                ▼                                           │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  GLUE ETL JOB (Transformation)                                      │    │
│  │  • Clean and validate data                                          │    │
│  │  • Convert formats (CSV → Parquet)                                  │    │
│  │  • Apply business rules                                             │    │
│  │  • Enrich with reference data                                       │    │
│  │  • Partition by date keys                                           │    │
│  └─────────────────────────────┬───────────────────────────────────────┘    │
│                                │                                           │
│                                ▼                                           │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  PROCESSED ZONE (S3)                                                │    │
│  │  s3://data-lake-processed/silver/{table}/{year}/{month}/{day}/     │    │
│  │  Format: Parquet, Snappy compressed                                 │    │
│  └─────────────────────────────┬───────────────────────────────────────┘    │
│                                │                                           │
│              ┌─────────────────┼─────────────────┐                         │
│              ▼                 ▼                 ▼                         │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐            │
│  │  REDSHIFT COPY  │  │  SPECTRUM       │  │  ATHENA         │            │
│  │  (Load Data)    │  │  (Query S3)     │  │  (Ad-hoc)       │            │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘            │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  ORCHESTRATION: Step Functions                                       │    │
│  │                                                                     │    │
│  │  Start → Crawl → Validate → Transform → Load → Notify              │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘

Glue ETL Job Example

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.context import SparkContext
from pyspark.sql import functions as F

args = getResolvedOptions(sys.argv, ['JOB_NAME', 'INPUT_PATH', 'OUTPUT_PATH'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read from raw zone
raw_df = spark.read.csv(args['INPUT_PATH'], header=True, inferSchema=True)

# Clean and transform
cleaned_df = raw_df \
    .dropDuplicates() \
    .filter(F.col("amount").isNotNull()) \
    .withColumn("processed_date", F.current_date()) \
    .withColumn("year", F.year("transaction_date")) \
    .withColumn("month", F.month("transaction_date")) \
    .withColumn("day", F.dayofmonth("transaction_date"))

# Write to processed zone as Parquet
cleaned_df.write \
    .mode("overwrite") \
    .partitionBy("year", "month", "day") \
    .option("compression", "snappy") \
    .parquet(args['OUTPUT_PATH'])

job.commit()

Redshift Loading

-- COPY from processed zone
COPY dim_customers
FROM 's3://data-lake-processed/silver/customers/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftRole'
FORMAT AS PARQUET;

-- COPY with date partitioning
COPY fact_sales
FROM 's3://data-lake-processed/silver/sales/year=2024/month=01/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftRole'
FORMAT AS PARQUET;

ℹ️

Pro Tip: Use manifest files with Redshift COPY to load specific partitions. This avoids reloading the entire dataset.

Interview Q&A

Q1: What are the key stages of a batch ETL pipeline?

Answer: Ingestion → Raw Storage → Schema Discovery → Transformation → Processed Storage → Loading → Validation

Q2: How do you handle late-arriving data in batch pipelines?

Answer: Use partition overwriting, late-arrival triggers in Step Functions, or separate late-data processing jobs.

Q3: What is the optimal file size for Parquet in S3?

Answer: 128MB - 1GB per file. Smaller files cause excessive metadata; larger files reduce parallelism.

Summary

Architecture: S3 → Glue Crawler → Glue ETL → S3 → Redshift
File Format: Parquet with Snappy compression
Partitioning: By date (year/month/day)
Orchestration: Step Functions for pipeline management
Monitoring: CloudWatch, Glue bookmarks, Step Functions history

Batch Processing Pipelines on AWS