Batch Processing Architecture
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BATCH PROCESSING PIPELINE: S3 β GLUE β REDSHIFT β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β INGESTION β β
β β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β β
β β β SFTP β β API β β JDBC β β Kinesis β β β
β β β Files β β Pull β β Extract β β Buffer β β β
β β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β β
β βββββββββΌβββββββββββββββΌβββββββββββββββΌβββββββββββββββΌββββββββββββββββ β
β βΌ βΌ βΌ βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β RAW ZONE (S3) β β
β β s3://data-lake-raw/landing/{date}/{source}/ β β
β β Format: CSV, JSON, XML (as-is) β β
β βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β GLUE CRAWLER (Schema Discovery) β β
β β β’ Run daily or on new data arrival β β
β β β’ Update Glue Data Catalog β β
β β β’ Detect schema changes β β
β βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β GLUE ETL JOB (Transformation) β β
β β β’ Clean and validate data β β
β β β’ Convert formats (CSV β Parquet) β β
β β β’ Apply business rules β β
β β β’ Enrich with reference data β β
β β β’ Partition by date keys β β
β βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β PROCESSED ZONE (S3) β β
β β s3://data-lake-processed/silver/{table}/{year}/{month}/{day}/ β β
β β Format: Parquet, Snappy compressed β β
β βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββΌββββββββββββββββββ β
β βΌ βΌ βΌ β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β REDSHIFT COPY β β SPECTRUM β β ATHENA β β
β β (Load Data) β β (Query S3) β β (Ad-hoc) β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ORCHESTRATION: Step Functions β β
β β β β
β β Start β Crawl β Validate β Transform β Load β Notify β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Glue ETL Job Example
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.context import SparkContext
from pyspark.sql import functions as F
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'INPUT_PATH', 'OUTPUT_PATH'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Read from raw zone
raw_df = spark.read.csv(args['INPUT_PATH'], header=True, inferSchema=True)
# Clean and transform
cleaned_df = raw_df \
.dropDuplicates() \
.filter(F.col("amount").isNotNull()) \
.withColumn("processed_date", F.current_date()) \
.withColumn("year", F.year("transaction_date")) \
.withColumn("month", F.month("transaction_date")) \
.withColumn("day", F.dayofmonth("transaction_date"))
# Write to processed zone as Parquet
cleaned_df.write \
.mode("overwrite") \
.partitionBy("year", "month", "day") \
.option("compression", "snappy") \
.parquet(args['OUTPUT_PATH'])
job.commit()
Redshift Loading
-- COPY from processed zone
COPY dim_customers
FROM 's3://data-lake-processed/silver/customers/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftRole'
FORMAT AS PARQUET;
-- COPY with date partitioning
COPY fact_sales
FROM 's3://data-lake-processed/silver/sales/year=2024/month=01/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftRole'
FORMAT AS PARQUET;
βΉοΈ
Pro Tip: Use manifest files with Redshift COPY to load specific partitions. This avoids reloading the entire dataset.
Interview Q&A
Q1: What are the key stages of a batch ETL pipeline?
Answer: Ingestion β Raw Storage β Schema Discovery β Transformation β Processed Storage β Loading β Validation
Q2: How do you handle late-arriving data in batch pipelines?
Answer: Use partition overwriting, late-arrival triggers in Step Functions, or separate late-data processing jobs.
Q3: What is the optimal file size for Parquet in S3?
Answer: 128MB - 1GB per file. Smaller files cause excessive metadata; larger files reduce parallelism.
Summary
- Architecture: S3 β Glue Crawler β Glue ETL β S3 β Redshift
- File Format: Parquet with Snappy compression
- Partitioning: By date (year/month/day)
- Orchestration: Step Functions for pipeline management
- Monitoring: CloudWatch, Glue bookmarks, Step Functions history