π¨ Glue Studio
Deep dive into Glue Studio visual ETL, job bookmarks, and data profiling.
Module: AWS Data Engineering β’ Topic 40 of 65 β’ Premium Content
Glue Studio Architecture
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GLUE STUDIO ARCHITECTURE β
β β
β VISUAL EDITOR β Generated PySpark β Glue Runtime β S3 Output β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β VISUAL EDITOR β β
β β Sources β Transforms β Joins β Filters β Sinks β β
β β β β
β β Features: β β
β β β’ Drag-and-drop ETL design β β
β β β’ Auto-generated PySpark code β β
β β β’ Real-time preview β β
β β β’ Job versioning β β
β β β’ Data quality checks β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β JOB BOOKMARKS: β
β β’ Track processed data state β
β β’ Enable incremental processing β
β β’ Store in DynamoDB β
β β’ Skip previously processed files β
β β
β DATA PROFILING: β
β β’ Column statistics (min, max, mean, nulls) β
β β’ Data quality scores β
β β’ Anomaly detection β
β β’ Schema validation β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Visual ETL Example
# Glue Studio generates PySpark from visual design
from awsglue.transforms import *
from awsglue.context import GlueContext
from pyspark.context import SparkContext
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
# Source: S3 Parquet
source = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={"paths": ["s3://data-lake-raw/sales/"]},
format="parquet"
)
# Transform: Filter
filtered = Filter.apply(frame=source, f=lambda x: x["amount"] > 0)
# Transform: Apply Mapping
mapped = ApplyMapping.apply(
frame=filtered,
mappings=[
("sale_id", "long", "sale_id", "long"),
("amount", "double", "amount", "double"),
("sale_date", "string", "sale_date", "date")
]
)
# Sink: S3 Parquet
glueContext.write_dynamic_frame.from_options(
frame=mapped,
connection_type="s3",
connection_options={"path": "s3://data-lake-processed/sales/"},
format="parquet"
)
Interview Q&A
Q1: What is the advantage of Glue Studio over script editor?
Answer: Visual design, auto-generated code, real-time preview, easier debugging. Code can be exported and customized.
Q2: How do job bookmarks work?
Answer: Bookmarks track processed file positions in DynamoDB. On next run, only new/changed data is processed.
Q3: What is data profiling in Glue?
Answer: Automated analysis of data quality: column statistics, data types, nulls, distributions, and anomalies.
Summary
- Visual Editor: Drag-and-drop ETL design with auto-generated code
- Job Bookmarks: DynamoDB-backed state for incremental processing
- Data Profiling: Automated quality analysis and statistics
- Monitoring: CloudWatch metrics and logs
- Versioning: Track changes and rollback capability