🎨 Glue Studio

Deep dive into Glue Studio visual ETL, job bookmarks, and data profiling.

Module: AWS Data Engineering • Topic 40 of 65 • Premium Content

Glue Studio Architecture

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                    GLUE STUDIO ARCHITECTURE                                   │
│                                                                             │
│  VISUAL EDITOR → Generated PySpark → Glue Runtime → S3 Output              │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  VISUAL EDITOR                                                      │    │
│  │  Sources → Transforms → Joins → Filters → Sinks                    │    │
│  │                                                                     │    │
│  │  Features:                                                          │    │
│  │  • Drag-and-drop ETL design                                         │    │
│  │  • Auto-generated PySpark code                                      │    │
│  │  • Real-time preview                                                │    │
│  │  • Job versioning                                                   │    │
│  │  • Data quality checks                                              │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  JOB BOOKMARKS:                                                             │
│  • Track processed data state                                               │
│  • Enable incremental processing                                            │
│  • Store in DynamoDB                                                       │
│  • Skip previously processed files                                          │
│                                                                             │
│  DATA PROFILING:                                                            │
│  • Column statistics (min, max, mean, nulls)                                │
│  • Data quality scores                                                      │
│  • Anomaly detection                                                        │
│  • Schema validation                                                        │
└─────────────────────────────────────────────────────────────────────────────┘

Visual ETL Example

# Glue Studio generates PySpark from visual design
from awsglue.transforms import *
from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Source: S3 Parquet
source = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": ["s3://data-lake-raw/sales/"]},
    format="parquet"
)

# Transform: Filter
filtered = Filter.apply(frame=source, f=lambda x: x["amount"] > 0)

# Transform: Apply Mapping
mapped = ApplyMapping.apply(
    frame=filtered,
    mappings=[
        ("sale_id", "long", "sale_id", "long"),
        ("amount", "double", "amount", "double"),
        ("sale_date", "string", "sale_date", "date")
    ]
)

# Sink: S3 Parquet
glueContext.write_dynamic_frame.from_options(
    frame=mapped,
    connection_type="s3",
    connection_options={"path": "s3://data-lake-processed/sales/"},
    format="parquet"
)

Interview Q&A

Q1: What is the advantage of Glue Studio over script editor?

Answer: Visual design, auto-generated code, real-time preview, easier debugging. Code can be exported and customized.

Q2: How do job bookmarks work?

Answer: Bookmarks track processed file positions in DynamoDB. On next run, only new/changed data is processed.

Q3: What is data profiling in Glue?

Answer: Automated analysis of data quality: column statistics, data types, nulls, distributions, and anomalies.

Summary

Visual Editor: Drag-and-drop ETL design with auto-generated code
Job Bookmarks: DynamoDB-backed state for incremental processing
Data Profiling: Automated quality analysis and statistics
Monitoring: CloudWatch metrics and logs
Versioning: Track changes and rollback capability

Glue Studio Deep Dive