AWS Glue Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AWS GLUE ARCHITECTURE β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β CONTROL PLANE (Serverless) β β
β β β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββ β β
β β β Glue β β Glue β β Glue β β Glue β β β
β β β Console β β API/SDK β β Studio β β CLI β β β
β β ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ βββββββ¬ββββββ β β
β β β β β β β β
β βββββββββββΌβββββββββββββββββΌβββββββββββββββββΌββββββββββββββββΌβββββββββ β
β β β β β β
β βΌ βΌ βΌ βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DATA CATALOG (Metadata Store) β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Databases β β β
β β β βββ Tables β β β
β β β βββ Columns β β β
β β β βββ Partitions β β β
β β β βββ Indexes β β β
β β β βββ Table Properties β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Crawlers β β β
β β β β’ Discover schema from data stores β β β
β β β β’ Create/update table definitions β β β
β β β β’ Partition discovery β β β
β β β β’ Scheduled or event-triggered β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DATA PLANE (ETL Runtime) β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Glue ETL Jobs β β β
β β β β’ PySpark scripts β β β
β β β β’ Scala scripts β β β
β β β β’ Visual ETL (Glue Studio) β β β
β β β β’ Auto-scaling workers β β β
β β β β’ Job bookmarks (incremental processing) β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Worker Types β β β
β β β β’ G.1X: 1 vCPU, 4 GB RAM, $0.44/hr β β β
β β β β’ G.2X: 2 vCPU, 8 GB RAM, $0.88/hr β β β
β β β β’ G.025X: 0.25 vCPU, 1 GB RAM, $0.11/hr β β β
β β β β’ G.4X: 4 vCPU, 16 GB RAM, $1.76/hr β β β
β β β β’ G.8X: 8 vCPU, 32 GB RAM, $3.52/hr β β β
β β β β’ Z.2X: 2 vCPU, 8 GB RAM, $1.00/hr (flex) β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Glue Crawler Configuration
Crawler for S3 Data Lake
import boto3
glue = boto3.client('glue')
# Create crawler configuration
crawler_config = {
'Name': 's3-data-lake-crawler',
'Role': 'arn:aws:iam::123456789012:role/GlueServiceRole',
'DatabaseName': 'data_lake_db',
'Description': 'Crawl S3 data lake for schema discovery',
'Targets': {
'S3Targets': [
{
'Path': 's3://data-lake-raw/landing/',
'Exclusions': [
'_metadata.json',
'_SUCCESS',
'.*'
]
},
{
'Path': 's3://data-lake-processed/silver/',
'Exclusions': [
'_delta_log/*',
'_SUCCESS'
]
}
]
},
'SchemaChangePolicy': {
'UpdateBehavior': 'UPDATE_IN_DATABASE',
'DeleteBehavior': 'LOG'
},
'RecrawlPolicy': {
'RecrawlBehavior': 'CRAWL_EVERYTHING'
},
'LineageConfiguration': {
'CrawlerLineageSettings': 'ENABLE'
},
'State': 'READY',
'Schedule': 'cron(0 2 * * ? *)', # Daily at 2 AM UTC
'Configuration': '{"Version":1.0,"Grouping":{"TableGroupingPolicy":"CombineCompatibleSchemas"}}'
}
# Create the crawler
response = glue.create_crawler(**crawler_config)
print(f"Crawler created: {response['Name']}")
Crawler for Relational Database
# Crawler for RDS/Redshift
db_crawler_config = {
'Name': 'rds-source-crawler',
'Role': 'arn:aws:iam::123456789012:role/GlueServiceRole',
'DatabaseName': 'rds_catalog_db',
'Targets': {
'JdbcTargets': [
{
'ConnectionName': 'rds-mysql-connection',
'Path': 'production_db/customers',
'Exclusions': ['temp_*', 'backup_*']
},
{
'ConnectionName': 'rds-mysql-connection',
'Path': 'production_db/transactions',
'Exclusions': []
}
]
},
'SchemaChangePolicy': {
'UpdateBehavior': 'UPDATE_IN_DATABASE',
'DeleteBehavior': 'DEPRECATE_IN_DATABASE'
},
'Configuration': '{"Version":1.0,"Grouping":{"TableGroupingPolicy":"CombineCompatibleSchemas"}}'
}
response = glue.create_crawler(**db_crawler_config)
βΉοΈ
Pro Tip: Use Exclusions in crawlers to skip unnecessary files. This reduces crawler runtime and catalog clutter. Common exclusions: _SUCCESS, _metadata.json, .git, *.tmp.
Glue ETL Job Development
PySpark ETL Script
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.context import SparkContext
from pyspark.sql import functions as F
from pyspark.sql.types import *
# Initialize Glue context
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Read from Data Catalog
source_df = glueContext.create_dynamic_frame.from_catalog(
database="data_lake_db",
table_name="raw_transactions",
transformation_ctx="source_df"
)
# Convert to Spark DataFrame for complex transformations
spark_df = source_df.toDF()
# Apply transformations
transformed_df = spark_df \
.filter(F.col("amount") > 0) \
.withColumn("processed_date", F.current_date()) \
.withColumn("year", F.year(F.col("transaction_date"))) \
.withColumn("month", F.month(F.col("transaction_date"))) \
.withColumn("day", F.dayofmonth(F.col("transaction_date"))) \
.withColumn("amount_category",
F.when(F.col("amount") < 100, "small")
.when(F.col("amount") < 1000, "medium")
.otherwise("large")) \
.dropDuplicates(["transaction_id"]) \
.fillna({"status": "unknown", "category": "uncategorized"})
# Convert back to DynamicFrame
dynamic_frame = DynamicFrame.fromDF(transformed_df, glueContext, "transformed_df")
# Write to processed zone with partitioning
glueContext.write_dynamic_frame.from_options(
frame=dynamic_frame,
connection_type="s3",
connection_options={
"path": "s3://data-lake-processed/silver/transactions/",
"partitionKeys": ["year", "month", "day"],
"enableUpdateCatalog": True,
"updateCatalog": {
"database": "processed_db",
"tableName": "silver_transactions"
}
},
format="parquet",
format_options={
"compression": "snappy"
},
transformation_ctx="write_output"
)
# Commit job
job.commit()
Glue Job with Job Bookmarks
# Enable job bookmarks for incremental processing
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Read with job bookmark enabled
source = glueContext.create_dynamic_frame.from_catalog(
database="data_lake_db",
table_name="raw_events",
transformation_ctx="source",
additional_options={
"enableGlueDataCatalog": True,
"groupFiles": "inPartition",
"groupSize": "1048576" # 1MB
}
)
# Process only new records (bookmark tracks what's been processed)
processed = source.map(
lambda x: x,
transformation_ctx="processed"
)
# Write with bookmark
glueContext.write_dynamic_frame.from_options(
frame=processed,
connection_type="s3",
connection_options={
"path": "s3://data-lake-processed/silver/events/",
"partitionKeys": ["year", "month", "day"]
},
format="parquet",
transformation_ctx="write_output"
)
job.commit()
Glue Studio Visual ETL
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GLUE STUDIO VISUAL ETL β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β VISUAL EDITOR β β
β β β β
β β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β β
β β β Source βββββΊβ TransformβββββΊβ TransformβββββΊβ Sink β β β
β β β (S3) β β (Filter) β β (Join) β β (S3) β β β
β β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β β
β β β β β β β β
β β βΌ βΌ βΌ βΌ β β
β β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β β
β β β Source βββββΊβ TransformβββββΊβ TransformβββββΊβ Sink β β β
β β β (RDS) β β (Aggregate)β β (Rename) β β (Redshiftβ β β
β β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β GENERATED CODE (PySpark) β β
β β β β
β β # Source: S3 β β
β β source0 = glueContext.create_dynamic_frame.from_options( β β
β β connection_type="s3", β β
β β connection_options={"paths": ["s3://bucket/data/"]}, β β
β β format="parquet" β β
β β ) β β
β β β β
β β # Transform: Filter β β
β β filter1 = Filter.apply(frame=source0, f=lambda x: x["amount"] > 0)β β
β β β β
β β # Transform: Join β β
β β join1 = Join.apply(frame1=filter1, frame2=source1, keys=["id"]) β β
β β β β
β β # Sink: S3 β β
β β glueContext.write_dynamic_frame.from_options( β β
β β frame=join1, β β
β β connection_type="s3", β β
β β connection_options={"path": "s3://output/processed/"}, β β
β β format="parquet" β β
β β ) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βΉοΈ
Glue Studio Benefits: Visual editor generates PySpark code automatically. Great for rapid prototyping and team collaboration. Code can be exported and customized.
Glue Data Catalog Deep Dive
Catalog Structure
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GLUE DATA CATALOG STRUCTURE β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β AWS Account β β
β β βββ Data Catalog (1 per account, 1 per region) β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Database: data_lake_db β β β
β β β β β β
β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β Table: raw_transactions β β β β
β β β β β β β β
β β β β Columns: β β β β
β β β β βββ transaction_id (string, partition key) β β β β
β β β β βββ customer_id (string) β β β β
β β β β βββ amount (double) β β β β
β β β β βββ status (string) β β β β
β β β β βββ created_at (timestamp) β β β β
β β β β β β β β
β β β β Partitions: β β β β
β β β β βββ year=2024/month=01/day=15 β β β β
β β β β βββ year=2024/month=01/day=16 β β β β
β β β β βββ year=2024/month=01/day=17 β β β β
β β β β β β β β
β β β β Table Properties: β β β β
β β β β βββ classification: parquet β β β β
β β β β βββ compressionType: snappy β β β β
β β β β βββ parquetOutputFormat: org.apache.hadoop... β β β β
β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β β β
β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β Table: silver_customers β β β β
β β β β ... (similar structure) β β β β
β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Database: analytics_db β β β
β β β ... β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Querying Data Catalog
# Query catalog metadata
import boto3
glue = boto3.client('glue')
# List databases
databases = glue.get_databases()
for db in databases['DatabaseList']:
print(f"Database: {db['Name']}")
# List tables in database
tables = glue.get_tables(DatabaseName='data_lake_db')
for table in tables['TableList']:
print(f"Table: {table['Name']}")
print(f" Location: {table.get('StorageDescriptor', {}).get('Location')}")
print(f" Format: {table.get('StorageDescriptor', {}).get('SerdeInfo', {}).get('SerializationLibrary')}")
# Get table schema
table = glue.get_table(DatabaseName='data_lake_db', Name='raw_transactions')
columns = table['Table']['StorageDescriptor']['Columns']
for col in columns:
print(f" {col['Name']}: {col['Type']}")
# Get partitions
partitions = glue.get_partitions(DatabaseName='data_lake_db', TableName='raw_transactions')
for partition in partitions['Partitions']:
print(f"Partition: {partition['Values']}")
Glue Best Practices
βΉοΈ
Pro Tip: Use Glue Flex Execution for non-urgent jobs. It provides 46% cost savings with slightly longer execution times. Great for batch ETL jobs.
Performance Optimization
| Optimization | Technique | Benefit |
|---|---|---|
| Partitioning | Use date-based partition keys | Reduce data scanned |
| File Format | Use Parquet/ORC | Columnar, compressed |
| Compression | Use Snappy/Zstd | Faster I/O |
| Worker Count | Right-size based on data | Optimize cost/speed |
| Shuffle | Repartition before joins | Reduce shuffle spill |
| Caching | Cache frequently accessed data | Reduce redundant reads |
Cost Optimization
| Cost Factor | Optimization |
|---|---|
| Worker Type | Use G.025X for small jobs |
| Flex Execution | Use for non-urgent jobs |
| Auto Scaling | Enable for variable workloads |
| Job Bookmarks | Process only new data |
| Data Partitioning | Reduce data scanned |
Interview Questions & Answers
Q1: What is the difference between AWS Glue and AWS Glue Studio?
Answer:
- AWS Glue: Service providing ETL capabilities, Data Catalog, crawlers
- AWS Glue Studio: Visual editor for building ETL jobs (part of Glue)
Glue Studio generates PySpark code automatically from visual workflows. It's ideal for teams with mixed technical skills.
Q2: How do job bookmarks work in AWS Glue?
Answer: Job bookmarks track which data has been processed. They work by:
- Recording the state of processed data (file positions, timestamps)
- On next run, only process new/changed data
- Stored in DynamoDB table (job bookmark table)
Use job.commit() to update bookmark state.
Q3: When should you use Glue vs. EMR for data processing?
Answer:
- Glue: Serverless, short-running jobs (less than hours), smaller datasets
- EMR: Long-running clusters, very large datasets, custom configurations
Glue is better for scheduled ETL. EMR is better for complex Spark workloads or when you need full cluster control.
Q4: How do you handle schema evolution in Glue?
Answer:
- SchemaChangePolicy: Configure UpdateBehavior and DeleteBehavior
- Crawlers: Recrawl to detect schema changes
- Job Bookmarks: Handle new columns automatically
- DynamicFrame: Schema-flexible data format
Q5: What is the maximum number of concurrent Glue jobs?
Answer: Default: 1,000 concurrent jobs per region (can be increased via support). For large-scale deployments:
- Use job queues and scheduling
- Implement job dependencies via Step Functions
- Use different IAM roles for parallel execution
Cost Considerations
| Component | Cost | Notes |
|---|---|---|
| Data Catalog | $1 per million objects stored | First million free/month |
| Crawlers | $0.44 per DPU-hour | Minimum 2 DPUs |
| ETL Jobs | $0.44 per DPU-hour | G.1X = 1 DPU |
| Glue Studio | Included with Glue | No additional cost |
| Flex Execution | 46% discount | Slightly longer runtime |
| Data Processing | $0.02 per GB | Data processed |
β οΈ
Cost Warning: Glue costs can accumulate quickly with large datasets. Monitor DPU hours, optimize worker count, and use Flex Execution for non-urgent jobs.
Summary
AWS Glue is the cornerstone of serverless data integration. Key takeaways:
- Crawlers: Automate schema discovery and cataloging
- Data Catalog: Central metadata store for all data assets
- ETL Jobs: PySpark/Scala scripts for data transformation
- Glue Studio: Visual editor for rapid job development
- Job Bookmarks: Enable incremental processing
- Flex Execution: Cost savings for non-urgent workloads
- Best Practices: Partition data, use Parquet, right-size workers