🔧 AWS Glue for Data Engineering

Master AWS Glue crawlers, ETL jobs, Data Catalog, Glue Studio, job bookmarks, and serverless data integration.

Module: AWS Data Engineering • Topic 7 of 65 • Premium Content

AWS Glue Architecture

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                    AWS GLUE ARCHITECTURE                                      │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                    CONTROL PLANE (Serverless)                        │    │
│  │                                                                     │    │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌───────────┐ │    │
│  │  │   Glue      │  │   Glue      │  │   Glue      │  │   Glue    │ │    │
│  │  │  Console    │  │  API/SDK    │  │  Studio     │  │  CLI      │ │    │
│  │  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └─────┬─────┘ │    │
│  │         │                │                │               │        │    │
│  └─────────┼────────────────┼────────────────┼───────────────┼────────┘    │
│            │                │                │               │              │
│            ▼                ▼                ▼               ▼              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                    DATA CATALOG (Metadata Store)                     │    │
│  │                                                                     │    │
│  │  ┌───────────────────────────────────────────────────────────────┐  │    │
│  │  │  Databases                                                    │  │    │
│  │  │  └── Tables                                                   │  │    │
│  │  │      └── Columns                                              │  │    │
│  │  │      └── Partitions                                           │  │    │
│  │  │      └── Indexes                                              │  │    │
│  │  │      └── Table Properties                                     │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  │                                                                     │    │
│  │  ┌───────────────────────────────────────────────────────────────┐  │    │
│  │  │  Crawlers                                                    │  │    │
│  │  │  • Discover schema from data stores                           │  │    │
│  │  │  • Create/update table definitions                            │  │    │
│  │  │  • Partition discovery                                        │  │    │
│  │  │  • Scheduled or event-triggered                               │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                │                                           │
│                                ▼                                           │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                    DATA PLANE (ETL Runtime)                          │    │
│  │                                                                     │    │
│  │  ┌───────────────────────────────────────────────────────────────┐  │    │
│  │  │  Glue ETL Jobs                                                │  │    │
│  │  │  • PySpark scripts                                            │  │    │
│  │  │  • Scala scripts                                              │  │    │
│  │  │  • Visual ETL (Glue Studio)                                   │  │    │
│  │  │  • Auto-scaling workers                                       │  │    │
│  │  │  • Job bookmarks (incremental processing)                     │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  │                                                                     │    │
│  │  ┌───────────────────────────────────────────────────────────────┐  │    │
│  │  │  Worker Types                                                 │  │    │
│  │  │  • G.1X: 1 vCPU, 4 GB RAM, $0.44/hr                          │  │    │
│  │  │  • G.2X: 2 vCPU, 8 GB RAM, $0.88/hr                          │  │    │
│  │  │  • G.025X: 0.25 vCPU, 1 GB RAM, $0.11/hr                     │  │    │
│  │  │  • G.4X: 4 vCPU, 16 GB RAM, $1.76/hr                         │  │    │
│  │  │  • G.8X: 8 vCPU, 32 GB RAM, $3.52/hr                         │  │    │
│  │  │  • Z.2X: 2 vCPU, 8 GB RAM, $1.00/hr (flex)                   │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘

Glue Crawler Configuration

Crawler for S3 Data Lake

import boto3

glue = boto3.client('glue')

# Create crawler configuration
crawler_config = {
    'Name': 's3-data-lake-crawler',
    'Role': 'arn:aws:iam::123456789012:role/GlueServiceRole',
    'DatabaseName': 'data_lake_db',
    'Description': 'Crawl S3 data lake for schema discovery',
    'Targets': {
        'S3Targets': [
            {
                'Path': 's3://data-lake-raw/landing/',
                'Exclusions': [
                    '_metadata.json',
                    '_SUCCESS',
                    '.*'
                ]
            },
            {
                'Path': 's3://data-lake-processed/silver/',
                'Exclusions': [
                    '_delta_log/*',
                    '_SUCCESS'
                ]
            }
        ]
    },
    'SchemaChangePolicy': {
        'UpdateBehavior': 'UPDATE_IN_DATABASE',
        'DeleteBehavior': 'LOG'
    },
    'RecrawlPolicy': {
        'RecrawlBehavior': 'CRAWL_EVERYTHING'
    },
    'LineageConfiguration': {
        'CrawlerLineageSettings': 'ENABLE'
    },
    'State': 'READY',
    'Schedule': 'cron(0 2 * * ? *)',  # Daily at 2 AM UTC
    'Configuration': '{"Version":1.0,"Grouping":{"TableGroupingPolicy":"CombineCompatibleSchemas"}}'
}

# Create the crawler
response = glue.create_crawler(**crawler_config)
print(f"Crawler created: {response['Name']}")

Crawler for Relational Database

# Crawler for RDS/Redshift
db_crawler_config = {
    'Name': 'rds-source-crawler',
    'Role': 'arn:aws:iam::123456789012:role/GlueServiceRole',
    'DatabaseName': 'rds_catalog_db',
    'Targets': {
        'JdbcTargets': [
            {
                'ConnectionName': 'rds-mysql-connection',
                'Path': 'production_db/customers',
                'Exclusions': ['temp_*', 'backup_*']
            },
            {
                'ConnectionName': 'rds-mysql-connection',
                'Path': 'production_db/transactions',
                'Exclusions': []
            }
        ]
    },
    'SchemaChangePolicy': {
        'UpdateBehavior': 'UPDATE_IN_DATABASE',
        'DeleteBehavior': 'DEPRECATE_IN_DATABASE'
    },
    'Configuration': '{"Version":1.0,"Grouping":{"TableGroupingPolicy":"CombineCompatibleSchemas"}}'
}

response = glue.create_crawler(**db_crawler_config)

ℹ️

Pro Tip: Use Exclusions in crawlers to skip unnecessary files. This reduces crawler runtime and catalog clutter. Common exclusions: _SUCCESS, _metadata.json, .git, *.tmp.

Glue ETL Job Development

PySpark ETL Script

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.context import SparkContext
from pyspark.sql import functions as F
from pyspark.sql.types import *

# Initialize Glue context
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read from Data Catalog
source_df = glueContext.create_dynamic_frame.from_catalog(
    database="data_lake_db",
    table_name="raw_transactions",
    transformation_ctx="source_df"
)

# Convert to Spark DataFrame for complex transformations
spark_df = source_df.toDF()

# Apply transformations
transformed_df = spark_df \
    .filter(F.col("amount") > 0) \
    .withColumn("processed_date", F.current_date()) \
    .withColumn("year", F.year(F.col("transaction_date"))) \
    .withColumn("month", F.month(F.col("transaction_date"))) \
    .withColumn("day", F.dayofmonth(F.col("transaction_date"))) \
    .withColumn("amount_category", 
        F.when(F.col("amount") < 100, "small")
         .when(F.col("amount") < 1000, "medium")
         .otherwise("large")) \
    .dropDuplicates(["transaction_id"]) \
    .fillna({"status": "unknown", "category": "uncategorized"})

# Convert back to DynamicFrame
dynamic_frame = DynamicFrame.fromDF(transformed_df, glueContext, "transformed_df")

# Write to processed zone with partitioning
glueContext.write_dynamic_frame.from_options(
    frame=dynamic_frame,
    connection_type="s3",
    connection_options={
        "path": "s3://data-lake-processed/silver/transactions/",
        "partitionKeys": ["year", "month", "day"],
        "enableUpdateCatalog": True,
        "updateCatalog": {
            "database": "processed_db",
            "tableName": "silver_transactions"
        }
    },
    format="parquet",
    format_options={
        "compression": "snappy"
    },
    transformation_ctx="write_output"
)

# Commit job
job.commit()

Glue Job with Job Bookmarks

# Enable job bookmarks for incremental processing
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read with job bookmark enabled
source = glueContext.create_dynamic_frame.from_catalog(
    database="data_lake_db",
    table_name="raw_events",
    transformation_ctx="source",
    additional_options={
        "enableGlueDataCatalog": True,
        "groupFiles": "inPartition",
        "groupSize": "1048576"  # 1MB
    }
)

# Process only new records (bookmark tracks what's been processed)
processed = source.map(
    lambda x: x,
    transformation_ctx="processed"
)

# Write with bookmark
glueContext.write_dynamic_frame.from_options(
    frame=processed,
    connection_type="s3",
    connection_options={
        "path": "s3://data-lake-processed/silver/events/",
        "partitionKeys": ["year", "month", "day"]
    },
    format="parquet",
    transformation_ctx="write_output"
)

job.commit()

Glue Studio Visual ETL

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                    GLUE STUDIO VISUAL ETL                                      │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  VISUAL EDITOR                                                      │    │
│  │                                                                     │    │
│  │  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐    │    │
│  │  │  Source   │───►│ Transform│───►│ Transform│───►│   Sink   │    │    │
│  │  │  (S3)    │    │ (Filter) │    │ (Join)   │    │ (S3)     │    │    │
│  │  └──────────┘    └──────────┘    └──────────┘    └──────────┘    │    │
│  │       │               │               │               │           │    │
│  │       ▼               ▼               ▼               ▼           │    │
│  │  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐    │    │
│  │  │  Source   │───►│ Transform│───►│ Transform│───►│   Sink   │    │    │
│  │  │  (RDS)   │    │ (Aggregate)│  │ (Rename) │    │ (Redshift│    │    │
│  │  └──────────┘    └──────────┘    └──────────┘    └──────────┘    │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  GENERATED CODE (PySpark)                                            │    │
│  │                                                                     │    │
│  │  # Source: S3                                                       │    │
│  │  source0 = glueContext.create_dynamic_frame.from_options(           │    │
│  │      connection_type="s3",                                          │    │
│  │      connection_options={"paths": ["s3://bucket/data/"]},           │    │
│  │      format="parquet"                                               │    │
│  │  )                                                                  │    │
│  │                                                                     │    │
│  │  # Transform: Filter                                                │    │
│  │  filter1 = Filter.apply(frame=source0, f=lambda x: x["amount"] > 0)│    │
│  │                                                                     │    │
│  │  # Transform: Join                                                  │    │
│  │  join1 = Join.apply(frame1=filter1, frame2=source1, keys=["id"])   │    │
│  │                                                                     │    │
│  │  # Sink: S3                                                         │    │
│  │  glueContext.write_dynamic_frame.from_options(                      │    │
│  │      frame=join1,                                                   │    │
│  │      connection_type="s3",                                          │    │
│  │      connection_options={"path": "s3://output/processed/"},         │    │
│  │      format="parquet"                                               │    │
│  │  )                                                                  │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘

ℹ️

Glue Studio Benefits: Visual editor generates PySpark code automatically. Great for rapid prototyping and team collaboration. Code can be exported and customized.

Glue Data Catalog Deep Dive

Catalog Structure

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                    GLUE DATA CATALOG STRUCTURE                               │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  AWS Account                                                       │    │
│  │  └── Data Catalog (1 per account, 1 per region)                    │    │
│  │                                                                     │    │
│  │      ┌─────────────────────────────────────────────────────────┐   │    │
│  │      │  Database: data_lake_db                                  │   │    │
│  │      │                                                         │   │    │
│  │      │  ┌───────────────────────────────────────────────────┐  │   │    │
│  │      │  │  Table: raw_transactions                           │  │   │    │
│  │      │  │                                                   │  │   │    │
│  │      │  │  Columns:                                        │  │   │    │
│  │      │  │  ├── transaction_id (string, partition key)      │  │   │    │
│  │      │  │  ├── customer_id (string)                        │  │   │    │
│  │      │  │  ├── amount (double)                             │  │   │    │
│  │      │  │  ├── status (string)                             │  │   │    │
│  │      │  │  └── created_at (timestamp)                      │  │   │    │
│  │      │  │                                                   │  │   │    │
│  │      │  │  Partitions:                                     │  │   │    │
│  │      │  │  ├── year=2024/month=01/day=15                   │  │   │    │
│  │      │  │  ├── year=2024/month=01/day=16                   │  │   │    │
│  │      │  │  └── year=2024/month=01/day=17                   │  │   │    │
│  │      │  │                                                   │  │   │    │
│  │      │  │  Table Properties:                               │  │   │    │
│  │      │  │  ├── classification: parquet                      │  │   │    │
│  │      │  │  ├── compressionType: snappy                      │  │   │    │
│  │      │  │  └── parquetOutputFormat: org.apache.hadoop...    │  │   │    │
│  │      │  └───────────────────────────────────────────────────┘  │   │    │
│  │      │                                                         │   │    │
│  │      │  ┌───────────────────────────────────────────────────┐  │   │    │
│  │      │  │  Table: silver_customers                           │  │   │    │
│  │      │  │  ... (similar structure)                          │  │   │    │
│  │      │  └───────────────────────────────────────────────────┘  │   │    │
│  │      └─────────────────────────────────────────────────────────┘   │    │
│  │                                                                     │    │
│  │      ┌─────────────────────────────────────────────────────────┐   │    │
│  │      │  Database: analytics_db                                  │   │    │
│  │      │  ...                                                     │   │    │
│  │      └─────────────────────────────────────────────────────────┘   │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘

Querying Data Catalog

# Query catalog metadata
import boto3

glue = boto3.client('glue')

# List databases
databases = glue.get_databases()
for db in databases['DatabaseList']:
    print(f"Database: {db['Name']}")

# List tables in database
tables = glue.get_tables(DatabaseName='data_lake_db')
for table in tables['TableList']:
    print(f"Table: {table['Name']}")
    print(f"  Location: {table.get('StorageDescriptor', {}).get('Location')}")
    print(f"  Format: {table.get('StorageDescriptor', {}).get('SerdeInfo', {}).get('SerializationLibrary')}")

# Get table schema
table = glue.get_table(DatabaseName='data_lake_db', Name='raw_transactions')
columns = table['Table']['StorageDescriptor']['Columns']
for col in columns:
    print(f"  {col['Name']}: {col['Type']}")

# Get partitions
partitions = glue.get_partitions(DatabaseName='data_lake_db', TableName='raw_transactions')
for partition in partitions['Partitions']:
    print(f"Partition: {partition['Values']}")

Glue Best Practices

ℹ️

Pro Tip: Use Glue Flex Execution for non-urgent jobs. It provides 46% cost savings with slightly longer execution times. Great for batch ETL jobs.

Performance Optimization

Optimization	Technique	Benefit
Partitioning	Use date-based partition keys	Reduce data scanned
File Format	Use Parquet/ORC	Columnar, compressed
Compression	Use Snappy/Zstd	Faster I/O
Worker Count	Right-size based on data	Optimize cost/speed
Shuffle	Repartition before joins	Reduce shuffle spill
Caching	Cache frequently accessed data	Reduce redundant reads

Cost Optimization

Cost Factor	Optimization
Worker Type	Use G.025X for small jobs
Flex Execution	Use for non-urgent jobs
Auto Scaling	Enable for variable workloads
Job Bookmarks	Process only new data
Data Partitioning	Reduce data scanned

Interview Questions & Answers

Q1: What is the difference between AWS Glue and AWS Glue Studio?

Answer:

AWS Glue: Service providing ETL capabilities, Data Catalog, crawlers
AWS Glue Studio: Visual editor for building ETL jobs (part of Glue)

Glue Studio generates PySpark code automatically from visual workflows. It's ideal for teams with mixed technical skills.

Q2: How do job bookmarks work in AWS Glue?

Answer: Job bookmarks track which data has been processed. They work by:

Recording the state of processed data (file positions, timestamps)
On next run, only process new/changed data
Stored in DynamoDB table (job bookmark table)

Use job.commit() to update bookmark state.

Q3: When should you use Glue vs. EMR for data processing?

Answer:

Glue: Serverless, short-running jobs (less than hours), smaller datasets
EMR: Long-running clusters, very large datasets, custom configurations

Glue is better for scheduled ETL. EMR is better for complex Spark workloads or when you need full cluster control.

Q4: How do you handle schema evolution in Glue?

Answer:

SchemaChangePolicy: Configure UpdateBehavior and DeleteBehavior
Crawlers: Recrawl to detect schema changes
Job Bookmarks: Handle new columns automatically
DynamicFrame: Schema-flexible data format

Q5: What is the maximum number of concurrent Glue jobs?

Answer: Default: 1,000 concurrent jobs per region (can be increased via support). For large-scale deployments:

Use job queues and scheduling
Implement job dependencies via Step Functions
Use different IAM roles for parallel execution

Cost Considerations

Component	Cost	Notes
Data Catalog	$1 per million objects stored	First million free/month
Crawlers	$0.44 per DPU-hour	Minimum 2 DPUs
ETL Jobs	$0.44 per DPU-hour	G.1X = 1 DPU
Glue Studio	Included with Glue	No additional cost
Flex Execution	46% discount	Slightly longer runtime
Data Processing	$0.02 per GB	Data processed

⚠️

Cost Warning: Glue costs can accumulate quickly with large datasets. Monitor DPU hours, optimize worker count, and use Flex Execution for non-urgent jobs.

Summary

AWS Glue is the cornerstone of serverless data integration. Key takeaways:

Crawlers: Automate schema discovery and cataloging
Data Catalog: Central metadata store for all data assets
ETL Jobs: PySpark/Scala scripts for data transformation
Glue Studio: Visual editor for rapid job development
Job Bookmarks: Enable incremental processing
Flex Execution: Cost savings for non-urgent workloads
Best Practices: Partition data, use Parquet, right-size workers

AWS Glue for Data Engineers

🔧 AWS Glue for Data Engineering

AWS Glue Architecture

Glue Crawler Configuration

Crawler for S3 Data Lake

Crawler for Relational Database

Glue ETL Job Development

PySpark ETL Script

Glue Job with Job Bookmarks

Glue Studio Visual ETL

Glue Data Catalog Deep Dive

Catalog Structure

Querying Data Catalog

Glue Best Practices

Performance Optimization

Cost Optimization

Interview Questions & Answers

Q1: What is the difference between AWS Glue and AWS Glue Studio?

Q2: How do job bookmarks work in AWS Glue?

Q3: When should you use Glue vs. EMR for data processing?

Q4: How do you handle schema evolution in Glue?

Q5: What is the maximum number of concurrent Glue jobs?

Cost Considerations

Summary