π Glue Data Catalog
Deep dive into Glue Data Catalog partitions, indexes, and Lake Formation.
Module: AWS Data Engineering β’ Topic 39 of 65 β’ Premium Content
Catalog Structure
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GLUE DATA CATALOG DEEP DIVE β
β β
β Account β Region β Catalog β Database β Table β Columns/Partitions β
β β
β Partition Management: β
β β’ Partition by date (year/month/day) for time-series data β
β β’ Partition pruning reduces data scanned in queries β
β β’ Batch create/delete partitions for efficiency β
β β
β Statistics & Indexes: β
β β’ Table statistics for query optimization β
β β’ Column statistics for data profiling β
β β’ Stored in catalog for Athena/Spectrum β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Partition Management
import boto3
glue = boto3.client('glue')
# Batch create partitions
glue.batch_create_partition(
DatabaseName='analytics_db',
TableName='events',
PartitionInputList=[
{'Values': ['2024', '01', '15'],
'StorageDescriptor': {
'Columns': [{'Name': 'event_id', 'Type': 'string'}],
'Location': 's3://data/events/year=2024/month=01/day=15/',
'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',
'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat',
'SerdeInfo': {'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'}
}}
]
)
# Get partitions with expression
partitions = glue.get_partitions(
DatabaseName='analytics_db',
TableName='events',
Expression='year=2024 AND month=01'
)
Interview Q&A
Q1: How does partition pruning work?
Answer: When you query with WHERE year=2024, Athena/Spectrum only scans partitions matching that year, dramatically reducing data scanned.
Q2: What statistics does the catalog store?
Answer: Row counts, column statistics (min/max/num-nulls), table size. Used by Athena and Spectrum for query optimization.
Q3: How many partitions can a table have?
Answer: Glue supports up to 20 million partitions per table. Excessive partitions impact performance.
Summary
- Partitions: Organize data for efficient querying, up to 20M per table
- Statistics: Row counts, column stats for query optimization
- Crawlers: Automatic schema and partition discovery
- Lake Formation: Fine-grained permissions on catalog objects
- Performance: Partition pruning reduces data scanned