π Data Catalog on AWS
Master Glue Data Catalog, Lake Formation permissions, and metadata management.
Module: AWS Data Engineering β’ Topic 25 of 65 β’ Premium Content
Data Catalog Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GLUE DATA CATALOG ARCHITECTURE β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DATA CATALOG (1 per account, 1 per region) β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Database: data_lake_db β β β
β β β βββ Table: raw_sales (Schema discovered by Crawler) β β β
β β β β βββ Columns: sale_id, customer_id, amount, sale_date β β β
β β β β βββ Partitions: year, month, day β β β
β β β β βββ Location: s3://data-lake/raw/sales/ β β β
β β β βββ Table: silver_customers β β β
β β β β βββ ... β β β
β β β βββ Table: gold_daily_metrics β β β
β β β βββ ... β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Database: analytics_db β β β
β β β βββ Table: customer_360 β β β
β β β βββ Table: product_performance β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββΌββββββββββββββββββ β
β βΌ βΌ βΌ β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β Athena β β Redshift β β EMR/Glue β β
β β (Queries) β β Spectrum β β ETL Jobs β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Catalog Operations
import boto3
glue = boto3.client('glue')
# Create database
glue.create_database(
DatabaseInput={
'Name': 'analytics_db',
'Description': 'Analytics database for business intelligence',
'LocationUri': 's3://data-lake/analytics/',
'Parameters': {
'classification': 'managed',
'created_by': 'data-engineering'
}
}
)
# Create table
glue.create_table(
DatabaseName='analytics_db',
TableInput={
'Name': 'daily_sales_summary',
'Description': 'Daily aggregated sales metrics',
'StorageDescriptor': {
'Columns': [
{'Name': 'sale_date', 'Type': 'date', 'Comment': 'Transaction date'},
{'Name': 'total_revenue', 'Type': 'decimal(12,2)', 'Comment': 'Total revenue'},
{'Name': 'order_count', 'Type': 'int', 'Comment': 'Number of orders'},
{'Name': 'avg_order_value', 'Type': 'decimal(10,2)', 'Comment': 'Average order value'}
],
'Location': 's3://data-lake/gold/daily-sales/',
'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',
'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat',
'SerdeInfo': {
'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
},
'Compressed': True,
'NumberOfBuckets': -1
},
'PartitionKeys': [
{'Name': 'year', 'Type': 'int'},
{'Name': 'month', 'Type': 'int'}
],
'TableType': 'EXTERNAL_TABLE',
'Parameters': {
'classification': 'parquet',
'compressionType': 'snappy',
'parquetOutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.ParquetOutputFormat'
}
}
)
# Get table metadata
table = glue.get_table(
DatabaseName='analytics_db',
Name='daily_sales_summary'
)
# Update table statistics
glue.update_table(
DatabaseName='analytics_db',
TableInput=table['Table']
)
Partition Management
# Create partitions
glue.batch_create_partition(
DatabaseName='analytics_db',
TableName='daily_sales_summary',
PartitionInputList=[
{
'Values': ['2024', '1'],
'StorageDescriptor': {
'Columns': [...],
'Location': 's3://data-lake/gold/daily-sales/year=2024/month=1/',
'InputFormat': '...',
'OutputFormat': '...',
'SerdeInfo': {'SerializationLibrary': '...'}
}
}
]
)
# Get partitions
partitions = glue.get_partitions(
DatabaseName='analytics_db',
TableName='daily_sales_summary',
Expression='year=2024 AND month=1'
)
βΉοΈ
Pro Tip: Use partition expressions in queries to only scan relevant partitions. This dramatically reduces data scanned and query cost in Athena.
Interview Q&A
Q1: What is the Glue Data Catalog?
Answer: A centralized metadata repository that stores table definitions, schemas, and partition information. It's compatible with Apache Hive Metastore and used by Athena, Redshift Spectrum, and EMR.
Q2: How do crawlers update the catalog?
Answer: Crawlers connect to data stores, infer schemas, create/update table definitions, and detect partition changes. They can run on schedules or be triggered by events.
Q3: What is the difference between a database and a table in the catalog?
Answer: A database is a logical grouping (like a schema). Tables are the actual data definitions with columns, types, and locations.
Summary
- Glue Catalog: Central metadata store for all data assets
- Crawlers: Automate schema discovery and catalog updates
- Partitions: Organize data for efficient querying
- Lake Integration: Used by Athena, Spectrum, EMR, Glue ETL
- Permissions: Integrated with Lake Formation for fine-grained access