📚 Data Catalog on AWS

Master Glue Data Catalog, Lake Formation permissions, and metadata management.

Module: AWS Data Engineering • Topic 25 of 65 • Premium Content

Data Catalog Architecture

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                    GLUE DATA CATALOG ARCHITECTURE                            │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  DATA CATALOG (1 per account, 1 per region)                         │    │
│  │                                                                     │    │
│  │  ┌───────────────────────────────────────────────────────────────┐  │    │
│  │  │  Database: data_lake_db                                        │  │    │
│  │  │  ├── Table: raw_sales (Schema discovered by Crawler)          │  │    │
│  │  │  │   ├── Columns: sale_id, customer_id, amount, sale_date     │  │    │
│  │  │  │   ├── Partitions: year, month, day                         │  │    │
│  │  │  │   └── Location: s3://data-lake/raw/sales/                  │  │    │
│  │  │  ├── Table: silver_customers                                  │  │    │
│  │  │  │   └── ...                                                  │  │    │
│  │  │  └── Table: gold_daily_metrics                                │  │    │
│  │  │       └── ...                                                  │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  │                                                                     │    │
│  │  ┌───────────────────────────────────────────────────────────────┐  │    │
│  │  │  Database: analytics_db                                        │  │    │
│  │  │  ├── Table: customer_360                                      │  │    │
│  │  │  └── Table: product_performance                              │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  └─────────────────────────────┬───────────────────────────────────────┘    │
│                                │                                           │
│              ┌─────────────────┼─────────────────┐                         │
│              ▼                 ▼                 ▼                         │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐            │
│  │  Athena         │  │  Redshift       │  │  EMR/Glue       │            │
│  │  (Queries)      │  │  Spectrum       │  │  ETL Jobs       │            │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘            │
└─────────────────────────────────────────────────────────────────────────────┘

Catalog Operations

import boto3

glue = boto3.client('glue')

# Create database
glue.create_database(
    DatabaseInput={
        'Name': 'analytics_db',
        'Description': 'Analytics database for business intelligence',
        'LocationUri': 's3://data-lake/analytics/',
        'Parameters': {
            'classification': 'managed',
            'created_by': 'data-engineering'
        }
    }
)

# Create table
glue.create_table(
    DatabaseName='analytics_db',
    TableInput={
        'Name': 'daily_sales_summary',
        'Description': 'Daily aggregated sales metrics',
        'StorageDescriptor': {
            'Columns': [
                {'Name': 'sale_date', 'Type': 'date', 'Comment': 'Transaction date'},
                {'Name': 'total_revenue', 'Type': 'decimal(12,2)', 'Comment': 'Total revenue'},
                {'Name': 'order_count', 'Type': 'int', 'Comment': 'Number of orders'},
                {'Name': 'avg_order_value', 'Type': 'decimal(10,2)', 'Comment': 'Average order value'}
            ],
            'Location': 's3://data-lake/gold/daily-sales/',
            'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',
            'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat',
            'SerdeInfo': {
                'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
            },
            'Compressed': True,
            'NumberOfBuckets': -1
        },
        'PartitionKeys': [
            {'Name': 'year', 'Type': 'int'},
            {'Name': 'month', 'Type': 'int'}
        ],
        'TableType': 'EXTERNAL_TABLE',
        'Parameters': {
            'classification': 'parquet',
            'compressionType': 'snappy',
            'parquetOutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.ParquetOutputFormat'
        }
    }
)

# Get table metadata
table = glue.get_table(
    DatabaseName='analytics_db',
    Name='daily_sales_summary'
)

# Update table statistics
glue.update_table(
    DatabaseName='analytics_db',
    TableInput=table['Table']
)

Partition Management

# Create partitions
glue.batch_create_partition(
    DatabaseName='analytics_db',
    TableName='daily_sales_summary',
    PartitionInputList=[
        {
            'Values': ['2024', '1'],
            'StorageDescriptor': {
                'Columns': [...],
                'Location': 's3://data-lake/gold/daily-sales/year=2024/month=1/',
                'InputFormat': '...',
                'OutputFormat': '...',
                'SerdeInfo': {'SerializationLibrary': '...'}
            }
        }
    ]
)

# Get partitions
partitions = glue.get_partitions(
    DatabaseName='analytics_db',
    TableName='daily_sales_summary',
    Expression='year=2024 AND month=1'
)

ℹ️

Pro Tip: Use partition expressions in queries to only scan relevant partitions. This dramatically reduces data scanned and query cost in Athena.

Interview Q&A

Q1: What is the Glue Data Catalog?

Answer: A centralized metadata repository that stores table definitions, schemas, and partition information. It's compatible with Apache Hive Metastore and used by Athena, Redshift Spectrum, and EMR.

Q2: How do crawlers update the catalog?

Answer: Crawlers connect to data stores, infer schemas, create/update table definitions, and detect partition changes. They can run on schedules or be triggered by events.

Q3: What is the difference between a database and a table in the catalog?

Answer: A database is a logical grouping (like a schema). Tables are the actual data definitions with columns, types, and locations.

Summary

Glue Catalog: Central metadata store for all data assets
Crawlers: Automate schema discovery and catalog updates
Partitions: Organize data for efficient querying
Lake Integration: Used by Athena, Spectrum, EMR, Glue ETL
Permissions: Integrated with Lake Formation for fine-grained access