Data Lake Architecture
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AWS DATA LAKE ARCHITECTURE β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DATA SOURCES β β
β β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β β
β β β RDS/ β β SaaS β β IoT β β APIs β β β
β β β Aurora β β Apps β β Devices β β β β β
β β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β β
β βββββββββΌβββββββββββββββΌβββββββββββββββΌβββββββββββββββΌββββββββββββββββ β
β βΌ βΌ βΌ βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β INGESTION β β
β β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β β
β β β DMS β β Glue β β Kinesis β β Data β β β
β β β (CDC) β β (Batch) β β (Stream) β β Pipeline β β β
β β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β β
β βββββββββΌβββββββββββββββΌβββββββββββββββΌβββββββββββββββΌββββββββββββββββ β
β βΌ βΌ βΌ βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β S3 DATA LAKE (Medallion Architecture) β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β BRONZE (Raw) β β β
β β β s3://data-lake/bronze/{source}/{date}/ β β β
β β β β’ As-is data from sources β β β
β β β β’ No transformations β β β
β β β β’ Retention: 90 days β Glacier β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β SILVER (Cleaned) β β β
β β β s3://data-lake/silver/{table}/{date}/ β β β
β β β β’ Deduplicated, validated β β β
β β β β’ Schema enforced β β β
β β β β’ Format: Parquet β β β
β β β β’ Retention: 1 year β IA β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β GOLD (Business-ready) β β β
β β β s3://data-lake/gold/{domain}/{metric}/ β β β
β β β β’ Aggregated, enriched β β β
β β β β’ Business logic applied β β β
β β β β’ Ready for analytics β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββ β
β β GOVERNANCE & CATALOG β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β
β β β Lake β β Glue Data β β AWS Config β β β
β β β Formation β β Catalog β β Rules β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ANALYTICS β β
β β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β β
β β β Athena β β Redshift β β QuickSightβ β EMR/Sparkβ β β
β β β (Ad-hoc) β β (WH) β β (BI) β β (ML) β β β
β β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Lake Formation Permissions
import boto3
lakeformation = boto3.client('lakeformation')
# Grant database permissions
lakeformation.grant_permissions(
Principal={'DataLakePrincipalIdentifier': 'IAM_ARN:arn:aws:iam::123456789012:role/AnalystRole'},
Resource={
'Database': {'Name': 'data_lake_db'}
},
Permissions=['CREATE_TABLE', 'ALTER', 'DROP']
)
# Grant table permissions
lakeformation.grant_permissions(
Principal={'DataLakePrincipalIdentifier': 'IAM_ARN:arn:aws:iam::123456789012:role/AnalystRole'},
Resource={
'Table': {
'DatabaseName': 'data_lake_db',
'Name': 'sales'
}
},
Permissions=['SELECT', 'INSERT', 'DELETE']
)
# Grant column-level permissions
lakeformation.grant_permissions(
Principal={'DataLakePrincipalIdentifier': 'IAM_ARN:arn:aws:iam::123456789012:role/AnalystRole'},
Resource={
'TableWithColumns': {
'DatabaseName': 'data_lake_db',
'Name': 'customers',
'ColumnNames': ['customer_id', 'name', 'email'],
'ColumnWildcard': {'ExcludedColumnNames': ['ssn', 'credit_card']}
}
},
Permissions=['SELECT']
)
Interview Q&A
Q1: What is the Medallion Architecture?
Answer: Bronze (raw) β Silver (cleaned) β Gold (business-ready). Each layer adds quality and value while maintaining data lineage.
Q2: How does Lake Formation differ from IAM policies?
Answer: Lake Formation provides fine-grained (column/row level) permissions on data catalog objects. IAM controls AWS resource access. Both work together.
Q3: What is the key benefit of a data lake over a data warehouse?
Answer: Data lakes store raw, schema-on-read data at lower cost. Warehouses store structured, schema-on-write data optimized for queries.
Summary
- Architecture: Bronze β Silver β Gold layers
- Storage: S3 with lifecycle policies
- Governance: Lake Formation for fine-grained permissions
- Catalog: Glue Data Catalog for metadata management
- Formats: Parquet for processed data, raw format for Bronze
- Analytics: Athena, Redshift, QuickSight on the same data