Advanced Data Architecture Patterns on AWS
Advanced Data Architecture Patterns
Module 64 β Multi-Account Strategies, Hybrid Architectures, and Complex Patterns
Multi-Account Data Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AWS MULTI-ACCOUNT DATA STRATEGY β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β MANAGEMENT ACCOUNT β β
β β β’ AWS Organizations β’ Service Control Policies β β
β β β’ Consolidated Billing β’ Cross-Account Roles β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββΌβββββββββββββββββββββ β
β β β β β
β βΌ βΌ βΌ β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β SHARED β β DATA β β ANALYTICSβ β
β β SERVICES β β LAKE β β ACCOUNT β β
β β ACCOUNT β β ACCOUNT β β β β
β β β β β β β β
β β β’ IAM β β β’ S3 β β β’ Redshiftβ β
β β β’ KMS β β β’ Glue β β β’ QuickSightβ β
β β β’ Transit β β β’ Lake β β β’ Athena β β
β β GW β β Formationβ β β’ EMR β β
β β β’ Network β β β’ MSK β β β’ SageMakerβ β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β β β β
β β β β β
β βΌ βΌ βΌ β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β PROD β β DEV/TEST β β SECURITY β β
β β DATA β β ACCOUNT β β ACCOUNT β β
β β ACCOUNT β β β β β β
β β β β β β β β
β β β’ RDS β β β’ Copies β β β’ GuardDutyβ β
β β β’ DynamoDBβ β β’ Sandboxedβ β β’ Security Hubβ β
β β β’ ElastiCacheβ β data β β β’ Config β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Cross-Account Data Access Patterns
# Cross-account role assumption for data access
import boto3
import json
class CrossAccountDataAccess:
def __init__(self, management_account_id, target_account_id):
self.management_account_id = management_account_id
self.target_account_id = target_account_id
def assume_role(self, role_name, session_name='DataAccess'):
"""Assume cross-account role"""
sts = boto3.client('sts')
response = sts.assume_role(
RoleArn=f'arn:aws:iam::{self.target_account_id}:role/{role_name}',
RoleSessionName=session_name,
DurationSeconds=3600,
Tags=[
{
'Key': 'Purpose',
'Value': 'DataEngineeringAccess'
}
]
)
credentials = response['Credentials']
return boto3.Session(
aws_access_key_id=credentials['AccessKeyId'],
aws_secret_access_key=credentials['SecretAccessKey'],
aws_session_token=credentials['SessionToken']
)
def create_cross_account_s3_access(self):
"""Create cross-account S3 access policy"""
policy = {
"Version": "2012-10-17",
"Statement": [
{
"Sid": "CrossAccountS3Access",
"Effect": "Allow",
"Principal": {
"AWS": f"arn:aws:iam::{self.management_account_id}:root"
},
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::data-lake-bucket",
"arn:aws:s3:::data-lake-bucket/*"
]
}
]
}
return policy
# Usage
cross_account = CrossAccountDataAccess('111111111111', '222222222222')
session = cross_account.assume_role('DataEngineeringRole')
s3 = session.client('s3')
Hybrid Cloud Data Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HYBRID CLOUD DATA ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ON-PREMISES AWS CLOUD β
β ββββββββββββββββββββββββ ββββββββββββββββββββββββ β
β β Data Center β β VPC β β
β β β β β β
β β ββββββββββββββββ β Direct β ββββββββββββββββ β β
β β β Oracle DB βββββΌββConnectβββΌββΆβ Aurora β β β
β β β 50TB β β β β PostgreSQL β β β
β β ββββββββββββββββ β β ββββββββββββββββ β β
β β β β β β
β β ββββββββββββββββ β VPN β ββββββββββββββββ β β
β β β Data βββββΌβββββββββββΌββΆβ S3 Data Lake β β β
β β β Warehouse β β β β β β β
β β ββββββββββββββββ β β ββββββββββββββββ β β
β β β β β β
β β ββββββββββββββββ β β ββββββββββββββββ β β
β β β File Server βββββΌβββββββββββΌββΆβ Snowball Edgeβ β β
β β β 100TB β β β β β β β
β β ββββββββββββββββ β β ββββββββββββββββ β β
β β β β β β
β ββββββββββββββββββββββββ ββββββββββββββββββββββββ β
β β
β SYNC METHODS: β
β βββββββββββββ β
β β’ AWS DataSync: Managed file transfer β
β β’ AWS Direct Connect: Dedicated network connection β
β β’ AWS Storage Gateway: Hybrid storage β
β β’ AWS Snow Family: Offline large-scale transfer β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
DataSync Configuration
# AWS DataSync for hybrid data transfer
import boto3
def create_datasync_task():
"""Create DataSync task for hybrid sync"""
datasync = boto3.client('datasync')
# Create source location (on-premises NFS)
source_response = datasync.create_location_nfs(
ServerHostname='onprem-server.example.com',
Subdirectory='/data/exports',
OnPremConfig={
AgentArns=['arn:aws:datasync:us-east-1:111111111111:agent/agent-id']
},
MountOptions={
'Version': 'NFS3'
}
)
# Create target location (S3)
target_response = datasync.create_location_s3(
S3BucketArn='arn:aws:s3:::data-lake-sync',
S3Config={
'BucketAccessRoleArn': 'arn:aws:iam::111111111111:role/DataSyncRole'
},
Subdirectory='/sync-data'
)
# Create DataSync task
task_response = datasync.create_task(
SourceLocationArn=source_response['LocationArn'],
DestinationLocationArn=target_response['LocationArn'],
Name='OnPremToS3Sync',
Options={
'VerifyMode': 'ONLY_FILES_TRANSFERRED',
'OverwriteMode': 'ALWAYS',
'PreserveDeletedFiles': 'PRESERVE',
'PreserveDevices': 'NONE',
'PosixPermissions': 'NONE',
'TaskQueuingingMode': 'ENABLED',
'TransferMode': 'ALL',
'LogLevel': 'TRANSFER'
},
Schedule={
'ScheduleExpression': 'cron(0 2 * * ? *)' # Daily at 2 AM
}
)
return task_response['TaskArn']
# Monitor DataSync
def monitor_datasync(task_arn):
"""Monitor DataSync task execution"""
datasync = boto3.client('datasync')
response = datasync.list_task_executions(
TaskArn=task_arn,
MaxResults=5
)
for execution in response['TaskExecutions']:
print(f"Execution: {execution['TaskExecutionArn']}")
print(f"Status: {execution['Status']}")
print(f"Bytes transferred: {execution['BytesTransferred']}")
print(f"Files transferred: {execution['FilesTransferred']}")
print(f"Duration: {execution['Duration']}")
Complex ETL Patterns
Lambda-Based Serverless ETL
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SERVERLESS ETL WITH LAMBDA & STEP FUNCTIONS β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β S3 Upload Lambda Step Function S3 Processed β
β ββββββββββββ ββββββββββββ ββββββββββββββββ ββββββββββββ β
β βRaw Data βββΆβTrigger βββΆβOrchestrate βββΆβClean Dataβ β
β βLanding β βFunction β β β β β β
β ββββββββββββ ββββββββββββ β ββββββββββββ ββββββββββββ β
β β βValidate ββ β
β ββββββββββββ ββββββββββββ β β Data ββ ββββββββββββ β
β βSchema βββΆβEnrichmentβ β ββββββ¬ββββββ βRedshift β β
β βRegistry β βFunction β β β β βLoading β β
β ββββββββββββ ββββββββββββ β ββββββΌββββββ ββββββββββββ β
β β βTransform ββ β
β ββββββββββββ ββββββββββββ β β Data ββ ββββββββββββ β
β βData βββΆβQuality β β ββββββ¬ββββββ βAnalytics β β
β βCatalog β βCheck β β β β βDashboard β β
β ββββββββββββ ββββββββββββ β ββββββΌββββββ ββββββββββββ β
β β βLoad to ββ β
β β β Target ββ β
β β ββββββββββββ β
β ββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Step Functions Orchestration
# Step Functions state machine for complex ETL
import json
step_function_definition = {
"Comment": "Complex ETL Pipeline",
"StartAt": "ValidateInput",
"States": {
"ValidateInput": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:111111111111:validate-input",
"Next": "CheckValidation",
"Retry": [
{
"ErrorEquals": ["States.TaskFailed"],
"IntervalSeconds": 30,
"MaxAttempts": 3,
"BackoffRate": 2
}
],
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"Next": "HandleError"
}
]
},
"CheckValidation": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.isValid",
"BooleanEquals": True,
"Next": "EnrichData"
}
],
"Default": "NotifyInvalidData"
},
"EnrichData": {
"Type": "Parallel",
"Next": "TransformData",
"Branches": [
{
"StartAt": "EnrichGeolocation",
"States": {
"EnrichGeolocation": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:111111111111:enrich-geo",
"End": True
}
}
},
{
"StartAt": "EnrichDemographics",
"States": {
"EnrichDemographics": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:111111111111:enrich-demo",
"End": True
}
}
}
]
},
"TransformData": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName": "transform-data-job",
"Arguments": {
"--input_path.$": "$.outputPath",
"--output_path.$": "$.transformedPath"
}
},
"Next": "QualityCheck"
},
"QualityCheck": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:111111111111:quality-check",
"Next": "LoadToRedshift"
},
"LoadToRedshift": {
"Type": "Task",
"Resource": "arn:aws:states:::redshift-data:executeSql.waitForTaskToken",
"Parameters": {
"ClusterIdentifier": "analytics-cluster",
"Database": "analytics",
"Sql": "COPY schema.table FROM 's3://bucket/data' IAM_ROLE 'arn:aws:iam::role/RedshiftRole' FORMAT AS PARQUET"
},
"End": True
},
"HandleError": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:111111111111:handle-error",
"End": True
},
"NotifyInvalidData": {
"Type": "Task",
"Resource": "arn:aws:sns:us-east-1:111111111111:alerts",
"Parameters": {
"TopicArn": "arn:aws:sns:us-east-1:111111111111:data-alerts",
"Message": "Invalid data detected in pipeline"
},
"End": True
}
}
}
# Save definition
with open('etl-state-machine.json', 'w') as f:
json.dump(step_function_definition, f, indent=2)
Data Mesh on AWS
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA MESH IMPLEMENTATION ON AWS β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β PLATFORM TEAM (Central) β β
β β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β
β β β AWS Lake β β AWS Glue β β AWS IAM β β β
β β β Formation β β DataBrew β β Identity β β β
β β β β β β β Center β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β
β β β β
β β Self-Service Data Platform APIs β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββΌβββββββββββββββββββββ β
β β β β β
β βΌ βΌ βΌ β
β ββββββββββββββββββββββ ββββββββββββββββββββββ βββββββββββββββββ β
β β SALES DOMAIN β β MARKETING DOMAIN β β FINANCE DOMAINβ β
β β β β β β β β
β β ββββββββββββββββ β β ββββββββββββββββ β β βββββββββββββ β β
β β β Data Product β β β β Data Product β β β βData Productβ β β
β β β - Customer β β β β - Campaign β β β β- Revenue β β β
β β β - Orders β β β β - Leads β β β β- Forecast β β β
β β ββββββββββββββββ β β ββββββββββββββββ β β βββββββββββββ β β
β β β β β β β β
β β ββββββββββββββββ β β ββββββββββββββββ β β βββββββββββββ β β
β β β S3 Bucket β β β β S3 Bucket β β β βS3 Bucket β β β
β β β (sales-data) β β β β (mktg-data) β β β β(fin-data)β β β
β β ββββββββββββββββ β β ββββββββββββββββ β β βββββββββββββ β β
β β β β β β β β
β β Domain Team Owns β β Domain Team Owns β βDomain Team Ownsβ β
β ββββββββββββββββββββββ ββββββββββββββββββββββ βββββββββββββββββ β
β β β β β
β ββββββββββββββββββββββΌβββββββββββββββββββββ β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DATA CONSUMERS (Cross-Domain) β β
β β β β
β β Analytics ML Engineers Data Scientists BI Teams β β
β β βββββββββββ βββββββββββ βββββββββββ ββββββββββ β β
β β βAthena β βSageMakerβ βNotebooksβ βQuickSightβ β β
β β βββββββββββ βββββββββββ βββββββββββ ββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Lake Formation for Data Mesh
# Lake Formation permissions for data mesh domains
import boto3
def setup_lake_formation_permissions():
"""Setup Lake Formation permissions for data mesh"""
lakeformation = boto3.client('lakeformation')
# Grant database permissions to domain teams
response = lakeformation.grant_permissions(
Principal={
'DataLakePrincipalIdentifier': 'arn:aws:iam::111111111111:role/SalesDomainRole'
},
Resource={
'Database': {
'Name': 'sales_domain_db'
}
},
Permissions=['CREATE_TABLE', 'ALTER', 'DROP'],
GrantOption=True
)
# Grant table permissions
response = lakeformation.grant_permissions(
Principal={
'DataLakePrincipalIdentifier': 'arn:aws:iam::111111111111:role/AnalyticsRole'
},
Resource={
'Table': {
'DatabaseName': 'sales_domain_db',
'Name': 'customers'
}
},
Permissions=['SELECT', 'DESCRIBE'],
Conditions=[
{
'Expression': {
"StringEquals": {
"lakeformation:ColumnNames": ["customer_id", "email", "phone"]
}
},
'Name': 'ColumnFilter'
}
]
)
return response
# Data sharing between domains
def share_data_between_domains():
"""Share data products between domains"""
glue = boto3.client('glue')
# Create cross-account table version
response = glue.create_table(
DatabaseName='sales_domain_db',
TableInput={
'Name': 'customers_shared',
'TargetTable': {
'DatabaseName': 'marketing_domain_db',
'Name': 'customer_segments',
'CatalogId': '222222222222'
}
}
)
return response
Complex Streaming Architectures
Real-Time Analytics Pipeline
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β REAL-TIME ANALYTICS PIPELINE ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β DATA SOURCES β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β IoT β βClickstreamβ βLogs β βDatabase β β
β β Sensors β β β β β βCDC β β
β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β
β β β β β β
β βΌ βΌ βΌ βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β KINESIS DATA STREAMS β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β
β β βStream 1 β βStream 2 β βStream 3 β βStream 4 β β β
β β βIoT Data β βClicks β βApp Logs β βCDC Data β β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββΌβββββββββββββββββββββ β
β β β β β
β βΌ βΌ βΌ β
β ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ β
β β KINESIS DATA β β KINESIS ANALYTICSβ β KINESIS FIREHOSE β β
β β ANALYTICS β β (SQL Application)β β β β
β β (Flink) β β β β β β
β β β β β’ Window Aggs β β β’ S3 (Parquet) β β
β β β’ State Mgmt β β β’ Joins β β β’ Redshift β β
β β β’ Complex Events β β β’ Filtering β β β’ Elasticsearch β β
β β β’ ML Inference β β β’ Patterns β β β’ OpenSearch β β
β ββββββββββ¬ββββββββββ ββββββββββ¬ββββββββββ ββββββββββ¬ββββββββββ β
β β β β β
β βΌ βΌ βΌ β
β ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ β
β β DynamoDB β β Aurora β β S3 Data Lake β β
β β (Real-time β β (Query Store) β β (Historical β β
β β Features) β β β β Analytics) β β
β ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ β
β β
β MONITORING: CloudWatch Metrics + Custom Dashboard β
β LATENCY: End-to-end < 1 second β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Kinesis Analytics Flink Application
// Flink SQL for real-time analytics
CREATE TABLE iot_stream (
device_id VARCHAR,
temperature DOUBLE,
humidity DOUBLE,
timestamp TIMESTAMP(3),
WATERMARK FOR timestamp AS timestamp - INTERVAL '5' SECOND
) WITH (
'connector' = 'kinesis',
'stream' = 'iot-data-stream',
'aws.region' = 'us-east-1',
'format' = 'json'
);
CREATE TABLE alert_sink (
device_id VARCHAR,
avg_temp DOUBLE,
alert_type VARCHAR,
window_start TIMESTAMP(3),
window_end TIMESTAMP(3)
) WITH (
'connector' = 'kinesis',
'stream' = 'alert-stream',
'aws.region' = 'us-east-1',
'format' = 'json'
);
-- Windowed aggregation with alerting
INSERT INTO alert_sink
SELECT
device_id,
AVG(temperature) as avg_temp,
CASE
WHEN AVG(temperature) > 50 THEN 'HIGH_TEMP'
WHEN AVG(temperature) < 10 THEN 'LOW_TEMP'
ELSE 'NORMAL'
END as alert_type,
TUMBLE_START(timestamp, INTERVAL '1' MINUTE) as window_start,
TUMBLE_END(timestamp, INTERVAL '1' MINUTE) as window_end
FROM iot_stream
GROUP BY
device_id,
TUMBLE(timestamp, INTERVAL '1' MINUTE)
HAVING AVG(temperature) > 50 OR AVG(temperature) < 10;
Interview Questions & Answers
Q1: How do you design a multi-account data architecture for a large enterprise?
Answer: Use AWS Organizations with separate accounts for shared services, data lake, analytics, production, dev/test, and security. Implement SCPs for governance, cross-account roles for access, and centralized logging. Use Lake Formation for fine-grained access control.
Q2: Explain the trade-offs between centralized and decentralized data architectures.
Answer: Centralized: easier governance, single source of truth, but can become bottleneck. Decentralized (Data Mesh): domain ownership, scalability, but complex governance. Choose based on organization size, data diversity, and team structure.
Q3: How do you handle data consistency in a hybrid cloud environment?
Answer: Use AWS DataSync for scheduled transfers, implement conflict resolution, use eventual consistency where acceptable, and strong consistency with DynamoDB Global Tables for critical data. Monitor sync status with CloudWatch.
Q4: Describe a serverless ETL architecture for variable workloads.
Answer: Use S3 event notifications to trigger Lambda, Step Functions for orchestration, Glue for data processing, and Kinesis for streaming. Benefits: auto-scaling, pay-per-use, no infrastructure management.
Q5: How do you implement data quality checks in a streaming pipeline?
Answer: Use Kinesis Analytics for real-time validation, Lambda for custom checks, CloudWatch for monitoring. Implement dead-letter queues for failed records, and alerts for quality violations.
Q6: Explain the Data Mesh pattern and how to implement it on AWS.
Answer: Data Mesh is decentralized data ownership. On AWS: use Lake Formation for permissions, Glue for catalog, S3 for storage per domain, and cross-account sharing. Platform team provides self-service tools.
Q7: How do you optimize costs in a complex data architecture?
Answer: Use S3 Intelligent-Tiering, Redshift Reserved Instances, Lambda for variable workloads, spot instances for batch, and auto-scaling. Monitor with Cost Explorer and implement tagging.
Q8: Describe a disaster recovery strategy for a data lake.
Answer: Use S3 Cross-Region Replication, Aurora Global Tables, Redshift snapshots to S3, and automated backups with AWS Backup. Test recovery regularly and implement RTO/RPO monitoring.
Q9: How do you handle schema evolution in a data lake?
Answer: Use Glue Schema Registry, implement versioning in S3, use Avro/Parquet with schema evolution, and Lambda for schema validation. Maintain backward compatibility with consumer contracts.
Q10: Explain the Lambda architecture vs Kappa architecture for data processing.
Answer: Lambda: batch + speed layers for complex processing. Kappa: single streaming layer for simplicity. On AWS: Lambda architecture uses EMR + Kinesis, Kappa uses Kinesis Analytics.
Q11: How do you implement real-time feature engineering for ML pipelines?
Answer: Use Kinesis Data Analytics for streaming features, SageMaker Feature Store for offline/online features, and Step Functions for orchestration. Implement feature versioning and monitoring.
Q12: Describe a multi-region data replication strategy.
Answer: Use DynamoDB Global Tables, Aurora Global Database, S3 Cross-Region Replication, and Route 53 for routing. Implement conflict resolution and monitor replication lag.
Q13: How do you handle data governance in a Data Mesh architecture?
Answer: Centralized governance policies with domain-specific implementation. Use Lake Formation for permissions, Glue Data Catalog for metadata, and AWS Config for compliance.
Q14: Explain the concept of data contracts and how to implement them.
Answer: Data contracts define data format, quality, and SLAs. Implement using Glue Schema Registry, API Gateway for data services, and automated validation in pipelines.
Q15: How do you optimize query performance in a data lake?
Answer: Use partitioning, bucketing, columnar formats (Parquet/ORC), caching with Athenaη»ζηΌε, and Redshift Spectrum for federated queries. Implement data compaction.
Q16: Describe a CI/CD pipeline for data engineering.
Answer: Use CodeCommit for version control, CodeBuild for testing, CodePipeline for deployment, and CloudFormation for infrastructure. Implement data pipeline testing with synthetic data.
Q17: How do you handle sensitive data in a data lake?
Answer: Use KMS for encryption, Lake Formation for column-level permissions, Macie for discovery, and DLP policies. Implement data masking and tokenization.
Q18: Explain the concept of data observability.
Answer: Data observability includes monitoring data quality, freshness, volume, and schema changes. Use CloudWatch, Glue DataBrew, and custom metrics with alerts.
Q19: How do you design for data scalability in a growing organization?
Answer: Use horizontal scaling with S3, serverless with Lambda, containerized processing with ECS/EKS, and distributed processing with EMR/Spark. Implement auto-scaling policies.
Q20: Describe a hybrid analytics architecture.
Answer: Use Direct Connect for hybrid connectivity, DataSync for data transfer, Athena for cloud analytics, and on-premises tools for local processing. Implement unified monitoring.
Q21: How do you implement data versioning in a data lake?
Answer: Use S3 versioning, Glue table versions, and implement CDC patterns. Use Apache Iceberg for table format with time travel queries.
Q22: Explain the trade-offs between different data storage formats.
Answer: Parquet: columnar, efficient for analytics. Avro: row-based, good for serialization. ORC: optimized for Hive. JSON: human-readable but verbose. Choose based on use case.
Q23: How do you handle data lineage in a complex pipeline?
Answer: Use Glue Data Catalog lineage, implement custom tracking with DynamoDB, and tools like AWS Lake Formation. Automate lineage capture with pipeline metadata.
Q24: Describe a real-time dashboard architecture.
Answer: Use Kinesis for ingestion, Lambda for processing, DynamoDB for features, ElastiCache for caching, and QuickSight for visualization. Implement WebSocket for real-time updates.
Q25: How do you design for data portability across clouds?
Answer: Use open formats (Parquet, ORC), standard APIs (S3-compatible), containerized processing (Docker), and infrastructure as code (Terraform). Avoid vendor lock-in.
Key Takeaways
β οΈ
Advanced data architectures require careful planning for scalability, security, and cost. Always design for failure and implement comprehensive monitoring.
- Multi-account strategy: Separate concerns with dedicated AWS accounts
- Hybrid connectivity: Use DataSync and Direct Connect for hybrid sync
- Serverless patterns: Lambda and Step Functions for variable workloads
- Data Mesh: Domain-oriented decentralized data ownership
- Real-time processing: Kinesis and Flink for streaming analytics
- Cost optimization: Auto-scaling and right-sizing resources
- Security: Encryption, fine-grained access, and compliance
- Monitoring: Comprehensive observability across all layers