Advanced Data Architecture Patterns on AWS

Advanced Data Architecture Patterns

Module 64 — Multi-Account Strategies, Hybrid Architectures, and Complex Patterns

Multi-Account Data Architecture

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                 AWS MULTI-ACCOUNT DATA STRATEGY                    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    MANAGEMENT ACCOUNT                        │  │
│  │  • AWS Organizations  • Service Control Policies            │  │
│  │  • Consolidated Billing  • Cross-Account Roles              │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              │                                     │
│         ┌────────────────────┼────────────────────┐               │
│         │                    │                    │               │
│         ▼                    ▼                    ▼               │
│  ┌──────────┐        ┌──────────┐        ┌──────────┐            │
│  │  SHARED   │        │ DATA     │        │ ANALYTICS│            │
│  │  SERVICES │        │ LAKE     │        │ ACCOUNT  │            │
│  │  ACCOUNT  │        │ ACCOUNT  │        │          │            │
│  │           │        │          │        │          │            │
│  │ • IAM     │        │ • S3     │        │ • Redshift│           │
│  │ • KMS     │        │ • Glue   │        │ • QuickSight│          │
│  │ • Transit │        │ • Lake   │        │ • Athena  │           │
│  │   GW      │        │   Formation│       │ • EMR     │           │
│  │ • Network │        │ • MSK    │        │ • SageMaker│           │
│  └──────────┘        └──────────┘        └──────────┘            │
│         │                    │                    │               │
│         │                    │                    │               │
│         ▼                    ▼                    ▼               │
│  ┌──────────┐        ┌──────────┐        ┌──────────┐            │
│  │ PROD     │        │ DEV/TEST │        │ SECURITY │            │
│  │ DATA     │        │ ACCOUNT  │        │ ACCOUNT  │            │
│  │ ACCOUNT  │        │          │        │          │            │
│  │          │        │          │        │          │            │
│  │ • RDS    │        │ • Copies │        │ • GuardDuty│           │
│  │ • DynamoDB│       │ • Sandboxed│      │ • Security Hub│        │
│  │ • ElastiCache│    │   data    │       │ • Config  │            │
│  └──────────┘        └──────────┘        └──────────┘            │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Cross-Account Data Access Patterns

# Cross-account role assumption for data access
import boto3
import json

class CrossAccountDataAccess:
    def __init__(self, management_account_id, target_account_id):
        self.management_account_id = management_account_id
        self.target_account_id = target_account_id
    
    def assume_role(self, role_name, session_name='DataAccess'):
        """Assume cross-account role"""
        sts = boto3.client('sts')
        
        response = sts.assume_role(
            RoleArn=f'arn:aws:iam::{self.target_account_id}:role/{role_name}',
            RoleSessionName=session_name,
            DurationSeconds=3600,
            Tags=[
                {
                    'Key': 'Purpose',
                    'Value': 'DataEngineeringAccess'
                }
            ]
        )
        
        credentials = response['Credentials']
        return boto3.Session(
            aws_access_key_id=credentials['AccessKeyId'],
            aws_secret_access_key=credentials['SecretAccessKey'],
            aws_session_token=credentials['SessionToken']
        )
    
    def create_cross_account_s3_access(self):
        """Create cross-account S3 access policy"""
        policy = {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Sid": "CrossAccountS3Access",
                    "Effect": "Allow",
                    "Principal": {
                        "AWS": f"arn:aws:iam::{self.management_account_id}:root"
                    },
                    "Action": [
                        "s3:GetObject",
                        "s3:PutObject",
                        "s3:ListBucket",
                        "s3:DeleteObject"
                    ],
                    "Resource": [
                        "arn:aws:s3:::data-lake-bucket",
                        "arn:aws:s3:::data-lake-bucket/*"
                    ]
                }
            ]
        }
        
        return policy

# Usage
cross_account = CrossAccountDataAccess('111111111111', '222222222222')
session = cross_account.assume_role('DataEngineeringRole')
s3 = session.client('s3')

Hybrid Cloud Data Architecture

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                   HYBRID CLOUD DATA ARCHITECTURE                   │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ON-PREMISES                         AWS CLOUD                     │
│  ┌──────────────────────┐           ┌──────────────────────┐      │
│  │  Data Center         │           │  VPC                 │      │
│  │                      │           │                      │      │
│  │  ┌──────────────┐   │  Direct   │  ┌──────────────┐   │      │
│  │  │ Oracle DB    │◄──┼──Connect──┼─▶│ Aurora       │   │      │
│  │  │ 50TB         │   │           │  │ PostgreSQL   │   │      │
│  │  └──────────────┘   │           │  └──────────────┘   │      │
│  │                      │           │                      │      │
│  │  ┌──────────────┐   │  VPN      │  ┌──────────────┐   │      │
│  │  │ Data         │◄──┼──────────┼─▶│ S3 Data Lake │   │      │
│  │  │ Warehouse    │   │           │  │              │   │      │
│  │  └──────────────┘   │           │  └──────────────┘   │      │
│  │                      │           │                      │      │
│  │  ┌──────────────┐   │           │  ┌──────────────┐   │      │
│  │  │ File Server  │◄──┼──────────┼─▶│ Snowball Edge│   │      │
│  │  │ 100TB        │   │           │  │              │   │      │
│  │  └──────────────┘   │           │  └──────────────┘   │      │
│  │                      │           │                      │      │
│  └──────────────────────┘           └──────────────────────┘      │
│                                                                     │
│  SYNC METHODS:                                                     │
│  ─────────────                                                     │
│  • AWS DataSync: Managed file transfer                             │
│  • AWS Direct Connect: Dedicated network connection                │
│  • AWS Storage Gateway: Hybrid storage                             │
│  • AWS Snow Family: Offline large-scale transfer                   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

DataSync Configuration

# AWS DataSync for hybrid data transfer
import boto3

def create_datasync_task():
    """Create DataSync task for hybrid sync"""
    
    datasync = boto3.client('datasync')
    
    # Create source location (on-premises NFS)
    source_response = datasync.create_location_nfs(
        ServerHostname='onprem-server.example.com',
        Subdirectory='/data/exports',
        OnPremConfig={
            AgentArns=['arn:aws:datasync:us-east-1:111111111111:agent/agent-id']
        },
        MountOptions={
            'Version': 'NFS3'
        }
    )
    
    # Create target location (S3)
    target_response = datasync.create_location_s3(
        S3BucketArn='arn:aws:s3:::data-lake-sync',
        S3Config={
            'BucketAccessRoleArn': 'arn:aws:iam::111111111111:role/DataSyncRole'
        },
        Subdirectory='/sync-data'
    )
    
    # Create DataSync task
    task_response = datasync.create_task(
        SourceLocationArn=source_response['LocationArn'],
        DestinationLocationArn=target_response['LocationArn'],
        Name='OnPremToS3Sync',
        Options={
            'VerifyMode': 'ONLY_FILES_TRANSFERRED',
            'OverwriteMode': 'ALWAYS',
            'PreserveDeletedFiles': 'PRESERVE',
            'PreserveDevices': 'NONE',
            'PosixPermissions': 'NONE',
            'TaskQueuingingMode': 'ENABLED',
            'TransferMode': 'ALL',
            'LogLevel': 'TRANSFER'
        },
        Schedule={
            'ScheduleExpression': 'cron(0 2 * * ? *)'  # Daily at 2 AM
        }
    )
    
    return task_response['TaskArn']

# Monitor DataSync
def monitor_datasync(task_arn):
    """Monitor DataSync task execution"""
    datasync = boto3.client('datasync')
    
    response = datasync.list_task_executions(
        TaskArn=task_arn,
        MaxResults=5
    )
    
    for execution in response['TaskExecutions']:
        print(f"Execution: {execution['TaskExecutionArn']}")
        print(f"Status: {execution['Status']}")
        print(f"Bytes transferred: {execution['BytesTransferred']}")
        print(f"Files transferred: {execution['FilesTransferred']}")
        print(f"Duration: {execution['Duration']}")

Complex ETL Patterns

Lambda-Based Serverless ETL

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│              SERVERLESS ETL WITH LAMBDA & STEP FUNCTIONS            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  S3 Upload     Lambda       Step Function      S3 Processed       │
│  ┌──────────┐  ┌──────────┐  ┌──────────────┐  ┌──────────┐      │
│  │Raw Data  │─▶│Trigger   │─▶│Orchestrate   │─▶│Clean Data│      │
│  │Landing   │  │Function  │  │              │  │          │      │
│  └──────────┘  └──────────┘  │  ┌─────────┐│  └──────────┘      │
│                              │  │Validate ││                      │
│  ┌──────────┐  ┌──────────┐  │  │  Data   ││  ┌──────────┐      │
│  │Schema    │─▶│Enrichment│  │  └────┬────┘│  │Redshift  │      │
│  │Registry  │  │Function  │  │       │     │  │Loading   │      │
│  └──────────┘  └──────────┘  │  ┌────▼────┐│  └──────────┘      │
│                              │  │Transform ││                      │
│  ┌──────────┐  ┌──────────┐  │  │  Data   ││  ┌──────────┐      │
│  │Data      │─▶│Quality   │  │  └────┬────┘│  │Analytics │      │
│  │Catalog   │  │Check     │  │       │     │  │Dashboard │      │
│  └──────────┘  └──────────┘  │  ┌────▼────┐│  └──────────┘      │
│                              │  │Load to  ││                      │
│                              │  │  Target ││                      │
│                              │  └─────────┘│                      │
│                              └──────────────┘                      │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Step Functions Orchestration

# Step Functions state machine for complex ETL
import json

step_function_definition = {
    "Comment": "Complex ETL Pipeline",
    "StartAt": "ValidateInput",
    "States": {
        "ValidateInput": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:us-east-1:111111111111:validate-input",
            "Next": "CheckValidation",
            "Retry": [
                {
                    "ErrorEquals": ["States.TaskFailed"],
                    "IntervalSeconds": 30,
                    "MaxAttempts": 3,
                    "BackoffRate": 2
                }
            ],
            "Catch": [
                {
                    "ErrorEquals": ["States.ALL"],
                    "Next": "HandleError"
                }
            ]
        },
        "CheckValidation": {
            "Type": "Choice",
            "Choices": [
                {
                    "Variable": "$.isValid",
                    "BooleanEquals": True,
                    "Next": "EnrichData"
                }
            ],
            "Default": "NotifyInvalidData"
        },
        "EnrichData": {
            "Type": "Parallel",
            "Next": "TransformData",
            "Branches": [
                {
                    "StartAt": "EnrichGeolocation",
                    "States": {
                        "EnrichGeolocation": {
                            "Type": "Task",
                            "Resource": "arn:aws:lambda:us-east-1:111111111111:enrich-geo",
                            "End": True
                        }
                    }
                },
                {
                    "StartAt": "EnrichDemographics",
                    "States": {
                        "EnrichDemographics": {
                            "Type": "Task",
                            "Resource": "arn:aws:lambda:us-east-1:111111111111:enrich-demo",
                            "End": True
                        }
                    }
                }
            ]
        },
        "TransformData": {
            "Type": "Task",
            "Resource": "arn:aws:states:::glue:startJobRun.sync",
            "Parameters": {
                "JobName": "transform-data-job",
                "Arguments": {
                    "--input_path.$": "$.outputPath",
                    "--output_path.$": "$.transformedPath"
                }
            },
            "Next": "QualityCheck"
        },
        "QualityCheck": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:us-east-1:111111111111:quality-check",
            "Next": "LoadToRedshift"
        },
        "LoadToRedshift": {
            "Type": "Task",
            "Resource": "arn:aws:states:::redshift-data:executeSql.waitForTaskToken",
            "Parameters": {
                "ClusterIdentifier": "analytics-cluster",
                "Database": "analytics",
                "Sql": "COPY schema.table FROM 's3://bucket/data' IAM_ROLE 'arn:aws:iam::role/RedshiftRole' FORMAT AS PARQUET"
            },
            "End": True
        },
        "HandleError": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:us-east-1:111111111111:handle-error",
            "End": True
        },
        "NotifyInvalidData": {
            "Type": "Task",
            "Resource": "arn:aws:sns:us-east-1:111111111111:alerts",
            "Parameters": {
                "TopicArn": "arn:aws:sns:us-east-1:111111111111:data-alerts",
                "Message": "Invalid data detected in pipeline"
            },
            "End": True
        }
    }
}

# Save definition
with open('etl-state-machine.json', 'w') as f:
    json.dump(step_function_definition, f, indent=2)

Data Mesh on AWS

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                    DATA MESH IMPLEMENTATION ON AWS                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                  PLATFORM TEAM (Central)                    │   │
│  │                                                             │   │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │   │
│  │  │ AWS Lake     │  │ AWS Glue     │  │ AWS IAM      │     │   │
│  │  │ Formation    │  │ DataBrew     │  │ Identity     │     │   │
│  │  │              │  │              │  │ Center       │     │   │
│  │  └──────────────┘  └──────────────┘  └──────────────┘     │   │
│  │                                                             │   │
│  │  Self-Service Data Platform APIs                            │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              │                                     │
│         ┌────────────────────┼────────────────────┐               │
│         │                    │                    │               │
│         ▼                    ▼                    ▼               │
│  ┌────────────────────┐ ┌────────────────────┐ ┌───────────────┐ │
│  │  SALES DOMAIN       │ │  MARKETING DOMAIN  │ │ FINANCE DOMAIN│ │
│  │                     │ │                    │ │               │ │
│  │  ┌──────────────┐  │ │  ┌──────────────┐  │ │ ┌───────────┐ │ │
│  │  │ Data Product │  │ │  │ Data Product │  │ │ │Data Product│ │ │
│  │  │ - Customer   │  │ │  │ - Campaign   │  │ │ │- Revenue  │ │ │
│  │  │ - Orders     │  │ │  │ - Leads      │  │ │ │- Forecast │ │ │
│  │  └──────────────┘  │ │  └──────────────┘  │ │ └───────────┘ │ │
│  │                     │ │                    │ │               │ │
│  │  ┌──────────────┐  │ │  ┌──────────────┐  │ │ ┌───────────┐ │ │
│  │  │ S3 Bucket    │  │ │  │ S3 Bucket    │  │ │ │S3 Bucket │ │ │
│  │  │ (sales-data) │  │ │  │ (mktg-data)  │  │ │ │(fin-data)│ │ │
│  │  └──────────────┘  │ │  └──────────────┘  │ │ └───────────┘ │ │
│  │                     │ │                    │ │               │ │
│  │  Domain Team Owns   │ │  Domain Team Owns  │ │Domain Team Owns│ │
│  └────────────────────┘ └────────────────────┘ └───────────────┘ │
│         │                    │                    │               │
│         └────────────────────┼────────────────────┘               │
│                              ▼                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │           DATA CONSUMERS (Cross-Domain)                     │   │
│  │                                                             │   │
│  │  Analytics    ML Engineers    Data Scientists    BI Teams  │   │
│  │  ┌─────────┐  ┌─────────┐     ┌─────────┐      ┌────────┐ │   │
│  │  │Athena   │  │SageMaker│     │Notebooks│      │QuickSight│ │   │
│  │  └─────────┘  └─────────┘     └─────────┘      └────────┘ │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Lake Formation for Data Mesh

# Lake Formation permissions for data mesh domains
import boto3

def setup_lake_formation_permissions():
    """Setup Lake Formation permissions for data mesh"""
    
    lakeformation = boto3.client('lakeformation')
    
    # Grant database permissions to domain teams
    response = lakeformation.grant_permissions(
        Principal={
            'DataLakePrincipalIdentifier': 'arn:aws:iam::111111111111:role/SalesDomainRole'
        },
        Resource={
            'Database': {
                'Name': 'sales_domain_db'
            }
        },
        Permissions=['CREATE_TABLE', 'ALTER', 'DROP'],
        GrantOption=True
    )
    
    # Grant table permissions
    response = lakeformation.grant_permissions(
        Principal={
            'DataLakePrincipalIdentifier': 'arn:aws:iam::111111111111:role/AnalyticsRole'
        },
        Resource={
            'Table': {
                'DatabaseName': 'sales_domain_db',
                'Name': 'customers'
            }
        },
        Permissions=['SELECT', 'DESCRIBE'],
        Conditions=[
            {
                'Expression': {
                    "StringEquals": {
                        "lakeformation:ColumnNames": ["customer_id", "email", "phone"]
                    }
                },
                'Name': 'ColumnFilter'
            }
        ]
    )
    
    return response

# Data sharing between domains
def share_data_between_domains():
    """Share data products between domains"""
    
    glue = boto3.client('glue')
    
    # Create cross-account table version
    response = glue.create_table(
        DatabaseName='sales_domain_db',
        TableInput={
            'Name': 'customers_shared',
            'TargetTable': {
                'DatabaseName': 'marketing_domain_db',
                'Name': 'customer_segments',
                'CatalogId': '222222222222'
            }
        }
    )
    
    return response

Complex Streaming Architectures

Real-Time Analytics Pipeline

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│           REAL-TIME ANALYTICS PIPELINE ARCHITECTURE                │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  DATA SOURCES                                                       │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐         │
│  │ IoT      │  │Clickstream│ │Logs      │  │Database  │         │
│  │ Sensors  │  │          │  │          │  │CDC       │         │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘         │
│       │              │              │              │                │
│       ▼              ▼              ▼              ▼                │
│  ┌─────────────────────────────────────────────────────────────┐  │
│  │                    KINESIS DATA STREAMS                     │  │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐      │  │
│  │  │Stream 1 │  │Stream 2 │  │Stream 3 │  │Stream 4 │      │  │
│  │  │IoT Data │  │Clicks   │  │App Logs │  │CDC Data │      │  │
│  │  └─────────┘  └─────────┘  └─────────┘  └─────────┘      │  │
│  └─────────────────────────────────────────────────────────────┘  │
│                              │                                     │
│         ┌────────────────────┼────────────────────┐               │
│         │                    │                    │               │
│         ▼                    ▼                    ▼               │
│  ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐  │
│  │ KINESIS DATA     │ │ KINESIS ANALYTICS│ │ KINESIS FIREHOSE │  │
│  │ ANALYTICS        │ │ (SQL Application)│ │                  │  │
│  │ (Flink)          │ │                  │ │                  │  │
│  │                  │ │ • Window Aggs    │ │ • S3 (Parquet)   │  │
│  │ • State Mgmt     │ │ • Joins          │ │ • Redshift       │  │
│  │ • Complex Events │ │ • Filtering      │ │ • Elasticsearch  │  │
│  │ • ML Inference   │ │ • Patterns       │ │ • OpenSearch     │  │
│  └────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘  │
│           │                    │                    │               │
│           ▼                    ▼                    ▼               │
│  ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐  │
│  │ DynamoDB         │ │ Aurora           │ │ S3 Data Lake     │  │
│  │ (Real-time       │ │ (Query Store)    │ │ (Historical      │  │
│  │  Features)       │ │                  │ │  Analytics)      │  │
│  └──────────────────┘ └──────────────────┘ └──────────────────┘  │
│                                                                     │
│  MONITORING: CloudWatch Metrics + Custom Dashboard                │
│  LATENCY: End-to-end < 1 second                                   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Kinesis Analytics Flink Application

// Flink SQL for real-time analytics
CREATE TABLE iot_stream (
    device_id VARCHAR,
    temperature DOUBLE,
    humidity DOUBLE,
    timestamp TIMESTAMP(3),
    WATERMARK FOR timestamp AS timestamp - INTERVAL '5' SECOND
) WITH (
    'connector' = 'kinesis',
    'stream' = 'iot-data-stream',
    'aws.region' = 'us-east-1',
    'format' = 'json'
);

CREATE TABLE alert_sink (
    device_id VARCHAR,
    avg_temp DOUBLE,
    alert_type VARCHAR,
    window_start TIMESTAMP(3),
    window_end TIMESTAMP(3)
) WITH (
    'connector' = 'kinesis',
    'stream' = 'alert-stream',
    'aws.region' = 'us-east-1',
    'format' = 'json'
);

-- Windowed aggregation with alerting
INSERT INTO alert_sink
SELECT 
    device_id,
    AVG(temperature) as avg_temp,
    CASE 
        WHEN AVG(temperature) > 50 THEN 'HIGH_TEMP'
        WHEN AVG(temperature) < 10 THEN 'LOW_TEMP'
        ELSE 'NORMAL'
    END as alert_type,
    TUMBLE_START(timestamp, INTERVAL '1' MINUTE) as window_start,
    TUMBLE_END(timestamp, INTERVAL '1' MINUTE) as window_end
FROM iot_stream
GROUP BY 
    device_id,
    TUMBLE(timestamp, INTERVAL '1' MINUTE)
HAVING AVG(temperature) > 50 OR AVG(temperature) < 10;

Interview Questions & Answers

Q1: How do you design a multi-account data architecture for a large enterprise?

Answer: Use AWS Organizations with separate accounts for shared services, data lake, analytics, production, dev/test, and security. Implement SCPs for governance, cross-account roles for access, and centralized logging. Use Lake Formation for fine-grained access control.

Q2: Explain the trade-offs between centralized and decentralized data architectures.

Answer: Centralized: easier governance, single source of truth, but can become bottleneck. Decentralized (Data Mesh): domain ownership, scalability, but complex governance. Choose based on organization size, data diversity, and team structure.

Q3: How do you handle data consistency in a hybrid cloud environment?

Answer: Use AWS DataSync for scheduled transfers, implement conflict resolution, use eventual consistency where acceptable, and strong consistency with DynamoDB Global Tables for critical data. Monitor sync status with CloudWatch.

Q4: Describe a serverless ETL architecture for variable workloads.

Answer: Use S3 event notifications to trigger Lambda, Step Functions for orchestration, Glue for data processing, and Kinesis for streaming. Benefits: auto-scaling, pay-per-use, no infrastructure management.

Q5: How do you implement data quality checks in a streaming pipeline?

Answer: Use Kinesis Analytics for real-time validation, Lambda for custom checks, CloudWatch for monitoring. Implement dead-letter queues for failed records, and alerts for quality violations.

Q6: Explain the Data Mesh pattern and how to implement it on AWS.

Answer: Data Mesh is decentralized data ownership. On AWS: use Lake Formation for permissions, Glue for catalog, S3 for storage per domain, and cross-account sharing. Platform team provides self-service tools.

Q7: How do you optimize costs in a complex data architecture?

Answer: Use S3 Intelligent-Tiering, Redshift Reserved Instances, Lambda for variable workloads, spot instances for batch, and auto-scaling. Monitor with Cost Explorer and implement tagging.

Q8: Describe a disaster recovery strategy for a data lake.

Answer: Use S3 Cross-Region Replication, Aurora Global Tables, Redshift snapshots to S3, and automated backups with AWS Backup. Test recovery regularly and implement RTO/RPO monitoring.

Q9: How do you handle schema evolution in a data lake?

Answer: Use Glue Schema Registry, implement versioning in S3, use Avro/Parquet with schema evolution, and Lambda for schema validation. Maintain backward compatibility with consumer contracts.

Q10: Explain the Lambda architecture vs Kappa architecture for data processing.

Answer: Lambda: batch + speed layers for complex processing. Kappa: single streaming layer for simplicity. On AWS: Lambda architecture uses EMR + Kinesis, Kappa uses Kinesis Analytics.

Q11: How do you implement real-time feature engineering for ML pipelines?

Answer: Use Kinesis Data Analytics for streaming features, SageMaker Feature Store for offline/online features, and Step Functions for orchestration. Implement feature versioning and monitoring.

Q12: Describe a multi-region data replication strategy.

Answer: Use DynamoDB Global Tables, Aurora Global Database, S3 Cross-Region Replication, and Route 53 for routing. Implement conflict resolution and monitor replication lag.

Q13: How do you handle data governance in a Data Mesh architecture?

Answer: Centralized governance policies with domain-specific implementation. Use Lake Formation for permissions, Glue Data Catalog for metadata, and AWS Config for compliance.

Q14: Explain the concept of data contracts and how to implement them.

Answer: Data contracts define data format, quality, and SLAs. Implement using Glue Schema Registry, API Gateway for data services, and automated validation in pipelines.

Q15: How do you optimize query performance in a data lake?

Answer: Use partitioning, bucketing, columnar formats (Parquet/ORC), caching with Athena结果缓存, and Redshift Spectrum for federated queries. Implement data compaction.

Q16: Describe a CI/CD pipeline for data engineering.

Answer: Use CodeCommit for version control, CodeBuild for testing, CodePipeline for deployment, and CloudFormation for infrastructure. Implement data pipeline testing with synthetic data.

Q17: How do you handle sensitive data in a data lake?

Answer: Use KMS for encryption, Lake Formation for column-level permissions, Macie for discovery, and DLP policies. Implement data masking and tokenization.

Q18: Explain the concept of data observability.

Answer: Data observability includes monitoring data quality, freshness, volume, and schema changes. Use CloudWatch, Glue DataBrew, and custom metrics with alerts.

Q19: How do you design for data scalability in a growing organization?

Answer: Use horizontal scaling with S3, serverless with Lambda, containerized processing with ECS/EKS, and distributed processing with EMR/Spark. Implement auto-scaling policies.

Q20: Describe a hybrid analytics architecture.

Answer: Use Direct Connect for hybrid connectivity, DataSync for data transfer, Athena for cloud analytics, and on-premises tools for local processing. Implement unified monitoring.

Q21: How do you implement data versioning in a data lake?

Answer: Use S3 versioning, Glue table versions, and implement CDC patterns. Use Apache Iceberg for table format with time travel queries.

Q22: Explain the trade-offs between different data storage formats.

Answer: Parquet: columnar, efficient for analytics. Avro: row-based, good for serialization. ORC: optimized for Hive. JSON: human-readable but verbose. Choose based on use case.

Q23: How do you handle data lineage in a complex pipeline?

Answer: Use Glue Data Catalog lineage, implement custom tracking with DynamoDB, and tools like AWS Lake Formation. Automate lineage capture with pipeline metadata.

Q24: Describe a real-time dashboard architecture.

Answer: Use Kinesis for ingestion, Lambda for processing, DynamoDB for features, ElastiCache for caching, and QuickSight for visualization. Implement WebSocket for real-time updates.

Q25: How do you design for data portability across clouds?

Answer: Use open formats (Parquet, ORC), standard APIs (S3-compatible), containerized processing (Docker), and infrastructure as code (Terraform). Avoid vendor lock-in.

Key Takeaways

⚠️

Advanced data architectures require careful planning for scalability, security, and cost. Always design for failure and implement comprehensive monitoring.

Multi-account strategy: Separate concerns with dedicated AWS accounts
Hybrid connectivity: Use DataSync and Direct Connect for hybrid sync
Serverless patterns: Lambda and Step Functions for variable workloads
Data Mesh: Domain-oriented decentralized data ownership
Real-time processing: Kinesis and Flink for streaming analytics
Cost optimization: Auto-scaling and right-sizing resources
Security: Encryption, fine-grained access, and compliance
Monitoring: Comprehensive observability across all layers