πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Advanced Data Architecture Patterns on AWS

Interview Q&AAdvanced Architecture Patterns⭐ Premium

Advertisement

Advanced Data Architecture Patterns on AWS

Advanced Data Architecture Patterns

Module 64 β€” Multi-Account Strategies, Hybrid Architectures, and Complex Patterns

Multi-Account Data Architecture

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 AWS MULTI-ACCOUNT DATA STRATEGY                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚                    MANAGEMENT ACCOUNT                        β”‚  β”‚
β”‚  β”‚  β€’ AWS Organizations  β€’ Service Control Policies            β”‚  β”‚
β”‚  β”‚  β€’ Consolidated Billing  β€’ Cross-Account Roles              β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                              β”‚                                     β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
β”‚         β”‚                    β”‚                    β”‚               β”‚
β”‚         β–Ό                    β–Ό                    β–Ό               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚  β”‚  SHARED   β”‚        β”‚ DATA     β”‚        β”‚ ANALYTICSβ”‚            β”‚
β”‚  β”‚  SERVICES β”‚        β”‚ LAKE     β”‚        β”‚ ACCOUNT  β”‚            β”‚
β”‚  β”‚  ACCOUNT  β”‚        β”‚ ACCOUNT  β”‚        β”‚          β”‚            β”‚
β”‚  β”‚           β”‚        β”‚          β”‚        β”‚          β”‚            β”‚
β”‚  β”‚ β€’ IAM     β”‚        β”‚ β€’ S3     β”‚        β”‚ β€’ Redshiftβ”‚           β”‚
β”‚  β”‚ β€’ KMS     β”‚        β”‚ β€’ Glue   β”‚        β”‚ β€’ QuickSightβ”‚          β”‚
β”‚  β”‚ β€’ Transit β”‚        β”‚ β€’ Lake   β”‚        β”‚ β€’ Athena  β”‚           β”‚
β”‚  β”‚   GW      β”‚        β”‚   Formationβ”‚       β”‚ β€’ EMR     β”‚           β”‚
β”‚  β”‚ β€’ Network β”‚        β”‚ β€’ MSK    β”‚        β”‚ β€’ SageMakerβ”‚           β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚         β”‚                    β”‚                    β”‚               β”‚
β”‚         β”‚                    β”‚                    β”‚               β”‚
β”‚         β–Ό                    β–Ό                    β–Ό               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚  β”‚ PROD     β”‚        β”‚ DEV/TEST β”‚        β”‚ SECURITY β”‚            β”‚
β”‚  β”‚ DATA     β”‚        β”‚ ACCOUNT  β”‚        β”‚ ACCOUNT  β”‚            β”‚
β”‚  β”‚ ACCOUNT  β”‚        β”‚          β”‚        β”‚          β”‚            β”‚
β”‚  β”‚          β”‚        β”‚          β”‚        β”‚          β”‚            β”‚
β”‚  β”‚ β€’ RDS    β”‚        β”‚ β€’ Copies β”‚        β”‚ β€’ GuardDutyβ”‚           β”‚
β”‚  β”‚ β€’ DynamoDBβ”‚       β”‚ β€’ Sandboxedβ”‚      β”‚ β€’ Security Hubβ”‚        β”‚
β”‚  β”‚ β€’ ElastiCacheβ”‚    β”‚   data    β”‚       β”‚ β€’ Config  β”‚            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚                                                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Cross-Account Data Access Patterns

# Cross-account role assumption for data access
import boto3
import json

class CrossAccountDataAccess:
    def __init__(self, management_account_id, target_account_id):
        self.management_account_id = management_account_id
        self.target_account_id = target_account_id
    
    def assume_role(self, role_name, session_name='DataAccess'):
        """Assume cross-account role"""
        sts = boto3.client('sts')
        
        response = sts.assume_role(
            RoleArn=f'arn:aws:iam::{self.target_account_id}:role/{role_name}',
            RoleSessionName=session_name,
            DurationSeconds=3600,
            Tags=[
                {
                    'Key': 'Purpose',
                    'Value': 'DataEngineeringAccess'
                }
            ]
        )
        
        credentials = response['Credentials']
        return boto3.Session(
            aws_access_key_id=credentials['AccessKeyId'],
            aws_secret_access_key=credentials['SecretAccessKey'],
            aws_session_token=credentials['SessionToken']
        )
    
    def create_cross_account_s3_access(self):
        """Create cross-account S3 access policy"""
        policy = {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Sid": "CrossAccountS3Access",
                    "Effect": "Allow",
                    "Principal": {
                        "AWS": f"arn:aws:iam::{self.management_account_id}:root"
                    },
                    "Action": [
                        "s3:GetObject",
                        "s3:PutObject",
                        "s3:ListBucket",
                        "s3:DeleteObject"
                    ],
                    "Resource": [
                        "arn:aws:s3:::data-lake-bucket",
                        "arn:aws:s3:::data-lake-bucket/*"
                    ]
                }
            ]
        }
        
        return policy

# Usage
cross_account = CrossAccountDataAccess('111111111111', '222222222222')
session = cross_account.assume_role('DataEngineeringRole')
s3 = session.client('s3')

Hybrid Cloud Data Architecture

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   HYBRID CLOUD DATA ARCHITECTURE                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                     β”‚
β”‚  ON-PREMISES                         AWS CLOUD                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚  Data Center         β”‚           β”‚  VPC                 β”‚      β”‚
β”‚  β”‚                      β”‚           β”‚                      β”‚      β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚  Direct   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚      β”‚
β”‚  β”‚  β”‚ Oracle DB    │◄──┼──Connect──┼─▢│ Aurora       β”‚   β”‚      β”‚
β”‚  β”‚  β”‚ 50TB         β”‚   β”‚           β”‚  β”‚ PostgreSQL   β”‚   β”‚      β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚           β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚      β”‚
β”‚  β”‚                      β”‚           β”‚                      β”‚      β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚  VPN      β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚      β”‚
β”‚  β”‚  β”‚ Data         │◄──┼──────────┼─▢│ S3 Data Lake β”‚   β”‚      β”‚
β”‚  β”‚  β”‚ Warehouse    β”‚   β”‚           β”‚  β”‚              β”‚   β”‚      β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚           β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚      β”‚
β”‚  β”‚                      β”‚           β”‚                      β”‚      β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚           β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚      β”‚
β”‚  β”‚  β”‚ File Server  │◄──┼──────────┼─▢│ Snowball Edgeβ”‚   β”‚      β”‚
β”‚  β”‚  β”‚ 100TB        β”‚   β”‚           β”‚  β”‚              β”‚   β”‚      β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚           β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚      β”‚
β”‚  β”‚                      β”‚           β”‚                      β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚                                                                     β”‚
β”‚  SYNC METHODS:                                                     β”‚
β”‚  ─────────────                                                     β”‚
β”‚  β€’ AWS DataSync: Managed file transfer                             β”‚
β”‚  β€’ AWS Direct Connect: Dedicated network connection                β”‚
β”‚  β€’ AWS Storage Gateway: Hybrid storage                             β”‚
β”‚  β€’ AWS Snow Family: Offline large-scale transfer                   β”‚
β”‚                                                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

DataSync Configuration

# AWS DataSync for hybrid data transfer
import boto3

def create_datasync_task():
    """Create DataSync task for hybrid sync"""
    
    datasync = boto3.client('datasync')
    
    # Create source location (on-premises NFS)
    source_response = datasync.create_location_nfs(
        ServerHostname='onprem-server.example.com',
        Subdirectory='/data/exports',
        OnPremConfig={
            AgentArns=['arn:aws:datasync:us-east-1:111111111111:agent/agent-id']
        },
        MountOptions={
            'Version': 'NFS3'
        }
    )
    
    # Create target location (S3)
    target_response = datasync.create_location_s3(
        S3BucketArn='arn:aws:s3:::data-lake-sync',
        S3Config={
            'BucketAccessRoleArn': 'arn:aws:iam::111111111111:role/DataSyncRole'
        },
        Subdirectory='/sync-data'
    )
    
    # Create DataSync task
    task_response = datasync.create_task(
        SourceLocationArn=source_response['LocationArn'],
        DestinationLocationArn=target_response['LocationArn'],
        Name='OnPremToS3Sync',
        Options={
            'VerifyMode': 'ONLY_FILES_TRANSFERRED',
            'OverwriteMode': 'ALWAYS',
            'PreserveDeletedFiles': 'PRESERVE',
            'PreserveDevices': 'NONE',
            'PosixPermissions': 'NONE',
            'TaskQueuingingMode': 'ENABLED',
            'TransferMode': 'ALL',
            'LogLevel': 'TRANSFER'
        },
        Schedule={
            'ScheduleExpression': 'cron(0 2 * * ? *)'  # Daily at 2 AM
        }
    )
    
    return task_response['TaskArn']

# Monitor DataSync
def monitor_datasync(task_arn):
    """Monitor DataSync task execution"""
    datasync = boto3.client('datasync')
    
    response = datasync.list_task_executions(
        TaskArn=task_arn,
        MaxResults=5
    )
    
    for execution in response['TaskExecutions']:
        print(f"Execution: {execution['TaskExecutionArn']}")
        print(f"Status: {execution['Status']}")
        print(f"Bytes transferred: {execution['BytesTransferred']}")
        print(f"Files transferred: {execution['FilesTransferred']}")
        print(f"Duration: {execution['Duration']}")

Complex ETL Patterns

Lambda-Based Serverless ETL

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              SERVERLESS ETL WITH LAMBDA & STEP FUNCTIONS            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                     β”‚
β”‚  S3 Upload     Lambda       Step Function      S3 Processed       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚Raw Data  │─▢│Trigger   │─▢│Orchestrate   │─▢│Clean Dataβ”‚      β”‚
β”‚  β”‚Landing   β”‚  β”‚Function  β”‚  β”‚              β”‚  β”‚          β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚                              β”‚  β”‚Validate β”‚β”‚                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚  Data   β”‚β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚Schema    │─▢│Enrichmentβ”‚  β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜β”‚  β”‚Redshift  β”‚      β”‚
β”‚  β”‚Registry  β”‚  β”‚Function  β”‚  β”‚       β”‚     β”‚  β”‚Loading   β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚                              β”‚  β”‚Transform β”‚β”‚                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚  Data   β”‚β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚Data      │─▢│Quality   β”‚  β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜β”‚  β”‚Analytics β”‚      β”‚
β”‚  β”‚Catalog   β”‚  β”‚Check     β”‚  β”‚       β”‚     β”‚  β”‚Dashboard β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚                              β”‚  β”‚Load to  β”‚β”‚                      β”‚
β”‚                              β”‚  β”‚  Target β”‚β”‚                      β”‚
β”‚                              β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚                      β”‚
β”‚                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                      β”‚
β”‚                                                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Step Functions Orchestration

# Step Functions state machine for complex ETL
import json

step_function_definition = {
    "Comment": "Complex ETL Pipeline",
    "StartAt": "ValidateInput",
    "States": {
        "ValidateInput": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:us-east-1:111111111111:validate-input",
            "Next": "CheckValidation",
            "Retry": [
                {
                    "ErrorEquals": ["States.TaskFailed"],
                    "IntervalSeconds": 30,
                    "MaxAttempts": 3,
                    "BackoffRate": 2
                }
            ],
            "Catch": [
                {
                    "ErrorEquals": ["States.ALL"],
                    "Next": "HandleError"
                }
            ]
        },
        "CheckValidation": {
            "Type": "Choice",
            "Choices": [
                {
                    "Variable": "$.isValid",
                    "BooleanEquals": True,
                    "Next": "EnrichData"
                }
            ],
            "Default": "NotifyInvalidData"
        },
        "EnrichData": {
            "Type": "Parallel",
            "Next": "TransformData",
            "Branches": [
                {
                    "StartAt": "EnrichGeolocation",
                    "States": {
                        "EnrichGeolocation": {
                            "Type": "Task",
                            "Resource": "arn:aws:lambda:us-east-1:111111111111:enrich-geo",
                            "End": True
                        }
                    }
                },
                {
                    "StartAt": "EnrichDemographics",
                    "States": {
                        "EnrichDemographics": {
                            "Type": "Task",
                            "Resource": "arn:aws:lambda:us-east-1:111111111111:enrich-demo",
                            "End": True
                        }
                    }
                }
            ]
        },
        "TransformData": {
            "Type": "Task",
            "Resource": "arn:aws:states:::glue:startJobRun.sync",
            "Parameters": {
                "JobName": "transform-data-job",
                "Arguments": {
                    "--input_path.$": "$.outputPath",
                    "--output_path.$": "$.transformedPath"
                }
            },
            "Next": "QualityCheck"
        },
        "QualityCheck": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:us-east-1:111111111111:quality-check",
            "Next": "LoadToRedshift"
        },
        "LoadToRedshift": {
            "Type": "Task",
            "Resource": "arn:aws:states:::redshift-data:executeSql.waitForTaskToken",
            "Parameters": {
                "ClusterIdentifier": "analytics-cluster",
                "Database": "analytics",
                "Sql": "COPY schema.table FROM 's3://bucket/data' IAM_ROLE 'arn:aws:iam::role/RedshiftRole' FORMAT AS PARQUET"
            },
            "End": True
        },
        "HandleError": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:us-east-1:111111111111:handle-error",
            "End": True
        },
        "NotifyInvalidData": {
            "Type": "Task",
            "Resource": "arn:aws:sns:us-east-1:111111111111:alerts",
            "Parameters": {
                "TopicArn": "arn:aws:sns:us-east-1:111111111111:data-alerts",
                "Message": "Invalid data detected in pipeline"
            },
            "End": True
        }
    }
}

# Save definition
with open('etl-state-machine.json', 'w') as f:
    json.dump(step_function_definition, f, indent=2)

Data Mesh on AWS

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    DATA MESH IMPLEMENTATION ON AWS                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                  PLATFORM TEAM (Central)                    β”‚   β”‚
β”‚  β”‚                                                             β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚   β”‚
β”‚  β”‚  β”‚ AWS Lake     β”‚  β”‚ AWS Glue     β”‚  β”‚ AWS IAM      β”‚     β”‚   β”‚
β”‚  β”‚  β”‚ Formation    β”‚  β”‚ DataBrew     β”‚  β”‚ Identity     β”‚     β”‚   β”‚
β”‚  β”‚  β”‚              β”‚  β”‚              β”‚  β”‚ Center       β”‚     β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚   β”‚
β”‚  β”‚                                                             β”‚   β”‚
β”‚  β”‚  Self-Service Data Platform APIs                            β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                              β”‚                                     β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
β”‚         β”‚                    β”‚                    β”‚               β”‚
β”‚         β–Ό                    β–Ό                    β–Ό               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  SALES DOMAIN       β”‚ β”‚  MARKETING DOMAIN  β”‚ β”‚ FINANCE DOMAINβ”‚ β”‚
β”‚  β”‚                     β”‚ β”‚                    β”‚ β”‚               β”‚ β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚ β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
β”‚  β”‚  β”‚ Data Product β”‚  β”‚ β”‚  β”‚ Data Product β”‚  β”‚ β”‚ β”‚Data Productβ”‚ β”‚ β”‚
β”‚  β”‚  β”‚ - Customer   β”‚  β”‚ β”‚  β”‚ - Campaign   β”‚  β”‚ β”‚ β”‚- Revenue  β”‚ β”‚ β”‚
β”‚  β”‚  β”‚ - Orders     β”‚  β”‚ β”‚  β”‚ - Leads      β”‚  β”‚ β”‚ β”‚- Forecast β”‚ β”‚ β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚  β”‚                     β”‚ β”‚                    β”‚ β”‚               β”‚ β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚ β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
β”‚  β”‚  β”‚ S3 Bucket    β”‚  β”‚ β”‚  β”‚ S3 Bucket    β”‚  β”‚ β”‚ β”‚S3 Bucket β”‚ β”‚ β”‚
β”‚  β”‚  β”‚ (sales-data) β”‚  β”‚ β”‚  β”‚ (mktg-data)  β”‚  β”‚ β”‚ β”‚(fin-data)β”‚ β”‚ β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚  β”‚                     β”‚ β”‚                    β”‚ β”‚               β”‚ β”‚
β”‚  β”‚  Domain Team Owns   β”‚ β”‚  Domain Team Owns  β”‚ β”‚Domain Team Ownsβ”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚         β”‚                    β”‚                    β”‚               β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
β”‚                              β–Ό                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚           DATA CONSUMERS (Cross-Domain)                     β”‚   β”‚
β”‚  β”‚                                                             β”‚   β”‚
β”‚  β”‚  Analytics    ML Engineers    Data Scientists    BI Teams  β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚   β”‚
β”‚  β”‚  β”‚Athena   β”‚  β”‚SageMakerβ”‚     β”‚Notebooksβ”‚      β”‚QuickSightβ”‚ β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Lake Formation for Data Mesh

# Lake Formation permissions for data mesh domains
import boto3

def setup_lake_formation_permissions():
    """Setup Lake Formation permissions for data mesh"""
    
    lakeformation = boto3.client('lakeformation')
    
    # Grant database permissions to domain teams
    response = lakeformation.grant_permissions(
        Principal={
            'DataLakePrincipalIdentifier': 'arn:aws:iam::111111111111:role/SalesDomainRole'
        },
        Resource={
            'Database': {
                'Name': 'sales_domain_db'
            }
        },
        Permissions=['CREATE_TABLE', 'ALTER', 'DROP'],
        GrantOption=True
    )
    
    # Grant table permissions
    response = lakeformation.grant_permissions(
        Principal={
            'DataLakePrincipalIdentifier': 'arn:aws:iam::111111111111:role/AnalyticsRole'
        },
        Resource={
            'Table': {
                'DatabaseName': 'sales_domain_db',
                'Name': 'customers'
            }
        },
        Permissions=['SELECT', 'DESCRIBE'],
        Conditions=[
            {
                'Expression': {
                    "StringEquals": {
                        "lakeformation:ColumnNames": ["customer_id", "email", "phone"]
                    }
                },
                'Name': 'ColumnFilter'
            }
        ]
    )
    
    return response

# Data sharing between domains
def share_data_between_domains():
    """Share data products between domains"""
    
    glue = boto3.client('glue')
    
    # Create cross-account table version
    response = glue.create_table(
        DatabaseName='sales_domain_db',
        TableInput={
            'Name': 'customers_shared',
            'TargetTable': {
                'DatabaseName': 'marketing_domain_db',
                'Name': 'customer_segments',
                'CatalogId': '222222222222'
            }
        }
    )
    
    return response

Complex Streaming Architectures

Real-Time Analytics Pipeline

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           REAL-TIME ANALYTICS PIPELINE ARCHITECTURE                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                     β”‚
β”‚  DATA SOURCES                                                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚  β”‚ IoT      β”‚  β”‚Clickstreamβ”‚ β”‚Logs      β”‚  β”‚Database  β”‚         β”‚
β”‚  β”‚ Sensors  β”‚  β”‚          β”‚  β”‚          β”‚  β”‚CDC       β”‚         β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜         β”‚
β”‚       β”‚              β”‚              β”‚              β”‚                β”‚
β”‚       β–Ό              β–Ό              β–Ό              β–Ό                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚                    KINESIS DATA STREAMS                     β”‚  β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚  β”‚
β”‚  β”‚  β”‚Stream 1 β”‚  β”‚Stream 2 β”‚  β”‚Stream 3 β”‚  β”‚Stream 4 β”‚      β”‚  β”‚
β”‚  β”‚  β”‚IoT Data β”‚  β”‚Clicks   β”‚  β”‚App Logs β”‚  β”‚CDC Data β”‚      β”‚  β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                              β”‚                                     β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
β”‚         β”‚                    β”‚                    β”‚               β”‚
β”‚         β–Ό                    β–Ό                    β–Ό               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ KINESIS DATA     β”‚ β”‚ KINESIS ANALYTICSβ”‚ β”‚ KINESIS FIREHOSE β”‚  β”‚
β”‚  β”‚ ANALYTICS        β”‚ β”‚ (SQL Application)β”‚ β”‚                  β”‚  β”‚
β”‚  β”‚ (Flink)          β”‚ β”‚                  β”‚ β”‚                  β”‚  β”‚
β”‚  β”‚                  β”‚ β”‚ β€’ Window Aggs    β”‚ β”‚ β€’ S3 (Parquet)   β”‚  β”‚
β”‚  β”‚ β€’ State Mgmt     β”‚ β”‚ β€’ Joins          β”‚ β”‚ β€’ Redshift       β”‚  β”‚
β”‚  β”‚ β€’ Complex Events β”‚ β”‚ β€’ Filtering      β”‚ β”‚ β€’ Elasticsearch  β”‚  β”‚
β”‚  β”‚ β€’ ML Inference   β”‚ β”‚ β€’ Patterns       β”‚ β”‚ β€’ OpenSearch     β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚           β”‚                    β”‚                    β”‚               β”‚
β”‚           β–Ό                    β–Ό                    β–Ό               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ DynamoDB         β”‚ β”‚ Aurora           β”‚ β”‚ S3 Data Lake     β”‚  β”‚
β”‚  β”‚ (Real-time       β”‚ β”‚ (Query Store)    β”‚ β”‚ (Historical      β”‚  β”‚
β”‚  β”‚  Features)       β”‚ β”‚                  β”‚ β”‚  Analytics)      β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                                     β”‚
β”‚  MONITORING: CloudWatch Metrics + Custom Dashboard                β”‚
β”‚  LATENCY: End-to-end < 1 second                                   β”‚
β”‚                                                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Kinesis Analytics Flink Application

// Flink SQL for real-time analytics
CREATE TABLE iot_stream (
    device_id VARCHAR,
    temperature DOUBLE,
    humidity DOUBLE,
    timestamp TIMESTAMP(3),
    WATERMARK FOR timestamp AS timestamp - INTERVAL '5' SECOND
) WITH (
    'connector' = 'kinesis',
    'stream' = 'iot-data-stream',
    'aws.region' = 'us-east-1',
    'format' = 'json'
);

CREATE TABLE alert_sink (
    device_id VARCHAR,
    avg_temp DOUBLE,
    alert_type VARCHAR,
    window_start TIMESTAMP(3),
    window_end TIMESTAMP(3)
) WITH (
    'connector' = 'kinesis',
    'stream' = 'alert-stream',
    'aws.region' = 'us-east-1',
    'format' = 'json'
);

-- Windowed aggregation with alerting
INSERT INTO alert_sink
SELECT 
    device_id,
    AVG(temperature) as avg_temp,
    CASE 
        WHEN AVG(temperature) > 50 THEN 'HIGH_TEMP'
        WHEN AVG(temperature) < 10 THEN 'LOW_TEMP'
        ELSE 'NORMAL'
    END as alert_type,
    TUMBLE_START(timestamp, INTERVAL '1' MINUTE) as window_start,
    TUMBLE_END(timestamp, INTERVAL '1' MINUTE) as window_end
FROM iot_stream
GROUP BY 
    device_id,
    TUMBLE(timestamp, INTERVAL '1' MINUTE)
HAVING AVG(temperature) > 50 OR AVG(temperature) < 10;

Interview Questions & Answers

Q1: How do you design a multi-account data architecture for a large enterprise?

Answer: Use AWS Organizations with separate accounts for shared services, data lake, analytics, production, dev/test, and security. Implement SCPs for governance, cross-account roles for access, and centralized logging. Use Lake Formation for fine-grained access control.

Q2: Explain the trade-offs between centralized and decentralized data architectures.

Answer: Centralized: easier governance, single source of truth, but can become bottleneck. Decentralized (Data Mesh): domain ownership, scalability, but complex governance. Choose based on organization size, data diversity, and team structure.

Q3: How do you handle data consistency in a hybrid cloud environment?

Answer: Use AWS DataSync for scheduled transfers, implement conflict resolution, use eventual consistency where acceptable, and strong consistency with DynamoDB Global Tables for critical data. Monitor sync status with CloudWatch.

Q4: Describe a serverless ETL architecture for variable workloads.

Answer: Use S3 event notifications to trigger Lambda, Step Functions for orchestration, Glue for data processing, and Kinesis for streaming. Benefits: auto-scaling, pay-per-use, no infrastructure management.

Q5: How do you implement data quality checks in a streaming pipeline?

Answer: Use Kinesis Analytics for real-time validation, Lambda for custom checks, CloudWatch for monitoring. Implement dead-letter queues for failed records, and alerts for quality violations.

Q6: Explain the Data Mesh pattern and how to implement it on AWS.

Answer: Data Mesh is decentralized data ownership. On AWS: use Lake Formation for permissions, Glue for catalog, S3 for storage per domain, and cross-account sharing. Platform team provides self-service tools.

Q7: How do you optimize costs in a complex data architecture?

Answer: Use S3 Intelligent-Tiering, Redshift Reserved Instances, Lambda for variable workloads, spot instances for batch, and auto-scaling. Monitor with Cost Explorer and implement tagging.

Q8: Describe a disaster recovery strategy for a data lake.

Answer: Use S3 Cross-Region Replication, Aurora Global Tables, Redshift snapshots to S3, and automated backups with AWS Backup. Test recovery regularly and implement RTO/RPO monitoring.

Q9: How do you handle schema evolution in a data lake?

Answer: Use Glue Schema Registry, implement versioning in S3, use Avro/Parquet with schema evolution, and Lambda for schema validation. Maintain backward compatibility with consumer contracts.

Q10: Explain the Lambda architecture vs Kappa architecture for data processing.

Answer: Lambda: batch + speed layers for complex processing. Kappa: single streaming layer for simplicity. On AWS: Lambda architecture uses EMR + Kinesis, Kappa uses Kinesis Analytics.

Q11: How do you implement real-time feature engineering for ML pipelines?

Answer: Use Kinesis Data Analytics for streaming features, SageMaker Feature Store for offline/online features, and Step Functions for orchestration. Implement feature versioning and monitoring.

Q12: Describe a multi-region data replication strategy.

Answer: Use DynamoDB Global Tables, Aurora Global Database, S3 Cross-Region Replication, and Route 53 for routing. Implement conflict resolution and monitor replication lag.

Q13: How do you handle data governance in a Data Mesh architecture?

Answer: Centralized governance policies with domain-specific implementation. Use Lake Formation for permissions, Glue Data Catalog for metadata, and AWS Config for compliance.

Q14: Explain the concept of data contracts and how to implement them.

Answer: Data contracts define data format, quality, and SLAs. Implement using Glue Schema Registry, API Gateway for data services, and automated validation in pipelines.

Q15: How do you optimize query performance in a data lake?

Answer: Use partitioning, bucketing, columnar formats (Parquet/ORC), caching with Athenaη»“ζžœηΌ“ε­˜, and Redshift Spectrum for federated queries. Implement data compaction.

Q16: Describe a CI/CD pipeline for data engineering.

Answer: Use CodeCommit for version control, CodeBuild for testing, CodePipeline for deployment, and CloudFormation for infrastructure. Implement data pipeline testing with synthetic data.

Q17: How do you handle sensitive data in a data lake?

Answer: Use KMS for encryption, Lake Formation for column-level permissions, Macie for discovery, and DLP policies. Implement data masking and tokenization.

Q18: Explain the concept of data observability.

Answer: Data observability includes monitoring data quality, freshness, volume, and schema changes. Use CloudWatch, Glue DataBrew, and custom metrics with alerts.

Q19: How do you design for data scalability in a growing organization?

Answer: Use horizontal scaling with S3, serverless with Lambda, containerized processing with ECS/EKS, and distributed processing with EMR/Spark. Implement auto-scaling policies.

Q20: Describe a hybrid analytics architecture.

Answer: Use Direct Connect for hybrid connectivity, DataSync for data transfer, Athena for cloud analytics, and on-premises tools for local processing. Implement unified monitoring.

Q21: How do you implement data versioning in a data lake?

Answer: Use S3 versioning, Glue table versions, and implement CDC patterns. Use Apache Iceberg for table format with time travel queries.

Q22: Explain the trade-offs between different data storage formats.

Answer: Parquet: columnar, efficient for analytics. Avro: row-based, good for serialization. ORC: optimized for Hive. JSON: human-readable but verbose. Choose based on use case.

Q23: How do you handle data lineage in a complex pipeline?

Answer: Use Glue Data Catalog lineage, implement custom tracking with DynamoDB, and tools like AWS Lake Formation. Automate lineage capture with pipeline metadata.

Q24: Describe a real-time dashboard architecture.

Answer: Use Kinesis for ingestion, Lambda for processing, DynamoDB for features, ElastiCache for caching, and QuickSight for visualization. Implement WebSocket for real-time updates.

Q25: How do you design for data portability across clouds?

Answer: Use open formats (Parquet, ORC), standard APIs (S3-compatible), containerized processing (Docker), and infrastructure as code (Terraform). Avoid vendor lock-in.

Key Takeaways

⚠️

Advanced data architectures require careful planning for scalability, security, and cost. Always design for failure and implement comprehensive monitoring.

  1. Multi-account strategy: Separate concerns with dedicated AWS accounts
  2. Hybrid connectivity: Use DataSync and Direct Connect for hybrid sync
  3. Serverless patterns: Lambda and Step Functions for variable workloads
  4. Data Mesh: Domain-oriented decentralized data ownership
  5. Real-time processing: Kinesis and Flink for streaming analytics
  6. Cost optimization: Auto-scaling and right-sizing resources
  7. Security: Encryption, fine-grained access, and compliance
  8. Monitoring: Comprehensive observability across all layers

Advertisement