Multi-Region Active-Active Deployment Patterns
Difficulty: Principal/Staff Level | Companies: Netflix, Amazon, Google, Cloudflare, Akamai
Interview Question
"Design a multi-region active-active deployment for a global e-commerce platform serving 500M+ users. How do you handle data consistency, conflict resolution, and failover?"
โน๏ธKey Concepts
This question tests your understanding of global distributed systems, consistency models, and disaster recovery at planetary scale.
Complete Multi-Region Architecture
Global Architecture Overview
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ MULTI-REGION ACTIVE-ACTIVE ARCHITECTURE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโ GLOBAL LAYER โโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ โ
โ โ โ Route53 โ โ CloudFront โ โ Global โ โ โ
โ โ โ (DNS) โ โ (CDN) โ โ Accelerator โ โ โ
โ โ โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ โ โ
โ โ โ โ โ โ โ
โ โโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ โ
โ โโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโ โ
โ โ REGION ROUTING โ โ
โ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ US-EAST-1 โ โ EU-WEST-1 โ โ โ
โ โ โ (Primary) โ โ (Secondary) โ โ โ
โ โ โโโโโโโโโโฌโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโ โ โ
โ โ โ โ โ โ
โ โ โโโโโโโโโโผโโโโโโโโโ โโโโโโโโโโผโโโโโโโโโ โ โ
โ โ โ AP-SOUTH-1 โ โ AP-EAST-1 โ โ โ
โ โ โ (Tertiary) โ โ (Quaternary) โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโ DATA REPLICATION โโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ DynamoDB Global Tables โ Aurora Global Database โ โ
โ โ Redis Global Datastore โ S3 Cross-Region Replication โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Mathematical Foundation: Consistency Models
Consistency-Availability Trade-off (CAP Theorem):
- Consistency (C): All nodes see the same data at the same time
- Availability (A): Every request gets a response
- Partition tolerance (P): System works despite network failures
For multi-region: C ร A ร P = 0 (must sacrifice one)
Eventual Consistency Window:
- Cross-region replication lag: L = 100ms (typical)
- Consistency probability: P = 1 - e^(-t/L)
- For t = 200ms: P = 1 - e^(-200/100) = 0.865 = 86.5%
Conflict Resolution Math:
- Last-write-wins (LWW): Simple but can lose data
- Vector clocks: O(n) storage per update
- CRDTs: Merge without conflict, O(1) merge time
Global Load Balancing with Route53
# Route53 health checks and failover
resource "aws_route53_health_check" "us_east_1" {
ip_address = var.us_east_1_endpoint
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
request_interval = 10
tags = {
Name = "us-east-1-health-check"
}
}
resource "aws_route53_health_check" "eu_west_1" {
ip_address = var.eu_west_1_endpoint
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
request_interval = 10
}
# Latency-based routing
resource "aws_route53_record" "app_latency" {
for_each = {
us-east-1 = var.us_east_1_endpoint
eu-west-1 = var.eu_west_1_endpoint
ap-south-1 = var.ap_south_1_endpoint
}
zone_id = aws_route53_zone.main.zone_id
name = "app.example.com"
type = "A"
alias {
name = each.value
zone_id = data.aws_elb_hosted_zone_id[each.key].id
evaluate_target_health = true
}
latency_routing_policy {
region = each.key
}
set_identifier = each.key
}
# Failover routing
resource "aws_route53_record" "app_failover_primary" {
zone_id = aws_route53_zone.main.zone_id
name = "app.example.com"
type = "A"
alias {
name = var.us_east_1_endpoint
zone_id = data.aws_elb_hosted_zone_id["us-east-1"].id
evaluate_target_health = true
}
failover_routing_policy {
type = "PRIMARY"
}
set_identifier = "primary"
health_check_id = aws_route53_health_check.us_east_1.id
}
resource "aws_route53_record" "app_failover_secondary" {
zone_id = aws_route53_zone.main.zone_id
name = "app.example.com"
type = "A"
alias {
name = var.eu_west_1_endpoint
zone_id = data.aws_elb_hosted_zone_id["eu-west-1"].id
evaluate_target_health = true
}
failover_routing_policy {
type = "SECONDARY"
}
set_identifier = "secondary"
health_check_id = aws_route53_health_check.eu_west_1.id
}
# Geolocation routing for compliance
resource "aws_route53_record" "app_geo" {
for_each = {
"us-east-1" = {
continent = "NA"
endpoint = var.us_east_1_endpoint
}
"eu-west-1" = {
continent = "EU"
endpoint = var.eu_west_1_endpoint
}
"ap-south-1" = {
continent = "AS"
endpoint = var.ap_south_1_endpoint
}
}
zone_id = aws_route53_zone.main.zone_id
name = "app.example.com"
type = "A"
alias {
name = each.value.endpoint
zone_id = data.aws_elb_hosted_zone_id[each.key].id
evaluate_target_health = true
}
geolocation_routing_policy {
continent = each.value.continent
}
set_identifier = each.key
}
DynamoDB Global Tables
# DynamoDB Global Tables configuration
import boto3
from typing import Dict, Any, List
from datetime import datetime
import json
class GlobalTableManager:
"""Manager for DynamoDB Global Tables"""
def __init__(self, table_name: str):
self.dynamodb = boto3.client('dynamodb')
self.table_name = table_name
def create_global_table(self):
"""Create DynamoDB Global Table"""
response = self.dynamodb.create_global_table(
GlobalTableName=self.table_name,
ReplicationGroup=[
{
'RegionName': 'us-east-1',
'ReadCapacitySettings': {
'ReadCapacityUnits': 1000,
'WriteCapacityAutoScalingSettings': {
'MinimumUnits': 100,
'MaximumUnits': 10000,
'AutoScalingDisabled': False,
'TargetTrackingScalingPolicyConfiguration': {
'TargetValue': 70.0,
'PredefinedMetricSpecification': {
'PredefinedMetricType': 'DynamoDBReadCapacityUtilization'
}
}
}
},
'WriteCapacitySettings': {
'WriteCapacityUnits': 1000,
'WriteCapacityAutoScalingSettings': {
'MinimumUnits': 100,
'MaximumUnits': 10000,
'AutoScalingDisabled': False,
'TargetTrackingScalingPolicyConfiguration': {
'TargetValue': 70.0,
'PredefinedMetricSpecification': {
'PredefinedMetricType': 'DynamoDBWriteCapacityUtilization'
}
}
}
}
},
{
'RegionName': 'eu-west-1',
'ReadCapacitySettings': {
'ReadCapacityUnits': 1000,
'WriteCapacityAutoScalingSettings': {
'MinimumUnits': 100,
'MaximumUnits': 10000,
'AutoScalingDisabled': False,
'TargetTrackingScalingPolicyConfiguration': {
'TargetValue': 70.0,
'PredefinedMetricSpecification': {
'PredefinedMetricType': 'DynamoDBReadCapacityUtilization'
}
}
}
},
'WriteCapacitySettings': {
'WriteCapacityUnits': 1000,
'WriteCapacityAutoScalingSettings': {
'MinimumUnits': 100,
'MaximumUnits': 10000,
'AutoScalingDisabled': False,
'TargetTrackingScalingPolicyConfiguration': {
'TargetValue': 70.0,
'PredefinedMetricSpecification': {
'PredefinedMetricType': 'DynamoDBWriteCapacityUtilization'
}
}
}
}
},
{
'RegionName': 'ap-south-1',
'ReadCapacitySettings': {
'ReadCapacityUnits': 500,
'WriteCapacityAutoScalingSettings': {
'MinimumUnits': 50,
'MaximumUnits': 5000,
'AutoScalingDisabled': False,
'TargetTrackingScalingPolicyConfiguration': {
'TargetValue': 70.0,
'PredefinedMetricSpecification': {
'PredefinedMetricType': 'DynamoDBReadCapacityUtilization'
}
}
}
},
'WriteCapacitySettings': {
'WriteCapacityUnits': 500,
'WriteCapacityAutoScalingSettings': {
'MinimumUnits': 50,
'MaximumUnits': 5000,
'AutoScalingDisabled': False,
'TargetTrackingScalingPolicyConfiguration': {
'TargetValue': 70.0,
'PredefinedMetricSpecification': {
'PredefinedMetricType': 'DynamoDBWriteCapacityUtilization'
}
}
}
}
}
],
BillingMode='PAY_PER_REQUEST',
StreamSpecification={
'StreamEnabled': True,
'StreamViewType': 'NEW_AND_OLD_IMAGES'
},
SSESpecification={
'Enabled': True,
'SSEType': 'KMS',
'KMSMasterKeyId': 'alias/aws/dynamodb'
}
)
return response
def put_item_global(self, item: Dict[str, Any], region: str = 'us-east-1'):
"""Put item with conflict resolution"""
dynamodb = boto3.resource('dynamodb', region_name=region)
table = dynamodb.Table(self.table_name)
# Add version for conflict resolution
item['version'] = int(datetime.utcnow().timestamp() * 1000)
item['last_updated_region'] = region
response = table.put_item(
Item=item,
ConditionExpression='attribute_not_exists(PK) OR version < :version',
ExpressionAttributeValues={
':version': item['version']
}
)
return response
def get_item_global(self, item_key: Dict[str, Any]) -> Dict[str, Any]:
"""Get item with consistent read"""
# Use consistent read for most up-to-date data
dynamodb = boto3.resource('dynamodb', region_name='us-east-1')
table = dynamodb.Table(self.table_name)
response = table.get_item(
Key=item_key,
ConsistentRead=True
)
return response.get('Item')
def query_global(self, index_name: str, key_condition: str, limit: int = 100):
"""Query global table"""
dynamodb = boto3.resource('dynamodb', region_name='us-east-1')
table = dynamodb.Table(self.table_name)
response = table.query(
IndexName=index_name,
KeyConditionExpression=key_condition,
Limit=limit,
ScanIndexForward=False
)
return response.get('Items', [])
# Conflict resolution using vector clocks
class VectorClock:
"""Vector clock for conflict detection"""
def __init__(self):
self.clock: Dict[str, int] = {}
def increment(self, node_id: str):
"""Increment clock for node"""
self.clock[node_id] = self.clock.get(node_id, 0) + 1
def merge(self, other: 'VectorClock'):
"""Merge two vector clocks"""
for node_id, timestamp in other.clock.items():
self.clock[node_id] = max(self.clock.get(node_id, 0), timestamp)
def happens_before(self, other: 'VectorClock') -> bool:
"""Check if this clock happens before another"""
for node_id in self.clock:
if node_id not in other.clock:
return False
if self.clock[node_id] > other.clock[node_id]:
return False
return True
def concurrent_with(self, other: 'VectorClock') -> bool:
"""Check if clocks are concurrent"""
return not self.happens_before(other) and not other.happens_before(self)
def to_dict(self) -> Dict[str, int]:
return self.clock.copy()
@classmethod
def from_dict(cls, clock_dict: Dict[str, int]) -> 'VectorClock':
vc = cls()
vc.clock = clock_dict.copy()
return vc
โ ๏ธConflict Resolution
Choose your conflict resolution strategy carefully. Last-write-wins is simple but can lose data. Vector clocks add complexity but preserve causality.
Cross-Region Replication
# Cross-region data synchronization
import boto3
import json
from typing import Dict, Any, List
from datetime import datetime
import hashlib
class CrossRegionReplicator:
"""Handles cross-region data replication"""
def __init__(self, regions: List[str]):
self.regions = regions
self.kinesis = boto3.client('kinesis')
def replicate_event(self, event: Dict[str, Any], source_region: str):
"""Replicate event to all regions"""
# Create unique event ID for deduplication
event_id = hashlib.md5(
json.dumps(event, sort_keys=True).encode()
).hexdigest()
for region in self.regions:
if region != source_region:
self._send_to_region(region, event, event_id)
def _send_to_region(self, region: str, event: Dict[str, Any], event_id: str):
"""Send event to specific region via Kinesis"""
kinesis = boto3.client('kinesis', region_name=region)
# Add metadata for replication
event_with_metadata = {
'original_event': event,
'replication_metadata': {
'event_id': event_id,
'source_region': region,
'replication_timestamp': datetime.utcnow().isoformat(),
'replication_id': str(hashlib.md5(
f"{event_id}{region}{datetime.utcnow().isoformat()}".encode()
).hexdigest())
}
}
kinesis.put_record(
StreamName='cross-region-replication-stream',
Data=json.dumps(event_with_metadata),
PartitionKey=event_id
)
class GlobalDataStore:
"""Global data store with eventual consistency"""
def __init__(self, primary_region: str, replica_regions: List[str]):
self.primary_region = primary_region
self.replica_regions = replica_regions
self.dynamodb = boto3.resource('dynamodb')
def write(self, table_name: str, item: Dict[str, Any], region: str = None):
"""Write to primary region"""
if region is None:
region = self.primary_region
table = self.dynamodb.Table(table_name, region_name=region)
response = table.put_item(Item=item)
# Trigger replication
self._replicate(table_name, item, region)
return response
def read(self, table_name: str, key: Dict[str, Any], consistent: bool = False):
"""Read from nearest region"""
# In production, determine nearest region based on latency
region = self._get_nearest_region()
table = self.dynamodb.Table(table_name, region_name=region)
response = table.get_item(
Key=key,
ConsistentRead=consistent
)
return response.get('Item')
def _replicate(self, table_name: str, item: Dict[str, Any], source_region: str):
"""Replicate to other regions"""
for region in self.replica_regions:
if region != source_region:
self._replicate_async(table_name, item, region)
def _replicate_async(self, table_name: str, item: Dict[str, Any], region: str):
"""Async replication to region"""
# In production, use async processing
table = self.dynamodb.Table(table_name, region_name=region)
table.put_item(Item=item)
def _get_nearest_region(self) -> str:
"""Get nearest region based on latency"""
# Simplified - in production, use latency measurements
return self.primary_region
Failover Automation
# CloudWatch alarms for failover
resource "aws_cloudwatch_metric_alarm" "us_east_1_5xx" {
alarm_name = "us-east-1-5xx-errors"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "HTTPCode_Target_5XX_Count"
namespace = "AWS/ApplicationELB"
period = 60
statistic = "Sum"
threshold = 100
alarm_description = "5xx errors in us-east-1"
dimensions = {
LoadBalancer = aws_lb.us_east_1.arn_suffix
}
alarm_actions = [aws_sns_topic.failover.arn]
ok_actions = [aws_sns_topic.failover_recovery.arn]
}
# Lambda function for automatic failover
resource "aws_lambda_function" "failover_handler" {
filename = "lambda/failover.zip"
function_name = "failover-handler"
role = aws_iam_role.failover_lambda.arn
handler = "index.handler"
runtime = "python3.9"
timeout = 30
memory_size = 128
environment {
variables = {
PRIMARY_REGION = "us-east-1"
SECONDARY_REGION = "eu-west-1"
ROUTE53_ZONE_ID = aws_route53_zone.main.zone_id
}
}
}
# SNS topic for failover notifications
resource "aws_sns_topic" "failover" {
name = "failover-notifications"
}
# EventBridge rule for failover events
resource "aws_cloudwatch_event_rule" "failover_event" {
name = "failover-event-rule"
description = "Capture failover events"
event_pattern = jsonencode({
source = ["aws.route53"]
detail-type = ["AWS API Call via CloudTrail"]
detail = {
eventName = ["UpdateHealthCheckStatus"]
}
})
}
resource "aws_cloudwatch_event_target" "failover_lambda" {
rule = aws_cloudwatch_event_rule.failover_event.name
target_id = "FailoverLambda"
arn = aws_lambda_function.failover_handler.arn
}
# Failover automation handler
import boto3
import json
from typing import Dict, Any
route53 = boto3.client('route53')
cloudfront = boto3.client('cloudfront')
class FailoverAutomation:
"""Automated failover management"""
def __init__(self):
self.route53 = boto3.client('route53')
self.cloudfront = boto3.client('cloudfront')
def handle_failover(self, event: Dict[str, Any]) -> Dict[str, Any]:
"""Handle failover event"""
# Parse CloudWatch alarm
alarm_name = event['detail']['alarmName']
alarm_state = event['detail']['state']['value']
if alarm_state == 'ALARM':
return self._trigger_failover(alarm_name)
elif alarm_state == 'OK':
return self._trigger_recovery(alarm_name)
def _trigger_failover(self, alarm_name: str) -> Dict[str, Any]:
"""Trigger failover to secondary region"""
# Update Route53 records
self.route53.change_resource_record_sets(
HostedZoneId='Z1234567890',
ChangeBatch={
'Changes': [
{
'Action': 'UPSERT',
'ResourceRecordSet': {
'Name': 'app.example.com',
'Type': 'A',
'SetIdentifier': 'primary',
'Failover': 'SECONDARY', # Change to secondary
'TTL': 60,
'ResourceRecords': [
{'Value': '203.0.113.10'} # Secondary region IP
]
}
}
]
}
)
# Update CloudFront origin
self.cloudfront.update_distribution(
DistributionId='E1234567890',
IfMatch='ETAG',
DistributionConfig={
'Origins': {
'Items': [
{
'Id': 'secondary-origin',
'DomainName': 'secondary.example.com',
'CustomOriginConfig': {
'HTTPPort': 80,
'HTTPSPort': 443,
'OriginProtocolPolicy': 'https-only'
}
}
]
}
}
)
return {
'action': 'failover',
'status': 'completed',
'new_primary': 'eu-west-1'
}
def _trigger_recovery(self, alarm_name: str) -> Dict[str, Any]:
"""Trigger recovery to primary region"""
# Similar logic to failover, but reverse
return {
'action': 'recovery',
'status': 'completed',
'new_primary': 'us-east-1'
}
โ Multi-Region Benefits
Active-active deployments provide high availability, low latency globally, and disaster recovery. The key is balancing consistency with availability.
Summary
| Component | Purpose | Configuration |
|---|---|---|
| Route53 | Global DNS | Latency/failover routing |
| DynamoDB Global Tables | Multi-region database | Replication, conflict resolution |
| CloudFront | Global CDN | Edge caching, origin failover |
| Cross-Region Replication | Data sync | Kinesis, async replication |
| Failover Automation | Recovery | CloudWatch, Lambda, EventBridge |