Multi-AZ Design Patterns
Difficulty: Senior Level | Companies: AWS, Google, Microsoft, Netflix, Uber
Why Multi-AZ Matters
Availability Zones (AZs) are isolated data centers within a cloud region. Each AZ has independent power, networking, and connectivity. Designing across multiple AZs ensures your application survives localized failures.
โน๏ธ
A single AZ failure should never cause a full application outage. Multi-AZ is the baseline for production workloads.
Architecture Overview
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Route 53 / DNS โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโดโโโโโโโโโโโโโ
โ Application Load โ
โ Balancer โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโ
โ โ โ
โโโโโโโโโโดโโโโโโโโโ โโโโโโโโดโโโโโโโ โโโโโโโโโโดโโโโโโโโโ
โ AZ-1a โ โ AZ-1b โ โ AZ-1c โ
โ โโโโโโโโโโโโโโโ โ โ โโโโโโโโโโโ โ โ โโโโโโโโโโโโโโโ โ
โ โ EC2 / ECS โ โ โ โ EC2 โ โ โ โ EC2 / ECS โ โ
โ โโโโโโโโโโโโโโโ โ โ โโโโโโโโโโโ โ โ โโโโโโโโโโโโโโโ โ
โ โโโโโโโโโโโโโโโ โ โ โโโโโโโโโโโ โ โ โโโโโโโโโโโโโโโ โ
โ โ RDS Primary โ โ โ โRDS Standโ โ โ โ ElastiCache โ โ
โ โโโโโโโโโโโโโโโ โ โ โโโโโโโโโโโ โ โ โโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
Pattern 1: Active-Passive Database Failover
Use RDS Multi-AZ for automatic failover. The standby replica is in a different AZ.
-- RDS Multi-AZ Configuration (Terraform)
resource "aws_db_instance" "primary" {
identifier = "app-database"
instance_class = "db.r6g.xlarge"
engine = "postgres"
engine_version = "15.4"
multi_az = true -- Enables standby in different AZ
db_subnet_group_name = aws_db_subnet_group.main.name
# Storage must be fast for failover
storage_type = "gp3"
allocated_storage = 100
iops = 12000
storage_throughput = 500
backup_retention_period = 7
backup_window = "03:00-04:00"
maintenance_window = "Mon:04:00-Mon:05:00"
tags = {
Environment = "production"
MultiAZ = "true"
}
}
โ ๏ธ
Multi-AZ adds cost (~50% more for RDS). Evaluate whether your read replica strategy with manual failover meets your RTO requirements before enabling it everywhere.
Pattern 2: Cross-AZ Load Balancing
Distribute traffic evenly across AZs with health checks.
// AWS ALB with cross-zone load balancing
import * as elbv2 from 'aws-cdk-lib/aws-elasticloadbalancingv2';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
const vpc = ec2.Vpc.fromLookup(this, 'Vpc', { isDefault: false });
const alb = new elbv2.ApplicationLoadBalancer(this, 'ALB', {
vpc,
internetFacing: true,
loadBalancerName: 'multi-az-alb',
});
// ALB automatically distributes across AZs
// Cross-zone is enabled by default for ALB
const listener = alb.addListener('Listener', {
port: 443,
certificates: [elbv2.ListenerCertificate.fromArn(certArn)],
});
listener.addTargets('AppTarget', {
port: 8080,
healthCheck: {
path: '/health',
interval: cdk.Duration.seconds(10),
healthyThresholdCount: 2,
unhealthyThresholdCount: 3,
},
targets: [
new elbv2.ApplicationTargetGroup(this, 'TG-AZ1', {
vpc,
port: 8080,
healthCheck: { path: '/health' },
targetGroupName: 'app-az1',
}),
],
});
Pattern 3: Session Affinity with Distributed Cache
Use ElastiCache Redis with cluster mode for session storage across AZs.
# Redis Cluster across multiple AZs
import boto3
elasticache = boto3.client('elasticache')
# Create Redis cluster spanning 3 AZs
response = elasticache.create_replication_group(
ReplicationGroupId='multi-az-session-cache',
ReplicationGroupDescription='Session cache across AZs',
NumCacheClusters=3, # One per AZ
CacheNodeType='cache.r6g.xlarge',
Engine='redis',
EngineVersion='7.0',
AutomaticFailoverEnabled=True,
MultiAZ='enabled',
CacheSubnetGroupName='multi-az-subnet-group',
SecurityGroupIds=['sg-0123456789abcdef0'],
Port=6379,
AtRestEncryptionEnabled=True,
TransitEncryptionEnabled=True,
)
# Application session configuration
SESSION_CONFIG = {
'host': response['ReplicationGroupId'],
'port': 6379,
'ssl': True,
'socket_timeout': 5,
'socket_connect_timeout': 5,
'retry_on_timeout': True,
'health_check_interval': 30,
}
Pattern 4: S3 Cross-Region Replication
Replicate critical data across regions for disaster recovery.
# S3 Cross-Region Replication Rule
AWSTemplateFormatVersion: '2010-09-09'
Resources:
SourceBucket:
Type: AWS::S3::Bucket
Properties:
VersioningConfiguration:
Status: Enabled
ReplicationConfiguration:
Role: !GetAtt ReplicationRole.Arn
Rules:
- ID: CrossRegionReplication
Status: Enabled
Destination:
Bucket: arn:aws:s3:::backup-bucket-us-west-2
StorageClass: STANDARD_IA
ReplicationTime:
Status: Enabled
Time: 15 # Minutes
Metrics:
Status: Enabled
EventThreshold:
Minutes: 15
Filter:
Prefix: critical-data/
DeleteMarkerReplication:
Status: Enabled
โน๏ธ
S3 Replication Time Control (RTC) guarantees 99.99% of objects replicated within 15 minutes for critical data.
Pattern 5: Cross-AZ DNS Failover
Use Route 53 health checks with failover routing.
{
"Comment": "Multi-AZ failover configuration",
"Changes": [
{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "api.example.com",
"Type": "A",
"SetIdentifier": "primary-az1",
"Failover": "PRIMARY",
"TTL": 60,
"ResourceRecords": [
{ "Value": "10.0.1.100" }
],
"HealthCheckId": "health-check-az1"
}
},
{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "api.example.com",
"Type": "A",
"SetIdentifier": "secondary-az2",
"Failover": "SECONDARY",
"TTL": 60,
"ResourceRecords": [
{ "Value": "10.0.2.100" }
],
"HealthCheckId": "health-check-az2"
}
}
]
}
Pattern 6: Aurora Global Database
Aurora Global Database provides sub-second replication across regions.
-- Primary Region: us-east-1
CREATE DATABASE app_primary;
-- Create Global Database
CREATE GLOBAL DATABASE cluster_global
PRIMARY cluster_us_east_1
(REGION 'us-east-1')
SECONDARY cluster_us_west_2
(REGION 'us-west-2');
-- Aurora automatically replicates with <1s lag
-- Promote secondary for DR with <1 minute downtime
โ ๏ธ
Aurora Global Database has a maximum of 5 secondary regions and 16 read replicas per region. Plan your architecture accordingly.
Best Practices Checklist
- Minimum 2 AZs for any production workload
- Health checks on all components with fast intervals (5-10 seconds)
- Connection pooling to handle failover gracefully (PgBouncer, HikariCP)
- Retry logic with exponential backoff in application code
- Chaos testing with tools like AWS Fault Injection Simulator
- RTO/RPO targets documented and validated quarterly
Follow-Up Questions
- How do you handle session state during an AZ failover without losing user experience?
- What is the trade-off between synchronous and asynchronous replication for multi-AZ databases?
- How would you design a multi-AZ architecture for a real-time gaming application with strict latency requirements?