๐ŸŽ‰ 75% of content is free forever โ€” Unlock Premium from $10/mo โ†’
CW
Search coursesโ€ฆ
๐Ÿ’ผ Servicesโ„น๏ธ Aboutโœ‰๏ธ ContactView Pricing Plansfrom $10

Multi-AZ Design Patterns

Cloud ArchitectureAvailability & Reliabilityโญ Premium

Advertisement

Multi-AZ Design Patterns

Difficulty: Senior Level | Companies: AWS, Google, Microsoft, Netflix, Uber

Why Multi-AZ Matters

Availability Zones (AZs) are isolated data centers within a cloud region. Each AZ has independent power, networking, and connectivity. Designing across multiple AZs ensures your application survives localized failures.

โ„น๏ธ

A single AZ failure should never cause a full application outage. Multi-AZ is the baseline for production workloads.

Architecture Overview

Architecture Diagram
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚      Route 53 / DNS      โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                 โ”‚
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚    Application Load     โ”‚
                    โ”‚       Balancer          โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                 โ”‚
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚                  โ”‚                  โ”‚
     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
     โ”‚   AZ-1a         โ”‚ โ”‚   AZ-1b     โ”‚ โ”‚   AZ-1c         โ”‚
     โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
     โ”‚ โ”‚  EC2 / ECS  โ”‚ โ”‚ โ”‚ โ”‚  EC2    โ”‚ โ”‚ โ”‚ โ”‚  EC2 / ECS  โ”‚ โ”‚
     โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
     โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
     โ”‚ โ”‚ RDS Primary โ”‚ โ”‚ โ”‚ โ”‚RDS Standโ”‚ โ”‚ โ”‚ โ”‚  ElastiCache โ”‚ โ”‚
     โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Pattern 1: Active-Passive Database Failover

Use RDS Multi-AZ for automatic failover. The standby replica is in a different AZ.

-- RDS Multi-AZ Configuration (Terraform)
resource "aws_db_instance" "primary" {
  identifier           = "app-database"
  instance_class       = "db.r6g.xlarge"
  engine               = "postgres"
  engine_version       = "15.4"
  multi_az             = true  -- Enables standby in different AZ
  db_subnet_group_name = aws_db_subnet_group.main.name
  
  # Storage must be fast for failover
  storage_type          = "gp3"
  allocated_storage     = 100
  iops                  = 12000
  storage_throughput    = 500
  
  backup_retention_period = 7
  backup_window          = "03:00-04:00"
  maintenance_window     = "Mon:04:00-Mon:05:00"
  
  tags = {
    Environment = "production"
    MultiAZ     = "true"
  }
}

โš ๏ธ

Multi-AZ adds cost (~50% more for RDS). Evaluate whether your read replica strategy with manual failover meets your RTO requirements before enabling it everywhere.

Pattern 2: Cross-AZ Load Balancing

Distribute traffic evenly across AZs with health checks.

// AWS ALB with cross-zone load balancing
import * as elbv2 from 'aws-cdk-lib/aws-elasticloadbalancingv2';
import * as ec2 from 'aws-cdk-lib/aws-ec2';

const vpc = ec2.Vpc.fromLookup(this, 'Vpc', { isDefault: false });

const alb = new elbv2.ApplicationLoadBalancer(this, 'ALB', {
  vpc,
  internetFacing: true,
  loadBalancerName: 'multi-az-alb',
});

// ALB automatically distributes across AZs
// Cross-zone is enabled by default for ALB
const listener = alb.addListener('Listener', {
  port: 443,
  certificates: [elbv2.ListenerCertificate.fromArn(certArn)],
});

listener.addTargets('AppTarget', {
  port: 8080,
  healthCheck: {
    path: '/health',
    interval: cdk.Duration.seconds(10),
    healthyThresholdCount: 2,
    unhealthyThresholdCount: 3,
  },
  targets: [
    new elbv2.ApplicationTargetGroup(this, 'TG-AZ1', {
      vpc,
      port: 8080,
      healthCheck: { path: '/health' },
      targetGroupName: 'app-az1',
    }),
  ],
});

Pattern 3: Session Affinity with Distributed Cache

Use ElastiCache Redis with cluster mode for session storage across AZs.

# Redis Cluster across multiple AZs
import boto3

elasticache = boto3.client('elasticache')

# Create Redis cluster spanning 3 AZs
response = elasticache.create_replication_group(
    ReplicationGroupId='multi-az-session-cache',
    ReplicationGroupDescription='Session cache across AZs',
    NumCacheClusters=3,  # One per AZ
    CacheNodeType='cache.r6g.xlarge',
    Engine='redis',
    EngineVersion='7.0',
    AutomaticFailoverEnabled=True,
    MultiAZ='enabled',
    CacheSubnetGroupName='multi-az-subnet-group',
    SecurityGroupIds=['sg-0123456789abcdef0'],
    Port=6379,
    AtRestEncryptionEnabled=True,
    TransitEncryptionEnabled=True,
)

# Application session configuration
SESSION_CONFIG = {
    'host': response['ReplicationGroupId'],
    'port': 6379,
    'ssl': True,
    'socket_timeout': 5,
    'socket_connect_timeout': 5,
    'retry_on_timeout': True,
    'health_check_interval': 30,
}

Pattern 4: S3 Cross-Region Replication

Replicate critical data across regions for disaster recovery.

# S3 Cross-Region Replication Rule
AWSTemplateFormatVersion: '2010-09-09'
Resources:
  SourceBucket:
    Type: AWS::S3::Bucket
    Properties:
      VersioningConfiguration:
        Status: Enabled
      ReplicationConfiguration:
        Role: !GetAtt ReplicationRole.Arn
        Rules:
          - ID: CrossRegionReplication
            Status: Enabled
            Destination:
              Bucket: arn:aws:s3:::backup-bucket-us-west-2
              StorageClass: STANDARD_IA
              ReplicationTime:
                Status: Enabled
                Time: 15  # Minutes
              Metrics:
                Status: Enabled
                EventThreshold:
                  Minutes: 15
            Filter:
              Prefix: critical-data/
            DeleteMarkerReplication:
              Status: Enabled

โ„น๏ธ

S3 Replication Time Control (RTC) guarantees 99.99% of objects replicated within 15 minutes for critical data.

Pattern 5: Cross-AZ DNS Failover

Use Route 53 health checks with failover routing.

{
  "Comment": "Multi-AZ failover configuration",
  "Changes": [
    {
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "api.example.com",
    "Type": "A",
    "SetIdentifier": "primary-az1",
    "Failover": "PRIMARY",
    "TTL": 60,
    "ResourceRecords": [
      { "Value": "10.0.1.100" }
    ],
    "HealthCheckId": "health-check-az1"
  }
},
{
  "Action": "CREATE",
  "ResourceRecordSet": {
    "Name": "api.example.com",
    "Type": "A",
    "SetIdentifier": "secondary-az2",
    "Failover": "SECONDARY",
    "TTL": 60,
    "ResourceRecords": [
      { "Value": "10.0.2.100" }
    ],
    "HealthCheckId": "health-check-az2"
  }
}
]
}

Pattern 6: Aurora Global Database

Aurora Global Database provides sub-second replication across regions.

-- Primary Region: us-east-1
CREATE DATABASE app_primary;

-- Create Global Database
CREATE GLOBAL DATABASE cluster_global
  PRIMARY cluster_us_east_1
  (REGION 'us-east-1')
  SECONDARY cluster_us_west_2
  (REGION 'us-west-2');

-- Aurora automatically replicates with <1s lag
-- Promote secondary for DR with <1 minute downtime

โš ๏ธ

Aurora Global Database has a maximum of 5 secondary regions and 16 read replicas per region. Plan your architecture accordingly.

Best Practices Checklist

  1. Minimum 2 AZs for any production workload
  2. Health checks on all components with fast intervals (5-10 seconds)
  3. Connection pooling to handle failover gracefully (PgBouncer, HikariCP)
  4. Retry logic with exponential backoff in application code
  5. Chaos testing with tools like AWS Fault Injection Simulator
  6. RTO/RPO targets documented and validated quarterly

Follow-Up Questions

  1. How do you handle session state during an AZ failover without losing user experience?
  2. What is the trade-off between synchronous and asynchronous replication for multi-AZ databases?
  3. How would you design a multi-AZ architecture for a real-time gaming application with strict latency requirements?

Advertisement