High Availability & Disaster Recovery: RPO/RTO, Multi-AZ
Difficulty: Senior Level | Companies: AWS, Google, Microsoft, Netflix, Amazon
Interview Question
"Design a high availability architecture with 99.99% uptime. How do you handle RPO/RTO, multi-AZ deployments, and disaster recovery?"
โน๏ธKey Concepts
This question tests your understanding of availability engineering, disaster recovery strategies, and fault tolerance patterns.
Complete HA/DR Architecture
Architecture Overview
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ HIGH AVAILABILITY ARCHITECTURE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโโโ REGION 1 (PRIMARY) โโโโโโโโโโโโโโ โ
โ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ AZ-1a โ โ AZ-1b โ โ โ
โ โ โ โ โ โ โ โ
โ โ โ โโโโโโโโโโโโโ โ โ โโโโโโโโโโโโโ โ โ โ
โ โ โ โ App Server โ โ โ โ App Server โ โ โ โ
โ โ โ โ + DB Primaryโ โ โ โ + DB Standbyโ โ โ โ
โ โ โ โโโโโโโโโโโโโ โ โ โโโโโโโโโโโโโ โ โ โ
โ โ โ โ โ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ Cross-Region Replication โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโ REGION 2 (DR) โโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ AZ-2a โ โ AZ-2b โ โ โ
โ โ โ โ โ โ โ โ
โ โ โ โโโโโโโโโโโโโ โ โ โโโโโโโโโโโโโ โ โ โ
โ โ โ โ App Server โ โ โ โ App Server โ โ โ โ
โ โ โ โ + DB Read โ โ โ โ + DB Read โ โ โ โ
โ โ โ โโโโโโโโโโโโโ โ โ โโโโโโโโโโโโโ โ โ โ
โ โ โ โ โ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Mathematical Foundation: Availability Calculations
Availability Formula:
- Availability = (Total Time - Downtime) / Total Time
- For 99.99% availability: Downtime โค 52.6 minutes/year
- For 99.999% availability: Downtime โค 5.26 minutes/year
RPO/RTO Calculations:
- RPO (Recovery Point Objective): Maximum acceptable data loss
- RTO (Recovery Time Objective): Maximum acceptable downtime
Multi-AZ Availability:
- Single AZ availability: A_az = 99.95%
- Multi-AZ availability: A_multi = 1 - (1 - A_az)^2
- A_multi = 1 - (1 - 0.9995)^2 = 99.999975%
MTBF and MTTR:
- Mean Time Between Failures (MTBF): Average time between failures
- Mean Time To Recovery (MTTR): Average time to recover from failure
- Availability = MTBF / (MTBF + MTTR)
Failure Domains:
- AZ failure probability: P_az = 0.001 (0.1%)
- Region failure probability: P_region = 0.0001 (0.01%)
- Multi-region availability: A_region = 1 - P_region^2 = 99.9999%
Database High Availability
# Aurora Multi-AZ
resource "aws_rds_cluster" "primary" {
cluster_identifier = "primary-cluster"
engine = "aurora-postgresql"
engine_version = "14.6"
database_name = "appdb"
master_username = var.db_username
master_password = var.db_password
skip_final_snapshot = false
final_snapshot_identifier = "primary-cluster-final"
backup_retention_period = 35
preferred_backup_window = "03:00-04:00"
preferred_maintenance_window = "sun:04:00-sun:05:00"
storage_encrypted = true
kms_key_id = aws_kms_key.rds.arn
vpc_security_group_ids = [aws_security_group.rds.id]
db_subnet_group_name = aws_db_subnet_group.main.name
# Enable automated failover
allow_major_version_upgrade = true
apply_immediately = false
}
resource "aws_rds_cluster_instance" "primary" {
identifier = "primary-instance"
cluster_identifier = aws_rds_cluster.primary.id
instance_class = "db.r6g.xlarge"
engine = aws_rds_cluster.primary.engine
engine_version = aws_rds_cluster.primary.engine_version
publicly_accessible = false
monitoring_interval = 60
performance_insights_enabled = true
performance_insights_kms_key_id = aws_kms_key.rds.arn
}
resource "aws_rds_cluster_instance" "secondary" {
identifier = "secondary-instance"
cluster_identifier = aws_rds_cluster.primary.id
instance_class = "db.r6g.xlarge"
engine = aws_rds_cluster.primary.engine
engine_version = aws_rds_cluster.primary.engine_version
publicly_accessible = false
monitoring_interval = 60
performance_insights_enabled = true
performance_insights_kms_key_id = aws_kms_key.rds.arn
}
# Cross-region read replica
resource "aws_rds_cluster" "dr" {
provider = aws.dr_region
cluster_identifier = "dr-cluster"
engine = "aurora-postgresql"
engine_version = "14.6"
database_name = "appdb"
master_username = var.db_username
master_password = var.db_password
# Enable global database
global_cluster_identifier = aws_rds_global_cluster.main.id
}
# Database failover automation
import boto3
import time
from typing import Dict, Any
from dataclasses import dataclass
@dataclass
class FailoverConfig:
primary_cluster_id: str
dr_cluster_id: str
health_check_interval: int = 30
failover_threshold: int = 3
class DatabaseFailoverManager:
"""Database failover automation"""
def __init__(self, config: FailoverConfig):
self.config = config
self.rds = boto3.client('rds')
self.cloudwatch = boto3.client('cloudwatch')
self.failover_count = 0
def monitor_and_failover(self):
"""Monitor database and failover if needed"""
while True:
if self._check_database_health():
self.failover_count = 0
else:
self.failover_count += 1
if self.failover_count >= self.config.failover_threshold:
self._execute_failover()
break
time.sleep(self.config.health_check_interval)
def _check_database_health(self) -> bool:
"""Check database health"""
try:
response = self.cloudwatch.get_metric_statistics(
Namespace='AWS/RDS',
MetricName='DatabaseConnections',
Dimensions=[
{
'Name': 'DBClusterIdentifier',
'Value': self.config.primary_cluster_id
}
],
StartTime=time.time() - 60,
EndTime=time.time(),
Period=60,
Statistics=['Average']
)
if response['Datapoints']:
connections = response['Datapoints'][-1]['Average']
return connections < 1000 # Threshold
return True
except Exception as e:
print(f"Health check failed: {e}")
return False
def _execute_failover(self):
"""Execute database failover"""
try:
# Promote DR cluster
self.rds.promote_read_replica_db_cluster(
DBClusterIdentifier=self.config.dr_cluster_id
)
# Wait for promotion
self.rds.get_waiter('db_cluster_available').wait(
DBClusterIdentifier=self.config.dr_cluster_id
)
# Update application endpoints
self._update_application_endpoints()
print(f"Failover completed: {self.config.dr_cluster_id}")
except Exception as e:
print(f"Failover failed: {e}")
raise
def _update_application_endpoints(self):
"""Update application endpoints to point to DR cluster"""
# In production, update DNS or configuration
pass
Load Balancer Configuration
# Application Load Balancer
resource "aws_lb" "main" {
name = "main-alb"
internal = false
load_balancer_type = "application"
security_groups = [aws_security_group.alb.id]
subnets = aws_subnet.public[*].id
enable_deletion_protection = true
access_logs {
bucket = aws_s3_bucket.alb_logs.id
prefix = "alb-logs"
enabled = true
}
tags = {
Environment = var.environment
}
}
# Target group with health checks
resource "aws_lb_target_group" "app" {
name = "app-tg"
port = 80
protocol = "HTTP"
vpc_id = aws_vpc.main.id
health_check {
enabled = true
healthy_threshold = 2
unhealthy_threshold = 3
timeout = 5
interval = 30
path = "/health"
port = "traffic-port"
matcher = "200"
}
stickiness {
type = "lb_cookie"
cookie_duration = 86400
enabled = true
}
deregistration_delay = 300
}
# Listener with SSL
resource "aws_lb_listener" "https" {
load_balancer_arn = aws_lb.main.arn
port = "443"
protocol = "HTTPS"
ssl_policy = "ELBSecurityPolicy-TLS13-1-2-2021-06"
certificate_arn = aws_acm_certificate.main.arn
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.app.arn
}
}
# Route53 health check
resource "aws_route53_health_check" "alb" {
fqdn = aws_lb.main.dns_name
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
request_interval = 10
}
โ ๏ธLoad Balancer Best Practices
Use multiple ALBs across AZs with Route53 health checks for automatic failover. Implement connection draining and slow start for smooth deployments.
Disaster Recovery Testing
# DR testing automation
import boto3
import json
from typing import Dict, Any, List
from datetime import datetime, timedelta
from dataclasses import dataclass
@dataclass
class DRTestResult:
test_name: str
status: str
rto_actual: float
rpo_actual: float
rto_target: float
rpo_target: float
issues: List[str]
class DRTestManager:
"""Disaster recovery testing manager"""
def __init__(self):
self.rds = boto3.client('rds')
self.route53 = boto3.client('route53')
self.s3 = boto3.client('s3')
def run_full_dr_test(self) -> DRTestResult:
"""Run full DR test"""
start_time = datetime.utcnow()
# Test 1: Database failover
db_result = self._test_database_failover()
# Test 2: DNS failover
dns_result = self._test_dns_failover()
# Test 3: Data integrity
data_result = self._test_data_integrity()
end_time = datetime.utcnow()
rto_actual = (end_time - start_time).total_seconds()
# Calculate RPO
rpo_actual = self._calculate_rpo()
return DRTestResult(
test_name="Full DR Test",
status="passed" if all([db_result, dns_result, data_result]) else "failed",
rto_actual=rto_actual,
rpo_actual=rpo_actual,
rto_target=300, # 5 minutes
rpo_target=60, # 1 minute
issues=[]
)
def _test_database_failover(self) -> bool:
"""Test database failover"""
try:
# Simulate primary failure
self.rds.stop_db_instance(
DBInstanceIdentifier='primary-instance',
SkipFinalSnapshot=True
)
# Wait for automatic failover
time.sleep(60)
# Verify DR cluster is active
response = self.rds.describe_db_clusters(
DBClusterIdentifier='dr-cluster'
)
cluster = response['DBClusters'][0]
return cluster['Status'] == 'available'
except Exception as e:
print(f"Database failover test failed: {e}")
return False
def _test_dns_failover(self) -> bool:
"""Test DNS failover"""
try:
# Update DNS to point to DR
self.route53.change_resource_record_sets(
HostedZoneId='Z1234567890',
ChangeBatch={
'Changes': [
{
'Action': 'UPSERT',
'ResourceRecordSet': {
'Name': 'app.example.com',
'Type': 'A',
'SetIdentifier': 'primary',
'Failover': 'SECONDARY',
'TTL': 60,
'ResourceRecords': [
{'Value': '203.0.113.10'}
]
}
}
]
}
)
# Wait for DNS propagation
time.sleep(120)
# Verify DNS resolves to DR
import socket
ip = socket.gethostbyname('app.example.com')
return ip == '203.0.113.10'
except Exception as e:
print(f"DNS failover test failed: {e}")
return False
def _test_data_integrity(self) -> bool:
"""Test data integrity"""
try:
# Compare data between primary and DR
# In production, use database comparison tools
return True
except Exception as e:
print(f"Data integrity test failed: {e}")
return False
def _calculate_rpo(self) -> float:
"""Calculate actual RPO"""
# In production, compare timestamps of last replicated data
return 0.0
Chaos Engineering
# Chaos engineering for resilience testing
import boto3
import random
from typing import List, Callable
from dataclasses import dataclass
from datetime import datetime
@dataclass
class ChaosExperiment:
name: str
description: str
target: str
action: Callable
rollback: Callable
duration: int # seconds
class ChaosEngineer:
"""Chaos engineering automation"""
def __init__(self):
self.ec2 = boto3.client('ec2')
self.rds = boto3.client('rds')
self.experiments: List[ChaosExperiment] = []
def add_experiment(self, experiment: ChaosExperiment):
"""Add chaos experiment"""
self.experiments.append(experiment)
def run_experiment(self, experiment: ChaosExperiment) -> Dict[str, Any]:
"""Run chaos experiment"""
start_time = datetime.utcnow()
try:
# Execute chaos action
experiment.action()
# Monitor for duration
import time
time.sleep(experiment.duration)
# Check if system is healthy
healthy = self._check_system_health()
# Rollback
experiment.rollback()
end_time = datetime.utcnow()
duration = (end_time - start_time).total_seconds()
return {
'experiment': experiment.name,
'status': 'passed' if healthy else 'failed',
'duration': duration,
'healthy': healthy
}
except Exception as e:
# Rollback on error
experiment.rollback()
return {
'experiment': experiment.name,
'status': 'error',
'error': str(e)
}
def _check_system_health(self) -> bool:
"""Check if system is healthy after chaos"""
# In production, check health endpoints
return True
def terminate_random_instance(self):
"""Terminate random EC2 instance"""
instances = self.ec2.describe_instances(
Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
)
all_instances = []
for reservation in instances['Reservations']:
for instance in reservation['Instances']:
all_instances.append(instance['InstanceId'])
if all_instances:
target = random.choice(all_instances)
self.ec2.terminate_instances(InstanceIds=[target])
print(f"Terminated instance: {target}")
def restore_instances(self):
"""Restore terminated instances"""
# In production, use Auto Scaling or launch templates
pass
โ HA/DR Benefits
High availability and disaster recovery ensure business continuity. Regular testing is essential to validate RPO/RTO targets and identify weaknesses.
Summary
| Strategy | RPO | RTO | Cost | Complexity |
|---|---|---|---|---|
| Backup/Restore | Hours | Hours | Low | Low |
| Pilot Light | Minutes | Minutes | Medium | Medium |
| Warm Standby | Seconds | Minutes | High | High |
| Multi-Site Active-Active | Zero | Zero | Very High | Very High |