High Availability & Disaster Recovery: RPO/RTO, Multi-AZ

Difficulty: Senior Level | Companies: AWS, Google, Microsoft, Netflix, Amazon

Interview Question

"Design a high availability architecture with 99.99% uptime. How do you handle RPO/RTO, multi-AZ deployments, and disaster recovery?"

ℹ️Key Concepts

This question tests your understanding of availability engineering, disaster recovery strategies, and fault tolerance patterns.

Complete HA/DR Architecture

Architecture Overview

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────┐
│                    HIGH AVAILABILITY ARCHITECTURE                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌───────────────── REGION 1 (PRIMARY) ─────────────┐                  │
│  │                                                       │              │
│  │  ┌─────────────────┐  ┌─────────────────┐          │              │
│  │  │    AZ-1a        │  │    AZ-1b        │          │              │
│  │  │                 │  │                 │          │              │
│  │  │  ┌───────────┐ │  │  ┌───────────┐ │          │              │
│  │  │  │ App Server │ │  │  │ App Server │ │          │              │
│  │  │  │ + DB Primary│ │  │  │ + DB Standby│ │          │              │
│  │  │  └───────────┘ │  │  └───────────┘ │          │              │
│  │  │                 │  │                 │          │              │
│  │  └─────────────────┘  └─────────────────┘          │              │
│  │                                                       │              │
│  └─────────────────────────────────────────────────────┘              │
│                         │                                               │
│                    Cross-Region Replication                             │
│                         │                                               │
│  ┌───────────────── REGION 2 (DR) ──────────────────┐                 │
│  │                                                       │              │
│  │  ┌─────────────────┐  ┌─────────────────┐          │              │
│  │  │    AZ-2a        │  │    AZ-2b        │          │              │
│  │  │                 │  │                 │          │              │
│  │  │  ┌───────────┐ │  │  ┌───────────┐ │          │              │
│  │  │  │ App Server │ │  │  │ App Server │ │          │              │
│  │  │  │ + DB Read  │ │  │  │ + DB Read  │ │          │              │
│  │  │  └───────────┘ │  │  └───────────┘ │          │              │
│  │  │                 │  │                 │          │              │
│  │  └─────────────────┘  └─────────────────┘          │              │
│  │                                                       │              │
│  └─────────────────────────────────────────────────────┘              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Mathematical Foundation: Availability Calculations

Availability Formula:

Availability = (Total Time - Downtime) / Total Time
For 99.99% availability: Downtime ≤ 52.6 minutes/year
For 99.999% availability: Downtime ≤ 5.26 minutes/year

RPO/RTO Calculations:

RPO (Recovery Point Objective): Maximum acceptable data loss
RTO (Recovery Time Objective): Maximum acceptable downtime

Multi-AZ Availability:

Single AZ availability: A_az = 99.95%
Multi-AZ availability: A_multi = 1 - (1 - A_az)^2
A_multi = 1 - (1 - 0.9995)^2 = 99.999975%

MTBF and MTTR:

Mean Time Between Failures (MTBF): Average time between failures
Mean Time To Recovery (MTTR): Average time to recover from failure
Availability = MTBF / (MTBF + MTTR)

Failure Domains:

AZ failure probability: P_az = 0.001 (0.1%)
Region failure probability: P_region = 0.0001 (0.01%)
Multi-region availability: A_region = 1 - P_region^2 = 99.9999%

Database High Availability

# Aurora Multi-AZ
resource "aws_rds_cluster" "primary" {
  cluster_identifier     = "primary-cluster"
  engine                = "aurora-postgresql"
  engine_version        = "14.6"
  database_name         = "appdb"
  master_username       = var.db_username
  master_password       = var.db_password
  skip_final_snapshot   = false
  final_snapshot_identifier = "primary-cluster-final"

  backup_retention_period      = 35
  preferred_backup_window      = "03:00-04:00"
  preferred_maintenance_window = "sun:04:00-sun:05:00"

  storage_encrypted = true
  kms_key_id       = aws_kms_key.rds.arn

  vpc_security_group_ids = [aws_security_group.rds.id]
  db_subnet_group_name   = aws_db_subnet_group.main.name

  # Enable automated failover
  allow_major_version_upgrade = true
  apply_immediately          = false
}

resource "aws_rds_cluster_instance" "primary" {
  identifier         = "primary-instance"
  cluster_identifier = aws_rds_cluster.primary.id
  instance_class     = "db.r6g.xlarge"
  engine             = aws_rds_cluster.primary.engine
  engine_version     = aws_rds_cluster.primary.engine_version

  publicly_accessible = false
  monitoring_interval = 60

  performance_insights_enabled = true
  performance_insights_kms_key_id = aws_kms_key.rds.arn
}

resource "aws_rds_cluster_instance" "secondary" {
  identifier         = "secondary-instance"
  cluster_identifier = aws_rds_cluster.primary.id
  instance_class     = "db.r6g.xlarge"
  engine             = aws_rds_cluster.primary.engine
  engine_version     = aws_rds_cluster.primary.engine_version

  publicly_accessible = false
  monitoring_interval = 60

  performance_insights_enabled = true
  performance_insights_kms_key_id = aws_kms_key.rds.arn
}

# Cross-region read replica
resource "aws_rds_cluster" "dr" {
  provider = aws.dr_region

  cluster_identifier     = "dr-cluster"
  engine                = "aurora-postgresql"
  engine_version        = "14.6"
  database_name         = "appdb"
  master_username       = var.db_username
  master_password       = var.db_password

  # Enable global database
  global_cluster_identifier = aws_rds_global_cluster.main.id
}

# Database failover automation
import boto3
import time
from typing import Dict, Any
from dataclasses import dataclass

@dataclass
class FailoverConfig:
    primary_cluster_id: str
    dr_cluster_id: str
    health_check_interval: int = 30
    failover_threshold: int = 3

class DatabaseFailoverManager:
    """Database failover automation"""

    def __init__(self, config: FailoverConfig):
        self.config = config
        self.rds = boto3.client('rds')
        self.cloudwatch = boto3.client('cloudwatch')
        self.failover_count = 0

    def monitor_and_failover(self):
        """Monitor database and failover if needed"""
        while True:
            if self._check_database_health():
                self.failover_count = 0
            else:
                self.failover_count += 1

                if self.failover_count >= self.config.failover_threshold:
                    self._execute_failover()
                    break

            time.sleep(self.config.health_check_interval)

    def _check_database_health(self) -> bool:
        """Check database health"""
        try:
            response = self.cloudwatch.get_metric_statistics(
                Namespace='AWS/RDS',
                MetricName='DatabaseConnections',
                Dimensions=[
                    {
                        'Name': 'DBClusterIdentifier',
                        'Value': self.config.primary_cluster_id
                    }
                ],
                StartTime=time.time() - 60,
                EndTime=time.time(),
                Period=60,
                Statistics=['Average']
            )

            if response['Datapoints']:
                connections = response['Datapoints'][-1]['Average']
                return connections < 1000  # Threshold
            return True

        except Exception as e:
            print(f"Health check failed: {e}")
            return False

    def _execute_failover(self):
        """Execute database failover"""
        try:
            # Promote DR cluster
            self.rds.promote_read_replica_db_cluster(
                DBClusterIdentifier=self.config.dr_cluster_id
            )

            # Wait for promotion
            self.rds.get_waiter('db_cluster_available').wait(
                DBClusterIdentifier=self.config.dr_cluster_id
            )

            # Update application endpoints
            self._update_application_endpoints()

            print(f"Failover completed: {self.config.dr_cluster_id}")

        except Exception as e:
            print(f"Failover failed: {e}")
            raise

    def _update_application_endpoints(self):
        """Update application endpoints to point to DR cluster"""
        # In production, update DNS or configuration
        pass

Load Balancer Configuration

# Application Load Balancer
resource "aws_lb" "main" {
  name               = "main-alb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets           = aws_subnet.public[*].id

  enable_deletion_protection = true

  access_logs {
    bucket  = aws_s3_bucket.alb_logs.id
    prefix  = "alb-logs"
    enabled = true
  }

  tags = {
    Environment = var.environment
  }
}

# Target group with health checks
resource "aws_lb_target_group" "app" {
  name     = "app-tg"
  port     = 80
  protocol = "HTTP"
  vpc_id   = aws_vpc.main.id

  health_check {
    enabled             = true
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
    interval            = 30
    path                = "/health"
    port                = "traffic-port"
    matcher             = "200"
  }

  stickiness {
    type            = "lb_cookie"
    cookie_duration = 86400
    enabled         = true
  }

  deregistration_delay = 300
}

# Listener with SSL
resource "aws_lb_listener" "https" {
  load_balancer_arn = aws_lb.main.arn
  port              = "443"
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS13-1-2-2021-06"
  certificate_arn   = aws_acm_certificate.main.arn

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.app.arn
  }
}

# Route53 health check
resource "aws_route53_health_check" "alb" {
  fqdn              = aws_lb.main.dns_name
  port               = 443
  type               = "HTTPS"
  resource_path      = "/health"
  failure_threshold = 3
  request_interval  = 10
}

⚠️Load Balancer Best Practices

Use multiple ALBs across AZs with Route53 health checks for automatic failover. Implement connection draining and slow start for smooth deployments.

Disaster Recovery Testing

# DR testing automation
import boto3
import json
from typing import Dict, Any, List
from datetime import datetime, timedelta
from dataclasses import dataclass

@dataclass
class DRTestResult:
    test_name: str
    status: str
    rto_actual: float
    rpo_actual: float
    rto_target: float
    rpo_target: float
    issues: List[str]

class DRTestManager:
    """Disaster recovery testing manager"""

    def __init__(self):
        self.rds = boto3.client('rds')
        self.route53 = boto3.client('route53')
        self.s3 = boto3.client('s3')

    def run_full_dr_test(self) -> DRTestResult:
        """Run full DR test"""
        start_time = datetime.utcnow()

        # Test 1: Database failover
        db_result = self._test_database_failover()

        # Test 2: DNS failover
        dns_result = self._test_dns_failover()

        # Test 3: Data integrity
        data_result = self._test_data_integrity()

        end_time = datetime.utcnow()
        rto_actual = (end_time - start_time).total_seconds()

        # Calculate RPO
        rpo_actual = self._calculate_rpo()

        return DRTestResult(
            test_name="Full DR Test",
            status="passed" if all([db_result, dns_result, data_result]) else "failed",
            rto_actual=rto_actual,
            rpo_actual=rpo_actual,
            rto_target=300,  # 5 minutes
            rpo_target=60,   # 1 minute
            issues=[]
        )

    def _test_database_failover(self) -> bool:
        """Test database failover"""
        try:
            # Simulate primary failure
            self.rds.stop_db_instance(
                DBInstanceIdentifier='primary-instance',
                SkipFinalSnapshot=True
            )

            # Wait for automatic failover
            time.sleep(60)

            # Verify DR cluster is active
            response = self.rds.describe_db_clusters(
                DBClusterIdentifier='dr-cluster'
            )

            cluster = response['DBClusters'][0]
            return cluster['Status'] == 'available'

        except Exception as e:
            print(f"Database failover test failed: {e}")
            return False

    def _test_dns_failover(self) -> bool:
        """Test DNS failover"""
        try:
            # Update DNS to point to DR
            self.route53.change_resource_record_sets(
                HostedZoneId='Z1234567890',
                ChangeBatch={
                    'Changes': [
                        {
                            'Action': 'UPSERT',
                            'ResourceRecordSet': {
                                'Name': 'app.example.com',
                                'Type': 'A',
                                'SetIdentifier': 'primary',
                                'Failover': 'SECONDARY',
                                'TTL': 60,
                                'ResourceRecords': [
                                    {'Value': '203.0.113.10'}
                                ]
                            }
                        }
                    ]
                }
            )

            # Wait for DNS propagation
            time.sleep(120)

            # Verify DNS resolves to DR
            import socket
            ip = socket.gethostbyname('app.example.com')
            return ip == '203.0.113.10'

        except Exception as e:
            print(f"DNS failover test failed: {e}")
            return False

    def _test_data_integrity(self) -> bool:
        """Test data integrity"""
        try:
            # Compare data between primary and DR
            # In production, use database comparison tools
            return True

        except Exception as e:
            print(f"Data integrity test failed: {e}")
            return False

    def _calculate_rpo(self) -> float:
        """Calculate actual RPO"""
        # In production, compare timestamps of last replicated data
        return 0.0

Chaos Engineering

# Chaos engineering for resilience testing
import boto3
import random
from typing import List, Callable
from dataclasses import dataclass
from datetime import datetime

@dataclass
class ChaosExperiment:
    name: str
    description: str
    target: str
    action: Callable
    rollback: Callable
    duration: int  # seconds

class ChaosEngineer:
    """Chaos engineering automation"""

    def __init__(self):
        self.ec2 = boto3.client('ec2')
        self.rds = boto3.client('rds')
        self.experiments: List[ChaosExperiment] = []

    def add_experiment(self, experiment: ChaosExperiment):
        """Add chaos experiment"""
        self.experiments.append(experiment)

    def run_experiment(self, experiment: ChaosExperiment) -> Dict[str, Any]:
        """Run chaos experiment"""
        start_time = datetime.utcnow()

        try:
            # Execute chaos action
            experiment.action()

            # Monitor for duration
            import time
            time.sleep(experiment.duration)

            # Check if system is healthy
            healthy = self._check_system_health()

            # Rollback
            experiment.rollback()

            end_time = datetime.utcnow()
            duration = (end_time - start_time).total_seconds()

            return {
                'experiment': experiment.name,
                'status': 'passed' if healthy else 'failed',
                'duration': duration,
                'healthy': healthy
            }

        except Exception as e:
            # Rollback on error
            experiment.rollback()
            return {
                'experiment': experiment.name,
                'status': 'error',
                'error': str(e)
            }

    def _check_system_health(self) -> bool:
        """Check if system is healthy after chaos"""
        # In production, check health endpoints
        return True

    def terminate_random_instance(self):
        """Terminate random EC2 instance"""
        instances = self.ec2.describe_instances(
            Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
        )

        all_instances = []
        for reservation in instances['Reservations']:
            for instance in reservation['Instances']:
                all_instances.append(instance['InstanceId'])

        if all_instances:
            target = random.choice(all_instances)
            self.ec2.terminate_instances(InstanceIds=[target])
            print(f"Terminated instance: {target}")

    def restore_instances(self):
        """Restore terminated instances"""
        # In production, use Auto Scaling or launch templates
        pass

✅HA/DR Benefits

High availability and disaster recovery ensure business continuity. Regular testing is essential to validate RPO/RTO targets and identify weaknesses.

Summary

Strategy	RPO	RTO	Cost	Complexity
Backup/Restore	Hours	Hours	Low	Low
Pilot Light	Minutes	Minutes	Medium	Medium
Warm Standby	Seconds	Minutes	High	High
Multi-Site Active-Active	Zero	Zero	Very High	Very High