Multi-Cloud Data Strategy: Architecture & Governance

Difficulty: Staff Level | Companies: Netflix, Uber, Apple, Microsoft, Google

1. Why Multi-Cloud?

Data Sovereignty: Regulatory requirements (GDPR, data residency)
Best-of-Breed: Use the best service from each cloud
Risk Mitigation: Avoid single vendor lock-in
Cost Optimization: Compare pricing across clouds
M&A: Different companies on different clouds

2. Cross-Cloud Replication

class CrossCloudReplicator:
    def __init__(self):
        self.replication_rules = []
    
    def add_rule(self, source_cloud, source_path, target_cloud, target_path, frequency="hourly"):
        self.replication_rules.append({
            "source": {"cloud": source_cloud, "path": source_path},
            "target": {"cloud": target_cloud, "path": target_path},
            "frequency": frequency,
        })
    
    def replicate(self, rule):
        source_client = self._get_client(rule["source"]["cloud"])
        data = source_client.read(rule["source"]["path"])
        target_client = self._get_client(rule["target"]["cloud"])
        target_client.write(rule["target"]["path"], data)
    
    def _get_client(self, cloud):
        if cloud == "aws": return S3Client()
        elif cloud == "gcp": return GCSClient()
        elif cloud == "azure": return ADLSClient()

3. Data Sovereignty

class DataSovereigntyManager:
    COMPLIANCE_RULES = {
        "GDPR": {"regions": ["eu-west-1", "eu-central-1"], "encryption": True},
        "CCPA": {"regions": ["us-east-1", "us-west-2"], "encryption": True},
        "PDPA": {"regions": ["ap-southeast-1"], "encryption": True},
    }
    
    def validate_placement(self, classification, region):
        rules = self.COMPLIANCE_RULES.get(classification, {})
        return region in rules.get("regions", [])
    
    def get_compliant_regions(self, classification):
        return self.COMPLIANCE_RULES.get(classification, {}).get("regions", [])

4. Vendor Lock-in Avoidance

Use open formats: Parquet, Avro, Iceberg, Kafka, Arrow. These work identically across all cloud providers.

# Cloud-agnostic pipeline
df = spark.read.format("parquet").load("s3://bucket/data")  # Works on AWS, GCS, ADLS
result = df.filter(F.col("status") == "active").groupBy("category").count()
result.write.format("iceberg").mode("overwrite").save("s3://lakehouse/output")

5. Cost Comparison Framework

class MultiCloudCostOptimizer:
    def compare_storage(self, size_gb):
        return {
            "aws_s3": size_gb * 0.023,
            "gcp_gcs": size_gb * 0.020,
            "azure_blob": size_gb * 0.018,
        }
    
    def compare_compute(self, hours, spec="4vCPU/16GB"):
        return {
            "aws_ec2": hours * 0.50,
            "gcp_compute": hours * 0.45,
            "azure_vm": hours * 0.48,
            "aws_spot": hours * 0.15,
            "gcp_preemptible": hours * 0.13,
        }

ℹ️

Best Practice: Start multi-cloud with a thin abstraction layer (Terraform + open formats). Don't build a custom multi-cloud platform unless you have 1000+ engineers.

Follow-Up Questions

How would you design a multi-cloud data lake for a global company?
Compare cross-cloud networking options (VPN, Direct Connect, Private Link).
How do you handle data residency requirements across 50 countries?
Design a unified governance layer across AWS, GCP, and Azure.
When does multi-cloud make sense vs. staying with a single provider?