Multi-Cloud Data Strategy: Architecture & Governance
Difficulty: Staff Level | Companies: Netflix, Uber, Apple, Microsoft, Google
1. Why Multi-Cloud?
- Data Sovereignty: Regulatory requirements (GDPR, data residency)
- Best-of-Breed: Use the best service from each cloud
- Risk Mitigation: Avoid single vendor lock-in
- Cost Optimization: Compare pricing across clouds
- M&A: Different companies on different clouds
2. Cross-Cloud Replication
class CrossCloudReplicator:
def __init__(self):
self.replication_rules = []
def add_rule(self, source_cloud, source_path, target_cloud, target_path, frequency="hourly"):
self.replication_rules.append({
"source": {"cloud": source_cloud, "path": source_path},
"target": {"cloud": target_cloud, "path": target_path},
"frequency": frequency,
})
def replicate(self, rule):
source_client = self._get_client(rule["source"]["cloud"])
data = source_client.read(rule["source"]["path"])
target_client = self._get_client(rule["target"]["cloud"])
target_client.write(rule["target"]["path"], data)
def _get_client(self, cloud):
if cloud == "aws": return S3Client()
elif cloud == "gcp": return GCSClient()
elif cloud == "azure": return ADLSClient()
3. Data Sovereignty
class DataSovereigntyManager:
COMPLIANCE_RULES = {
"GDPR": {"regions": ["eu-west-1", "eu-central-1"], "encryption": True},
"CCPA": {"regions": ["us-east-1", "us-west-2"], "encryption": True},
"PDPA": {"regions": ["ap-southeast-1"], "encryption": True},
}
def validate_placement(self, classification, region):
rules = self.COMPLIANCE_RULES.get(classification, {})
return region in rules.get("regions", [])
def get_compliant_regions(self, classification):
return self.COMPLIANCE_RULES.get(classification, {}).get("regions", [])
4. Vendor Lock-in Avoidance
Use open formats: Parquet, Avro, Iceberg, Kafka, Arrow. These work identically across all cloud providers.
# Cloud-agnostic pipeline
df = spark.read.format("parquet").load("s3://bucket/data") # Works on AWS, GCS, ADLS
result = df.filter(F.col("status") == "active").groupBy("category").count()
result.write.format("iceberg").mode("overwrite").save("s3://lakehouse/output")
5. Cost Comparison Framework
class MultiCloudCostOptimizer:
def compare_storage(self, size_gb):
return {
"aws_s3": size_gb * 0.023,
"gcp_gcs": size_gb * 0.020,
"azure_blob": size_gb * 0.018,
}
def compare_compute(self, hours, spec="4vCPU/16GB"):
return {
"aws_ec2": hours * 0.50,
"gcp_compute": hours * 0.45,
"azure_vm": hours * 0.48,
"aws_spot": hours * 0.15,
"gcp_preemptible": hours * 0.13,
}
βΉοΈ
Best Practice: Start multi-cloud with a thin abstraction layer (Terraform + open formats). Don't build a custom multi-cloud platform unless you have 1000+ engineers.
Follow-Up Questions
- How would you design a multi-cloud data lake for a global company?
- Compare cross-cloud networking options (VPN, Direct Connect, Private Link).
- How do you handle data residency requirements across 50 countries?
- Design a unified governance layer across AWS, GCP, and Azure.
- When does multi-cloud make sense vs. staying with a single provider?