Netflix & Uber Interview

Cloud Data Platform Cost Optimization Strategies

Reducing cloud costs while maintaining performance

Interview Question

"Your company spends $500K/month on cloud data platform costs. Design a cost optimization strategy that: (1) reduces costs by 30%, (2) maintains performance SLAs, (3) provides visibility into spending, (4) automates cost controls. Include specific tools and techniques for AWS, GCP, and Azure."

Difficulty: Hard | Frequently asked at Netflix, Uber, Amazon, Google

Theoretical Foundation

Cloud Cost Components

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│              Cloud Data Platform Costs                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Compute (40-60%):                                          │
│  - Virtual warehouses (Snowflake, Redshift)                │
│  - EMR/Dataproc clusters                                   │
│  - Kubernetes pods                                          │
│  - Serverless functions                                    │
│                                                             │
│  Storage (20-30%):                                          │
│  - S3/GCS/Blob storage                                     │
│  - Database storage                                         │
│  - Backup and snapshots                                     │
│                                                             │
│  Data Transfer (10-20%):                                    │
│  - Cross-region transfer                                    │
│  - Internet egress                                          │
│  - API calls                                                │
│                                                             │
│  Other (5-10%):                                             │
│  - Monitoring and logging                                   │
│  - Security and compliance                                  │
│  - Support                                                  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Cost Optimization Strategies

1. Right-Sizing

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                    Right-Sizing                             │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Problem: Over-provisioned resources                        │
│                                                             │
│  Current:                                                   │
│  - 10 x r5.4xlarge (64 vCPUs, 256GB) = $4,380/month       │
│  - Average utilization: 30%                                 │
│                                                             │
│  Optimized:                                                 │
│  - 5 x r5.2xlarge (32 vCPUs, 128GB) = $2,190/month        │
│  - Average utilization: 60%                                 │
│  - Savings: $2,190/month (50%)                             │
│                                                             │
└─────────────────────────────────────────────────────────────┘

2. Reserved Instances / Savings Plans

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│              Reserved Instances                             │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  On-Demand: $0.50/hour                                      │
│  1-Year Reserved: $0.35/hour (30% savings)                 │
│  3-Year Reserved: $0.25/hour (50% savings)                 │
│                                                             │
│  Example:                                                   │
│  - 10 instances running 24/7                                │
│  - On-Demand: 10 × $0.50 × 730 = $3,650/month             │
│  - 3-Year Reserved: 10 × $0.25 × 730 = $1,825/month       │
│  - Savings: $1,825/month (50%)                             │
│                                                             │
└─────────────────────────────────────────────────────────────┘

3. Spot Instances / Preemptible VMs

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                    Spot Instances                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  On-Demand: $0.50/hour                                      │
│  Spot: $0.15/hour (70% savings)                            │
│                                                             │
│  Use Cases:                                                 │
│  - Batch processing                                         │
│  - Data transformations                                     │
│  - CI/CD pipelines                                          │
│  - Development/testing                                      │
│                                                             │
│  Limitations:                                               │
│  - Can be reclaimed (2-minute notice)                       │
│  - Not suitable for stateful workloads                      │
│  - May have availability constraints                        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

4. Storage Optimization

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│              Storage Optimization                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Tiered Storage:                                            │
│  - Hot: S3 Standard ($0.023/GB/month)                       │
│  - Warm: S3 Infrequent Access ($0.0125/GB/month)            │
│  - Cold: S3 Glacier ($0.004/GB/month)                       │
│  - Archive: S3 Glacier Deep Archive ($0.00099/GB/month)     │
│                                                             │
│  Lifecycle Policies:                                        │
│  - Move to IA after 30 days                                │
│  - Move to Glacier after 90 days                           │
│  - Delete after 365 days                                   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

5. Serverless Optimization

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│              Serverless Optimization                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Traditional:                                               │
│  - Always-on cluster: $2,000/month                         │
│  - Utilization: 40%                                         │
│                                                             │
│  Serverless:                                                │
│  - Pay per query: $0.01/GB scanned                         │
│  - 1000 queries/day × 10GB × 30 days = $3,000/month       │
│                                                             │
│  When to use serverless:                                    │
│  - Variable workload                                        │
│  - Ad-hoc queries                                           │
│  - Development/testing                                      │
│                                                             │
│  When to use provisioned:                                   │
│  - Predictable workload                                     │
│  - High-throughput requirements                             │
│  - Latency-sensitive applications                           │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Cost Allocation and Tagging

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│              Cost Allocation                                │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Tags:                                                      │
│  - environment: production, staging, development           │
│  - team: data-engineering, data-science, analytics         │
│  - project: etl-pipeline, ml-training, reporting           │
│  - cost-center: engineering, research, operations          │
│                                                             │
│  Cost Allocation:                                           │
│  - Track costs by team/project                              │
│  - Chargeback to business units                             │
│  - Budget alerts and thresholds                            │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Code Implementation

Cost Monitoring Dashboard

import boto3
import pandas as pd
from datetime import datetime, timedelta
import json

class CostMonitor:
    """Monitor and optimize cloud costs"""
    
    def __init__(self, aws_access_key, aws_secret_key, region='us-east-1'):
        self.ce_client = boto3.client(
            'ce',
            aws_access_key_id=aws_access_key,
            aws_secret_access_key=aws_secret_key,
            region_name=region
        )
        
        self.cloudwatch = boto3.client(
            'cloudwatch',
            aws_access_key_id=aws_access_key,
            aws_secret_access_key=aws_secret_key,
            region_name=region
        )
    
    def get_cost_summary(self, start_date, end_date):
        """Get cost summary for date range"""
        
        response = self.ce_client.get_cost_and_usage(
            TimePeriod={
                'Start': start_date,
                'End': end_date
            },
            Granularity='MONTHLY',
            Metrics=['UnblendedCost', 'UsageQuantity'],
            GroupBy=[
                {'Type': 'DIMENSION', 'Key': 'SERVICE'},
                {'Type': 'TAG', 'Key': 'team'}
            ]
        )
        
        return response['ResultsByTime']
    
    def get_cost_by_service(self, days=30):
        """Get cost breakdown by service"""
        
        end_date = datetime.now().strftime('%Y-%m-%d')
        start_date = (datetime.now() - timedelta(days=days)).strftime('%Y-%m-%d')
        
        response = self.ce_client.get_cost_and_usage(
            TimePeriod={
                'Start': start_date,
                'End': end_date
            },
            Granularity='MONTHLY',
            Metrics=['UnblendedCost'],
            GroupBy=[
                {'Type': 'DIMENSION', 'Key': 'SERVICE'}
            ]
        )
        
        costs = {}
        for result in response['ResultsByTime']:
            for group in result['Groups']:
                service = group['Keys'][0]
                cost = float(group['Metrics']['UnblendedCost']['Amount'])
                costs[service] = costs.get(service, 0) + cost
        
        return sorted(costs.items(), key=lambda x: x[1], reverse=True)
    
    def get_cost_trends(self, months=6):
        """Get cost trends over time"""
        
        trends = []
        for i in range(months):
            end_date = datetime.now() - timedelta(days=30*i)
            start_date = end_date - timedelta(days=30)
            
            cost = self.get_cost_summary(
                start_date.strftime('%Y-%m-%d'),
                end_date.strftime('%Y-%m-%d')
            )
            
            trends.append({
                'month': end_date.strftime('%Y-%m'),
                'cost': cost
            })
        
        return trends
    
    def identify_waste(self):
        """Identify cost waste and optimization opportunities"""
        
        waste = []
        
        # Check for idle resources
        response = self.cloudwatch.get_metric_statistics(
            Namespace='AWS/EC2',
            MetricName='CPUUtilization',
            Dimensions=[],
            StartTime=datetime.now() - timedelta(hours=24),
            EndTime=datetime.now(),
            Period=3600,
            Statistics=['Average']
        )
        
        for datapoint in response['Datapoints']:
            if datapoint['Average'] < 10:  # Less than 10% CPU
                waste.append({
                    'type': 'idle_instance',
                    'instance_id': datapoint['Dimensions'][0]['Value'],
                    'utilization': datapoint['Average'],
                    'estimated_savings': 100  # Example
                })
        
        return waste

AWS Cost Optimization

# ============================================================
# AWS COST OPTIMIZATION
# ============================================================

import boto3

class AWSCostOptimizer:
    """AWS cost optimization strategies"""
    
    def __init__(self):
        self.ec2 = boto3.client('ec2')
        self.s3 = boto3.client('s3')
        self.rds = boto3.client('rds')
    
    def rightsize_instances(self):
        """Rightsize EC2 instances"""
        
        # Get instance utilization
        response = self.ec2.describe_instances()
        
        recommendations = []
        for reservation in response['Reservations']:
            for instance in reservation['Instances']:
                instance_id = instance['InstanceId']
                instance_type = instance['InstanceType']
                
                # Get CPU utilization
                cpu_util = self.get_cpu_utilization(instance_id)
                
                if cpu_util < 30:  # Under-utilized
                    recommended_type = self.get_recommended_type(instance_type, cpu_util)
                    savings = self.calculate_savings(instance_type, recommended_type)
                    
                    recommendations.append({
                        'instance_id': instance_id,
                        'current_type': instance_type,
                        'recommended_type': recommended_type,
                        'cpu_utilization': cpu_util,
                        'monthly_savings': savings
                    })
        
        return recommendations
    
    def optimize_s3(self):
        """Optimize S3 storage costs"""
        
        # Get S3 bucket inventory
        buckets = self.s3.list_buckets()['Buckets']
        
        recommendations = []
        for bucket in buckets:
            bucket_name = bucket['Name']
            
            # Analyze access patterns
            lifecycle_rules = self.analyze_lifecycle(bucket_name)
            
            if lifecycle_rules:
                recommendations.append({
                    'bucket': bucket_name,
                    'lifecycle_rules': lifecycle_rules,
                    'estimated_savings': self.estimate_s3_savings(bucket_name)
                })
        
        return recommendations
    
    def create_savings_plans(self):
        """Create savings plans for predictable workloads"""
        
        # Analyze usage patterns
        usage_patterns = self.analyze_usage_patterns()
        
        recommendations = []
        for pattern in usage_patterns:
            if pattern['utilization'] > 70 and pattern['commitment'] > 12:
                recommendations.append({
                    'instance_family': pattern['family'],
                    'recommendation': 'Compute Savings Plan',
                    'savings': pattern['potential_savings']
                })
        
        return recommendations
    
    def get_cpu_utilization(self, instance_id):
        """Get CPU utilization for instance"""
        # Simplified - in production, use CloudWatch
        return 25.0  # Example
    
    def get_recommended_type(self, current_type, utilization):
        """Get recommended instance type based on utilization"""
        # Simplified recommendation logic
        if utilization < 20:
            return 't3.medium'
        elif utilization < 40:
            return 't3.large'
        else:
            return current_type
    
    def calculate_savings(self, current_type, recommended_type):
        """Calculate monthly savings"""
        # Simplified pricing
        prices = {
            'm5.xlarge': 0.192,
            'm5.2xlarge': 0.384,
            't3.medium': 0.0416,
            't3.large': 0.0832
        }
        
        current_cost = prices.get(current_type, 0.192) * 730
        recommended_cost = prices.get(recommended_type, 0.0416) * 730
        
        return current_cost - recommended_cost

Snowflake Cost Optimization

-- ============================================================
-- SNOWFLAKE COST OPTIMIZATION
-- ============================================================

-- 1. Monitor credit usage
SELECT 
    warehouse_name,
    SUM(credits_used) AS total_credits,
    SUM(credits_used) * 3 AS total_cost_usd
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY
WHERE start_date >= DATEADD(day, -30, CURRENT_DATE())
GROUP BY warehouse_name
ORDER BY total_credits DESC;

-- 2. Identify expensive queries
SELECT 
    query_id,
    query_text,
    execution_time_ms,
    bytes_scanned / 1024 / 1024 AS mb_scanned,
    credits_used
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE start_date >= DATEADD(day, -7, CURRENT_DATE())
ORDER BY credits_used DESC
LIMIT 10;

-- 3. Right-size warehouses
-- If average utilization < 50%, use smaller warehouse
SELECT 
    warehouse_name,
    AVG(used_clusters) AS avg_clusters,
    MAX(cluster_number) AS max_clusters
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_LOAD_HISTORY
WHERE start_date >= DATEADD(day, -7, CURRENT_DATE())
GROUP BY warehouse_name;

-- 4. Auto-suspend and auto-resume
ALTER WAREHOUSE analytics_wh 
    SET AUTO_SUSPEND = 60 
    AUTO_RESUME = TRUE;

-- 5. Multi-cluster warehouse for concurrency
ALTER WAREHOUSE analytics_wh 
    SET MIN_CLUSTER_COUNT = 1 
    MAX_CLUSTER_COUNT = 5;

-- 6. Use serverless for ad-hoc queries
-- Serverless auto-scales and charges per second

-- 7. Compact micro-partitions
ALTER TABLE orders COMPACT;

-- 8. Drop unused tables
SELECT table_name, last_altered 
FROM INFORMATION_SCHEMA.TABLES 
WHERE last_altered < DATEADD(day, -90, CURRENT_DATE());

Cost Alerting

# ============================================================
# COST ALERTING
# ============================================================

import boto3
import json

class CostAlerting:
    """Set up cost alerts and budgets"""
    
    def __init__(self):
        self.ce_client = boto3.client('ce')
        self.sns = boto3.client('sns')
    
    def create_budget(self, budget_name, limit, email):
        """Create a cost budget"""
        
        response = self.ce_client.create_budget(
            AccountId='123456789012',
            Budget={
                'BudgetName': budget_name,
                'BudgetLimit': {
                    'Amount': str(limit),
                    'Unit': 'USD'
                },
                'TimeUnit': 'MONTHLY',
                'BudgetType': 'COST',
                'CostFilters': {
                    'TagKey': ['team'],
                    'TagValues': ['data-engineering']
                }
            },
            NotificationsWithSubscribers=[
                {
                    'Notification': {
                        'NotificationType': 'ACTUAL',
                        'ComparisonOperator': 'GREATER_THAN',
                        'Threshold': 80,
                        'ThresholdType': 'PERCENTAGE'
                    },
                    'Subscribers': [
                        {
                            'SubscriptionType': 'EMAIL',
                            'Address': email
                        }
                    ]
                }
            ]
        )
        
        return response
    
    def create_cost_anomaly_detection(self):
        """Create cost anomaly detection"""
        
        response = self.ce_client.create_anomaly_detector(
            AnomalyDetector={
                'DetectorName': 'cost-anomaly-detector',
                'MonitorType': 'DIMENSIONAL',
                'DimensionValue': {
                    'Key': 'SERVICE',
                    'Values': ['Amazon Elastic Compute Cloud - Compute']
                }
            }
        )
        
        return response

Infrastructure as Code for Cost Optimization

# ============================================================
# TERRAFORM FOR COST OPTIMIZATION
# ============================================================

# main.tf
"""
# Spot instances for batch processing
resource "aws_instance" "batch" {
  count         = 3
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "c5.xlarge"
  
  instance_market_options {
    market_type = "spot"
    spot_options {
      spot_instance_type = "persistent"
    }
  }
  
  tags = {
    Name = "batch-worker-${count.index}"
    team = "data-engineering"
  }
}

# Reserved instances for predictable workloads
resource "aws_instance" "production" {
  count         = 2
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "m5.xlarge"
  
  # Use Reserved Instance pricing
  lifecycle {
    prevent_destroy = true
  }
  
  tags = {
    Name = "production-worker-${count.index}"
    team = "data-engineering"
  }
}

# S3 lifecycle policy for cost optimization
resource "aws_s3_bucket_lifecycle_configuration" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id
  
  rule {
    id     = "move-to-ia"
    status = "Enabled"
    
    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }
    
    transition {
      days          = 90
      storage_class = "GLACIER"
    }
    
    transition {
      days          = 365
      storage_class = "DEEP_ARCHIVE"
    }
  }
}
"""

💡

Production Tip: Start with the biggest cost drivers. Typically, compute is 40-60% of costs. Focus on right-sizing, reserved instances, and spot instances for batch workloads. Then optimize storage and data transfer.

Common Follow-Up Questions

Q1: How do you calculate ROI for cost optimization?

def calculate_roi(optimization_cost, monthly_savings, implementation_hours, hourly_rate):
    """Calculate ROI for cost optimization"""
    
    implementation_cost = implementation_hours * hourly_rate
    annual_savings = monthly_savings * 12
    
    roi = (annual_savings - implementation_cost) / implementation_cost * 100
    payback_months = implementation_cost / monthly_savings
    
    return {
        'annual_savings': annual_savings,
        'implementation_cost': implementation_cost,
        'roi_percent': roi,
        'payback_months': payback_months
    }

Q2: How do you handle cost allocation across teams?

-- Use tags for cost allocation
SELECT 
    tag_value as team,
    SUM(unblended_cost) as cost
FROM cost_data
WHERE tag_key = 'team'
GROUP BY tag_value
ORDER BY cost DESC;

Q3: How do you automate cost optimization?

# Auto-scaling based on utilization
def auto_scale_cluster(cluster_id, target_utilization=70):
    """Auto-scale cluster based on utilization"""
    
    current_utilization = get_cluster_utilization(cluster_id)
    current_nodes = get_cluster_nodes(cluster_id)
    
    if current_utilization > target_utilization + 10:
        new_nodes = current_nodes + 1
    elif current_utilization < target_utilization - 10:
        new_nodes = max(1, current_nodes - 1)
    else:
        return
    
    scale_cluster(cluster_id, new_nodes)

Q4: How do you handle multi-cloud cost optimization?

Use cloud-agnostic tools (Kubecost, CloudHealth)
Standardize tagging across clouds
Compare pricing across providers
Use reserved capacity across clouds

⚠️

Critical Consideration: Cost optimization is not a one-time project—it's an ongoing process. Set up regular reviews, automate monitoring, and create a culture of cost awareness.

Company-Specific Tips

Netflix Interview Tips

Discuss multi-cloud cost optimization
Explain spot instances for encoding
Mention CDN optimization for streaming
Talk about data transfer cost reduction

Uber Interview Tips

Focus on real-time cost optimization
Discuss geospatial data storage optimization
Mention ML training cost reduction
Talk about multi-region deployment costs

Amazon Interview Tips

Discuss AWS cost optimization services
Explain Savings Plans and Reserved Instances
Mention Spot Instances for batch processing
Talk about S3 Intelligent-Tiering

ℹ️

Final Takeaway: Cloud cost optimization requires continuous monitoring and optimization. Focus on the biggest cost drivers first, automate where possible, and create a culture of cost awareness. The goal is not just to cut costs, but to optimize value.