π° Cost Optimization
Master cost optimization strategies for S3, compute, and data engineering workloads.
Module: AWS Data Engineering β’ Topic 28 of 65 β’ Premium Content
Cost Optimization Framework
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β COST OPTIMIZATION FRAMEWORK β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β STORAGE COSTS β β
β β β β
β β S3 Standard: $0.023/GB/mo β Frequently accessed β β
β β S3 IA: $0.0125/GB/mo β Infrequent (30-day min) β β
β β S3 Glacier: $0.004/GB/mo β Archive (90-day min) β β
β β S3 Deep Archive: $0.00099/GB/mo β Long-term (180-day min) β β
β β β β
β β Savings: Use lifecycle policies β up to 95% reduction β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β COMPUTE COSTS β β
β β β β
β β On-Demand: Full price β Dev/Test, variable workloads β β
β β Reserved (1yr): ~40% off β Steady-state production β β
β β Reserved (3yr): ~60% off β Long-term infrastructure β β
β β Spot Instances: ~90% off β Fault-tolerant batch jobs β β
β β Savings Plans: Up to 72% off β Flexible usage patterns β β
β β β β
β β Serverless: Lambda, Glue, Athena β Pay only for what you use β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DATA TRANSFER COSTS β β
β β β β
β β Inbound: Free β Most data ingestion β β
β β Outbound: $0.09/GB β Use VPC endpoints for S3 β β
β β Cross-AZ: $0.01/GB β Keep services in same AZ β β
β β Cross-Region: $0.02/GB β Minimize cross-region traffic β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
S3 Lifecycle Policy Example
{
"Rules": [
{
"ID": "OptimizeDataLake",
"Status": "Enabled",
"Filter": {"Prefix": "data-lake/"},
"Transitions": [
{"Days": 0, "StorageClass": "STANDARD"},
{"Days": 90, "StorageClass": "STANDARD_IA"},
{"Days": 180, "StorageClass": "GLACIER"},
{"Days": 365, "StorageClass": "DEEP_ARCHIVE"}
]
}
]
}
Spot Instance Strategy
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SPOT INSTANCE STRATEGY β
β β
β EMR Cluster Cost Comparison (10 nodes, 24/7): β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β On-Demand: 10 Γ $0.52/hr Γ 730 hrs = $3,796/month β β
β β Reserved: 10 Γ $0.31/hr Γ 730 hrs = $2,263/month (40% off) β β
β β Spot: 10 Γ $0.10/hr Γ 730 hrs = $730/month (81% off) β β
β β β β
β β Annual Savings: β β
β β Reserved vs On-Demand: $18,396/year β β
β β Spot vs On-Demand: $36,792/year β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Best Practices: β
β β’ Use Spot for task nodes (fault-tolerant) β
β β’ Use Reserved for core nodes (persistent) β
β β’ Use On-Demand for master node (critical) β
β β’ Set max price to On-Demand price β
β β’ Enable graceful decommissioning β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Interview Q&A
Q1: What is the biggest cost driver in data lakes?
Answer: Storage (S3) and compute (EMR/Glue). Use lifecycle policies for storage and Spot instances for compute to reduce costs.
Q2: How do you estimate costs before deployment?
Answer: Use the AWS Pricing Calculator. Input expected storage, compute, and data transfer requirements.
Q3: When should you use Savings Plans vs. Reserved Instances?
Answer: Savings Plans offer flexibility across instance families and regions. Reserved Instances are specific to instance type and AZ.
Summary
- Storage: Lifecycle policies for 95% savings on cold data
- Compute: Spot for 90% savings, Reserved for 40-60% savings
- Transfer: VPC endpoints eliminate NAT Gateway costs
- Serverless: Pay-per-use for variable workloads
- Monitoring: Cost Explorer and Budgets for visibility