Cost Optimization: Managing Data Platform Expenses

Cost Optimization: Maximizing Value per Dollar

Cost optimization in data engineering involves right-sizing compute resources, optimizing storage, eliminating waste, and implementing governance to control data platform expenses.

Why Cost Optimization Matters

Cost Risks:

A single poorly optimized query can cost hundreds of dollars
Unused warehouses running 24/7 waste thousands monthly
Cloud costs can spiral quickly without governance

FinOps Benefits:

Every dollar spent delivers business value
Right-sizing reduces waste
Auto-suspend saves costs when not in use

Key Insight: FinOps practices ensure every dollar spent on cloud data platforms delivers business value.

Architecture Overview

The Cost Optimization framework covers three dimensions:

Compute Optimization — Right-sizing (monitor queue times), Auto-Suspend (60-300s timeout), Auto-Resume (on-demand scaling), Reserved Capacity (20-40% savings).

Storage Optimization — Tiering (Hot/Warm/Cold), Compression (columnar Parquet), Partitioning (date-based pruning), Lifecycle Policies (auto-expire data).

Query Optimization — Filter Early (partition pruning), Materialized Views (pre-computed aggregations), Result Caching (repeat query acceleration), Cost Alerts (budget threshold notifications).

Compute Cost Optimization

Compute cost in cloud data warehouses is driven by the number of virtual warehouses (or slots/clusters), their size, and their utilization time. The primary optimization lever is right-sizing and auto-suspend.

Compute Cost Model

Snowflake: Cost = Credits_Used × Price_Per_Credit
- Credits_Per_Hour = Warehouse_Size_Credits (1-128)
- Price_Per_Credit = $2-4 (edition-dependent)
- Monthly Cost = Credits_Per_Hour × Hours_Active × Price
BigQuery: Cost = Data_Scanned_TB × $6.25 (on-demand) OR Slots × Time × Rate
Redshift: Cost = Nodes × Price_Per_Node × Hours
Break-Even: Flat-Rate vs On-Demand = (Flat_Rate_Monthly) / (Data_Scanned_TB × $6.25)

-- Snowflake: Monitor warehouse utilization
SELECT
    warehouse_name,
    warehouse_size,
    SUM(credits_used) AS total_credits,
    SUM(credits_used) * 3 AS total_cost_usd,  -- Assuming $3/credit
    SUM(query_count) AS total_queries,
    ROUND(SUM(credits_used) / NULLIF(SUM(query_count), 0), 4) AS credits_per_query,
    ROUND(AVG(avg_queue_time_ms), 0) AS avg_queue_ms,
    ROUND(AVG(avg_execution_time_ms), 0) AS avg_exec_ms
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY
WHERE start_date >= DATEADD(day, -30, CURRENT_DATE())
GROUP BY warehouse_name, warehouse_size
ORDER BY total_credits DESC;

-- Identify idle warehouses
SELECT
    warehouse_name,
    warehouse_size,
    SUM(CASE WHEN credits_used = 0 THEN 1 ELSE 0 END) AS idle_hours,
    SUM(credits_used) AS active_credits,
    ROUND(SUM(CASE WHEN credits_used = 0 THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS idle_pct
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY
WHERE start_date >= DATEADD(day, -7, CURRENT_DATE())
GROUP BY warehouse_name, warehouse_size
HAVING idle_pct > 50
ORDER BY idle_pct DESC;

-- Right-sizing recommendation
SELECT
    warehouse_name,
    warehouse_size,
    AVG(avg_queue_time_ms) AS avg_queue,
    CASE
        WHEN AVG(avg_queue_time_ms) > 10000 AND warehouse_size IN ('X-Small', 'Small')
            THEN 'SCALE UP: Queue time > 10s'
        WHEN AVG(avg_queue_time_ms) < 1000 AND warehouse_size IN ('Large', 'X-Large', '2X-Large')
            THEN 'SCALE DOWN: Queue time < 1s'
        ELSE 'ADEQUATE'
    END AS recommendation
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY
WHERE start_date >= DATEADD(day, -7, CURRENT_DATE())
GROUP BY warehouse_name, warehouse_size;

-- Auto-suspend optimization
ALTER WAREHOUSE analytics_wh SET AUTO_SUSPEND = 60;  -- Suspend after 1 min
ALTER WAREHOUSE etl_wh SET AUTO_SUSPEND = 300;        -- Suspend after 5 min
ALTER WAREHOUSE adhoc_wh SET AUTO_SUSPEND = 60;       -- Suspend after 1 min

-- Cost by query pattern
SELECT
    query_type,
    COUNT(*) AS query_count,
    SUM(credits_used) AS total_credits,
    ROUND(AVG(credits_used), 4) AS avg_credits,
    SUM(bytes_scanned) / 1024 / 1024 / 1024 AS total_gb_scanned
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE start_time >= DATEADD(day, -30, CURRENT_DATE())
GROUP BY query_type
ORDER BY total_credits DESC;

Storage Cost Optimization

Storage tiering classifies data by access frequency and stores it in cost-appropriate tiers: hot (frequently accessed, expensive), warm (occasionally accessed, moderate), and cold (rarely accessed, cheap).

Storage Cost Comparison

Hot Tier (SSD/Provisioned): $0.10-0.20/GB/month — frequent access
Warm Tier (Standard): $0.02-0.04/GB/month — occasional access
Cold Tier (Archive/Glacier): $0.004-0.01/GB/month — rare access
Deep Archive: $0.001/GB/month — compliance/backup only
Cost Savings: Moving 100 TB from hot to cold = $10,000/month savings

# Storage Tiering Implementation
import boto3
from datetime import datetime, timedelta
from dataclasses import dataclass
from typing import List

@dataclass
class StorageTier:
    name: str
    storage_class: str
    cost_per_gb_month: float
    min_access_frequency_days: int

class StorageTierManager:
    """Manage storage tiering across S3 buckets."""

    TIERS = {
        "hot": StorageTier("hot", "STANDARD", 0.023, 0),
        "warm": StorageTier("warm", "STANDARD_IA", 0.0125, 30),
        "cold": StorageTier("cold", "GLACIER", 0.004, 90),
        "deep_archive": StorageTier("deep_archive", "DEEP_ARCHIVE", 0.00099, 365)
    }

    def __init__(self, bucket_name: str, region: str = "us-east-1"):
        self.bucket_name = bucket_name
        self.s3 = boto3.client('s3', region_name=region)

    def transition_objects(self, prefix: str, rules: List[dict]):
        """Apply lifecycle rules for automatic tiering."""
        lifecycle_rules = []
        for rule in rules:
            lifecycle_rules.append({
                "ID": rule["name"],
                "Filter": {"Prefix": prefix},
                "Status": "Enabled",
                "Transitions": [
                    {
                        "Days": tier["days"],
                        "StorageClass": self.TIERS[tier["tier"]].storage_class
                    }
                    for tier in rule["transitions"]
                ],
                "Expiration": rule.get("expiration", {})
            })

        self.s3.put_bucket_lifecycle_configuration(
            Bucket=self.bucket_name,
            LifecycleConfiguration={"Rules": lifecycle_rules}
        )

    def analyze_storage_costs(self, prefix: str = "") -> dict:
        """Analyze current storage distribution and costs."""
        paginator = self.s3.get_paginator('list_objects_v2')
        total_size = 0
        tier_sizes = {tier: 0 for tier in self.TIERS}

        for page in paginator.paginate(Bucket=self.bucket_name, Prefix=prefix):
            for obj in page.get('Contents', []):
                size = obj['Size']
                total_size += size
                storage_class = obj.get('StorageClass', 'STANDARD')

                for tier_name, tier in self.TIERS.items():
                    if tier.storage_class == storage_class:
                        tier_sizes[tier_name] += size
                        break

        costs = {}
        for tier_name, size_bytes in tier_sizes.items():
            size_gb = size_bytes / (1024**3)
            costs[tier_name] = {
                "size_gb": round(size_gb, 2),
                "cost_per_month": round(size_gb * self.TIERS[tier_name].cost_per_gb_month, 2)
            }

        return {
            "total_size_gb": round(total_size / (1024**3), 2),
            "total_cost_per_month": sum(c["cost_per_month"] for c in costs.values()),
            "breakdown": costs
        }

# Usage
manager = StorageTierManager("my-data-lake")
manager.transition_objects("data/events/", [
    {
        "name": "events-tiering",
        "transitions": [
            {"tier": "warm", "days": 30},
            {"tier": "cold", "days": 90},
            {"tier": "deep_archive", "days": 365}
        ],
        "expiration": {"Days": 2555}  # 7 years
    }
])

costs = manager.analyze_storage_costs("data/")
print(f"Total storage: {costs['total_size_gb']} GB")
print(f"Monthly cost: ${costs['total_cost_per_month']}")

Cost Monitoring and Alerting

# Cost Monitoring Dashboard
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Dict, List
import json

@dataclass
class CostAlert:
    service: str
    threshold_usd: float
    period_days: int
    current_cost: float
    alert_level: str  # info, warning, critical

class CostMonitor:
    """Monitor and alert on data platform costs."""

    def __init__(self, budget_monthly_usd: float = 10000):
        self.budget = budget_monthly_usd
        self.alerts: List[CostAlert] = []

    def check_snowflake_costs(self, warehouse_costs: Dict[str, float]):
        """Check Snowflake warehouse costs against thresholds."""
        total = sum(warehouse_costs.values())

        # Budget alert
        if total > self.budget * 0.9:
            self.alerts.append(CostAlert(
                service="snowflake",
                threshold_usd=self.budget * 0.9,
                period_days=30,
                current_cost=total,
                alert_level="critical" if total > self.budget else "warning"
            ))

        # Individual warehouse alerts
        for warehouse, cost in warehouse_costs.items():
            if cost > self.budget * 0.3:  # Single warehouse > 30% of budget
                self.alerts.append(CostAlert(
                    service=f"snowflake.{warehouse}",
                    threshold_usd=self.budget * 0.3,
                    period_days=30,
                    current_cost=cost,
                    alert_level="warning"
                ))

        return self.alerts

    def generate_cost_report(self, costs: Dict[str, Dict]) -> str:
        """Generate a formatted cost report."""
        report = []
        report.append("=" * 60)
        report.append("DATA PLATFORM COST REPORT")
        report.append(f"Period: {datetime.now().strftime('%Y-%m')}")
        report.append("=" * 60)

        total = 0
        for service, breakdown in costs.items():
            service_total = sum(breakdown.values())
            total += service_total
            report.append(f"\n{service.upper()}")
            for component, cost in breakdown.items():
                report.append(f"  {component}: ${cost:,.2f}")

        report.append(f"\n{'=' * 60}")
        report.append(f"TOTAL: ${total:,.2f}")
        report.append(f"BUDGET: ${self.budget:,.2f}")
        report.append(f"REMAINING: ${self.budget - total:,.2f}")
        report.append(f"UTILIZATION: {total/self.budget*100:.1f}%")

        if self.alerts:
            report.append(f"\nALERTS ({len(self.alerts)}):")
            for alert in self.alerts:
                report.append(f"  [{alert.alert_level.upper()}] {alert.service}: "
                            f"${alert.current_cost:,.2f} > ${alert.threshold_usd:,.2f}")

        return "\n".join(report)

# Usage
monitor = CostMonitor(budget_monthly_usd=10000)
alerts = monitor.check_snowflake_costs({
    "analytics_wh": 2500,
    "etl_wh": 1800,
    "adhoc_wh": 400,
    "ml_wh": 3200
})

report = monitor.generate_cost_report({
    "Snowflake": {"analytics": 2500, "etl": 1800, "adhoc": 400, "ml": 3200},
    "S3": {"storage": 200, "transfer": 50},
    "Databricks": {"clusters": 1500, "storage": 100}
})
print(report)

Key Concepts Summary

Optimization Area	Strategy	Savings	Implementation
Warehouse Sizing	Right-size to workload	30-60%	Monitor queue times
Auto-Suspend	Suspend idle warehouses	20-40%	Set 60-300 second timeout
Reserved Capacity	Commit to usage	20-40%	Annual/monthly commitment
Storage Tiering	Hot/Warm/Cold	40-70%	Lifecycle policies
Partitioning	Reduce data scanned	70-95%	Partition by date
Clustering	Improve data skipping	20-60%	Cluster by filter columns
Materialized Views	Pre-computed results	30-70%	Auto-refresh aggregations
Result Caching	Avoid re-execution	50-80%	Enable caching
Query Optimization	Reduce data processed	20-50%	Filter early, select columns
Cost Monitoring	Catch anomalies	Variable	Alerts and dashboards

Performance Metrics

Metric	Without Optimization	With Optimization	Savings
Monthly Compute	$8,000 \|$ 3,500	56%
Monthly Storage	$2,000 \|$ 600	70%
Query Cost/GB	$0.00625 \|$ 0.002	68%
Idle Compute	40%	5%	87%
Storage Tier Utilization	100% hot	30% hot, 40% warm, 30% cold	60%
Budget Utilization	120%	85%	29%
Cost Anomaly Detection	Manual	Automated	90% faster

10 Best Practices

Set auto-suspend to 60-300 seconds on all warehouses — biggest single cost lever
Monitor warehouse utilization weekly — right-size based on queue times and credits used
Implement storage tiering — transition old data to cheaper storage automatically
Use partition pruning — filter by date in every query to minimize data scanned
Enable result caching — identical queries within 24 hours return cached results
Set up cost alerts — notify when approaching 80% and 100% of budget
Use reserved capacity for predictable workloads — 20-40% savings vs on-demand
Optimize file sizes — 128-256 MB compressed files balance parallelism and overhead
Clean up unused resources — drop orphaned tables, unused warehouses, expired stages
Tag all resources by team and project for cost allocation and accountability

Auto-suspend is the single most impactful cost lever — suspend idle warehouses immediately
Storage tiering (hot/warm/cold) reduces storage costs by 40-70%
Partition pruning reduces compute costs by minimizing data scanned
Cost monitoring with automated alerts catches anomalies before they become expensive
Right-sizing compute to workload (not over-provisioning) saves 30-60%