πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Infrastructure as Code: Automating Data Platform Provisioning

Module 4: Advanced DE & CareerAdvanced Data Engineering🟒 Free Lesson

Advertisement

Infrastructure as Code: Reproducible, Version-Controlled Infrastructure

Infrastructure as Code (IaC) manages and provisions data platform infrastructure through machine-readable configuration files, enabling version control, review, and automated deployment.

Why IaC Matters


Problems with Manual Provisioning:

  • Configuration drift
  • Undocumented changes
  • Deployment inconsistencies

IaC Benefits:

  1. Every environment is identical β€” reproducible infrastructure
  2. Version control β€” track changes over time
  3. Automated deployment β€” reduce human error
  4. Auditable β€” clear record of what was deployed

Key Insight: IaC ensures every environment is identical, reproducible, and auditable.


Architecture Overview

Infrastructure as Code PipelineIaC RepoTerraformPulumi.tf filesVariablesGit versionedPlan {'&'} Validateterraform planPolicy checksCost estimateSecurity scanReview requiredCI PipelineLint configUnit testsIntegrationDrift detectGitHub ActionsApplyterraform applyState lockProvisionOutput varsAuto-approveCloudAWSGCPAzureSnowflakeResources live


Terraform for Data Platforms

IaC Tool ComparisonTerraformLanguage: HCLState: Remote (S3 + DynamoDB)Providers: 3000+ modulesMulti-cloud: AWS, GCP, AzurePlan before applyMature ecosystem, large communityPolicy as Code: Sentinel, OPABest for: Multi-cloud platformsPulumiLanguage: Python, TS, GoState: Pulumi Cloud / S3Providers: AWS, GCP, AzureReal programming languageTest with pytest, unittestIDE support, code reuseDynamic infrastructureBest for: Dev-heavy teamsCloudFormationLanguage: YAML / JSONState: AWS-managedProviders: AWS onlyNative AWS integrationStacks and change setsFree, no external toolDrift detection built-inBest for: AWS-only shops

IaC declares infrastructure resources in configuration files. The IaC tool (Terraform, Pulumi, CloudFormation) then provisions, manages, and destroys resources to match the declared state.

# main.tf - Snowflake data warehouse infrastructure
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    snowflake = {
      source  = "Snowflake-Labs/snowflake"
      version = "~> 0.89"
    }
  }

  backend "s3" {
    bucket         = "terraform-state-data-platform"
    key            = "snowflake/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

provider "snowflake" {
  account  = var.snowflake_account
  username = var.snowflake_username
  password = var.snowflake_password
  role     = "ACCOUNTADMIN"
}

# Variables
variable "snowflake_account" {
  type        = string
  description = "Snowflake account identifier"
}

variable "environment" {
  type        = string
  description = "Deployment environment"
  default     = "dev"
}

variable " warehouses" {
  type = map(object({
    size         = string
    auto_suspend = number
    min_clusters = number
    max_clusters = number
  }))
  default = {
    analytics = {
      size         = "medium"
      auto_suspend = 300
      min_clusters = 1
      max_clusters = 4
    }
    etl = {
      size         = "large"
      auto_suspend = 600
      min_clusters = 2
      max_clusters = 6
    }
    adhoc = {
      size         = "x-small"
      auto_suspend = 60
      min_clusters = 1
      max_clusters = 1
    }
  }
}

# Database
resource "snowflake_database" "analytics" {
  name    = "ANALYTICS_${upper(var.environment)}"
  comment = "Analytics database for ${var.environment}"
}

# Schema
resource "snowflake_schema" "marts" {
  database = snowflake_database.analytics.name
  name     = "MARTS"
  comment  = "Business-ready data marts"
}

resource "snowflake_schema" "staging" {
  database = snowflake_database.analytics.name
  name     = "STAGING"
  comment  = "Staging area for transformations"
}

# Virtual Warehouses
resource "snowflake_warehouse" "warehouses" {
  for_each = var.warehouses

  name           = "${upper(each.key)}_WH_${upper(var.environment)}"
  comment        = "${each.key} warehouse for ${var.environment}"
  warehouse_size = each.value.size

  auto_suspend    = each.value.auto_suspend
  auto_resume     = true
  min_cluster_count = each.value.min_clusters
  max_cluster_count = each.value.max_clusters
  scaling_policy  = "ECONOMY"
}

# Roles
resource "snowflake_role" "analyst" {
  name    = "DATA_ANALYST_${upper(var.environment)}"
  comment = "Data analyst role"
}

resource "snowflake_role" "engineer" {
  name    = "DATA_ENGINEER_${upper(var.environment)}"
  comment = "Data engineer role"
}

# Grants
resource "snowflake_grant_account_role" "analyst_db_usage" {
  role_name        = snowflake_role.analyst.name
  parent_role_name = "SYSADMIN"
}

resource "snowflake_grant_database_privileges" "analyst_analytics" {
  database_name = snowflake_database.analytics.name
  privilege     = "USAGE"
  roles         = [snowflake_role.analyst.name]
}

resource "snowflake_grant_schema_privileges" "analyst_marts" {
  database_name = snowflake_database.analytics.name
  schema_name   = snowflake_schema.marts.name
  privilege     = "USAGE"
  roles         = [snowflake_role.analyst.name]
}

resource "snowflake_grant_table_privileges" "analyst_select" {
  database_name = snowflake_database.analytics.name
  schema_name   = snowflake_schema.marts.name
  table_name    = "*"
  privilege     = "SELECT"
  roles         = [snowflake_role.analyst.name]
}

# Outputs
output "database_name" {
  value = snowflake_database.analytics.name
}

output "warehouse_names" {
  value = { for k, v in snowflake_warehouse.warehouses : k => v.name }
}

Terraform for S3 Data Lake

# s3_data_lake.tf
resource "aws_s3_bucket" "data_lake" {
  bucket = "data-lake-${var.environment}-${var.aws_account_id}"

  tags = {
    Environment = var.environment
    ManagedBy   = "Terraform"
    Project     = "data-platform"
  }
}

resource "aws_s3_bucket_versioning" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.data_lake.arn
    }
  }
}

resource "aws_s3_bucket_lifecycle_configuration" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id

  rule {
    id     = "tier-to-warm"
    status = "Enabled"

    filter {
      prefix = "raw/"
    }

    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }

    transition {
      days          = 90
      storage_class = "GLACIER"
    }

    transition {
      days          = 365
      storage_class = "DEEP_ARCHIVE"
    }
  }
}

resource "aws_s3_bucket_policy" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid       = "EnforceTLS"
        Effect    = "Deny"
        Principal = "*"
        Action    = "s3:*"
        Resource = [
          aws_s3_bucket.data_lake.arn,
          "${aws_s3_bucket.data_lake.arn}/*"
        ]
        Condition = {
          Bool = {
            "aws:SecureTransport" = "false"
          }
        }
      },
      {
        Sid       = "EnforceEncryption"
        Effect    = "Deny"
        Principal = "*"
        Action    = "s3:PutObject"
        Resource  = "${aws_s3_bucket.data_lake.arn}/*"
        Condition = {
          StringNotEquals = {
            "s3:x-amz-server-side-encryption" = "aws:kms"
          }
        }
      }
    ]
  })
}

# KMS key for encryption
resource "aws_kms_key" "data_lake" {
  description             = "KMS key for data lake encryption"
  deletion_window_in_days = 30
  enable_key_rotation     = true
}

resource "aws_kms_alias" "data_lake" {
  name          = "alias/data-lake-${var.environment}"
  target_key_id = aws_kms_key.data_lake.key_id
}

# Glue Catalog
resource "aws_glue_catalog_database" "analytics" {
  name = "analytics_${var.environment}"
}

resource "aws_glue_catalog_table" "orders" {
  name          = "orders"
  database_name = aws_glue_catalog_database.analytics.name

  table_type = "EXTERNAL_TABLE"

  parameters = {
    "classification"  = "parquet"
    "parquet.compression" = "SNAPPY"
  }

  storage_descriptor {
    location      = "s3://${aws_s3_bucket.data_lake.bucket}/silver/orders/"
    input_format  = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"
    output_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"

    ser_de_info {
      name                  = "parquet"
      serialization_library = "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
    }

    columns {
      name = "order_id"
      type = "string"
    }
    columns {
      name = "customer_id"
      type = "string"
    }
    columns {
      name = "order_date"
      type = "date"
    }
    columns {
      name = "total_amount"
      type = "decimal(14,2)"
    }
  }

  partition_keys {
    name = "order_year"
    type = "int"
  }
  partition_keys {
    name = "order_month"
    type = "int"
  }
}

Pulumi for Data Platform

# pulumi_data_platform.py
import pulumi
import pulumi_aws as aws
import pulumi_snowflake as snowflake

# Configuration
config = pulumi.Config()
environment = config.get("environment") or "dev"

# Snowflake Database
analytics_db = snowflake.Database("analytics_db",
    name=f"ANALYTICS_{environment.upper()}",
    comment=f"Analytics database for {environment}"
)

# Snowflake Schemas
marts_schema = snowflake.Schema("marts_schema",
    database=analytics_db.name,
    name="MARTS",
    comment="Business-ready data marts"
)

staging_schema = snowflake.Schema("staging_schema",
    database=analytics_db.name,
    name="STAGING",
    comment="Staging area"
)

# Virtual Warehouses
warehouses = {}
for name, config_dict in [
    ("analytics", {"size": "medium", "auto_suspend": 300}),
    ("etl", {"size": "large", "auto_suspend": 600}),
    ("adhoc", {"size": "x-small", "auto_suspend": 60})
]:
    warehouses[name] = snowflake.Warehouse(f"{name}_wh",
        name=f"{name.upper()}_WH_{environment.upper()}",
        warehouse_size=config_dict["size"],
        auto_suspend=config_dict["auto_suspend"],
        auto_resume=True,
        scaling_policy="ECONOMY"
    )

# S3 Bucket for Data Lake
data_lake_bucket = aws.s3.Bucket("data_lake",
    bucket=f"data-lake-{environment}",
    tags={
        "Environment": environment,
        "ManagedBy": "Pulumi"
    }
)

# Enable versioning
aws.s3.BucketVersioning("data_lake_versioning",
    bucket=data_lake_bucket.id,
    versioning_configuration={
        "status": "Enabled"
    }
)

# KMS Key
kms_key = aws.kms.Key("data_lake_key",
    description="KMS key for data lake",
    enable_key_rotation=True
)

# Export outputs
pulumi.export("database_name", analytics_db.name)
pulumi.export("bucket_name", data_lake_bucket.bucket)
pulumi.export("warehouses", {k: v.name for k, v in warehouses.items()})

Key Concepts Summary

ConceptDescriptionToolUse Case
Declarative ConfigDefine desired stateTerraform, PulumiAll infrastructure
State ManagementTrack resource stateS3+DynamoDB, Terraform CloudMulti-user collaboration
Drift DetectionDetect manual changesTerraform Cloud, SpaceliftCompliance
Module ReuseShare common patternsTerraform ModulesMulti-environment
Policy as CodeEnforce standardsSentinel, OPA, CheckovGovernance
Secret ManagementSecure credentialsVault, AWS Secrets ManagerSecurity
Cost EstimationPredict costs before applyInfracost, Terraform CloudBudget control
Multi-CloudDeploy across providersTerraform, PulumiVendor flexibility

Performance Metrics

MetricManual ProvisioningIaCImprovement
New Environment SetupHours-DaysMinutes10-50x
Configuration Consistency60-80%99-100%+20-40%
Drift IncidentsMonthlyRare-90%
Deployment RollbackHoursMinutes10-20x
Documentation CurrencyOutdatedAlways current100%
Cost VisibilityNonePer-resourceFull
Audit TrailNoneGit historyComplete
Cross-Team ReuseManual copyModule registryAutomated

10 Best Practices

  1. Store all IaC in version control β€” Git provides audit trail and rollback capability
  2. Use remote state with locking β€” S3+DynamoDB or Terraform Cloud for team collaboration
  3. Implement policy as code β€” enforce naming, tagging, and security standards automatically
  4. Use modules for reuse β€” create reusable modules for warehouses, databases, and networking
  5. Plan before apply β€” always review terraform plan output before deploying changes
  6. Separate environments β€” use workspaces or separate state files for dev/staging/prod
  7. Implement drift detection β€” alert on manual changes outside IaC
  8. Tag all resources β€” enable cost allocation and ownership tracking
  9. Use secret management β€” never commit credentials to Git; use Vault or cloud KMS
  10. Test IaC changes β€” use terraform validate and terraform plan in CI before apply

  • IaC enables reproducible, version-controlled infrastructure provisioning
  • Terraform and Pulumi provide cloud-agnostic infrastructure management
  • Remote state with locking enables team collaboration without conflicts
  • Policy as code enforces governance standards automatically
  • IaC reduces provisioning time from hours to minutes while improving consistency

See Also

⭐

Premium Content

Infrastructure as Code: Automating Data Platform Provisioning

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Data Engineering Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement