Infrastructure as Code: Reproducible, Version-Controlled Infrastructure
Infrastructure as Code (IaC) manages and provisions data platform infrastructure through machine-readable configuration files, enabling version control, review, and automated deployment.
Why IaC Matters
Problems with Manual Provisioning:
- Configuration drift
- Undocumented changes
- Deployment inconsistencies
IaC Benefits:
- Every environment is identical β reproducible infrastructure
- Version control β track changes over time
- Automated deployment β reduce human error
- Auditable β clear record of what was deployed
Key Insight: IaC ensures every environment is identical, reproducible, and auditable.
Architecture Overview
Terraform for Data Platforms
IaC declares infrastructure resources in configuration files. The IaC tool (Terraform, Pulumi, CloudFormation) then provisions, manages, and destroys resources to match the declared state.
# main.tf - Snowflake data warehouse infrastructure
terraform {
required_version = ">= 1.5.0"
required_providers {
snowflake = {
source = "Snowflake-Labs/snowflake"
version = "~> 0.89"
}
}
backend "s3" {
bucket = "terraform-state-data-platform"
key = "snowflake/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
provider "snowflake" {
account = var.snowflake_account
username = var.snowflake_username
password = var.snowflake_password
role = "ACCOUNTADMIN"
}
# Variables
variable "snowflake_account" {
type = string
description = "Snowflake account identifier"
}
variable "environment" {
type = string
description = "Deployment environment"
default = "dev"
}
variable " warehouses" {
type = map(object({
size = string
auto_suspend = number
min_clusters = number
max_clusters = number
}))
default = {
analytics = {
size = "medium"
auto_suspend = 300
min_clusters = 1
max_clusters = 4
}
etl = {
size = "large"
auto_suspend = 600
min_clusters = 2
max_clusters = 6
}
adhoc = {
size = "x-small"
auto_suspend = 60
min_clusters = 1
max_clusters = 1
}
}
}
# Database
resource "snowflake_database" "analytics" {
name = "ANALYTICS_${upper(var.environment)}"
comment = "Analytics database for ${var.environment}"
}
# Schema
resource "snowflake_schema" "marts" {
database = snowflake_database.analytics.name
name = "MARTS"
comment = "Business-ready data marts"
}
resource "snowflake_schema" "staging" {
database = snowflake_database.analytics.name
name = "STAGING"
comment = "Staging area for transformations"
}
# Virtual Warehouses
resource "snowflake_warehouse" "warehouses" {
for_each = var.warehouses
name = "${upper(each.key)}_WH_${upper(var.environment)}"
comment = "${each.key} warehouse for ${var.environment}"
warehouse_size = each.value.size
auto_suspend = each.value.auto_suspend
auto_resume = true
min_cluster_count = each.value.min_clusters
max_cluster_count = each.value.max_clusters
scaling_policy = "ECONOMY"
}
# Roles
resource "snowflake_role" "analyst" {
name = "DATA_ANALYST_${upper(var.environment)}"
comment = "Data analyst role"
}
resource "snowflake_role" "engineer" {
name = "DATA_ENGINEER_${upper(var.environment)}"
comment = "Data engineer role"
}
# Grants
resource "snowflake_grant_account_role" "analyst_db_usage" {
role_name = snowflake_role.analyst.name
parent_role_name = "SYSADMIN"
}
resource "snowflake_grant_database_privileges" "analyst_analytics" {
database_name = snowflake_database.analytics.name
privilege = "USAGE"
roles = [snowflake_role.analyst.name]
}
resource "snowflake_grant_schema_privileges" "analyst_marts" {
database_name = snowflake_database.analytics.name
schema_name = snowflake_schema.marts.name
privilege = "USAGE"
roles = [snowflake_role.analyst.name]
}
resource "snowflake_grant_table_privileges" "analyst_select" {
database_name = snowflake_database.analytics.name
schema_name = snowflake_schema.marts.name
table_name = "*"
privilege = "SELECT"
roles = [snowflake_role.analyst.name]
}
# Outputs
output "database_name" {
value = snowflake_database.analytics.name
}
output "warehouse_names" {
value = { for k, v in snowflake_warehouse.warehouses : k => v.name }
}
Terraform for S3 Data Lake
# s3_data_lake.tf
resource "aws_s3_bucket" "data_lake" {
bucket = "data-lake-${var.environment}-${var.aws_account_id}"
tags = {
Environment = var.environment
ManagedBy = "Terraform"
Project = "data-platform"
}
}
resource "aws_s3_bucket_versioning" "data_lake" {
bucket = aws_s3_bucket.data_lake.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "data_lake" {
bucket = aws_s3_bucket.data_lake.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
kms_master_key_id = aws_kms_key.data_lake.arn
}
}
}
resource "aws_s3_bucket_lifecycle_configuration" "data_lake" {
bucket = aws_s3_bucket.data_lake.id
rule {
id = "tier-to-warm"
status = "Enabled"
filter {
prefix = "raw/"
}
transition {
days = 30
storage_class = "STANDARD_IA"
}
transition {
days = 90
storage_class = "GLACIER"
}
transition {
days = 365
storage_class = "DEEP_ARCHIVE"
}
}
}
resource "aws_s3_bucket_policy" "data_lake" {
bucket = aws_s3_bucket.data_lake.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "EnforceTLS"
Effect = "Deny"
Principal = "*"
Action = "s3:*"
Resource = [
aws_s3_bucket.data_lake.arn,
"${aws_s3_bucket.data_lake.arn}/*"
]
Condition = {
Bool = {
"aws:SecureTransport" = "false"
}
}
},
{
Sid = "EnforceEncryption"
Effect = "Deny"
Principal = "*"
Action = "s3:PutObject"
Resource = "${aws_s3_bucket.data_lake.arn}/*"
Condition = {
StringNotEquals = {
"s3:x-amz-server-side-encryption" = "aws:kms"
}
}
}
]
})
}
# KMS key for encryption
resource "aws_kms_key" "data_lake" {
description = "KMS key for data lake encryption"
deletion_window_in_days = 30
enable_key_rotation = true
}
resource "aws_kms_alias" "data_lake" {
name = "alias/data-lake-${var.environment}"
target_key_id = aws_kms_key.data_lake.key_id
}
# Glue Catalog
resource "aws_glue_catalog_database" "analytics" {
name = "analytics_${var.environment}"
}
resource "aws_glue_catalog_table" "orders" {
name = "orders"
database_name = aws_glue_catalog_database.analytics.name
table_type = "EXTERNAL_TABLE"
parameters = {
"classification" = "parquet"
"parquet.compression" = "SNAPPY"
}
storage_descriptor {
location = "s3://${aws_s3_bucket.data_lake.bucket}/silver/orders/"
input_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"
output_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"
ser_de_info {
name = "parquet"
serialization_library = "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
}
columns {
name = "order_id"
type = "string"
}
columns {
name = "customer_id"
type = "string"
}
columns {
name = "order_date"
type = "date"
}
columns {
name = "total_amount"
type = "decimal(14,2)"
}
}
partition_keys {
name = "order_year"
type = "int"
}
partition_keys {
name = "order_month"
type = "int"
}
}
Pulumi for Data Platform
# pulumi_data_platform.py
import pulumi
import pulumi_aws as aws
import pulumi_snowflake as snowflake
# Configuration
config = pulumi.Config()
environment = config.get("environment") or "dev"
# Snowflake Database
analytics_db = snowflake.Database("analytics_db",
name=f"ANALYTICS_{environment.upper()}",
comment=f"Analytics database for {environment}"
)
# Snowflake Schemas
marts_schema = snowflake.Schema("marts_schema",
database=analytics_db.name,
name="MARTS",
comment="Business-ready data marts"
)
staging_schema = snowflake.Schema("staging_schema",
database=analytics_db.name,
name="STAGING",
comment="Staging area"
)
# Virtual Warehouses
warehouses = {}
for name, config_dict in [
("analytics", {"size": "medium", "auto_suspend": 300}),
("etl", {"size": "large", "auto_suspend": 600}),
("adhoc", {"size": "x-small", "auto_suspend": 60})
]:
warehouses[name] = snowflake.Warehouse(f"{name}_wh",
name=f"{name.upper()}_WH_{environment.upper()}",
warehouse_size=config_dict["size"],
auto_suspend=config_dict["auto_suspend"],
auto_resume=True,
scaling_policy="ECONOMY"
)
# S3 Bucket for Data Lake
data_lake_bucket = aws.s3.Bucket("data_lake",
bucket=f"data-lake-{environment}",
tags={
"Environment": environment,
"ManagedBy": "Pulumi"
}
)
# Enable versioning
aws.s3.BucketVersioning("data_lake_versioning",
bucket=data_lake_bucket.id,
versioning_configuration={
"status": "Enabled"
}
)
# KMS Key
kms_key = aws.kms.Key("data_lake_key",
description="KMS key for data lake",
enable_key_rotation=True
)
# Export outputs
pulumi.export("database_name", analytics_db.name)
pulumi.export("bucket_name", data_lake_bucket.bucket)
pulumi.export("warehouses", {k: v.name for k, v in warehouses.items()})
Key Concepts Summary
| Concept | Description | Tool | Use Case |
|---|---|---|---|
| Declarative Config | Define desired state | Terraform, Pulumi | All infrastructure |
| State Management | Track resource state | S3+DynamoDB, Terraform Cloud | Multi-user collaboration |
| Drift Detection | Detect manual changes | Terraform Cloud, Spacelift | Compliance |
| Module Reuse | Share common patterns | Terraform Modules | Multi-environment |
| Policy as Code | Enforce standards | Sentinel, OPA, Checkov | Governance |
| Secret Management | Secure credentials | Vault, AWS Secrets Manager | Security |
| Cost Estimation | Predict costs before apply | Infracost, Terraform Cloud | Budget control |
| Multi-Cloud | Deploy across providers | Terraform, Pulumi | Vendor flexibility |
Performance Metrics
| Metric | Manual Provisioning | IaC | Improvement |
|---|---|---|---|
| New Environment Setup | Hours-Days | Minutes | 10-50x |
| Configuration Consistency | 60-80% | 99-100% | +20-40% |
| Drift Incidents | Monthly | Rare | -90% |
| Deployment Rollback | Hours | Minutes | 10-20x |
| Documentation Currency | Outdated | Always current | 100% |
| Cost Visibility | None | Per-resource | Full |
| Audit Trail | None | Git history | Complete |
| Cross-Team Reuse | Manual copy | Module registry | Automated |
10 Best Practices
- Store all IaC in version control β Git provides audit trail and rollback capability
- Use remote state with locking β S3+DynamoDB or Terraform Cloud for team collaboration
- Implement policy as code β enforce naming, tagging, and security standards automatically
- Use modules for reuse β create reusable modules for warehouses, databases, and networking
- Plan before apply β always review
terraform planoutput before deploying changes - Separate environments β use workspaces or separate state files for dev/staging/prod
- Implement drift detection β alert on manual changes outside IaC
- Tag all resources β enable cost allocation and ownership tracking
- Use secret management β never commit credentials to Git; use Vault or cloud KMS
- Test IaC changes β use
terraform validateandterraform planin CI before apply
- IaC enables reproducible, version-controlled infrastructure provisioning
- Terraform and Pulumi provide cloud-agnostic infrastructure management
- Remote state with locking enables team collaboration without conflicts
- Policy as code enforces governance standards automatically
- IaC reduces provisioning time from hours to minutes while improving consistency
See Also
- CI/CD for Data Pipelines β GitHub Actions for IaC deployment
- Snowflake Fundamentals β Terraform provider for Snowflake
- Cost Optimization β Cost estimation with Infracost
- Data Security & Compliance β Security policies as code
- Data Mesh Architecture β Self-serve platform provisioning
- Capstone: End-to-End β Terraform infrastructure for capstone project