Data Lake: The Foundation of Modern Data Platforms
A data lake is a centralized repository that stores structured, semi-structured, and unstructured data at any scale.
Why Data Lakes Matter
Data Types Not Handled by Warehouses:
- Logs
- JSON events
- Images
- IoT streams
Data Lake Benefits:
- Cost-effective storage β for all data types
- Exploration β raw data available for ad-hoc analysis
- Machine learning β access to unstructured data
- Advanced analytics β capabilities warehouses cannot support
Key Insight: Unlike data warehouses, data lakes store raw data in its native format and apply schema at query time.
Architecture Overview
Medallion Architecture
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs. Schema-on-Write
Schema-on-read applies structure to data at query time rather than write time. Raw data is stored in its native format and the schema is defined when the data is read by a query engine.
Schema-on-write applies structure when data is written. The data is validated, transformed, and stored in a predefined schema before it can be queried.
| Aspect | Schema-on-Read | Schema-on-Write |
|---|---|---|
| Write Speed | Fast (no validation) | Slow (validation + transform) |
| Read Speed | Slow (parse at read) | Fast (pre-structured) |
| Flexibility | High (any format) | Low (rigid schema) |
| Data Quality | Low (raw) | High (validated) |
| Storage Cost | Low (raw format) | High (processed + raw) |
| Query Complexity | High (user defines schema) | Low (pre-defined) |
| Use Case | Exploration, ML, logs | Reporting, dashboards |
Medallion Architecture
The Medallion Architecture organizes data into three progressively refined layers (Bronze -> Silver -> Gold) within a data lake. Each layer applies increasing levels of data quality, transformation, and governance.
# Medallion Architecture Implementation with PySpark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark = SparkSession.builder \
.appName("MedallionArchitecture") \
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.SparkSessionExtensions") \
.getOrCreate()
# ============================================================
# BRONZE LAYER: Raw Ingestion
# ============================================================
def ingest_to_bronze(spark, source_path, bronze_path):
"""
Ingest raw data into Bronze layer.
No transformations β preserve original format.
"""
raw_df = spark.read \
.format("json") \
.option("multiLine", "true") \
.load(source_path)
raw_df.withColumn("_ingestion_timestamp", current_timestamp()) \
.withColumn("_source_file", input_file_name()) \
.withColumn("_load_date", current_date()) \
.write \
.format("parquet") \
.mode("append") \
.partitionBy("_load_date") \
.save(bronze_path)
print(f"Bronze: Ingested {raw_df.count()} records")
# ============================================================
# SILVER LAYER: Cleaned & Conformed
# ============================================================
def transform_to_silver(spark, bronze_path, silver_path):
"""
Clean, deduplicate, validate, and enrich Bronze data.
"""
bronze_df = spark.read.format("parquet").load(bronze_path)
silver_df = bronze_df \
.dropDuplicates(["order_id"]) \
.filter(col("order_id").isNotNull()) \
.withColumn("order_date", to_date(col("order_date"), "yyyy-MM-dd")) \
.withColumn("amount", col("amount").cast(DecimalType(12, 2))) \
.withColumn("customer_id", col("customer_id").cast(IntegerType())) \
.withColumn("order_status",
when(col("status").isin("complete", "completed"), "completed")
.when(col("status").isin("pending", "processing"), "pending")
.when(col("status").isin("cancel", "cancelled"), "cancelled")
.otherwise("unknown")
) \
.withColumn("_cleaned_timestamp", current_timestamp()) \
.withColumn("_quality_score",
when(col("email").isNotNull() & col("address").isNotNull(), 1.0)
.when(col("email").isNotNull(), 0.7)
.otherwise(0.3)
)
silver_df.write \
.format("parquet") \
.mode("overwrite") \
.partitionBy("order_date") \
.save(silver_path)
print(f"Silver: Processed {silver_df.count()} records")
# ============================================================
# GOLD LAYER: Business-Ready Aggregations
# ============================================================
def aggregate_to_gold(spark, silver_path, gold_path):
"""
Create business-ready aggregations and dimensions.
"""
silver_df = spark.read.format("parquet").load(silver_path)
# Daily revenue aggregation
daily_metrics = silver_df \
.groupBy("order_date", "order_status") \
.agg(
count("*").alias("order_count"),
sum("amount").alias("total_revenue"),
avg("amount").alias("avg_order_value"),
countDistinct("customer_id").alias("unique_customers")
)
daily_metrics.write \
.format("parquet") \
.mode("overwrite") \
.save(f"{gold_path}/daily_metrics")
# Customer dimension
customer_dim = silver_df \
.groupBy("customer_id") \
.agg(
min("order_date").alias("first_order_date"),
max("order_date").alias("last_order_date"),
count("*").alias("total_orders"),
sum("amount").alias("lifetime_value")
) \
.withColumn("customer_tenure_days",
datediff(current_date(), col("first_order_date"))
) \
.withColumn("customer_segment",
when(col("lifetime_value") > 10000, "enterprise")
.when(col("lifetime_value") > 1000, "mid_market")
.otherwise("smb")
)
customer_dim.write \
.format("parquet") \
.mode("overwrite") \
.save(f"{gold_path}/dim_customers")
print(f"Gold: Created daily_metrics and dim_customers")
Data Swamp Prevention
A data swamp is a data lake that has become unusable due to lack of governance, metadata, quality controls, and documentation. Data accumulates without organization, making discovery and trust impossible.
| Prevention Strategy | Implementation | Impact |
|---|---|---|
| Data Catalog | Tag all datasets with metadata | Discoverability |
| Quality Monitoring | Automated freshness/completeness checks | Trust |
| Access Controls | Role-based access to lake layers | Security |
| Naming Conventions | Standardized naming for files/tables | Organization |
| Retention Policies | Auto-delete expired data | Cost control |
| Lineage Tracking | Track data flow from source to consumer | Transparency |
| Cost Tagging | Label resources by team/project | Accountability |
# Data Quality Monitoring with Great Expectations
import great_expectations as gx
from great_expectations.core import ExpectationSuite
context = gx.get_context()
# Define quality expectations for Bronze layer
bronze_suite = ExpectationSuite(expectation_suite_name="bronze_orders")
bronze_suite.add_expectation(
gx.expectations.ExpectColumnValuesToNotBeNull(column="order_id")
)
bronze_suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeUnique(column="order_id")
)
bronze_suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeBetween(
column="amount", min_value=0, max_value=1000000
)
)
bronze_suite.add_expectation(
gx.expectations.ExpectTableRowCountToBeBetween(
min_value=1000, max_value=10000000
)
)
# Run validation
result = context.run_checkpoint(
checkpoint_name="bronze_checkpoint",
batch_request={
"datasource_name": "s3_datasource",
"data_asset_name": "orders",
"options": {"path": "s3://data-lake/bronze/orders/"},
},
suite=bronze_suite
)
print(f"Success: {result.success}")
print(f"Statistics: {result.statistics}")
Data Lake File Formats
Columnar file formats (Parquet, ORC) store data by column rather than by row, enabling efficient compression and selective column retrieval. Each column is stored contiguously, allowing analytics queries to read only the columns they need.
| Format | Compression | Schema Evolution | Column Pruning | Ecosystem Support |
|---|---|---|---|---|
| Parquet | Snappy/Zstd | Via metadata | Yes (predicate pushdown) | Universal |
| ORC | Zlib/Snappy | Via metadata | Yes (bloom filters) | Hive, Spark |
| Avro | Snappy/Deflate | Schema resolution | No (row-based) | Kafka, Flink |
| JSON | None/Gzip | N/A | No | Universal |
| CSV | None/Gzip | N/A | No | Universal |
# File format comparison and selection
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("FileFormats").getOrCreate()
# Read different formats
parquet_df = spark.read.format("parquet").load("s3://data-lake/events.parquet/")
orc_df = spark.read.format("orc").load("s3://data-lake/events.orc/")
json_df = spark.read.format("json").load("s3://data-lake/events.json/")
# Write optimized Parquet
parquet_df.write \
.format("parquet") \
.option("compression", "snappy") \
.option("parquet.block.size", 128 * 1024 * 1024) \ # 128 MB blocks
.partitionBy("event_date") \
.save("s3://data-lake/optimized-events/")
# Schema evolution in Parquet
spark.read \
.option("mergeSchema", "true") \
.format("parquet") \
.load("s3://data-lake/evolved-events/")
Data Lake Security Patterns
Data lake security encompasses encryption, access control, audit logging, and data masking across all layers. Defense-in-depth applies security at network, storage, and application levels.
| Security Layer | Implementation | Tool/Technology |
|---|---|---|
| Network | VPC, PrivateLink, Firewalls | AWS VPC, Security Groups |
| Storage | Encryption at rest (AES-256) | KMS, CloudHSM |
| Access | RBAC/ABAC policies | IAM, Lake Formation |
| Data | Column-level masking | Dynamic data masking |
| Audit | CloudTrail, access logs | CloudTrail, custom |
| Compliance | Retention policies, GDPR | Lifecycle policies |
-- AWS Lake Formation: Grant table permissions
-- Grant SELECT on specific columns
GRANT SELECT ON TABLE analytics.events (event_id, event_type, event_date)
TO IAM_ROLE 'arn:aws:iam::123456789:role/analyst-role';
-- Grant INSERT on staging schema
GRANT INSERT ON SCHEMA analytics.staging
TO IAM_ROLE 'arn:aws:iam::123456789:role/etl-role';
-- Row-level security
CREATE ROW ACCESS POLICY events_policy ON analytics.events
AS (event_type VARCHAR) RETURNS BOOLEAN ->
CASE
WHEN CURRENT_ROLE() = 'admin' THEN TRUE
WHEN CURRENT_ROLE() = 'analyst' AND event_type IN ('click', 'view') THEN TRUE
ELSE FALSE
END;
Key Concepts Summary
| Concept | Description | Cost ($/GB/month) | When to Use |
|---|---|---|---|
| Bronze Layer | Raw, immutable ingestion | $0.023 | All incoming data |
| Silver Layer | Cleaned, validated, conformed | $0.03 | Queryable by analysts |
| Gold Layer | Business-ready aggregations | $0.04 | Dashboards, reports |
| Schema-on-Read | Structure at query time | Minimal | Exploration, ML |
| Schema-on-Write | Structure at write time | Moderate | Reporting, BI |
| Data Swamp | Unusable, ungoverned lake | Cost center | Never (prevent) |
| Data Catalog | Metadata repository | $0.001 | All layers |
| Lineage | Data flow tracking | $0.0001 | Compliance |
| Partitioning | Data segmentation | Minimal | All large datasets |
| Compaction | File size optimization | I/O savings | Frequent access |
Performance Metrics
| Metric | Bronze | Silver | Gold |
|---|---|---|---|
| Storage Cost/GB/month | 0.03 | $0.04 | |
| Query Latency | High (raw) | Medium | Low (optimized) |
| Data Freshness | Real-time | 5-60 min | Hourly-Daily |
| Schema Flexibility | High | Medium | Low |
| Data Quality | None | High | Very High |
| Concurrency | Low | Medium | High |
| ML Readiness | High | High | Low |
| BI Readiness | Low | Medium | High |
| Compliance | Minimal | Moderate | High |
| Discoverability | Low | Medium | High |
10 Best Practices
- Enforce the Medallion Architecture β Bronze for raw, Silver for clean, Gold for business-ready
- Implement automated data quality checks at each layer transition
- Use a data catalog (AWS Glue, DataHub, Amundsen) to tag all datasets
- Partition data by access pattern β date partitioning for time-series, entity partitioning for lookups
- Implement lifecycle policies β auto-expire Bronze data after Silver ingestion
- Use columnar formats (Parquet, ORC) for all analytical workloads
- Enforce naming conventions β
bronze_source_table_date,silver_source_table - Track data lineage from source to consumer for compliance and debugging
- Implement access controls β separate read/write permissions per layer
- Monitor costs with resource tagging by team, project, and data domain
- The Medallion Architecture (Bronze -> Silver -> Gold) provides a governed data lifecycle
- Schema-on-read enables flexibility but requires quality controls to prevent data swamps
- Storage tiering (raw -> processed -> serving) balances cost against query performance
- Data catalogs and lineage tracking are essential for preventing data swamps
- Automated quality checks at each layer transition maintain data trustworthiness
See Also
- Delta Lake & Iceberg β Open table formats bringing ACID to data lakes
- Data Lakehouse β Unifying data lakes and warehouses
- Data Governance & Catalog β Metadata management and data discovery
- Data Security & Compliance β Encryption, access control, and GDPR compliance
- Cost Optimization β Storage tiering and lifecycle policies
- Performance Optimization β Query tuning and data format optimization