πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Data Lake Architecture: Designing Scalable Storage Systems

Module 3: Data Warehouses & StorageData Lake Design🟒 Free Lesson

Advertisement

Data Lake: The Foundation of Modern Data Platforms

A data lake is a centralized repository that stores structured, semi-structured, and unstructured data at any scale.

Why Data Lakes Matter


Data Types Not Handled by Warehouses:

  • Logs
  • JSON events
  • Images
  • IoT streams

Data Lake Benefits:

  1. Cost-effective storage β€” for all data types
  2. Exploration β€” raw data available for ad-hoc analysis
  3. Machine learning β€” access to unstructured data
  4. Advanced analytics β€” capabilities warehouses cannot support

Key Insight: Unlike data warehouses, data lakes store raw data in its native format and apply schema at query time.


Architecture Overview

Medallion Architecture

Medallion Architecture: Bronze, Silver, GoldBronze LayerRaw ingestion (as-is)Append-only, immutableSchema-on-readFull audit trailSilver LayerCleansed + validatedConformed schemasDeduplicationEnrichment joinsGold LayerBusiness-level aggregatesBI-ready, query-optimizedStar schema / cubesML feature storecleanseaggregate

Schema-on-Read vs Schema-on-Write

Schema-on-Read vs Schema-on-WriteSchema-on-Read (Data Lake)Write raw data, define schema when readingFlexible: any format (JSON, CSV, Parquet)Pros: fast ingest, schema evolutionCons: slower queries, data swamp riskTools: Spark, Presto, AthenaSchema-on-Write (Data Warehouse)Define schema, transform before writingStructured: Parquet, ORC, columnarPros: fast queries, data qualityCons: slower ingest, rigid schemaTools: Snowflake, BigQuery, Redshift

Schema-on-Read vs. Schema-on-Write

Schema-on-read applies structure to data at query time rather than write time. Raw data is stored in its native format and the schema is defined when the data is read by a query engine.

Schema-on-write applies structure when data is written. The data is validated, transformed, and stored in a predefined schema before it can be queried.

AspectSchema-on-ReadSchema-on-Write
Write SpeedFast (no validation)Slow (validation + transform)
Read SpeedSlow (parse at read)Fast (pre-structured)
FlexibilityHigh (any format)Low (rigid schema)
Data QualityLow (raw)High (validated)
Storage CostLow (raw format)High (processed + raw)
Query ComplexityHigh (user defines schema)Low (pre-defined)
Use CaseExploration, ML, logsReporting, dashboards

Medallion Architecture

The Medallion Architecture organizes data into three progressively refined layers (Bronze -> Silver -> Gold) within a data lake. Each layer applies increasing levels of data quality, transformation, and governance.

# Medallion Architecture Implementation with PySpark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession.builder \
    .appName("MedallionArchitecture") \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.SparkSessionExtensions") \
    .getOrCreate()

# ============================================================
# BRONZE LAYER: Raw Ingestion
# ============================================================
def ingest_to_bronze(spark, source_path, bronze_path):
    """
    Ingest raw data into Bronze layer.
    No transformations β€” preserve original format.
    """
    raw_df = spark.read \
        .format("json") \
        .option("multiLine", "true") \
        .load(source_path)

    raw_df.withColumn("_ingestion_timestamp", current_timestamp()) \
        .withColumn("_source_file", input_file_name()) \
        .withColumn("_load_date", current_date()) \
        .write \
        .format("parquet") \
        .mode("append") \
        .partitionBy("_load_date") \
        .save(bronze_path)

    print(f"Bronze: Ingested {raw_df.count()} records")

# ============================================================
# SILVER LAYER: Cleaned & Conformed
# ============================================================
def transform_to_silver(spark, bronze_path, silver_path):
    """
    Clean, deduplicate, validate, and enrich Bronze data.
    """
    bronze_df = spark.read.format("parquet").load(bronze_path)

    silver_df = bronze_df \
        .dropDuplicates(["order_id"]) \
        .filter(col("order_id").isNotNull()) \
        .withColumn("order_date", to_date(col("order_date"), "yyyy-MM-dd")) \
        .withColumn("amount", col("amount").cast(DecimalType(12, 2))) \
        .withColumn("customer_id", col("customer_id").cast(IntegerType())) \
        .withColumn("order_status",
            when(col("status").isin("complete", "completed"), "completed")
            .when(col("status").isin("pending", "processing"), "pending")
            .when(col("status").isin("cancel", "cancelled"), "cancelled")
            .otherwise("unknown")
        ) \
        .withColumn("_cleaned_timestamp", current_timestamp()) \
        .withColumn("_quality_score",
            when(col("email").isNotNull() & col("address").isNotNull(), 1.0)
            .when(col("email").isNotNull(), 0.7)
            .otherwise(0.3)
        )

    silver_df.write \
        .format("parquet") \
        .mode("overwrite") \
        .partitionBy("order_date") \
        .save(silver_path)

    print(f"Silver: Processed {silver_df.count()} records")

# ============================================================
# GOLD LAYER: Business-Ready Aggregations
# ============================================================
def aggregate_to_gold(spark, silver_path, gold_path):
    """
    Create business-ready aggregations and dimensions.
    """
    silver_df = spark.read.format("parquet").load(silver_path)

    # Daily revenue aggregation
    daily_metrics = silver_df \
        .groupBy("order_date", "order_status") \
        .agg(
            count("*").alias("order_count"),
            sum("amount").alias("total_revenue"),
            avg("amount").alias("avg_order_value"),
            countDistinct("customer_id").alias("unique_customers")
        )

    daily_metrics.write \
        .format("parquet") \
        .mode("overwrite") \
        .save(f"{gold_path}/daily_metrics")

    # Customer dimension
    customer_dim = silver_df \
        .groupBy("customer_id") \
        .agg(
            min("order_date").alias("first_order_date"),
            max("order_date").alias("last_order_date"),
            count("*").alias("total_orders"),
            sum("amount").alias("lifetime_value")
        ) \
        .withColumn("customer_tenure_days",
            datediff(current_date(), col("first_order_date"))
        ) \
        .withColumn("customer_segment",
            when(col("lifetime_value") > 10000, "enterprise")
            .when(col("lifetime_value") > 1000, "mid_market")
            .otherwise("smb")
        )

    customer_dim.write \
        .format("parquet") \
        .mode("overwrite") \
        .save(f"{gold_path}/dim_customers")

    print(f"Gold: Created daily_metrics and dim_customers")

Data Swamp Prevention

A data swamp is a data lake that has become unusable due to lack of governance, metadata, quality controls, and documentation. Data accumulates without organization, making discovery and trust impossible.

Prevention StrategyImplementationImpact
Data CatalogTag all datasets with metadataDiscoverability
Quality MonitoringAutomated freshness/completeness checksTrust
Access ControlsRole-based access to lake layersSecurity
Naming ConventionsStandardized naming for files/tablesOrganization
Retention PoliciesAuto-delete expired dataCost control
Lineage TrackingTrack data flow from source to consumerTransparency
Cost TaggingLabel resources by team/projectAccountability
# Data Quality Monitoring with Great Expectations
import great_expectations as gx
from great_expectations.core import ExpectationSuite

context = gx.get_context()

# Define quality expectations for Bronze layer
bronze_suite = ExpectationSuite(expectation_suite_name="bronze_orders")

bronze_suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column="order_id")
)
bronze_suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeUnique(column="order_id")
)
bronze_suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="amount", min_value=0, max_value=1000000
    )
)
bronze_suite.add_expectation(
    gx.expectations.ExpectTableRowCountToBeBetween(
        min_value=1000, max_value=10000000
    )
)

# Run validation
result = context.run_checkpoint(
    checkpoint_name="bronze_checkpoint",
    batch_request={
        "datasource_name": "s3_datasource",
        "data_asset_name": "orders",
        "options": {"path": "s3://data-lake/bronze/orders/"},
    },
    suite=bronze_suite
)

print(f"Success: {result.success}")
print(f"Statistics: {result.statistics}")

Data Lake File Formats

Columnar file formats (Parquet, ORC) store data by column rather than by row, enabling efficient compression and selective column retrieval. Each column is stored contiguously, allowing analytics queries to read only the columns they need.

FormatCompressionSchema EvolutionColumn PruningEcosystem Support
ParquetSnappy/ZstdVia metadataYes (predicate pushdown)Universal
ORCZlib/SnappyVia metadataYes (bloom filters)Hive, Spark
AvroSnappy/DeflateSchema resolutionNo (row-based)Kafka, Flink
JSONNone/GzipN/ANoUniversal
CSVNone/GzipN/ANoUniversal
# File format comparison and selection
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("FileFormats").getOrCreate()

# Read different formats
parquet_df = spark.read.format("parquet").load("s3://data-lake/events.parquet/")
orc_df = spark.read.format("orc").load("s3://data-lake/events.orc/")
json_df = spark.read.format("json").load("s3://data-lake/events.json/")

# Write optimized Parquet
parquet_df.write \
    .format("parquet") \
    .option("compression", "snappy") \
    .option("parquet.block.size", 128 * 1024 * 1024) \  # 128 MB blocks
    .partitionBy("event_date") \
    .save("s3://data-lake/optimized-events/")

# Schema evolution in Parquet
spark.read \
    .option("mergeSchema", "true") \
    .format("parquet") \
    .load("s3://data-lake/evolved-events/")

Data Lake Security Patterns

Data lake security encompasses encryption, access control, audit logging, and data masking across all layers. Defense-in-depth applies security at network, storage, and application levels.

Security LayerImplementationTool/Technology
NetworkVPC, PrivateLink, FirewallsAWS VPC, Security Groups
StorageEncryption at rest (AES-256)KMS, CloudHSM
AccessRBAC/ABAC policiesIAM, Lake Formation
DataColumn-level maskingDynamic data masking
AuditCloudTrail, access logsCloudTrail, custom
ComplianceRetention policies, GDPRLifecycle policies
-- AWS Lake Formation: Grant table permissions
-- Grant SELECT on specific columns
GRANT SELECT ON TABLE analytics.events (event_id, event_type, event_date)
    TO IAM_ROLE 'arn:aws:iam::123456789:role/analyst-role';

-- Grant INSERT on staging schema
GRANT INSERT ON SCHEMA analytics.staging
    TO IAM_ROLE 'arn:aws:iam::123456789:role/etl-role';

-- Row-level security
CREATE ROW ACCESS POLICY events_policy ON analytics.events
    AS (event_type VARCHAR) RETURNS BOOLEAN ->
        CASE
            WHEN CURRENT_ROLE() = 'admin' THEN TRUE
            WHEN CURRENT_ROLE() = 'analyst' AND event_type IN ('click', 'view') THEN TRUE
            ELSE FALSE
        END;

Key Concepts Summary

ConceptDescriptionCost ($/GB/month)When to Use
Bronze LayerRaw, immutable ingestion$0.023All incoming data
Silver LayerCleaned, validated, conformed$0.03Queryable by analysts
Gold LayerBusiness-ready aggregations$0.04Dashboards, reports
Schema-on-ReadStructure at query timeMinimalExploration, ML
Schema-on-WriteStructure at write timeModerateReporting, BI
Data SwampUnusable, ungoverned lakeCost centerNever (prevent)
Data CatalogMetadata repository$0.001All layers
LineageData flow tracking$0.0001Compliance
PartitioningData segmentationMinimalAll large datasets
CompactionFile size optimizationI/O savingsFrequent access

Performance Metrics

MetricBronzeSilverGold
Storage Cost/GB/month0.023∣0.023 |0.03$0.04
Query LatencyHigh (raw)MediumLow (optimized)
Data FreshnessReal-time5-60 minHourly-Daily
Schema FlexibilityHighMediumLow
Data QualityNoneHighVery High
ConcurrencyLowMediumHigh
ML ReadinessHighHighLow
BI ReadinessLowMediumHigh
ComplianceMinimalModerateHigh
DiscoverabilityLowMediumHigh

10 Best Practices

  1. Enforce the Medallion Architecture β€” Bronze for raw, Silver for clean, Gold for business-ready
  2. Implement automated data quality checks at each layer transition
  3. Use a data catalog (AWS Glue, DataHub, Amundsen) to tag all datasets
  4. Partition data by access pattern β€” date partitioning for time-series, entity partitioning for lookups
  5. Implement lifecycle policies β€” auto-expire Bronze data after Silver ingestion
  6. Use columnar formats (Parquet, ORC) for all analytical workloads
  7. Enforce naming conventions β€” bronze_source_table_date, silver_source_table
  8. Track data lineage from source to consumer for compliance and debugging
  9. Implement access controls β€” separate read/write permissions per layer
  10. Monitor costs with resource tagging by team, project, and data domain

  • The Medallion Architecture (Bronze -> Silver -> Gold) provides a governed data lifecycle
  • Schema-on-read enables flexibility but requires quality controls to prevent data swamps
  • Storage tiering (raw -> processed -> serving) balances cost against query performance
  • Data catalogs and lineage tracking are essential for preventing data swamps
  • Automated quality checks at each layer transition maintain data trustworthiness

See Also

⭐

Premium Content

Data Lake Architecture: Designing Scalable Storage Systems

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Data Engineering Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement