πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Data Lakehouse: Unifying Data Lakes and Data Warehouses

Module 3: Data Warehouses & StorageLakehouse Architecture🟒 Free Lesson

Advertisement

Data Lakehouse: The Best of Both Worlds

The data lakehouse combines the low-cost, flexible storage of data lakes with the ACID transactions, performance, and governance of data warehouses.

Why Data Lakehouse Matters


Historical Problem:

  • Separate data lakes (cheap storage, raw data)
  • Separate data warehouses (governed, performant queries)
  • Operational complexity and data duplication

Lakehouse Solution:

  • Single platform for both storage and governance
  • Reduced operational complexity
  • Lower costs through data deduplication
  • Maintains benefits of both architectures

Key Insight: The lakehouse eliminates the duality of data lakes and data warehouses, reducing operational complexity, data duplication, and cost while maintaining the benefits of both architectures.


Architecture Overview

Lakehouse Architecture Stack

Data Lakehouse Architecture StackGovernance Layer: Unity Catalog | Access Control | Data Lineage | Quality RulesCompute Layer: SQL Warehouse | All-Purpose Cluster | Jobs | ML RuntimeTable Format Layer: Delta Lake | Apache Iceberg | Apache Hudi | ParquetStorage Layer: S3 | Azure Blob Storage | Google Cloud Storage (GCS)

Lake vs Warehouse vs Lakehouse

Evolution: Lake, Warehouse, LakehouseData LakeCheap storage (S3)Any format (raw)Schema-on-readACID: NoGovernance: WeakCost: LowPerformance: VariableData WarehouseManaged storageStructured (SQL)Schema-on-writeACID: YesGovernance: StrongCost: HighPerformance: Fast (SQL)LakehouseCheap storage (S3)Open formats (Parquet)ACID on data lakeTime travelGovernance: UnifiedCost: Low-MediumPerformance: Best of both

Multi-Hop Architecture

The multi-hop architecture (also called Medallion) processes data through progressive refinement stages. Each hop applies specific transformations while maintaining data provenance and enabling reprocessing.

# Multi-Hop Lakehouse Implementation with Databricks
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession.builder \
    .appName("LakehouseMultiHop") \
    .getOrCreate()

# HOP 1: RAW -> INGEST (Bronze)
def hop_raw_to_ingest(spark, source, target):
    """
    Ingest raw data with minimal transformation.
    Preserves original schema for traceability.
    """
    df = spark.read \
        .format("json") \
        .option("multiLine", "true") \
        .load(source)

    df.write \
        .format("delta") \
        .mode("append") \
        .option("mergeSchema", "true") \
        .partitionBy("_load_date") \
        .saveAsTable("bronze.raw_events")

    print(f"Bronze: {df.count()} records ingested")

# HOP 2: INGEST -> CLEAN (Silver)
def hop_ingest_to_clean(spark):
    """
    Clean, validate, deduplicate, and conform data.
    Apply data quality rules and type casting.
    """
    raw_df = spark.table("bronze.raw_events")

    clean_df = raw_df \
        .dropDuplicates(["event_id"]) \
        .filter(col("event_id").isNotNull()) \
        .withColumn("event_timestamp", to_timestamp(col("event_ts"))) \
        .withColumn("event_type",
            when(col("raw_type").isin("click", "tap"), "click")
            .when(col("raw_type").isin("view", "impression"), "view")
            .when(col("raw_type").isin("purchase", "buy"), "purchase")
            .otherwise("other")
        ) \
        .withColumn("user_id", col("uid").cast("long")) \
        .withColumn("properties", from_json(col("props_json"), MapType(StringType(), StringType()))) \
        .withColumn("_quality_check",
            when(col("user_id").isNotNull() & col("event_timestamp").isNotNull(), "PASS")
            .otherwise("FAIL")
        )

    clean_df.write \
        .format("delta") \
        .mode("overwrite") \
        .partitionBy("event_date") \
        .option("overwriteSchema", "true") \
        .saveAsTable("silver.clean_events")

    # Record quality metrics
    quality_metrics = clean_df.groupBy("_quality_check").count()
    quality_metrics.show()

# HOP 3: CLEAN -> CONFORM (Silver -> Gold)
def hop_clean_to_conform(spark):
    """
    Conform data into business domains with enrichment.
    Create dimensional model for analytics.
    """
    events = spark.table("silver.clean_events")
    users = spark.table("silver.clean_users")

    enriched = events \
        .join(users, events.user_id == users.user_id, "left") \
        .withColumn("session_duration",
            col("session_end") - col("session_start")
        ) \
        .withColumn("is_conversion",
            col("event_type") == "purchase"
        )

    # Daily aggregation
    daily_metrics = enriched \
        .groupBy("event_date", "event_type", "country") \
        .agg(
            count("*").alias("event_count"),
            countDistinct("user_id").alias("unique_users"),
            sum(when(col("is_conversion"), 1).otherwise(0)).alias("conversions"),
            avg("session_duration").alias("avg_session_duration")
        )

    daily_metrics.write \
        .format("delta") \
        .mode("overwrite") \
        .saveAsTable("gold.daily_event_metrics")

# HOP 4: CONFORM -> SERVE (Gold -> Platinum)
def hop_conform_to_serve(spark):
    """
    Serve pre-aggregated, BI-optimized datasets.
    Materialized for dashboard consumption.
    """
    daily = spark.table("gold.daily_event_metrics")

    # 30-day rolling metrics
    from pyspark.sql.window import Window
    window = Window.orderBy("event_date").rowsBetween(-29, 0)

    served = daily \
        .withColumn("rolling_30d_events", sum("event_count").over(window)) \
        .withColumn("rolling_30d_users", sum("unique_users").over(window)) \
        .withColumn("conversion_rate",
            col("conversions") / col("event_count")
        )

    served.write \
        .format("delta") \
        .mode("overwrite") \
        .saveAsTable("platinum.event_serving")

Unity Catalog

Unity Catalog is Databricks' unified governance solution for data and AI assets. It provides centralized access control, data lineage, data discovery, and auditing across all lakehouse resources.

-- Unity Catalog: Create catalog and schema
CREATE CATALOG IF NOT EXISTS analytics_catalog;
USE CATALOG analytics_catalog;

CREATE SCHEMA IF NOT EXISTS marketing
    COMMENT 'Marketing analytics datasets'
    WITH PROPERTIES ('owner' = 'marketing-team');

-- Create table with governance
CREATE TABLE marketing.campaign_performance (
    campaign_id     BIGINT,
    campaign_name   STRING,
    impressions     BIGINT,
    clicks          BIGINT,
    conversions     BIGINT,
    spend           DECIMAL(12,2),
    start_date      DATE,
    end_date        DATE
)
USING delta
COMMENT 'Daily campaign performance metrics'
TBLPROPERTIES (
    'quality' = 'gold',
    'owner' = 'marketing-team',
    'retention_days' = '365'
);

-- Row-level security
CREATE ROW ACCESS POLICY campaign_access
ON marketing.campaign_performance
AS (campaign_id BIGINT) ->
    CASE
        WHEN is_account_group_member('marketing_admin') THEN TRUE
        WHEN is_account_group_member('marketing_analyst')
             AND campaign_id IN (SELECT campaign_id FROM approved_campaigns) THEN TRUE
        ELSE FALSE
    END;

-- Column-level security
CREATE COLUMN MASKING POLICY email_mask
ON marketing.campaign_performance
AS (email STRING) ->
    CASE
        WHEN is_account_group_member('pii_viewers') THEN email
        ELSE CONCAT('***', SUBSTRING(email, POSITION('@' IN email)))
    END;

-- Data lineage (automatic with Unity Catalog)
SELECT * FROM system.access.column_lineage
WHERE target_table = 'marketing.campaign_performance';

-- Grant permissions
GRANT SELECT ON TABLE marketing.campaign_performance TO `marketing_analyst`;
GRANT MODIFY ON TABLE marketing.campaign_performance TO `marketing_engineer`;
GRANT USAGE ON SCHEMA marketing TO `data_scientist`;

Lakehouse vs. Warehouse vs. Lake

The choice between lakehouse, warehouse, and lake depends on data variety, query patterns, ML requirements, and cost constraints. Each architecture has distinct trade-offs in cost, performance, governance, and flexibility.

CriterionData LakeData WarehouseData Lakehouse
Data VarietyAll formatsStructured onlyAll formats
Cost (per GB)0.023∣0.023 |0.04-0.08$0.03
ACID TransactionsNoYesYes
Schema EnforcementSchema-on-readSchema-on-writeBoth
ML IntegrationNativeExternalNative
BI PerformancePoorExcellentGood-Excellent
GovernanceMinimalStrongStrong
Time TravelNoLimitedYes
StreamingLimitedLimitedNative
Open FormatYesVendor-dependentYes
Setup ComplexityLowMediumMedium-High
Best ForRaw storage, MLReporting, dashboardsUnified analytics
# Lakehouse architecture decision framework
def recommend_architecture(requirements: dict) -> str:
    """Recommend architecture based on requirements."""
    score = {
        "lake": 0,
        "warehouse": 0,
        "lakehouse": 0
    }

    if requirements.get("ml_workloads"):
        score["lake"] += 2
        score["lakehouse"] += 2

    if requirements.get("bi_dashboards"):
        score["warehouse"] += 3
        score["lakehouse"] += 2

    if requirements.get("streaming"):
        score["lakehouse"] += 2

    if requirements.get("structured_data_only"):
        score["warehouse"] += 2

    if requirements.get("cost_sensitive"):
        score["lake"] += 2
        score["lakehouse"] += 1

    if requirements.get("governance_required"):
        score["warehouse"] += 2
        score["lakehouse"] += 2

    return max(score, key=score.get)

# Example
recommendation = recommend_architecture({
    "ml_workloads": True,
    "bi_dashboards": True,
    "streaming": True,
    "governance_required": True
})
print(f"Recommended: {recommendation}")  # Output: lakehouse

Key Concepts Summary

ConceptDescriptionBenefitWhen to Use
LakehouseUnified lake + warehouseSingle platformAll data workloads
Multi-HopProgressive data refinementTraceabilityComplex pipelines
Unity CatalogCentralized governanceSecurity, lineageEnterprise deployments
Serverless SQLOn-demand computeNo cluster managementBI and ad-hoc queries
Delta LakeACID on data lakesTransaction safetyAll lakehouse tables
Auto LoaderIncremental file ingestionNear-real-timeStreaming + batch
Photon EngineVectorized query engine2-5x performanceInteractive queries
MLflowML lifecycle managementExperiment trackingML in production
Feature StoreFeature engineering platformFeature reuseML pipelines
Model ServingReal-time inferenceLow-latency predictionsOnline ML

Performance Metrics

MetricTraditional LakeLakehouseWarehouse
Storage Cost/GB/month0.023∣0.023 |0.03$0.04-0.08
Query Latency (1TB)300-600 sec60-180 sec30-120 sec
ACID TransactionsNoYesYes
Schema EvolutionManualEasyModerate
Concurrent WritesNoYesYes
ML IntegrationExternalNativeExternal
Streaming SupportLimitedNativeLimited
Time TravelNoYesYes
Open FormatYesYesVendor-dependent
EcosystemUniversalGrowingEstablished

10 Best Practices

  1. Use the Medallion Architecture consistently β€” Bronze (raw) -> Silver (clean) -> Gold (aggregated)
  2. Implement Unity Catalog from day one β€” retroactive governance is harder than proactive
  3. Enable Delta Lake auto-optimize for write-heavy workloads to prevent small files
  4. Use Auto Loader for incremental ingestion instead of manual file listing
  5. Separate compute by workload β€” SQL warehouses for BI, all-purpose for exploration, jobs for ETL
  6. Enable Photon engine for interactive queries β€” 2-5x performance improvement
  7. Use liquid clustering on high-cardinality columns for optimal data layout
  8. Implement data quality expectations at each Medallion hop transition
  9. Tag tables by quality level (bronze/silver/gold) for discovery and governance
  10. Monitor Unity Catalog lineage to understand data dependencies and impact analysis

  • The lakehouse unifies data lake storage with data warehouse governance on a single platform
  • Multi-hop architecture enables progressive data refinement with full traceability
  • Unity Catalog provides centralized security, lineage, and discovery across all assets
  • Delta Lake brings ACID transactions, time travel, and schema evolution to cloud storage
  • Serverless SQL and auto-scaling clusters eliminate infrastructure management

See Also

⭐

Premium Content

Data Lakehouse: Unifying Data Lakes and Data Warehouses

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Data Engineering Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement