Data Lakehouse: Unifying Data Lakes and Data Warehouses

Data Lakehouse: The Best of Both Worlds

The data lakehouse combines the low-cost, flexible storage of data lakes with the ACID transactions, performance, and governance of data warehouses.

Why Data Lakehouse Matters

Historical Problem:

Separate data lakes (cheap storage, raw data)
Separate data warehouses (governed, performant queries)
Operational complexity and data duplication

Lakehouse Solution:

Single platform for both storage and governance
Reduced operational complexity
Lower costs through data deduplication
Maintains benefits of both architectures

Key Insight: The lakehouse eliminates the duality of data lakes and data warehouses, reducing operational complexity, data duplication, and cost while maintaining the benefits of both architectures.

Architecture Overview

Lakehouse Architecture Stack

Lake vs Warehouse vs Lakehouse

Multi-Hop Architecture

The multi-hop architecture (also called Medallion) processes data through progressive refinement stages. Each hop applies specific transformations while maintaining data provenance and enabling reprocessing.

# Multi-Hop Lakehouse Implementation with Databricks
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession.builder \
    .appName("LakehouseMultiHop") \
    .getOrCreate()

# HOP 1: RAW -> INGEST (Bronze)
def hop_raw_to_ingest(spark, source, target):
    """
    Ingest raw data with minimal transformation.
    Preserves original schema for traceability.
    """
    df = spark.read \
        .format("json") \
        .option("multiLine", "true") \
        .load(source)

    df.write \
        .format("delta") \
        .mode("append") \
        .option("mergeSchema", "true") \
        .partitionBy("_load_date") \
        .saveAsTable("bronze.raw_events")

    print(f"Bronze: {df.count()} records ingested")

# HOP 2: INGEST -> CLEAN (Silver)
def hop_ingest_to_clean(spark):
    """
    Clean, validate, deduplicate, and conform data.
    Apply data quality rules and type casting.
    """
    raw_df = spark.table("bronze.raw_events")

    clean_df = raw_df \
        .dropDuplicates(["event_id"]) \
        .filter(col("event_id").isNotNull()) \
        .withColumn("event_timestamp", to_timestamp(col("event_ts"))) \
        .withColumn("event_type",
            when(col("raw_type").isin("click", "tap"), "click")
            .when(col("raw_type").isin("view", "impression"), "view")
            .when(col("raw_type").isin("purchase", "buy"), "purchase")
            .otherwise("other")
        ) \
        .withColumn("user_id", col("uid").cast("long")) \
        .withColumn("properties", from_json(col("props_json"), MapType(StringType(), StringType()))) \
        .withColumn("_quality_check",
            when(col("user_id").isNotNull() & col("event_timestamp").isNotNull(), "PASS")
            .otherwise("FAIL")
        )

    clean_df.write \
        .format("delta") \
        .mode("overwrite") \
        .partitionBy("event_date") \
        .option("overwriteSchema", "true") \
        .saveAsTable("silver.clean_events")

    # Record quality metrics
    quality_metrics = clean_df.groupBy("_quality_check").count()
    quality_metrics.show()

# HOP 3: CLEAN -> CONFORM (Silver -> Gold)
def hop_clean_to_conform(spark):
    """
    Conform data into business domains with enrichment.
    Create dimensional model for analytics.
    """
    events = spark.table("silver.clean_events")
    users = spark.table("silver.clean_users")

    enriched = events \
        .join(users, events.user_id == users.user_id, "left") \
        .withColumn("session_duration",
            col("session_end") - col("session_start")
        ) \
        .withColumn("is_conversion",
            col("event_type") == "purchase"
        )

    # Daily aggregation
    daily_metrics = enriched \
        .groupBy("event_date", "event_type", "country") \
        .agg(
            count("*").alias("event_count"),
            countDistinct("user_id").alias("unique_users"),
            sum(when(col("is_conversion"), 1).otherwise(0)).alias("conversions"),
            avg("session_duration").alias("avg_session_duration")
        )

    daily_metrics.write \
        .format("delta") \
        .mode("overwrite") \
        .saveAsTable("gold.daily_event_metrics")

# HOP 4: CONFORM -> SERVE (Gold -> Platinum)
def hop_conform_to_serve(spark):
    """
    Serve pre-aggregated, BI-optimized datasets.
    Materialized for dashboard consumption.
    """
    daily = spark.table("gold.daily_event_metrics")

    # 30-day rolling metrics
    from pyspark.sql.window import Window
    window = Window.orderBy("event_date").rowsBetween(-29, 0)

    served = daily \
        .withColumn("rolling_30d_events", sum("event_count").over(window)) \
        .withColumn("rolling_30d_users", sum("unique_users").over(window)) \
        .withColumn("conversion_rate",
            col("conversions") / col("event_count")
        )

    served.write \
        .format("delta") \
        .mode("overwrite") \
        .saveAsTable("platinum.event_serving")

Unity Catalog

Unity Catalog is Databricks' unified governance solution for data and AI assets. It provides centralized access control, data lineage, data discovery, and auditing across all lakehouse resources.

-- Unity Catalog: Create catalog and schema
CREATE CATALOG IF NOT EXISTS analytics_catalog;
USE CATALOG analytics_catalog;

CREATE SCHEMA IF NOT EXISTS marketing
    COMMENT 'Marketing analytics datasets'
    WITH PROPERTIES ('owner' = 'marketing-team');

-- Create table with governance
CREATE TABLE marketing.campaign_performance (
    campaign_id     BIGINT,
    campaign_name   STRING,
    impressions     BIGINT,
    clicks          BIGINT,
    conversions     BIGINT,
    spend           DECIMAL(12,2),
    start_date      DATE,
    end_date        DATE
)
USING delta
COMMENT 'Daily campaign performance metrics'
TBLPROPERTIES (
    'quality' = 'gold',
    'owner' = 'marketing-team',
    'retention_days' = '365'
);

-- Row-level security
CREATE ROW ACCESS POLICY campaign_access
ON marketing.campaign_performance
AS (campaign_id BIGINT) ->
    CASE
        WHEN is_account_group_member('marketing_admin') THEN TRUE
        WHEN is_account_group_member('marketing_analyst')
             AND campaign_id IN (SELECT campaign_id FROM approved_campaigns) THEN TRUE
        ELSE FALSE
    END;

-- Column-level security
CREATE COLUMN MASKING POLICY email_mask
ON marketing.campaign_performance
AS (email STRING) ->
    CASE
        WHEN is_account_group_member('pii_viewers') THEN email
        ELSE CONCAT('***', SUBSTRING(email, POSITION('@' IN email)))
    END;

-- Data lineage (automatic with Unity Catalog)
SELECT * FROM system.access.column_lineage
WHERE target_table = 'marketing.campaign_performance';

-- Grant permissions
GRANT SELECT ON TABLE marketing.campaign_performance TO `marketing_analyst`;
GRANT MODIFY ON TABLE marketing.campaign_performance TO `marketing_engineer`;
GRANT USAGE ON SCHEMA marketing TO `data_scientist`;

Lakehouse vs. Warehouse vs. Lake

The choice between lakehouse, warehouse, and lake depends on data variety, query patterns, ML requirements, and cost constraints. Each architecture has distinct trade-offs in cost, performance, governance, and flexibility.

Criterion	Data Lake	Data Warehouse	Data Lakehouse
Data Variety	All formats	Structured only	All formats
Cost (per GB)	$0.023 \|$ 0.04-0.08	$0.03
ACID Transactions	No	Yes	Yes
Schema Enforcement	Schema-on-read	Schema-on-write	Both
ML Integration	Native	External	Native
BI Performance	Poor	Excellent	Good-Excellent
Governance	Minimal	Strong	Strong
Time Travel	No	Limited	Yes
Streaming	Limited	Limited	Native
Open Format	Yes	Vendor-dependent	Yes
Setup Complexity	Low	Medium	Medium-High
Best For	Raw storage, ML	Reporting, dashboards	Unified analytics

# Lakehouse architecture decision framework
def recommend_architecture(requirements: dict) -> str:
    """Recommend architecture based on requirements."""
    score = {
        "lake": 0,
        "warehouse": 0,
        "lakehouse": 0
    }

    if requirements.get("ml_workloads"):
        score["lake"] += 2
        score["lakehouse"] += 2

    if requirements.get("bi_dashboards"):
        score["warehouse"] += 3
        score["lakehouse"] += 2

    if requirements.get("streaming"):
        score["lakehouse"] += 2

    if requirements.get("structured_data_only"):
        score["warehouse"] += 2

    if requirements.get("cost_sensitive"):
        score["lake"] += 2
        score["lakehouse"] += 1

    if requirements.get("governance_required"):
        score["warehouse"] += 2
        score["lakehouse"] += 2

    return max(score, key=score.get)

# Example
recommendation = recommend_architecture({
    "ml_workloads": True,
    "bi_dashboards": True,
    "streaming": True,
    "governance_required": True
})
print(f"Recommended: {recommendation}")  # Output: lakehouse

Key Concepts Summary

Concept	Description	Benefit	When to Use
Lakehouse	Unified lake + warehouse	Single platform	All data workloads
Multi-Hop	Progressive data refinement	Traceability	Complex pipelines
Unity Catalog	Centralized governance	Security, lineage	Enterprise deployments
Serverless SQL	On-demand compute	No cluster management	BI and ad-hoc queries
Delta Lake	ACID on data lakes	Transaction safety	All lakehouse tables
Auto Loader	Incremental file ingestion	Near-real-time	Streaming + batch
Photon Engine	Vectorized query engine	2-5x performance	Interactive queries
MLflow	ML lifecycle management	Experiment tracking	ML in production
Feature Store	Feature engineering platform	Feature reuse	ML pipelines
Model Serving	Real-time inference	Low-latency predictions	Online ML

Performance Metrics

Metric	Traditional Lake	Lakehouse	Warehouse
Storage Cost/GB/month	$0.023 \|$ 0.03	$0.04-0.08
Query Latency (1TB)	300-600 sec	60-180 sec	30-120 sec
ACID Transactions	No	Yes	Yes
Schema Evolution	Manual	Easy	Moderate
Concurrent Writes	No	Yes	Yes
ML Integration	External	Native	External
Streaming Support	Limited	Native	Limited
Time Travel	No	Yes	Yes
Open Format	Yes	Yes	Vendor-dependent
Ecosystem	Universal	Growing	Established

10 Best Practices

Use the Medallion Architecture consistently — Bronze (raw) -> Silver (clean) -> Gold (aggregated)
Implement Unity Catalog from day one — retroactive governance is harder than proactive
Enable Delta Lake auto-optimize for write-heavy workloads to prevent small files
Use Auto Loader for incremental ingestion instead of manual file listing
Separate compute by workload — SQL warehouses for BI, all-purpose for exploration, jobs for ETL
Enable Photon engine for interactive queries — 2-5x performance improvement
Use liquid clustering on high-cardinality columns for optimal data layout
Implement data quality expectations at each Medallion hop transition
Tag tables by quality level (bronze/silver/gold) for discovery and governance
Monitor Unity Catalog lineage to understand data dependencies and impact analysis

The lakehouse unifies data lake storage with data warehouse governance on a single platform
Multi-hop architecture enables progressive data refinement with full traceability
Unity Catalog provides centralized security, lineage, and discovery across all assets
Delta Lake brings ACID transactions, time travel, and schema evolution to cloud storage
Serverless SQL and auto-scaling clusters eliminate infrastructure management