Data Lakehouse: The Best of Both Worlds
The data lakehouse combines the low-cost, flexible storage of data lakes with the ACID transactions, performance, and governance of data warehouses.
Why Data Lakehouse Matters
Historical Problem:
- Separate data lakes (cheap storage, raw data)
- Separate data warehouses (governed, performant queries)
- Operational complexity and data duplication
Lakehouse Solution:
- Single platform for both storage and governance
- Reduced operational complexity
- Lower costs through data deduplication
- Maintains benefits of both architectures
Key Insight: The lakehouse eliminates the duality of data lakes and data warehouses, reducing operational complexity, data duplication, and cost while maintaining the benefits of both architectures.
Architecture Overview
Lakehouse Architecture Stack
Lake vs Warehouse vs Lakehouse
Multi-Hop Architecture
The multi-hop architecture (also called Medallion) processes data through progressive refinement stages. Each hop applies specific transformations while maintaining data provenance and enabling reprocessing.
# Multi-Hop Lakehouse Implementation with Databricks
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark = SparkSession.builder \
.appName("LakehouseMultiHop") \
.getOrCreate()
# HOP 1: RAW -> INGEST (Bronze)
def hop_raw_to_ingest(spark, source, target):
"""
Ingest raw data with minimal transformation.
Preserves original schema for traceability.
"""
df = spark.read \
.format("json") \
.option("multiLine", "true") \
.load(source)
df.write \
.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.partitionBy("_load_date") \
.saveAsTable("bronze.raw_events")
print(f"Bronze: {df.count()} records ingested")
# HOP 2: INGEST -> CLEAN (Silver)
def hop_ingest_to_clean(spark):
"""
Clean, validate, deduplicate, and conform data.
Apply data quality rules and type casting.
"""
raw_df = spark.table("bronze.raw_events")
clean_df = raw_df \
.dropDuplicates(["event_id"]) \
.filter(col("event_id").isNotNull()) \
.withColumn("event_timestamp", to_timestamp(col("event_ts"))) \
.withColumn("event_type",
when(col("raw_type").isin("click", "tap"), "click")
.when(col("raw_type").isin("view", "impression"), "view")
.when(col("raw_type").isin("purchase", "buy"), "purchase")
.otherwise("other")
) \
.withColumn("user_id", col("uid").cast("long")) \
.withColumn("properties", from_json(col("props_json"), MapType(StringType(), StringType()))) \
.withColumn("_quality_check",
when(col("user_id").isNotNull() & col("event_timestamp").isNotNull(), "PASS")
.otherwise("FAIL")
)
clean_df.write \
.format("delta") \
.mode("overwrite") \
.partitionBy("event_date") \
.option("overwriteSchema", "true") \
.saveAsTable("silver.clean_events")
# Record quality metrics
quality_metrics = clean_df.groupBy("_quality_check").count()
quality_metrics.show()
# HOP 3: CLEAN -> CONFORM (Silver -> Gold)
def hop_clean_to_conform(spark):
"""
Conform data into business domains with enrichment.
Create dimensional model for analytics.
"""
events = spark.table("silver.clean_events")
users = spark.table("silver.clean_users")
enriched = events \
.join(users, events.user_id == users.user_id, "left") \
.withColumn("session_duration",
col("session_end") - col("session_start")
) \
.withColumn("is_conversion",
col("event_type") == "purchase"
)
# Daily aggregation
daily_metrics = enriched \
.groupBy("event_date", "event_type", "country") \
.agg(
count("*").alias("event_count"),
countDistinct("user_id").alias("unique_users"),
sum(when(col("is_conversion"), 1).otherwise(0)).alias("conversions"),
avg("session_duration").alias("avg_session_duration")
)
daily_metrics.write \
.format("delta") \
.mode("overwrite") \
.saveAsTable("gold.daily_event_metrics")
# HOP 4: CONFORM -> SERVE (Gold -> Platinum)
def hop_conform_to_serve(spark):
"""
Serve pre-aggregated, BI-optimized datasets.
Materialized for dashboard consumption.
"""
daily = spark.table("gold.daily_event_metrics")
# 30-day rolling metrics
from pyspark.sql.window import Window
window = Window.orderBy("event_date").rowsBetween(-29, 0)
served = daily \
.withColumn("rolling_30d_events", sum("event_count").over(window)) \
.withColumn("rolling_30d_users", sum("unique_users").over(window)) \
.withColumn("conversion_rate",
col("conversions") / col("event_count")
)
served.write \
.format("delta") \
.mode("overwrite") \
.saveAsTable("platinum.event_serving")
Unity Catalog
Unity Catalog is Databricks' unified governance solution for data and AI assets. It provides centralized access control, data lineage, data discovery, and auditing across all lakehouse resources.
-- Unity Catalog: Create catalog and schema
CREATE CATALOG IF NOT EXISTS analytics_catalog;
USE CATALOG analytics_catalog;
CREATE SCHEMA IF NOT EXISTS marketing
COMMENT 'Marketing analytics datasets'
WITH PROPERTIES ('owner' = 'marketing-team');
-- Create table with governance
CREATE TABLE marketing.campaign_performance (
campaign_id BIGINT,
campaign_name STRING,
impressions BIGINT,
clicks BIGINT,
conversions BIGINT,
spend DECIMAL(12,2),
start_date DATE,
end_date DATE
)
USING delta
COMMENT 'Daily campaign performance metrics'
TBLPROPERTIES (
'quality' = 'gold',
'owner' = 'marketing-team',
'retention_days' = '365'
);
-- Row-level security
CREATE ROW ACCESS POLICY campaign_access
ON marketing.campaign_performance
AS (campaign_id BIGINT) ->
CASE
WHEN is_account_group_member('marketing_admin') THEN TRUE
WHEN is_account_group_member('marketing_analyst')
AND campaign_id IN (SELECT campaign_id FROM approved_campaigns) THEN TRUE
ELSE FALSE
END;
-- Column-level security
CREATE COLUMN MASKING POLICY email_mask
ON marketing.campaign_performance
AS (email STRING) ->
CASE
WHEN is_account_group_member('pii_viewers') THEN email
ELSE CONCAT('***', SUBSTRING(email, POSITION('@' IN email)))
END;
-- Data lineage (automatic with Unity Catalog)
SELECT * FROM system.access.column_lineage
WHERE target_table = 'marketing.campaign_performance';
-- Grant permissions
GRANT SELECT ON TABLE marketing.campaign_performance TO `marketing_analyst`;
GRANT MODIFY ON TABLE marketing.campaign_performance TO `marketing_engineer`;
GRANT USAGE ON SCHEMA marketing TO `data_scientist`;
Lakehouse vs. Warehouse vs. Lake
The choice between lakehouse, warehouse, and lake depends on data variety, query patterns, ML requirements, and cost constraints. Each architecture has distinct trade-offs in cost, performance, governance, and flexibility.
| Criterion | Data Lake | Data Warehouse | Data Lakehouse |
|---|---|---|---|
| Data Variety | All formats | Structured only | All formats |
| Cost (per GB) | 0.04-0.08 | $0.03 | |
| ACID Transactions | No | Yes | Yes |
| Schema Enforcement | Schema-on-read | Schema-on-write | Both |
| ML Integration | Native | External | Native |
| BI Performance | Poor | Excellent | Good-Excellent |
| Governance | Minimal | Strong | Strong |
| Time Travel | No | Limited | Yes |
| Streaming | Limited | Limited | Native |
| Open Format | Yes | Vendor-dependent | Yes |
| Setup Complexity | Low | Medium | Medium-High |
| Best For | Raw storage, ML | Reporting, dashboards | Unified analytics |
# Lakehouse architecture decision framework
def recommend_architecture(requirements: dict) -> str:
"""Recommend architecture based on requirements."""
score = {
"lake": 0,
"warehouse": 0,
"lakehouse": 0
}
if requirements.get("ml_workloads"):
score["lake"] += 2
score["lakehouse"] += 2
if requirements.get("bi_dashboards"):
score["warehouse"] += 3
score["lakehouse"] += 2
if requirements.get("streaming"):
score["lakehouse"] += 2
if requirements.get("structured_data_only"):
score["warehouse"] += 2
if requirements.get("cost_sensitive"):
score["lake"] += 2
score["lakehouse"] += 1
if requirements.get("governance_required"):
score["warehouse"] += 2
score["lakehouse"] += 2
return max(score, key=score.get)
# Example
recommendation = recommend_architecture({
"ml_workloads": True,
"bi_dashboards": True,
"streaming": True,
"governance_required": True
})
print(f"Recommended: {recommendation}") # Output: lakehouse
Key Concepts Summary
| Concept | Description | Benefit | When to Use |
|---|---|---|---|
| Lakehouse | Unified lake + warehouse | Single platform | All data workloads |
| Multi-Hop | Progressive data refinement | Traceability | Complex pipelines |
| Unity Catalog | Centralized governance | Security, lineage | Enterprise deployments |
| Serverless SQL | On-demand compute | No cluster management | BI and ad-hoc queries |
| Delta Lake | ACID on data lakes | Transaction safety | All lakehouse tables |
| Auto Loader | Incremental file ingestion | Near-real-time | Streaming + batch |
| Photon Engine | Vectorized query engine | 2-5x performance | Interactive queries |
| MLflow | ML lifecycle management | Experiment tracking | ML in production |
| Feature Store | Feature engineering platform | Feature reuse | ML pipelines |
| Model Serving | Real-time inference | Low-latency predictions | Online ML |
Performance Metrics
| Metric | Traditional Lake | Lakehouse | Warehouse |
|---|---|---|---|
| Storage Cost/GB/month | 0.03 | $0.04-0.08 | |
| Query Latency (1TB) | 300-600 sec | 60-180 sec | 30-120 sec |
| ACID Transactions | No | Yes | Yes |
| Schema Evolution | Manual | Easy | Moderate |
| Concurrent Writes | No | Yes | Yes |
| ML Integration | External | Native | External |
| Streaming Support | Limited | Native | Limited |
| Time Travel | No | Yes | Yes |
| Open Format | Yes | Yes | Vendor-dependent |
| Ecosystem | Universal | Growing | Established |
10 Best Practices
- Use the Medallion Architecture consistently β Bronze (raw) -> Silver (clean) -> Gold (aggregated)
- Implement Unity Catalog from day one β retroactive governance is harder than proactive
- Enable Delta Lake auto-optimize for write-heavy workloads to prevent small files
- Use Auto Loader for incremental ingestion instead of manual file listing
- Separate compute by workload β SQL warehouses for BI, all-purpose for exploration, jobs for ETL
- Enable Photon engine for interactive queries β 2-5x performance improvement
- Use liquid clustering on high-cardinality columns for optimal data layout
- Implement data quality expectations at each Medallion hop transition
- Tag tables by quality level (bronze/silver/gold) for discovery and governance
- Monitor Unity Catalog lineage to understand data dependencies and impact analysis
- The lakehouse unifies data lake storage with data warehouse governance on a single platform
- Multi-hop architecture enables progressive data refinement with full traceability
- Unity Catalog provides centralized security, lineage, and discovery across all assets
- Delta Lake brings ACID transactions, time travel, and schema evolution to cloud storage
- Serverless SQL and auto-scaling clusters eliminate infrastructure management
See Also
- Delta Lake & Iceberg β Open table formats underlying the lakehouse
- Data Lake Architecture β Medallion architecture patterns
- Data Governance & Catalog β Unity Catalog and metadata management
- Data Mesh Architecture β Domain-oriented data ownership
- Real-Time Analytics β Spark Structured Streaming for lakehouse
- MLOps for Data Engineering β ML pipelines on lakehouse platforms