Data Lake vs Data Warehouse vs Lakehouse
Understanding modern data architecture paradigms
Interview Question
"The CTO asks: 'Should we build a data lake, a data warehouse, or a lakehouse?' The company has: (1) 50TB of semi-structured logs, (2) 10TB of structured transactional data, (3) ML team needs raw data access, (4) BI team needs fast SQL queries, (5) budget is limited. Design the architecture and justify your choice."
Difficulty: Hard | Frequently asked at Databricks, Snowflake, Uber, Airbnb
Theoretical Foundation
Data Warehouse
A data warehouse is a centralized repository for structured, processed data optimized for analytical queries.
Characteristics:
- Schema-on-write: Data must conform to a predefined schema before loading
- Structured data only: Rows, columns, types
- ACID transactions: Full compliance guarantees
- Optimized for reads: Columnar storage, indexing, compression
- ETL required: Transform before loading
Pros:
- Fast SQL queries
- Data quality guaranteed
- ACID compliance
- Strong governance
Cons:
- Expensive (storage + compute)
- Inflexible schema
- Long development cycles
- Can't handle unstructured data
Data Lake
A data lake stores raw data in its native format, schema-on-read.
Characteristics:
- Schema-on-read: Schema applied when reading, not writing
- All data types: Structured, semi-structured, unstructured
- Low-cost storage: S3, ADLS, GCS
- ELT pattern: Load first, transform later
- Supports ML: Direct access to raw data
Pros:
- Low cost
- Flexible schema
- Supports all data types
- Good for ML/AI
- Scalable
Cons:
- No ACID transactions (with raw files)
- Data quality issues
- No built-in governance
- Query performance varies
Lakehouse
A lakehouse combines data lake storage with data warehouse management.
Characteristics:
- Open formats: Parquet, Delta Lake, Iceberg, Hudi
- ACID on data lakes: Transaction support on object storage
- Schema enforcement: Optional schema validation
- Time travel: Version control for data
- Unified analytics: SQL + ML + Streaming
Pros:
- Best of both worlds
- Open standards
- ACID on cheap storage
- Time travel
- Unified platform
Cons:
- Newer technology (less mature)
- Requires expertise
- Tooling still evolving
Comparison Matrix
| Feature | Data Warehouse | Data Lake | Lakehouse |
|---|---|---|---|
| Data Types | Structured only | All types | All types |
| Schema | On-write | On-read | Flexible |
| Storage Cost | High | Low | Low |
| Query Performance | Fast (optimized) | Variable | Fast (improving) |
| ACID | Yes | No | Yes |
| Time Travel | Limited | No | Yes |
| ML Support | Limited | Excellent | Good |
| Governance | Strong | Weak | Improving |
| Scalability | Good | Excellent | Excellent |
| Vendor Lock-in | High | Low | Low |
The Evolution of Data Architecture
2000s: Data Warehouses dominate
ETL β Warehouse β BI Tools
2010s: Data Lakes emerge
ELT β Data Lake β Spark/Presto β BI/ML
2020s: Lakehouse becomes mainstream
Streaming β Lakehouse β Unified Analytics
βΉοΈ
Key Insight: The lakehouse isn't just a combination of lake + warehouseβit's a new paradigm that uses open table formats (Delta Lake, Iceberg, Hudi) to bring warehouse-like features to data lakes. The key innovation is the transaction log that enables ACID, time travel, and schema enforcement on object storage.
Open Table Formats Deep Dive
Delta Lake
# Delta Lake: ACID transactions on S3
from delta.tables import DeltaTable
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.getOrCreate()
# Create Delta table
spark.sql("""
CREATE TABLE sales (
sale_id BIGINT,
product_id INT,
quantity INT,
amount DECIMAL(10,2),
sale_date DATE
) USING DELTA
LOCATION 's3://data-lake/sales/'
""")
# ACID transaction
spark.sql("""
INSERT INTO sales VALUES
(1, 101, 5, 49.99, '2024-01-15'),
(2, 102, 3, 29.99, '2024-01-15')
""")
# Time travel
spark.sql("SELECT * FROM sales VERSION AS OF 0") # Initial state
spark.sql("SELECT * FROM sales TIMESTAMP AS OF '2024-01-15 10:00:00'")
Apache Iceberg
# Iceberg: Hidden partitioning + partition evolution
spark.sql("""
CREATE TABLE catalog.db.events (
event_id BIGINT,
event_time TIMESTAMP,
user_id STRING,
event_type STRING
) USING iceberg
PARTITIONED BY (days(event_time))
""")
# Partition evolution (no data rewrite)
spark.sql("""
ALTER TABLE catalog.db.events
DROP PARTITION FIELD days(event_time)
""")
spark.sql("""
ALTER TABLE catalog.db.events
PARTITION BY (hours(event_time), user_id)
""")
Apache Hudi
# Hudi: Incremental processing + CDC
from hudi import DeltaStreamer
# Hudi table with upsert support
spark.sql("""
CREATE TABLE hudi_events (
uuid STRING,
user_id STRING,
event_time TIMESTAMP,
event_type STRING
) USING hudi
OPTIONS (
'hoodie.table.name' = 'events',
'hoodie.datasource.write.recordkey.field' = 'uuid',
'hoodie.datasource.write.precombine.field' = 'event_time',
'hoodie.datasource.write.operation' = 'upsert'
)
""")
Architecture Patterns
Pattern 1: Lambda Architecture
Pros: Low latency + high throughput Cons: Complex, duplicate logic
Pattern 2: Kappa Architecture
Pros: Simpler, single codebase Cons: Requires robust streaming infrastructure
Pattern 3: Lakehouse Architecture
βΉοΈ
Best Practice: The Medallion Architecture (Bronze β Silver β Gold) is the most popular pattern for lakehouses. Each layer adds value: Bronze preserves raw data, Silver ensures quality, Gold provides business-ready aggregates.
Code Implementation
Building a Lakehouse with Delta Lake
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from delta.tables import DeltaTable
spark = SparkSession.builder \
.appName("LakehouseArchitecture") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
# ============================================================
# BRONZE LAYER: Raw data ingestion
# ============================================================
# Read raw data from various sources
raw_logs = spark.read.text("s3://data-lake/raw/logs/")
raw_transactions = spark.read.json("s3://data-lake/raw/transactions/")
raw_customers = spark.read.parquet("s3://data-lake/raw/customers/")
# Write to Bronze (raw, no transformations)
raw_logs.write.format("delta") \
.mode("append") \
.save("s3://lakehouse/bronze/logs/")
raw_transactions.write.format("delta") \
.mode("append") \
.save("s3://lakehouse/bronze/transactions/")
raw_customers.write.format("delta") \
.mode("append") \
.save("s3://lakehouse/bronze/customers/")
# ============================================================
# SILVER LAYER: Cleaned and validated
# ============================================================
# Read from Bronze
bronze_transactions = spark.read.format("delta") \
.load("s3://lakehouse/bronze/transactions/")
# Clean and validate
silver_transactions = bronze_transactions \
.filter(F.col("amount").isNotNull()) \
.filter(F.col("amount") > 0) \
.filter(F.col("transaction_date").isNotNull()) \
.withColumn("amount", F.round("amount", 2)) \
.withColumn("transaction_date", F.to_date("transaction_date")) \
.withColumn("processed_timestamp", F.current_timestamp())
# Write to Silver with merge (upsert)
if DeltaTable.isDeltaTable(spark, "s3://lakehouse/silver/transactions/"):
delta_table = DeltaTable.forPath(spark, "s3://lakehouse/silver/transactions/")
delta_table.alias("target").merge(
silver_transactions.alias("source"),
"target.transaction_id = source.transaction_id"
).whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
else:
silver_transactions.write.format("delta") \
.mode("overwrite") \
.partitionBy("transaction_date") \
.save("s3://lakehouse/silver/transactions/")
# ============================================================
# GOLD LAYER: Business aggregates
# ============================================================
# Read from Silver
silver_tx = spark.read.format("delta") \
.load("s3://lakehouse/silver/transactions/")
silver_customers = spark.read.format("delta") \
.load("s3://lakehouse/silver/customers/")
# Create business aggregates
gold_customer_summary = silver_tx \
.join(silver_customers, "customer_id") \
.groupBy("customer_id", "customer_name", "segment") \
.agg(
F.count("*").alias("total_transactions"),
F.sum("amount").alias("total_amount"),
F.avg("amount").alias("avg_amount"),
F.min("transaction_date").alias("first_transaction"),
F.max("transaction_date").alias("last_transaction")
)
# Write to Gold
gold_customer_summary.write.format("delta") \
.mode("overwrite") \
.save("s3://lakehouse/gold/customer_summary/")
Time Travel and Versioning
# ============================================================
# TIME TRAVEL: Query historical data
# ============================================================
# Query by version
df_v0 = spark.read.format("delta") \
.option("versionAsOf", 0) \
.load("s3://lakehouse/silver/transactions/")
# Query by timestamp
df_historical = spark.read.format("delta") \
.option("timestampAsOf", "2024-01-15") \
.load("s3://lakehouse/silver/transactions/")
# Compare versions
current = spark.read.format("delta").load("s3://lakehouse/silver/transactions/")
previous = spark.read.format("delta").option("versionAsOf", 5) \
.load("s3://lakehouse/silver/transactions/")
# Find changes
changes = current.alias("curr") \
.join(previous.alias("prev"),
F.col("curr.transaction_id") == F.col("prev.transaction_id"),
"left") \
.filter(F.col("prev.transaction_id").isNull() |
(F.col("curr.amount") != F.col("prev.amount")))
# Get Delta table history
delta_table = DeltaTable.forPath(spark, "s3://lakehouse/silver/transactions/")
history = delta_table.history()
history.show(truncate=False)
Data Quality Framework
# ============================================================
# DATA QUALITY: Validate at each layer
# ============================================================
def validate_data_quality(df, rules):
"""
Validate DataFrame against quality rules.
Args:
df: Input DataFrame
rules: Dict of {column: [rule_functions]}
"""
results = {}
for column, column_rules in rules.items():
for rule in column_rules:
rule_name = rule.__name__
violations = df.filter(~rule(F.col(column))).count()
total = df.count()
quality_score = 1 - (violations / total) if total > 0 else 1
results[f"{column}.{rule_name}"] = {
"violations": violations,
"total": total,
"quality_score": quality_score,
"passed": violations == 0
}
return results
# Define quality rules
quality_rules = {
"amount": [
lambda col: col.isNotNull(),
lambda col: col > 0,
lambda col: col < 1000000,
],
"transaction_date": [
lambda col: col.isNotNull(),
lambda col: col <= F.current_date(),
],
"customer_id": [
lambda col: col.isNotNull(),
lambda col: col.rlike("^CUST[0-9]{6}$"),
]
}
# Validate Silver layer
silver_df = spark.read.format("delta").load("s3://lakehouse/silver/transactions/")
quality_results = validate_data_quality(silver_df, quality_rules)
# Check results
for check, result in quality_results.items():
if not result["passed"]:
print(f"FAILED: {check} - {result['violations']} violations")
# Alert or quarantine
Cost Analysis
# ============================================================
# COST ANALYSIS: Lakehouse vs Warehouse vs Lake
# ============================================================
# Assumptions
data_size_tb = 50 # 50TB
queries_per_day = 1000
avg_query_gb = 10
# Option 1: Traditional Warehouse (e.g., Redshift)
# Storage: $0.025/GB/month Γ 50TB = $1,250/month
# Compute: 8-node ra3.4xlarge = $26.68/hr Γ 730hr = $19,477/month
# Total: ~$20,727/month
# Option 2: Data Lake (S3 + Athena)
# Storage: $0.023/GB/month Γ 50TB = $1,150/month
# Query: $5/TB scanned Γ 10GB Γ 1000 = $50/day = $1,500/month
# Total: ~$2,650/month
# Option 3: Lakehouse (S3 + Delta Lake + Spark)
# Storage: $0.023/GB/month Γ 50TB = $1,150/month
# Compute: EMR cluster (auto-scaling) = ~$5,000/month
# Total: ~$6,150/month
# Option 4: Snowflake
# Storage: $23/TB/month Γ 50TB = $1,150/month
# Compute: 10 credits/hr Γ $3/credit Γ 730hr = $21,900/month
# Total: ~$23,050/month
print("Cost Comparison (monthly):")
print(f"Warehouse: ${20727:,}")
print(f"Data Lake: ${2650:,}")
print(f"Lakehouse: ${6150:,}")
print(f"Snowflake: ${23050:,}")
β οΈ
Critical Consideration: Cost analysis depends heavily on usage patterns. Data lakes are cheapest for storage-heavy workloads. Warehouses are cheapest for compute-heavy analytical workloads. Lakehouses offer the best balance for mixed workloads.
Common Follow-Up Questions
Q1: When would you choose a warehouse over a lakehouse?
Choose a warehouse when:
- All data is structured
- SQL performance is critical
- Team has strong SQL skills
- Governance requirements are strict
- Budget allows for premium pricing
Q2: How do you migrate from a warehouse to a lakehouse?
# Migration strategy: Big Bang vs Phased
# Phased approach:
# Phase 1: Replicate to lakehouse (dual-write)
# Phase 2: Migrate read workloads one by one
# Phase 3: Stop writes to warehouse
# Phase 4: Decommission warehouse
# Example: Migrate a table
def migrate_table(spark, warehouse_table, lakehouse_path):
"""Migrate a table from warehouse to lakehouse"""
# Read from warehouse
df = spark.sql(f"SELECT * FROM {warehouse_table}")
# Write to lakehouse with same schema
df.write.format("delta") \
.mode("overwrite") \
.save(lakehouse_path)
# Validate row counts
warehouse_count = spark.sql(f"SELECT COUNT(*) FROM {warehouse_table}").collect()[0][0]
lakehouse_count = spark.read.format("delta").load(lakehouse_path).count()
assert warehouse_count == lakehouse_count, f"Count mismatch: {warehouse_count} vs {lakehouse_count}"
print(f"Successfully migrated {warehouse_table} ({warehouse_count} rows)")
Q3: How do you handle governance in a lakehouse?
# Unity Catalog for governance (Databricks)
spark.sql("""
-- Create catalog and schema
CREATE CATALOG IF NOT EXISTS production;
CREATE SCHEMA IF NOT EXISTS production.sales;
-- Create table with governance
CREATE TABLE production.sales.transactions (
transaction_id BIGINT,
customer_id STRING,
amount DECIMAL(10,2),
transaction_date DATE
) COMMENT 'Sales transactions'
TBLPROPERTIES (
'quality' = 'gold',
'owner' = 'sales-team',
'pii' = 'customer_id'
);
-- Grant access
GRANT SELECT ON production.sales.transactions TO `sales-analysts`;
GRANT MODIFY ON production.sales.transactions TO `sales-engineers`;
""")
Q4: What about real-time in a lakehouse?
# Real-time ingestion into lakehouse
# Using Spark Structured Streaming + Delta Lake
streaming_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka:9092") \
.option("subscribe", "transactions") \
.load()
# Parse and transform
parsed_df = streaming_df \
.select(F.from_json(F.col("value").cast("string"), schema).alias("data")) \
.select("data.*")
# Write to Delta Lake (micro-batch)
query = parsed_df.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "s3://checkpoints/transactions/") \
.start("s3://lakehouse/silver/transactions/")
π‘
Interview Tip: When discussing data architecture, always ask about: (1) data volume, (2) query patterns, (3) latency requirements, (4) team skills, (5) budget. The "right" architecture depends on these factors. There's no one-size-fits-all solution.
Company-Specific Tips
Databricks Interview Tips
- Emphasize Delta Lake and Lakehouse architecture
- Discuss Medallion Architecture (Bronze/Silver/Gold)
- Be ready to explain time travel and ACID on S3
- Talk about Photon engine and performance optimization
Snowflake Interview Tips
- Focus on data warehouse strengths
- Discuss separation of storage and compute
- Mention time travel and Zero-Copy Cloning
- Talk about Snowpark for ML workloads
Netflix Interview Tips
- Discuss content recommendation architecture
- Explain multi-tenant data platforms
- Mention cost optimization strategies
- Talk about real-time personalization pipelines
βΉοΈ
Final Takeaway: The lakehouse is becoming the default choice for new data platforms. It offers the flexibility of data lakes with the governance and performance of warehouses. However, traditional warehouses still make sense for purely SQL workloads with strict governance requirements. The key is understanding your use case and choosing the architecture that fits.