Data Lakehouse Architecture: Delta Lake, Iceberg, Hudi
Difficulty: Senior Level | Companies: Databricks, Netflix, Uber, Airbnb, Apple
1. What is the Data Lakehouse?
The lakehouse combines data lake flexibility with data warehouse reliability:
Architecture Diagram
Traditional Stack:
Data Lake (S3/GCS) β ETL β Data Warehouse β Analytics
(Cheap, flexible) (Slow) (Expensive, fast)
Lakehouse:
Data Lake + ACID + Schema Enforcement + Time Travel
(All in one layer)
Key Lakehouse Properties
| Feature | Data Lake | Data Warehouse | Lakehouse |
|---|---|---|---|
| Storage Cost | Low | High | Low |
| Schema Enforcement | None | Strict | Flexible |
| ACID Transactions | No | Yes | Yes |
| Time Travel | No | Limited | Yes |
| Format | Parquet/ORC | Proprietary | Open (Parquet) |
| Streaming Support | Good | Poor | Good |
βΉοΈ
Key Insight: The lakehouse eliminates the need for separate lake and warehouse by adding warehouse-like features directly on top of open file formats.
2. Delta Lake
ACID Transactions on Data Lakes
from delta.tables import DeltaTable
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("DeltaLakehouse") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
# Create Delta table with ACID
df = spark.read.parquet("s3://raw/events/")
df.write.format("delta") \
.mode("overwrite") \
.save("s3://lakehouse/events")
# MERGE (upsert) β impossible with plain Parquet
delta_table = DeltaTable.forPath(spark, "s3://lakehouse/events")
new_data = spark.read.parquet("s3://raw/events_new/")
delta_table.alias("target").merge(
new_data.alias("source"),
"target.event_id = source.event_id"
).whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
# Time Travel β query historical data
df_yesterday = spark.read.format("delta") \
.option("versionAsOf", 123) \
.load("s3://lakehouse/events")
# Or by timestamp
df_jan1 = spark.read.format("delta") \
.option("timestampAsOf", "2024-01-01") \
.load("s3://lakehouse/events")
# VACUUM β clean up old files
delta_table.vacuum(retentionHours=168) # 7 days
Delta Lake Internals
Architecture Diagram
Delta Table on Storage:
s3://lakehouse/events/
βββ _delta_log/
β βββ 00000000000000000000.json (commit 0)
β βββ 00000000000000000001.json (commit 1)
β βββ 00000000000000000002.json (commit 2)
βββ part-00000-...-.parquet
βββ part-00001-...-.parquet
βββ _change_data/ (CDC files)
βββ cdc-00000-...-.parquet
3. Apache Iceberg
Table Format
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("IcebergLakehouse") \
.config("spark.sql.catalog.iceberg", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.iceberg.type", "hadoop") \
.config("spark.sql.catalog.iceberg.warehouse", "s3://lakehouse/iceberg") \
.getOrCreate()
# Create Iceberg table
spark.sql("""
CREATE TABLE iceberg.db.events (
event_id BIGINT,
user_id BIGINT,
event_type STRING,
event_timestamp TIMESTAMP,
properties MAP<STRING, STRING>
) USING iceberg
PARTITIONED BY (days(event_timestamp))
TBLPROPERTIES (
'format-version' = '2',
'write.parquet.compression-codec' = 'zstd'
)
""")
# Schema evolution (safe, column-level)
spark.sql("""
ALTER TABLE iceberg.db.events ADD COLUMN new_field STRING AFTER user_id
""")
# Hidden partitioning (auto-pruning)
spark.sql("""
SELECT * FROM iceberg.db.events
WHERE event_timestamp >= '2024-01-01'
LIMIT 100
""")
# Time travel with snapshots
spark.sql("""
SELECT * FROM iceberg.db.events
TIMESTAMP AS OF '2024-01-15 10:00:00'
""")
# Snapshot inspection
spark.sql("SELECT * FROM iceberg.db.events.snapshots")
Iceberg Internals
Architecture Diagram
Iceberg Table Metadata:
s3://lakehouse/iceberg/db/events/
βββ metadata/
β βββ v1.metadata.json (table metadata)
β βββ v2.metadata.json
β βββ snap-1234567890.avro (snapshot list)
β βββ 00000-1-uuid.manifest (manifest file)
βββ data/
β βββ 00000-0-uuid.parquet (data files)
β βββ 00001-1-uuid.parquet
βββ schema.json
4. Apache Hudi
Upsert-Optimized Tables
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("HudiLakehouse") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.sql.hive.convertMetastoreParquet", "false") \
.getOrCreate()
hudi_options = {
'hoodie.table.name': 'events',
'hoodie.datasource.write.recordkey.field': 'event_id',
'hoodie.datasource.write.precombine.field': 'event_timestamp',
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
'hoodie.compact.inline.max.delta.commits': '5',
}
# Upsert data
df.write.format("hudi") \
.options(**hudi_options) \
.mode("append") \
.save("s3://lakehouse/hudi/events")
# Read with time travel
spark.read.format("hudi") \
.option("as.of.instant", "20240115000000") \
.load("s3://lakehouse/hudi/events")
# Incremental read
spark.read.format("hudi") \
.option("hoodie.datasource.query.type", "incremental") \
.option("hoodie.datasource.read.begin.instanttime", "20240101000000") \
.option("hoodie.datasource.read.end.instanttime", "20240131000000") \
.load("s3://lakehouse/hudi/events")
5. Lakehouse Comparison
| Feature | Delta Lake | Iceberg | Hudi |
|---|---|---|---|
| ACID Transactions | Yes | Yes | Yes |
| Schema Evolution | Yes | Yes (better) | Yes |
| Time Travel | Yes | Yes | Yes |
| Hidden Partitioning | No | Yes | Yes |
| Upsert Performance | Good | Good | Excellent |
| Streaming Reads | Yes | Yes | Yes |
| Open Source | Yes (OSS) | Yes (ASF) | Yes (ASF) |
| Primary Backer | Databricks | Netflix | Uber |
β οΈ
Common Interview Trap: Don't say "Iceberg is better than Delta" or vice versa. Each has tradeoffs β Iceberg has better hidden partitioning, Delta has better streaming, Hudi has better upserts.
6. Lakehouse Best Practices
File Optimization
# Optimize file sizes (aim for 128MB-1GB per file)
spark.sql("""
OPTIMIZE delta.`s3://lakehouse/events`
ZORDER BY (user_id, event_timestamp)
""")
# Iceberg compaction
spark.sql("""
CALL iceberg.system.rewrite_data_files(
table => 'db.events',
options => map('target-file-size-bytes', '134217728')
)
""")
Partition Strategy
Architecture Diagram
Partition by date (not user_id):
β
Date pruning for time-range queries
β
Even file sizes (events distributed by time)
β User-specific queries scan all partitions
Partition by user_id:
β
User lookups are fast
β Hot partitions (power users)
β Time-range queries scan all partitions
Best: Partition by date, Z-order by user_id
βΉοΈ
Best Practice: Use Z-ORDER or data clustering for multi-dimensional queries. This creates zone maps that enable skipping irrelevant data blocks.
Follow-Up Questions
- How would you migrate from Parquet to Delta Lake without downtime?
- Compare write amplification between COW and MOR table types.
- Design a lakehouse architecture for 10PB of streaming + batch data.
- How do you handle GDPR deletions in a lakehouse?
- Design a cost optimization strategy for a lakehouse on S3.