Databricks & Snowflake Interview

Data Lake vs Data Warehouse vs Lakehouse

Understanding modern data architecture paradigms

Interview Question

"The CTO asks: 'Should we build a data lake, a data warehouse, or a lakehouse?' The company has: (1) 50TB of semi-structured logs, (2) 10TB of structured transactional data, (3) ML team needs raw data access, (4) BI team needs fast SQL queries, (5) budget is limited. Design the architecture and justify your choice."

Difficulty: Hard | Frequently asked at Databricks, Snowflake, Uber, Airbnb

Theoretical Foundation

Data Warehouse

A data warehouse is a centralized repository for structured, processed data optimized for analytical queries.

Characteristics:

Schema-on-write: Data must conform to a predefined schema before loading
Structured data only: Rows, columns, types
ACID transactions: Full compliance guarantees
Optimized for reads: Columnar storage, indexing, compression
ETL required: Transform before loading

Pros:

Fast SQL queries
Data quality guaranteed
ACID compliance
Strong governance

Cons:

Expensive (storage + compute)
Inflexible schema
Long development cycles
Can't handle unstructured data

Data Lake

A data lake stores raw data in its native format, schema-on-read.

Characteristics:

Schema-on-read: Schema applied when reading, not writing
All data types: Structured, semi-structured, unstructured
Low-cost storage: S3, ADLS, GCS
ELT pattern: Load first, transform later
Supports ML: Direct access to raw data

Pros:

Low cost
Flexible schema
Supports all data types
Good for ML/AI
Scalable

Cons:

No ACID transactions (with raw files)
Data quality issues
No built-in governance
Query performance varies

Lakehouse

A lakehouse combines data lake storage with data warehouse management.

Characteristics:

Open formats: Parquet, Delta Lake, Iceberg, Hudi
ACID on data lakes: Transaction support on object storage
Schema enforcement: Optional schema validation
Time travel: Version control for data
Unified analytics: SQL + ML + Streaming

Pros:

Best of both worlds
Open standards
ACID on cheap storage
Time travel
Unified platform

Cons:

Newer technology (less mature)
Requires expertise
Tooling still evolving

Comparison Matrix

Feature	Data Warehouse	Data Lake	Lakehouse
Data Types	Structured only	All types	All types
Schema	On-write	On-read	Flexible
Storage Cost	High	Low	Low
Query Performance	Fast (optimized)	Variable	Fast (improving)
ACID	Yes	No	Yes
Time Travel	Limited	No	Yes
ML Support	Limited	Excellent	Good
Governance	Strong	Weak	Improving
Scalability	Good	Excellent	Excellent
Vendor Lock-in	High	Low	Low

The Evolution of Data Architecture

2000s: Data Warehouses dominate

Architecture Diagram

ETL → Warehouse → BI Tools

2010s: Data Lakes emerge

Architecture Diagram

ELT → Data Lake → Spark/Presto → BI/ML

2020s: Lakehouse becomes mainstream

Architecture Diagram

Streaming → Lakehouse → Unified Analytics

ℹ️

Key Insight: The lakehouse isn't just a combination of lake + warehouse—it's a new paradigm that uses open table formats (Delta Lake, Iceberg, Hudi) to bring warehouse-like features to data lakes. The key innovation is the transaction log that enables ACID, time travel, and schema enforcement on object storage.

Open Table Formats Deep Dive

Delta Lake

# Delta Lake: ACID transactions on S3
from delta.tables import DeltaTable
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .getOrCreate()

# Create Delta table
spark.sql("""
    CREATE TABLE sales (
        sale_id BIGINT,
        product_id INT,
        quantity INT,
        amount DECIMAL(10,2),
        sale_date DATE
    ) USING DELTA
    LOCATION 's3://data-lake/sales/'
""")

# ACID transaction
spark.sql("""
    INSERT INTO sales VALUES
    (1, 101, 5, 49.99, '2024-01-15'),
    (2, 102, 3, 29.99, '2024-01-15')
""")

# Time travel
spark.sql("SELECT * FROM sales VERSION AS OF 0")  # Initial state
spark.sql("SELECT * FROM sales TIMESTAMP AS OF '2024-01-15 10:00:00'")

Apache Iceberg

# Iceberg: Hidden partitioning + partition evolution
spark.sql("""
    CREATE TABLE catalog.db.events (
        event_id BIGINT,
        event_time TIMESTAMP,
        user_id STRING,
        event_type STRING
    ) USING iceberg
    PARTITIONED BY (days(event_time))
""")

# Partition evolution (no data rewrite)
spark.sql("""
    ALTER TABLE catalog.db.events
    DROP PARTITION FIELD days(event_time)
""")

spark.sql("""
    ALTER TABLE catalog.db.events
    PARTITION BY (hours(event_time), user_id)
""")

Apache Hudi

# Hudi: Incremental processing + CDC
from hudi import DeltaStreamer

# Hudi table with upsert support
spark.sql("""
    CREATE TABLE hudi_events (
        uuid STRING,
        user_id STRING,
        event_time TIMESTAMP,
        event_type STRING
    ) USING hudi
    OPTIONS (
        'hoodie.table.name' = 'events',
        'hoodie.datasource.write.recordkey.field' = 'uuid',
        'hoodie.datasource.write.precombine.field' = 'event_time',
        'hoodie.datasource.write.operation' = 'upsert'
    )
""")

Architecture Patterns

Pattern 1: Lambda Architecture

Pros: Low latency + high throughput Cons: Complex, duplicate logic

Pattern 2: Kappa Architecture

Pros: Simpler, single codebase Cons: Requires robust streaming infrastructure

Pattern 3: Lakehouse Architecture

ℹ️

Best Practice: The Medallion Architecture (Bronze → Silver → Gold) is the most popular pattern for lakehouses. Each layer adds value: Bronze preserves raw data, Silver ensures quality, Gold provides business-ready aggregates.

Code Implementation

Building a Lakehouse with Delta Lake

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from delta.tables import DeltaTable

spark = SparkSession.builder \
    .appName("LakehouseArchitecture") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", 
            "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# ============================================================
# BRONZE LAYER: Raw data ingestion
# ============================================================

# Read raw data from various sources
raw_logs = spark.read.text("s3://data-lake/raw/logs/")
raw_transactions = spark.read.json("s3://data-lake/raw/transactions/")
raw_customers = spark.read.parquet("s3://data-lake/raw/customers/")

# Write to Bronze (raw, no transformations)
raw_logs.write.format("delta") \
    .mode("append") \
    .save("s3://lakehouse/bronze/logs/")

raw_transactions.write.format("delta") \
    .mode("append") \
    .save("s3://lakehouse/bronze/transactions/")

raw_customers.write.format("delta") \
    .mode("append") \
    .save("s3://lakehouse/bronze/customers/")

# ============================================================
# SILVER LAYER: Cleaned and validated
# ============================================================

# Read from Bronze
bronze_transactions = spark.read.format("delta") \
    .load("s3://lakehouse/bronze/transactions/")

# Clean and validate
silver_transactions = bronze_transactions \
    .filter(F.col("amount").isNotNull()) \
    .filter(F.col("amount") > 0) \
    .filter(F.col("transaction_date").isNotNull()) \
    .withColumn("amount", F.round("amount", 2)) \
    .withColumn("transaction_date", F.to_date("transaction_date")) \
    .withColumn("processed_timestamp", F.current_timestamp())

# Write to Silver with merge (upsert)
if DeltaTable.isDeltaTable(spark, "s3://lakehouse/silver/transactions/"):
    delta_table = DeltaTable.forPath(spark, "s3://lakehouse/silver/transactions/")
    
    delta_table.alias("target").merge(
        silver_transactions.alias("source"),
        "target.transaction_id = source.transaction_id"
    ).whenMatchedUpdateAll() \
     .whenNotMatchedInsertAll() \
     .execute()
else:
    silver_transactions.write.format("delta") \
        .mode("overwrite") \
        .partitionBy("transaction_date") \
        .save("s3://lakehouse/silver/transactions/")

# ============================================================
# GOLD LAYER: Business aggregates
# ============================================================

# Read from Silver
silver_tx = spark.read.format("delta") \
    .load("s3://lakehouse/silver/transactions/")
silver_customers = spark.read.format("delta") \
    .load("s3://lakehouse/silver/customers/")

# Create business aggregates
gold_customer_summary = silver_tx \
    .join(silver_customers, "customer_id") \
    .groupBy("customer_id", "customer_name", "segment") \
    .agg(
        F.count("*").alias("total_transactions"),
        F.sum("amount").alias("total_amount"),
        F.avg("amount").alias("avg_amount"),
        F.min("transaction_date").alias("first_transaction"),
        F.max("transaction_date").alias("last_transaction")
    )

# Write to Gold
gold_customer_summary.write.format("delta") \
    .mode("overwrite") \
    .save("s3://lakehouse/gold/customer_summary/")

Time Travel and Versioning

# ============================================================
# TIME TRAVEL: Query historical data
# ============================================================

# Query by version
df_v0 = spark.read.format("delta") \
    .option("versionAsOf", 0) \
    .load("s3://lakehouse/silver/transactions/")

# Query by timestamp
df_historical = spark.read.format("delta") \
    .option("timestampAsOf", "2024-01-15") \
    .load("s3://lakehouse/silver/transactions/")

# Compare versions
current = spark.read.format("delta").load("s3://lakehouse/silver/transactions/")
previous = spark.read.format("delta").option("versionAsOf", 5) \
    .load("s3://lakehouse/silver/transactions/")

# Find changes
changes = current.alias("curr") \
    .join(previous.alias("prev"), 
          F.col("curr.transaction_id") == F.col("prev.transaction_id"), 
          "left") \
    .filter(F.col("prev.transaction_id").isNull() | 
            (F.col("curr.amount") != F.col("prev.amount")))

# Get Delta table history
delta_table = DeltaTable.forPath(spark, "s3://lakehouse/silver/transactions/")
history = delta_table.history()
history.show(truncate=False)

Data Quality Framework

# ============================================================
# DATA QUALITY: Validate at each layer
# ============================================================

def validate_data_quality(df, rules):
    """
    Validate DataFrame against quality rules.
    
    Args:
        df: Input DataFrame
        rules: Dict of {column: [rule_functions]}
    """
    results = {}
    
    for column, column_rules in rules.items():
        for rule in column_rules:
            rule_name = rule.__name__
            violations = df.filter(~rule(F.col(column))).count()
            total = df.count()
            quality_score = 1 - (violations / total) if total > 0 else 1
            
            results[f"{column}.{rule_name}"] = {
                "violations": violations,
                "total": total,
                "quality_score": quality_score,
                "passed": violations == 0
            }
    
    return results

# Define quality rules
quality_rules = {
    "amount": [
        lambda col: col.isNotNull(),
        lambda col: col > 0,
        lambda col: col < 1000000,
    ],
    "transaction_date": [
        lambda col: col.isNotNull(),
        lambda col: col <= F.current_date(),
    ],
    "customer_id": [
        lambda col: col.isNotNull(),
        lambda col: col.rlike("^CUST[0-9]{6}$"),
    ]
}

# Validate Silver layer
silver_df = spark.read.format("delta").load("s3://lakehouse/silver/transactions/")
quality_results = validate_data_quality(silver_df, quality_rules)

# Check results
for check, result in quality_results.items():
    if not result["passed"]:
        print(f"FAILED: {check} - {result['violations']} violations")
        # Alert or quarantine

Cost Analysis

# ============================================================
# COST ANALYSIS: Lakehouse vs Warehouse vs Lake
# ============================================================

# Assumptions
data_size_tb = 50  # 50TB
queries_per_day = 1000
avg_query_gb = 10

# Option 1: Traditional Warehouse (e.g., Redshift)
# Storage: $0.025/GB/month × 50TB = $1,250/month
# Compute: 8-node ra3.4xlarge = $26.68/hr × 730hr = $19,477/month
# Total: ~$20,727/month

# Option 2: Data Lake (S3 + Athena)
# Storage: $0.023/GB/month × 50TB = $1,150/month
# Query: $5/TB scanned × 10GB × 1000 = $50/day = $1,500/month
# Total: ~$2,650/month

# Option 3: Lakehouse (S3 + Delta Lake + Spark)
# Storage: $0.023/GB/month × 50TB = $1,150/month
# Compute: EMR cluster (auto-scaling) = ~$5,000/month
# Total: ~$6,150/month

# Option 4: Snowflake
# Storage: $23/TB/month × 50TB = $1,150/month
# Compute: 10 credits/hr × $3/credit × 730hr = $21,900/month
# Total: ~$23,050/month

print("Cost Comparison (monthly):")
print(f"Warehouse:  ${20727:,}")
print(f"Data Lake:  ${2650:,}")
print(f"Lakehouse:  ${6150:,}")
print(f"Snowflake:  ${23050:,}")

⚠️

Critical Consideration: Cost analysis depends heavily on usage patterns. Data lakes are cheapest for storage-heavy workloads. Warehouses are cheapest for compute-heavy analytical workloads. Lakehouses offer the best balance for mixed workloads.

Common Follow-Up Questions

Q1: When would you choose a warehouse over a lakehouse?

Choose a warehouse when:

All data is structured
SQL performance is critical
Team has strong SQL skills
Governance requirements are strict
Budget allows for premium pricing

Q2: How do you migrate from a warehouse to a lakehouse?

# Migration strategy: Big Bang vs Phased
# Phased approach:

# Phase 1: Replicate to lakehouse (dual-write)
# Phase 2: Migrate read workloads one by one
# Phase 3: Stop writes to warehouse
# Phase 4: Decommission warehouse

# Example: Migrate a table
def migrate_table(spark, warehouse_table, lakehouse_path):
    """Migrate a table from warehouse to lakehouse"""
    
    # Read from warehouse
    df = spark.sql(f"SELECT * FROM {warehouse_table}")
    
    # Write to lakehouse with same schema
    df.write.format("delta") \
        .mode("overwrite") \
        .save(lakehouse_path)
    
    # Validate row counts
    warehouse_count = spark.sql(f"SELECT COUNT(*) FROM {warehouse_table}").collect()[0][0]
    lakehouse_count = spark.read.format("delta").load(lakehouse_path).count()
    
    assert warehouse_count == lakehouse_count, f"Count mismatch: {warehouse_count} vs {lakehouse_count}"
    
    print(f"Successfully migrated {warehouse_table} ({warehouse_count} rows)")

Q3: How do you handle governance in a lakehouse?

# Unity Catalog for governance (Databricks)
spark.sql("""
    -- Create catalog and schema
    CREATE CATALOG IF NOT EXISTS production;
    CREATE SCHEMA IF NOT EXISTS production.sales;
    
    -- Create table with governance
    CREATE TABLE production.sales.transactions (
        transaction_id BIGINT,
        customer_id STRING,
        amount DECIMAL(10,2),
        transaction_date DATE
    ) COMMENT 'Sales transactions'
    TBLPROPERTIES (
        'quality' = 'gold',
        'owner' = 'sales-team',
        'pii' = 'customer_id'
    );
    
    -- Grant access
    GRANT SELECT ON production.sales.transactions TO `sales-analysts`;
    GRANT MODIFY ON production.sales.transactions TO `sales-engineers`;
""")

Q4: What about real-time in a lakehouse?

# Real-time ingestion into lakehouse
# Using Spark Structured Streaming + Delta Lake

streaming_df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka:9092") \
    .option("subscribe", "transactions") \
    .load()

# Parse and transform
parsed_df = streaming_df \
    .select(F.from_json(F.col("value").cast("string"), schema).alias("data")) \
    .select("data.*")

# Write to Delta Lake (micro-batch)
query = parsed_df.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "s3://checkpoints/transactions/") \
    .start("s3://lakehouse/silver/transactions/")

💡

Interview Tip: When discussing data architecture, always ask about: (1) data volume, (2) query patterns, (3) latency requirements, (4) team skills, (5) budget. The "right" architecture depends on these factors. There's no one-size-fits-all solution.

Company-Specific Tips

Databricks Interview Tips

Emphasize Delta Lake and Lakehouse architecture
Discuss Medallion Architecture (Bronze/Silver/Gold)
Be ready to explain time travel and ACID on S3
Talk about Photon engine and performance optimization

Snowflake Interview Tips

Focus on data warehouse strengths
Discuss separation of storage and compute
Mention time travel and Zero-Copy Cloning
Talk about Snowpark for ML workloads

Netflix Interview Tips

Discuss content recommendation architecture
Explain multi-tenant data platforms
Mention cost optimization strategies
Talk about real-time personalization pipelines

ℹ️

Final Takeaway: The lakehouse is becoming the default choice for new data platforms. It offers the flexibility of data lakes with the governance and performance of warehouses. However, traditional warehouses still make sense for purely SQL workloads with strict governance requirements. The key is understanding your use case and choosing the architecture that fits.