Google & Meta Interview

SCD Type 1 vs 2 vs 3: Complete Guide

Handling historical changes in dimension attributes

Interview Question

"A customer moves from New York to San Francisco. Their segment changes from 'Premium' to 'Standard'. Design a slowly changing dimension strategy that: (1) tracks the customer's current location, (2) preserves their purchase history tied to the correct address, and (3) allows time-travel queries to see what their address was at any point in the past."

Difficulty: Medium-Hard | Frequently asked at Google, Meta, Netflix

Theoretical Foundation

What Are Slowly Changing Dimensions (SCDs)?

SCDs are dimension attributes that change infrequently but unpredictably over time. The challenge is deciding how to store these changes in a data warehouse.

Example scenarios:

Customer changes address
Product price increases
Employee department transfer
Store closes and reopens under new name

SCD Type 1: Overwrite

The simplest approach: overwrite the old value with the new value. No history is maintained.

Characteristics:

Simplest to implement
No history preserved
Lowest storage requirements
Fastest queries (no versioning logic)

When to use:

Attribute corrections (typo fixes)
Attributes where history doesn't matter
Real-time dashboards showing current state only

SCD Type 2: Add New Row

The most common approach for historical tracking: create a new row for each change, with effective dates.

Characteristics:

Full history preserved
Each version gets a unique surrogate key
Requires effective_date and expiry_date (or is_current flag)
Storage grows with number of changes

When to use:

Compliance requirements (audit trail)
Time-travel analysis
When historical context matters

SCD Type 3: Add New Column

A middle ground: add a new column for the previous value, keeping only one generation of history.

Characteristics:

Limited history (one or two generations)
Low storage overhead
Simple queries (no date range logic)
Limited to pre-defined number of history columns

When to use:

When only previous value matters
Low-cardinality attributes
When query simplicity is critical

SCD Type 4: History Table

A hybrid approach: keep current in main table, history in separate table.

Characteristics:

Current table remains small and fast
History preserved in separate table
Queries require UNION or separate lookups
Good for very large dimensions

SCD Type 6: Hybrid (Type 1 + Type 2)

Combines Type 2 row versioning with Type 1 current value columns:

ℹ️

Key Insight: Type 6 gives you the best of both worlds—historical rows for time-travel analysis AND a quick way to get current values without filtering. This is the most common pattern in modern data warehouses.

Mathematical Model for SCD Storage

For a dimension with n rows and c changes per row on average over t time periods:

Type 1 storage:

S_1 = n \times r \times s

Type 2 storage:

S_2 = n \times (1 + c) \times r \times s

Type 3 storage:

S_3 = n \times r \times s + n \times h \times s

where:

$r$ = number of regular columns
$s$ = average column size in bytes
$h$ = number of history columns

Example: 1M customers, 10 columns, 20 bytes average, 3 changes/customer:

Type 1: $1M \times 10 \times 20 = 200MB$
Type 2: $1M \times 4 \times 10 \times 20 = 800MB$ (4x)
Type 3: $1M \times 10 \times 20 + 1M \times 3 \times 20 = 260MB$ (1.3x)

⚠️

Common Interview Trap: Candidates often recommend Type 2 for everything because it "preserves history." But Type 2 has significant performance implications: table scans are slower, JOINs are more complex, and storage costs are higher. Always justify your choice based on business requirements.

Query Patterns for Each SCD Type

Type 1 - Current value only:

-- Simple: just read the table
SELECT customer_key, city, segment 
FROM dim_customer 
WHERE customer_key = 1001;

Type 2 - Current version:

-- Must filter for current row
SELECT customer_key, city, segment 
FROM dim_customer 
WHERE customer_key = 1001 
  AND is_current = TRUE;

-- Or using date range
SELECT customer_key, city, segment 
FROM dim_customer 
WHERE customer_key = 1001 
  AND effective_date <= CURRENT_DATE 
  AND expiry_date > CURRENT_DATE;

Type 2 - Point-in-time query:

-- What was the city on March 15, 2024?
SELECT customer_key, city, segment 
FROM dim_customer 
WHERE customer_key = 1001 
  AND effective_date <= '2024-03-15' 
  AND expiry_date > '2024-03-15';

Type 3 - Previous value:

-- Current and previous
SELECT customer_key, city_current, city_previous 
FROM dim_customer 
WHERE customer_key = 1001;

💡

Production Tip: For Type 2, always create a view that filters for current records. This prevents accidental analysis of historical rows and simplifies downstream queries:

CREATE VIEW dim_customer_current AS
SELECT * FROM dim_customer WHERE is_current = TRUE;

Code Implementation

SCD Type 1 Implementation

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("SCD_Type1").getOrCreate()

# Read incoming data
incoming_customers = spark.read.parquet("s3://data-lake/staging/customers/")

# Read current dimension
dim_customer = spark.read.format("delta").load("s3://data-warehouse/dim_customer/")

# SCD Type 1: Overwrite existing records
# Join on business key to find matches
updated = incoming_customers.alias("src") \
    .join(dim_customer.alias("tgt"), 
          F.col("src.customer_id") == F.col("tgt.business_customer_id"), 
          "left") \
    .withColumn("is_match", F.col("tgt.business_customer_id").isNotNull()) \
    .select(
        F.when(F.col("is_match"), F.col("tgt.customer_key"))
         .otherwise(F.monotonically_increasing_id()).alias("customer_key"),
        F.col("src.customer_id").alias("business_customer_id"),
        F.col("src.customer_name"),
        F.col("src.city"),
        F.col("src.segment"),
        F.current_timestamp().alias("last_updated")
    )

# Merge (upsert) into dimension table
updated.createOrReplaceTempView("updates")

spark.sql("""
    MERGE INTO dim_customer tgt
    USING updates src
    ON tgt.business_customer_id = src.business_customer_id
    WHEN MATCHED THEN UPDATE SET
        tgt.customer_name = src.customer_name,
        tgt.city = src.city,
        tgt.segment = src.segment,
        tgt.last_updated = src.last_updated
    WHEN NOT MATCHED THEN INSERT (
        customer_key,
        business_customer_id,
        customer_name,
        city,
        segment,
        last_updated
    ) VALUES (
        src.customer_key,
        src.business_customer_id,
        src.customer_name,
        src.city,
        src.segment,
        src.last_updated
    )
""")

SCD Type 2 Implementation

# SCD Type 2: Add new row for changes
incoming_customers = spark.read.parquet("s3://data-lake/staging/customers/")
dim_customer = spark.read.format("delta").load("s3://data-warehouse/dim_customer/")

# Identify changes
changes = incoming_customers.alias("src") \
    .join(dim_customer.filter(F.col("is_current") == True).alias("tgt"),
          F.col("src.customer_id") == F.col("tgt.business_customer_id"),
          "left") \
    .withColumn("is_changed",
        (F.col("tgt.business_customer_id").isNull()) |  # New record
        (F.col("src.customer_name") != F.col("tgt.customer_name")) |
        (F.col("src.city") != F.col("tgt.city")) |
        (F.col("src.segment") != F.col("tgt.segment"))
    ) \
    .filter(F.col("is_changed") == True)

# Expire old records
expired_records = changes.filter(F.col("tgt.business_customer_id").isNotNull()) \
    .select(
        F.col("tgt.customer_key"),
        F.current_timestamp().alias("expiry_date"),
        F.lit(False).alias("is_current")
    )

# Insert new records
new_records = changes.select(
    F.monotonically_increasing_id().alias("customer_key"),
    F.col("src.customer_id").alias("business_customer_id"),
    F.col("src.customer_name"),
    F.col("src.city"),
    F.col("src.segment"),
    F.current_timestamp().alias("effective_date"),
    F.lit(F.to_date("9999-12-31")).alias("expiry_date"),
    F.lit(True).alias("is_current")
)

# Apply changes using Delta Lake's MERGE
# Step 1: Expire old records
expired_records.createOrReplaceTempView("expired")
spark.sql("""
    MERGE INTO dim_customer tgt
    USING expired src
    ON tgt.customer_key = src.customer_key
    WHEN MATCHED THEN UPDATE SET
        tgt.expiry_date = src.expiry_date,
        tgt.is_current = src.is_current
""")

# Step 2: Insert new records
new_records.write.format("delta") \
    .mode("append") \
    .save("s3://data-warehouse/dim_customer/")

SCD Type 2 with dbt (Modern Approach)

-- models/dim_customer.sql
{{
  config(
    materialized='incremental',
    unique_key='customer_key',
    incremental_strategy='merge',
    merge_update_columns=['expiry_date', 'is_current']
  )
}}

WITH current_version AS (
    SELECT 
        customer_key,
        business_customer_id,
        customer_name,
        city,
        segment,
        effective_date,
        expiry_date,
        is_current
    FROM {{ ref('dim_customer') }}
    WHERE is_current = TRUE
),

incoming AS (
    SELECT 
        customer_id as business_customer_id,
        customer_name,
        city,
        segment
    FROM {{ source('staging', 'customers') }}
),

changes AS (
    SELECT 
        i.business_customer_id,
        i.customer_name,
        i.city,
        i.segment,
        CASE 
            WHEN c.customer_key IS NULL THEN 'INSERT'
            WHEN i.customer_name != c.customer_name 
              OR i.city != c.city 
              OR i.segment != c.segment THEN 'UPDATE'
            ELSE 'NO_CHANGE'
        END as change_type
    FROM incoming i
    LEFT JOIN current_version c 
        ON i.business_customer_id = c.business_customer_id
)

-- Records to expire (updates and deletes)
, to_expire AS (
    SELECT customer_key
    FROM current_version c
    WHERE EXISTS (
        SELECT 1 FROM changes ch 
        WHERE ch.business_customer_id = c.business_customer_id
        AND ch.change_type = 'UPDATE'
    )
)

-- New and updated records
, new_versions AS (
    SELECT 
        ROW_NUMBER() OVER (ORDER BY NOW()) + 
            COALESCE((SELECT MAX(customer_key) FROM {{ ref('dim_customer') }}), 0) 
            as customer_key,
        i.business_customer_id,
        i.customer_name,
        i.city,
        i.segment,
        CURRENT_TIMESTAMP as effective_date,
        TO_DATE('9999-12-31') as expiry_date,
        TRUE as is_current
    FROM incoming i
    WHERE EXISTS (
        SELECT 1 FROM changes ch 
        WHERE ch.business_customer_id = i.business_customer_id
        AND ch.change_type IN ('INSERT', 'UPDATE')
    )
)

SELECT * FROM new_versions

SCD Type 3 Implementation

# SCD Type 3: Add columns for previous values
incoming_customers = spark.read.parquet("s3://data-lake/staging/customers/")
dim_customer = spark.read.format("delta").load("s3://data-warehouse/dim_customer/")

# SCD Type 3 logic
updated = incoming_customers.alias("src") \
    .join(dim_customer.alias("tgt"),
          F.col("src.customer_id") == F.col("tgt.business_customer_id"),
          "left") \
    .select(
        F.col("tgt.customer_key"),
        F.col("src.customer_id").alias("business_customer_id"),
        F.col("src.customer_name"),
        # For city: move current to previous, set new as current
        F.when(F.col("tgt.business_customer_id").isNotNull(),
            F.col("tgt.city_current")  # Keep old previous
        ).alias("city_previous_old"),
        F.col("src.city").alias("city_current"),
        F.when(F.col("tgt.business_customer_id").isNotNull(),
            F.coalesce(F.col("tgt.city_current"), F.lit(None))
        ).alias("city_previous"),
        # Same logic for segment
        F.when(F.col("tgt.business_customer_id").isNotNull(),
            F.col("tgt.segment_current")
        ).alias("segment_previous_old"),
        F.col("src.segment").alias("segment_current"),
        F.when(F.col("tgt.business_customer_id").isNotNull(),
            F.coalesce(F.col("tgt.segment_current"), F.lit(None))
        ).alias("segment_previous"),
        F.current_timestamp().alias("last_updated")
    )

SCD Type 6 (Hybrid) Implementation

# SCD Type 6: Type 2 rows + Type 1 current columns
incoming_customers = spark.read.parquet("s3://data-lake/staging/customers/")
dim_customer = spark.read.format("delta").load("s3://data-warehouse/dim_customer/")

# Get current values for quick access
current_values = dim_customer.filter(F.col("is_current") == True) \
    .select(
        F.col("business_customer_id"),
        F.col("city").alias("current_city"),
        F.col("segment").alias("current_segment")
    )

# Apply SCD Type 2 logic (same as before)
# ... (Type 2 merge logic)

# Then update current values columns
updated_with_current = updated.alias("new") \
    .join(current_values.alias("cv"),
          F.col("new.business_customer_id") == F.col("cv.business_customer_id"),
          "left") \
    .withColumn("current_city", 
        F.when(F.col("new.is_current"), F.col("new.city"))
         .otherwise(F.col("cv.current_city"))) \
    .withColumn("current_segment",
        F.when(F.col("new.is_current"), F.col("new.segment"))
         .otherwise(F.col("cv.current_segment")))

ℹ️

Best Practice: Use Delta Lake or Apache Iceberg for SCD Type 2. These formats support:

Time travel (built-in history)
MERGE operations (atomic upserts)
Schema evolution (add columns without rewriting)
Partition pruning (efficient queries)

Advanced: SCD with Late-Arriving Data

# Handle late-arriving data in SCD Type 2
def handle_late_arriving_scd(incoming, dim_table, max_lag_days=30):
    """
    Late-arriving data: records arrive after their effective date
    Strategy: Gap detection and backfill
    """
    
    # Find records that should have been inserted earlier
    late_records = incoming.alias("src") \
        .join(dim_table.alias("tgt"),
              (F.col("src.customer_id") == F.col("tgt.business_customer_id")) &
              (F.col("src.effective_date") < F.col("tgt.effective_date")),
              "left") \
        .filter(F.col("tgt.customer_key").isNull()) \
        .filter(F.datediff(F.current_date(), F.col("src.effective_date")) <= max_lag_days)
    
    # For late records, insert with appropriate effective dates
    # and adjust expiry dates of overlapping records
    return late_records

⚠️

Critical Issue: Late-arriving data is one of the hardest problems in SCD management. Always have a clear policy: how far back will you accept changes? What happens to aggregated reports when historical data changes?

Common Follow-Up Questions

Q1: How do you handle SCD for rapidly changing attributes?

For attributes that change frequently (e.g., stock price, user session data):

Don't use SCD Type 2 - it would create too many rows
Use SCD Type 1 for the latest value
Store history in a separate fact table (e.g., fact_price_history)
Consider time-series databases (InfluxDB, TimescaleDB)

Q2: How do you handle SCD in real-time pipelines?

# Real-time SCD Type 2 using Kafka + Flink
from pyflink.table import EnvironmentSettings, TableEnvironment

env_settings = EnvironmentSettings.in_streaming_mode()
t_env = TableEnvironment.create(env_settings)

# Define SCD Type 2 logic in Flink SQL
t_env.execute_sql("""
    CREATE TABLE dim_customer_scd (
        customer_key BIGINT,
        customer_id STRING,
        customer_name STRING,
        city STRING,
        effective TIMESTAMP(3),
        expiry TIMESTAMP(3),
        is_current BOOLEAN,
        PRIMARY KEY (customer_key) NOT ENFORCED
    ) WITH (
        'connector' = 'jdbc',
        'url' = 'jdbc:postgresql://...',
        'table-name' = 'dim_customer'
    );
    
    -- Kafka source with CDC
    CREATE TABLE customers_cdc (
        customer_id STRING,
        customer_name STRING,
        city STRING,
        op_type STRING,
        op_ts TIMESTAMP(3)
    ) WITH (
        'connector' = 'kafka',
        'topic' = 'customers',
        'properties.bootstrap.servers' = 'kafka:9092',
        'format' = 'debezium-json'
    );
    
    -- SCD Type 2 merge logic
    INSERT INTO dim_customer_scd
    SELECT 
        ROW_NUMBER() OVER (ORDER BY op_ts) + COALESCE(max_key, 0),
        customer_id, customer_name, city,
        op_ts as effective,
        TO_TIMESTAMP('9999-12-31') as expiry,
        TRUE as is_current
    FROM customers_cdc
    WHERE op_type = 'c' OR op_type = 'u';
""")

Q3: How do you test SCD logic?

# Unit tests for SCD Type 2
def test_scd_type2_insert():
    """Test that new records are inserted correctly"""
    initial = create_dataframe([
        {"id": 1, "name": "John", "city": "NY", "is_current": True}
    ])
    incoming = create_dataframe([
        {"id": 1, "name": "John", "city": "SF"}  # Change
    ])
    
    result = apply_scd_type2(initial, incoming)
    
    # Should have 2 rows
    assert result.count() == 2
    # Old record should be expired
    assert result.filter("is_current = False").count() == 1
    # New record should be current
    assert result.filter("is_current = True AND city = 'SF'").count() == 1

def test_scd_type2_no_change():
    """Test that unchanged records are not modified"""
    initial = create_dataframe([
        {"id": 1, "name": "John", "city": "NY", "is_current": True}
    ])
    incoming = create_dataframe([
        {"id": 1, "name": "John", "city": "NY"}  # No change
    ])
    
    result = apply_scd_type2(initial, incoming)
    
    # Should still have 1 row
    assert result.count() == 1
    assert result.filter("is_current = True").count() == 1

Q4: How do you handle SCD in a data mesh architecture?

In a data mesh, each domain owns its dimensions:

Domain teams implement SCD for their domains
Central platform provides SCD templates and tools
Cross-domain queries use conformed dimensions
Data contracts define SCD behavior between domains

💡

Interview Tip: When discussing SCDs, always mention the business context. Type 2 is great for compliance but expensive. Type 1 is cheap but loses history. The right choice depends on: (1) regulatory requirements, (2) query patterns, (3) data volume, and (4) latency requirements.

Company-Specific Tips

Google Interview Tips

Discuss BigQuery's time-travel feature as an alternative to SCD Type 2
Mention partitioned tables for efficient SCD queries
Be ready to discuss SCD in data pipelines (Dataflow, Pub/Sub)
Talk about cost implications of each SCD type in cloud warehouses

Meta Interview Tips

Focus on user dimension changes (profile updates, privacy settings)
Discuss real-time SCD for ad targeting
Mention data governance requirements for user data
Talk about schema evolution in Hive/Delta Lake

Netflix Interview Tips

Discuss content metadata changes (genre reclassification, ratings)
Explain how viewing history affects SCD design
Mention A/B testing dimensions and their SCD requirements
Talk about cost optimization for large content catalogs

ℹ️

Final Takeaway: SCD Type 2 is the most commonly used in production, but it's not always the right choice. Start with business requirements: Do you need history? How much? What's the query pattern? Then choose the simplest SCD type that meets those requirements.