🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

SCD Type 1 vs 2 vs 3: Complete Guide

Data EngineeringData Modeling⭐ Premium

Advertisement

Google & Meta Interview

SCD Type 1 vs 2 vs 3: Complete Guide

Handling historical changes in dimension attributes

Interview Question

"A customer moves from New York to San Francisco. Their segment changes from 'Premium' to 'Standard'. Design a slowly changing dimension strategy that: (1) tracks the customer's current location, (2) preserves their purchase history tied to the correct address, and (3) allows time-travel queries to see what their address was at any point in the past."

Difficulty: Medium-Hard | Frequently asked at Google, Meta, Netflix


Theoretical Foundation

What Are Slowly Changing Dimensions (SCDs)?

SCDs are dimension attributes that change infrequently but unpredictably over time. The challenge is deciding how to store these changes in a data warehouse.

Example scenarios:

  • Customer changes address
  • Product price increases
  • Employee department transfer
  • Store closes and reopens under new name

SCD Type 1: Overwrite

The simplest approach: overwrite the old value with the new value. No history is maintained.

BEFORE UPDATEcustomer_keynamecitysegment1001John DoeNew YorkPremiumAFTER UPDATE (SCD Type 1)customer_keynamecitysegment1001John DoeSan FranciscoStandardHistorical data: LOSTOld value (New York / Premium) is overwrittenNo audit trail, no time-travel capabilityBest for: typo fixes, non-critical attributes

Characteristics:

  • Simplest to implement
  • No history preserved
  • Lowest storage requirements
  • Fastest queries (no versioning logic)

When to use:

  • Attribute corrections (typo fixes)
  • Attributes where history doesn't matter
  • Real-time dashboards showing current state only

SCD Type 2: Add New Row

The most common approach for historical tracking: create a new row for each change, with effective dates.

BEFORE UPDATEcustomer_keynamecitysegmenteffectiveexpirycurrent1001John DoeNew YorkPremium2024-01-019999-12-31TRUEAFTER UPDATE (SCD Type 2)customer_keynamecitysegmenteffectiveexpirycurrent1001John DoeNew YorkPremium2024-01-012024-06-15FALSE1001John DoeSan FranciscoStandard2024-06-159999-12-31TRUEHistorical data: PRESERVEDFull version history with effective/expiry datesBest for: compliance, audit trails, time-travel

Characteristics:

  • Full history preserved
  • Each version gets a unique surrogate key
  • Requires effective_date and expiry_date (or is_current flag)
  • Storage grows with number of changes

When to use:

  • Compliance requirements (audit trail)
  • Time-travel analysis
  • When historical context matters

SCD Type 3: Add New Column

A middle ground: add a new column for the previous value, keeping only one generation of history.

BEFORE UPDATEcustomer_keynamecity_currentcity_previous1001John DoeNew YorkNULLAFTER UPDATE (SCD Type 3)customer_keynamecity_currentcity_previous1001John DoeSan FranciscoNew YorkHistorical data: PartialOnly previous value retained. Low storage overhead.

Characteristics:

  • Limited history (one or two generations)
  • Low storage overhead
  • Simple queries (no date range logic)
  • Limited to pre-defined number of history columns

When to use:

  • When only previous value matters
  • Low-cardinality attributes
  • When query simplicity is critical

SCD Type 4: History Table

A hybrid approach: keep current in main table, history in separate table.

CURRENT TABLEcustomer_keynamecitysegment1001John DoeSan FranciscoStandardHISTORY TABLEcustomer_keycitysegmentchanged_date1001New YorkPremium2024-01-011001San FranciscoStandard2024-06-15Type 4: Current + History TablesCurrent table stays small and fast. History preserved separately.Queries require UNION or separate lookups.

Characteristics:

  • Current table remains small and fast
  • History preserved in separate table
  • Queries require UNION or separate lookups
  • Good for very large dimensions

SCD Type 6: Hybrid (Type 1 + Type 2)

Combines Type 2 row versioning with Type 1 current value columns:

customer_keynamecity_currentcity_preveffectiveexpirycurrent_citycurrent_segment1001John DoeNew YorkNULL2024-01-012024-06-15San FranciscoStandard1001John DoeSan FranciscoNew York2024-06-159999-12-31San FranciscoStandardBest of both worlds: history + current access

ℹ️

Key Insight: Type 6 gives you the best of both worlds—historical rows for time-travel analysis AND a quick way to get current values without filtering. This is the most common pattern in modern data warehouses.

Mathematical Model for SCD Storage

For a dimension with n rows and c changes per row on average over t time periods:

Type 1 storage:

S1=n×r×sS_1 = n \times r \times s

Type 2 storage:

S2=n×(1+c)×r×sS_2 = n \times (1 + c) \times r \times s

Type 3 storage:

S3=n×r×s+n×h×sS_3 = n \times r \times s + n \times h \times s

where:

  • rr = number of regular columns
  • ss = average column size in bytes
  • hh = number of history columns

Example: 1M customers, 10 columns, 20 bytes average, 3 changes/customer:

  • Type 1: 1M×10×20=200MB1M \times 10 \times 20 = 200MB
  • Type 2: 1M×4×10×20=800MB1M \times 4 \times 10 \times 20 = 800MB (4x)
  • Type 3: 1M×10×20+1M×3×20=260MB1M \times 10 \times 20 + 1M \times 3 \times 20 = 260MB (1.3x)

⚠️

Common Interview Trap: Candidates often recommend Type 2 for everything because it "preserves history." But Type 2 has significant performance implications: table scans are slower, JOINs are more complex, and storage costs are higher. Always justify your choice based on business requirements.

Query Patterns for Each SCD Type

Type 1 - Current value only:

-- Simple: just read the table
SELECT customer_key, city, segment 
FROM dim_customer 
WHERE customer_key = 1001;

Type 2 - Current version:

-- Must filter for current row
SELECT customer_key, city, segment 
FROM dim_customer 
WHERE customer_key = 1001 
  AND is_current = TRUE;

-- Or using date range
SELECT customer_key, city, segment 
FROM dim_customer 
WHERE customer_key = 1001 
  AND effective_date <= CURRENT_DATE 
  AND expiry_date > CURRENT_DATE;

Type 2 - Point-in-time query:

-- What was the city on March 15, 2024?
SELECT customer_key, city, segment 
FROM dim_customer 
WHERE customer_key = 1001 
  AND effective_date <= '2024-03-15' 
  AND expiry_date > '2024-03-15';

Type 3 - Previous value:

-- Current and previous
SELECT customer_key, city_current, city_previous 
FROM dim_customer 
WHERE customer_key = 1001;

💡

Production Tip: For Type 2, always create a view that filters for current records. This prevents accidental analysis of historical rows and simplifies downstream queries:

CREATE VIEW dim_customer_current AS
SELECT * FROM dim_customer WHERE is_current = TRUE;

Code Implementation

SCD Type 1 Implementation

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("SCD_Type1").getOrCreate()

# Read incoming data
incoming_customers = spark.read.parquet("s3://data-lake/staging/customers/")

# Read current dimension
dim_customer = spark.read.format("delta").load("s3://data-warehouse/dim_customer/")

# SCD Type 1: Overwrite existing records
# Join on business key to find matches
updated = incoming_customers.alias("src") \
    .join(dim_customer.alias("tgt"), 
          F.col("src.customer_id") == F.col("tgt.business_customer_id"), 
          "left") \
    .withColumn("is_match", F.col("tgt.business_customer_id").isNotNull()) \
    .select(
        F.when(F.col("is_match"), F.col("tgt.customer_key"))
         .otherwise(F.monotonically_increasing_id()).alias("customer_key"),
        F.col("src.customer_id").alias("business_customer_id"),
        F.col("src.customer_name"),
        F.col("src.city"),
        F.col("src.segment"),
        F.current_timestamp().alias("last_updated")
    )

# Merge (upsert) into dimension table
updated.createOrReplaceTempView("updates")

spark.sql("""
    MERGE INTO dim_customer tgt
    USING updates src
    ON tgt.business_customer_id = src.business_customer_id
    WHEN MATCHED THEN UPDATE SET
        tgt.customer_name = src.customer_name,
        tgt.city = src.city,
        tgt.segment = src.segment,
        tgt.last_updated = src.last_updated
    WHEN NOT MATCHED THEN INSERT (
        customer_key,
        business_customer_id,
        customer_name,
        city,
        segment,
        last_updated
    ) VALUES (
        src.customer_key,
        src.business_customer_id,
        src.customer_name,
        src.city,
        src.segment,
        src.last_updated
    )
""")

SCD Type 2 Implementation

# SCD Type 2: Add new row for changes
incoming_customers = spark.read.parquet("s3://data-lake/staging/customers/")
dim_customer = spark.read.format("delta").load("s3://data-warehouse/dim_customer/")

# Identify changes
changes = incoming_customers.alias("src") \
    .join(dim_customer.filter(F.col("is_current") == True).alias("tgt"),
          F.col("src.customer_id") == F.col("tgt.business_customer_id"),
          "left") \
    .withColumn("is_changed",
        (F.col("tgt.business_customer_id").isNull()) |  # New record
        (F.col("src.customer_name") != F.col("tgt.customer_name")) |
        (F.col("src.city") != F.col("tgt.city")) |
        (F.col("src.segment") != F.col("tgt.segment"))
    ) \
    .filter(F.col("is_changed") == True)

# Expire old records
expired_records = changes.filter(F.col("tgt.business_customer_id").isNotNull()) \
    .select(
        F.col("tgt.customer_key"),
        F.current_timestamp().alias("expiry_date"),
        F.lit(False).alias("is_current")
    )

# Insert new records
new_records = changes.select(
    F.monotonically_increasing_id().alias("customer_key"),
    F.col("src.customer_id").alias("business_customer_id"),
    F.col("src.customer_name"),
    F.col("src.city"),
    F.col("src.segment"),
    F.current_timestamp().alias("effective_date"),
    F.lit(F.to_date("9999-12-31")).alias("expiry_date"),
    F.lit(True).alias("is_current")
)

# Apply changes using Delta Lake's MERGE
# Step 1: Expire old records
expired_records.createOrReplaceTempView("expired")
spark.sql("""
    MERGE INTO dim_customer tgt
    USING expired src
    ON tgt.customer_key = src.customer_key
    WHEN MATCHED THEN UPDATE SET
        tgt.expiry_date = src.expiry_date,
        tgt.is_current = src.is_current
""")

# Step 2: Insert new records
new_records.write.format("delta") \
    .mode("append") \
    .save("s3://data-warehouse/dim_customer/")

SCD Type 2 with dbt (Modern Approach)

-- models/dim_customer.sql
{{
  config(
    materialized='incremental',
    unique_key='customer_key',
    incremental_strategy='merge',
    merge_update_columns=['expiry_date', 'is_current']
  )
}}

WITH current_version AS (
    SELECT 
        customer_key,
        business_customer_id,
        customer_name,
        city,
        segment,
        effective_date,
        expiry_date,
        is_current
    FROM {{ ref('dim_customer') }}
    WHERE is_current = TRUE
),

incoming AS (
    SELECT 
        customer_id as business_customer_id,
        customer_name,
        city,
        segment
    FROM {{ source('staging', 'customers') }}
),

changes AS (
    SELECT 
        i.business_customer_id,
        i.customer_name,
        i.city,
        i.segment,
        CASE 
            WHEN c.customer_key IS NULL THEN 'INSERT'
            WHEN i.customer_name != c.customer_name 
              OR i.city != c.city 
              OR i.segment != c.segment THEN 'UPDATE'
            ELSE 'NO_CHANGE'
        END as change_type
    FROM incoming i
    LEFT JOIN current_version c 
        ON i.business_customer_id = c.business_customer_id
)

-- Records to expire (updates and deletes)
, to_expire AS (
    SELECT customer_key
    FROM current_version c
    WHERE EXISTS (
        SELECT 1 FROM changes ch 
        WHERE ch.business_customer_id = c.business_customer_id
        AND ch.change_type = 'UPDATE'
    )
)

-- New and updated records
, new_versions AS (
    SELECT 
        ROW_NUMBER() OVER (ORDER BY NOW()) + 
            COALESCE((SELECT MAX(customer_key) FROM {{ ref('dim_customer') }}), 0) 
            as customer_key,
        i.business_customer_id,
        i.customer_name,
        i.city,
        i.segment,
        CURRENT_TIMESTAMP as effective_date,
        TO_DATE('9999-12-31') as expiry_date,
        TRUE as is_current
    FROM incoming i
    WHERE EXISTS (
        SELECT 1 FROM changes ch 
        WHERE ch.business_customer_id = i.business_customer_id
        AND ch.change_type IN ('INSERT', 'UPDATE')
    )
)

SELECT * FROM new_versions

SCD Type 3 Implementation

# SCD Type 3: Add columns for previous values
incoming_customers = spark.read.parquet("s3://data-lake/staging/customers/")
dim_customer = spark.read.format("delta").load("s3://data-warehouse/dim_customer/")

# SCD Type 3 logic
updated = incoming_customers.alias("src") \
    .join(dim_customer.alias("tgt"),
          F.col("src.customer_id") == F.col("tgt.business_customer_id"),
          "left") \
    .select(
        F.col("tgt.customer_key"),
        F.col("src.customer_id").alias("business_customer_id"),
        F.col("src.customer_name"),
        # For city: move current to previous, set new as current
        F.when(F.col("tgt.business_customer_id").isNotNull(),
            F.col("tgt.city_current")  # Keep old previous
        ).alias("city_previous_old"),
        F.col("src.city").alias("city_current"),
        F.when(F.col("tgt.business_customer_id").isNotNull(),
            F.coalesce(F.col("tgt.city_current"), F.lit(None))
        ).alias("city_previous"),
        # Same logic for segment
        F.when(F.col("tgt.business_customer_id").isNotNull(),
            F.col("tgt.segment_current")
        ).alias("segment_previous_old"),
        F.col("src.segment").alias("segment_current"),
        F.when(F.col("tgt.business_customer_id").isNotNull(),
            F.coalesce(F.col("tgt.segment_current"), F.lit(None))
        ).alias("segment_previous"),
        F.current_timestamp().alias("last_updated")
    )

SCD Type 6 (Hybrid) Implementation

# SCD Type 6: Type 2 rows + Type 1 current columns
incoming_customers = spark.read.parquet("s3://data-lake/staging/customers/")
dim_customer = spark.read.format("delta").load("s3://data-warehouse/dim_customer/")

# Get current values for quick access
current_values = dim_customer.filter(F.col("is_current") == True) \
    .select(
        F.col("business_customer_id"),
        F.col("city").alias("current_city"),
        F.col("segment").alias("current_segment")
    )

# Apply SCD Type 2 logic (same as before)
# ... (Type 2 merge logic)

# Then update current values columns
updated_with_current = updated.alias("new") \
    .join(current_values.alias("cv"),
          F.col("new.business_customer_id") == F.col("cv.business_customer_id"),
          "left") \
    .withColumn("current_city", 
        F.when(F.col("new.is_current"), F.col("new.city"))
         .otherwise(F.col("cv.current_city"))) \
    .withColumn("current_segment",
        F.when(F.col("new.is_current"), F.col("new.segment"))
         .otherwise(F.col("cv.current_segment")))

ℹ️

Best Practice: Use Delta Lake or Apache Iceberg for SCD Type 2. These formats support:

  1. Time travel (built-in history)
  2. MERGE operations (atomic upserts)
  3. Schema evolution (add columns without rewriting)
  4. Partition pruning (efficient queries)

Advanced: SCD with Late-Arriving Data

# Handle late-arriving data in SCD Type 2
def handle_late_arriving_scd(incoming, dim_table, max_lag_days=30):
    """
    Late-arriving data: records arrive after their effective date
    Strategy: Gap detection and backfill
    """
    
    # Find records that should have been inserted earlier
    late_records = incoming.alias("src") \
        .join(dim_table.alias("tgt"),
              (F.col("src.customer_id") == F.col("tgt.business_customer_id")) &
              (F.col("src.effective_date") < F.col("tgt.effective_date")),
              "left") \
        .filter(F.col("tgt.customer_key").isNull()) \
        .filter(F.datediff(F.current_date(), F.col("src.effective_date")) <= max_lag_days)
    
    # For late records, insert with appropriate effective dates
    # and adjust expiry dates of overlapping records
    return late_records

⚠️

Critical Issue: Late-arriving data is one of the hardest problems in SCD management. Always have a clear policy: how far back will you accept changes? What happens to aggregated reports when historical data changes?


Common Follow-Up Questions

Q1: How do you handle SCD for rapidly changing attributes?

For attributes that change frequently (e.g., stock price, user session data):

  • Don't use SCD Type 2 - it would create too many rows
  • Use SCD Type 1 for the latest value
  • Store history in a separate fact table (e.g., fact_price_history)
  • Consider time-series databases (InfluxDB, TimescaleDB)

Q2: How do you handle SCD in real-time pipelines?

# Real-time SCD Type 2 using Kafka + Flink
from pyflink.table import EnvironmentSettings, TableEnvironment

env_settings = EnvironmentSettings.in_streaming_mode()
t_env = TableEnvironment.create(env_settings)

# Define SCD Type 2 logic in Flink SQL
t_env.execute_sql("""
    CREATE TABLE dim_customer_scd (
        customer_key BIGINT,
        customer_id STRING,
        customer_name STRING,
        city STRING,
        effective TIMESTAMP(3),
        expiry TIMESTAMP(3),
        is_current BOOLEAN,
        PRIMARY KEY (customer_key) NOT ENFORCED
    ) WITH (
        'connector' = 'jdbc',
        'url' = 'jdbc:postgresql://...',
        'table-name' = 'dim_customer'
    );
    
    -- Kafka source with CDC
    CREATE TABLE customers_cdc (
        customer_id STRING,
        customer_name STRING,
        city STRING,
        op_type STRING,
        op_ts TIMESTAMP(3)
    ) WITH (
        'connector' = 'kafka',
        'topic' = 'customers',
        'properties.bootstrap.servers' = 'kafka:9092',
        'format' = 'debezium-json'
    );
    
    -- SCD Type 2 merge logic
    INSERT INTO dim_customer_scd
    SELECT 
        ROW_NUMBER() OVER (ORDER BY op_ts) + COALESCE(max_key, 0),
        customer_id, customer_name, city,
        op_ts as effective,
        TO_TIMESTAMP('9999-12-31') as expiry,
        TRUE as is_current
    FROM customers_cdc
    WHERE op_type = 'c' OR op_type = 'u';
""")

Q3: How do you test SCD logic?

# Unit tests for SCD Type 2
def test_scd_type2_insert():
    """Test that new records are inserted correctly"""
    initial = create_dataframe([
        {"id": 1, "name": "John", "city": "NY", "is_current": True}
    ])
    incoming = create_dataframe([
        {"id": 1, "name": "John", "city": "SF"}  # Change
    ])
    
    result = apply_scd_type2(initial, incoming)
    
    # Should have 2 rows
    assert result.count() == 2
    # Old record should be expired
    assert result.filter("is_current = False").count() == 1
    # New record should be current
    assert result.filter("is_current = True AND city = 'SF'").count() == 1

def test_scd_type2_no_change():
    """Test that unchanged records are not modified"""
    initial = create_dataframe([
        {"id": 1, "name": "John", "city": "NY", "is_current": True}
    ])
    incoming = create_dataframe([
        {"id": 1, "name": "John", "city": "NY"}  # No change
    ])
    
    result = apply_scd_type2(initial, incoming)
    
    # Should still have 1 row
    assert result.count() == 1
    assert result.filter("is_current = True").count() == 1

Q4: How do you handle SCD in a data mesh architecture?

In a data mesh, each domain owns its dimensions:

  • Domain teams implement SCD for their domains
  • Central platform provides SCD templates and tools
  • Cross-domain queries use conformed dimensions
  • Data contracts define SCD behavior between domains

💡

Interview Tip: When discussing SCDs, always mention the business context. Type 2 is great for compliance but expensive. Type 1 is cheap but loses history. The right choice depends on: (1) regulatory requirements, (2) query patterns, (3) data volume, and (4) latency requirements.


Company-Specific Tips

Google Interview Tips

  • Discuss BigQuery's time-travel feature as an alternative to SCD Type 2
  • Mention partitioned tables for efficient SCD queries
  • Be ready to discuss SCD in data pipelines (Dataflow, Pub/Sub)
  • Talk about cost implications of each SCD type in cloud warehouses

Meta Interview Tips

  • Focus on user dimension changes (profile updates, privacy settings)
  • Discuss real-time SCD for ad targeting
  • Mention data governance requirements for user data
  • Talk about schema evolution in Hive/Delta Lake

Netflix Interview Tips

  • Discuss content metadata changes (genre reclassification, ratings)
  • Explain how viewing history affects SCD design
  • Mention A/B testing dimensions and their SCD requirements
  • Talk about cost optimization for large content catalogs

ℹ️

Final Takeaway: SCD Type 2 is the most commonly used in production, but it's not always the right choice. Start with business requirements: Do you need history? How much? What's the query pattern? Then choose the simplest SCD type that meets those requirements.

Advertisement