SCD Type 1 vs 2 vs 3: Complete Guide
Handling historical changes in dimension attributes
Interview Question
"A customer moves from New York to San Francisco. Their segment changes from 'Premium' to 'Standard'. Design a slowly changing dimension strategy that: (1) tracks the customer's current location, (2) preserves their purchase history tied to the correct address, and (3) allows time-travel queries to see what their address was at any point in the past."
Difficulty: Medium-Hard | Frequently asked at Google, Meta, Netflix
Theoretical Foundation
What Are Slowly Changing Dimensions (SCDs)?
SCDs are dimension attributes that change infrequently but unpredictably over time. The challenge is deciding how to store these changes in a data warehouse.
Example scenarios:
- Customer changes address
- Product price increases
- Employee department transfer
- Store closes and reopens under new name
SCD Type 1: Overwrite
The simplest approach: overwrite the old value with the new value. No history is maintained.
Characteristics:
- Simplest to implement
- No history preserved
- Lowest storage requirements
- Fastest queries (no versioning logic)
When to use:
- Attribute corrections (typo fixes)
- Attributes where history doesn't matter
- Real-time dashboards showing current state only
SCD Type 2: Add New Row
The most common approach for historical tracking: create a new row for each change, with effective dates.
Characteristics:
- Full history preserved
- Each version gets a unique surrogate key
- Requires effective_date and expiry_date (or is_current flag)
- Storage grows with number of changes
When to use:
- Compliance requirements (audit trail)
- Time-travel analysis
- When historical context matters
SCD Type 3: Add New Column
A middle ground: add a new column for the previous value, keeping only one generation of history.
Characteristics:
- Limited history (one or two generations)
- Low storage overhead
- Simple queries (no date range logic)
- Limited to pre-defined number of history columns
When to use:
- When only previous value matters
- Low-cardinality attributes
- When query simplicity is critical
SCD Type 4: History Table
A hybrid approach: keep current in main table, history in separate table.
Characteristics:
- Current table remains small and fast
- History preserved in separate table
- Queries require UNION or separate lookups
- Good for very large dimensions
SCD Type 6: Hybrid (Type 1 + Type 2)
Combines Type 2 row versioning with Type 1 current value columns:
ℹ️
Key Insight: Type 6 gives you the best of both worlds—historical rows for time-travel analysis AND a quick way to get current values without filtering. This is the most common pattern in modern data warehouses.
Mathematical Model for SCD Storage
For a dimension with n rows and c changes per row on average over t time periods:
Type 1 storage:
Type 2 storage:
Type 3 storage:
where:
- = number of regular columns
- = average column size in bytes
- = number of history columns
Example: 1M customers, 10 columns, 20 bytes average, 3 changes/customer:
- Type 1:
- Type 2: (4x)
- Type 3: (1.3x)
⚠️
Common Interview Trap: Candidates often recommend Type 2 for everything because it "preserves history." But Type 2 has significant performance implications: table scans are slower, JOINs are more complex, and storage costs are higher. Always justify your choice based on business requirements.
Query Patterns for Each SCD Type
Type 1 - Current value only:
-- Simple: just read the table
SELECT customer_key, city, segment
FROM dim_customer
WHERE customer_key = 1001;
Type 2 - Current version:
-- Must filter for current row
SELECT customer_key, city, segment
FROM dim_customer
WHERE customer_key = 1001
AND is_current = TRUE;
-- Or using date range
SELECT customer_key, city, segment
FROM dim_customer
WHERE customer_key = 1001
AND effective_date <= CURRENT_DATE
AND expiry_date > CURRENT_DATE;
Type 2 - Point-in-time query:
-- What was the city on March 15, 2024?
SELECT customer_key, city, segment
FROM dim_customer
WHERE customer_key = 1001
AND effective_date <= '2024-03-15'
AND expiry_date > '2024-03-15';
Type 3 - Previous value:
-- Current and previous
SELECT customer_key, city_current, city_previous
FROM dim_customer
WHERE customer_key = 1001;
💡
Production Tip: For Type 2, always create a view that filters for current records. This prevents accidental analysis of historical rows and simplifies downstream queries:
CREATE VIEW dim_customer_current AS
SELECT * FROM dim_customer WHERE is_current = TRUE;
Code Implementation
SCD Type 1 Implementation
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("SCD_Type1").getOrCreate()
# Read incoming data
incoming_customers = spark.read.parquet("s3://data-lake/staging/customers/")
# Read current dimension
dim_customer = spark.read.format("delta").load("s3://data-warehouse/dim_customer/")
# SCD Type 1: Overwrite existing records
# Join on business key to find matches
updated = incoming_customers.alias("src") \
.join(dim_customer.alias("tgt"),
F.col("src.customer_id") == F.col("tgt.business_customer_id"),
"left") \
.withColumn("is_match", F.col("tgt.business_customer_id").isNotNull()) \
.select(
F.when(F.col("is_match"), F.col("tgt.customer_key"))
.otherwise(F.monotonically_increasing_id()).alias("customer_key"),
F.col("src.customer_id").alias("business_customer_id"),
F.col("src.customer_name"),
F.col("src.city"),
F.col("src.segment"),
F.current_timestamp().alias("last_updated")
)
# Merge (upsert) into dimension table
updated.createOrReplaceTempView("updates")
spark.sql("""
MERGE INTO dim_customer tgt
USING updates src
ON tgt.business_customer_id = src.business_customer_id
WHEN MATCHED THEN UPDATE SET
tgt.customer_name = src.customer_name,
tgt.city = src.city,
tgt.segment = src.segment,
tgt.last_updated = src.last_updated
WHEN NOT MATCHED THEN INSERT (
customer_key,
business_customer_id,
customer_name,
city,
segment,
last_updated
) VALUES (
src.customer_key,
src.business_customer_id,
src.customer_name,
src.city,
src.segment,
src.last_updated
)
""")
SCD Type 2 Implementation
# SCD Type 2: Add new row for changes
incoming_customers = spark.read.parquet("s3://data-lake/staging/customers/")
dim_customer = spark.read.format("delta").load("s3://data-warehouse/dim_customer/")
# Identify changes
changes = incoming_customers.alias("src") \
.join(dim_customer.filter(F.col("is_current") == True).alias("tgt"),
F.col("src.customer_id") == F.col("tgt.business_customer_id"),
"left") \
.withColumn("is_changed",
(F.col("tgt.business_customer_id").isNull()) | # New record
(F.col("src.customer_name") != F.col("tgt.customer_name")) |
(F.col("src.city") != F.col("tgt.city")) |
(F.col("src.segment") != F.col("tgt.segment"))
) \
.filter(F.col("is_changed") == True)
# Expire old records
expired_records = changes.filter(F.col("tgt.business_customer_id").isNotNull()) \
.select(
F.col("tgt.customer_key"),
F.current_timestamp().alias("expiry_date"),
F.lit(False).alias("is_current")
)
# Insert new records
new_records = changes.select(
F.monotonically_increasing_id().alias("customer_key"),
F.col("src.customer_id").alias("business_customer_id"),
F.col("src.customer_name"),
F.col("src.city"),
F.col("src.segment"),
F.current_timestamp().alias("effective_date"),
F.lit(F.to_date("9999-12-31")).alias("expiry_date"),
F.lit(True).alias("is_current")
)
# Apply changes using Delta Lake's MERGE
# Step 1: Expire old records
expired_records.createOrReplaceTempView("expired")
spark.sql("""
MERGE INTO dim_customer tgt
USING expired src
ON tgt.customer_key = src.customer_key
WHEN MATCHED THEN UPDATE SET
tgt.expiry_date = src.expiry_date,
tgt.is_current = src.is_current
""")
# Step 2: Insert new records
new_records.write.format("delta") \
.mode("append") \
.save("s3://data-warehouse/dim_customer/")
SCD Type 2 with dbt (Modern Approach)
-- models/dim_customer.sql
{{
config(
materialized='incremental',
unique_key='customer_key',
incremental_strategy='merge',
merge_update_columns=['expiry_date', 'is_current']
)
}}
WITH current_version AS (
SELECT
customer_key,
business_customer_id,
customer_name,
city,
segment,
effective_date,
expiry_date,
is_current
FROM {{ ref('dim_customer') }}
WHERE is_current = TRUE
),
incoming AS (
SELECT
customer_id as business_customer_id,
customer_name,
city,
segment
FROM {{ source('staging', 'customers') }}
),
changes AS (
SELECT
i.business_customer_id,
i.customer_name,
i.city,
i.segment,
CASE
WHEN c.customer_key IS NULL THEN 'INSERT'
WHEN i.customer_name != c.customer_name
OR i.city != c.city
OR i.segment != c.segment THEN 'UPDATE'
ELSE 'NO_CHANGE'
END as change_type
FROM incoming i
LEFT JOIN current_version c
ON i.business_customer_id = c.business_customer_id
)
-- Records to expire (updates and deletes)
, to_expire AS (
SELECT customer_key
FROM current_version c
WHERE EXISTS (
SELECT 1 FROM changes ch
WHERE ch.business_customer_id = c.business_customer_id
AND ch.change_type = 'UPDATE'
)
)
-- New and updated records
, new_versions AS (
SELECT
ROW_NUMBER() OVER (ORDER BY NOW()) +
COALESCE((SELECT MAX(customer_key) FROM {{ ref('dim_customer') }}), 0)
as customer_key,
i.business_customer_id,
i.customer_name,
i.city,
i.segment,
CURRENT_TIMESTAMP as effective_date,
TO_DATE('9999-12-31') as expiry_date,
TRUE as is_current
FROM incoming i
WHERE EXISTS (
SELECT 1 FROM changes ch
WHERE ch.business_customer_id = i.business_customer_id
AND ch.change_type IN ('INSERT', 'UPDATE')
)
)
SELECT * FROM new_versions
SCD Type 3 Implementation
# SCD Type 3: Add columns for previous values
incoming_customers = spark.read.parquet("s3://data-lake/staging/customers/")
dim_customer = spark.read.format("delta").load("s3://data-warehouse/dim_customer/")
# SCD Type 3 logic
updated = incoming_customers.alias("src") \
.join(dim_customer.alias("tgt"),
F.col("src.customer_id") == F.col("tgt.business_customer_id"),
"left") \
.select(
F.col("tgt.customer_key"),
F.col("src.customer_id").alias("business_customer_id"),
F.col("src.customer_name"),
# For city: move current to previous, set new as current
F.when(F.col("tgt.business_customer_id").isNotNull(),
F.col("tgt.city_current") # Keep old previous
).alias("city_previous_old"),
F.col("src.city").alias("city_current"),
F.when(F.col("tgt.business_customer_id").isNotNull(),
F.coalesce(F.col("tgt.city_current"), F.lit(None))
).alias("city_previous"),
# Same logic for segment
F.when(F.col("tgt.business_customer_id").isNotNull(),
F.col("tgt.segment_current")
).alias("segment_previous_old"),
F.col("src.segment").alias("segment_current"),
F.when(F.col("tgt.business_customer_id").isNotNull(),
F.coalesce(F.col("tgt.segment_current"), F.lit(None))
).alias("segment_previous"),
F.current_timestamp().alias("last_updated")
)
SCD Type 6 (Hybrid) Implementation
# SCD Type 6: Type 2 rows + Type 1 current columns
incoming_customers = spark.read.parquet("s3://data-lake/staging/customers/")
dim_customer = spark.read.format("delta").load("s3://data-warehouse/dim_customer/")
# Get current values for quick access
current_values = dim_customer.filter(F.col("is_current") == True) \
.select(
F.col("business_customer_id"),
F.col("city").alias("current_city"),
F.col("segment").alias("current_segment")
)
# Apply SCD Type 2 logic (same as before)
# ... (Type 2 merge logic)
# Then update current values columns
updated_with_current = updated.alias("new") \
.join(current_values.alias("cv"),
F.col("new.business_customer_id") == F.col("cv.business_customer_id"),
"left") \
.withColumn("current_city",
F.when(F.col("new.is_current"), F.col("new.city"))
.otherwise(F.col("cv.current_city"))) \
.withColumn("current_segment",
F.when(F.col("new.is_current"), F.col("new.segment"))
.otherwise(F.col("cv.current_segment")))
ℹ️
Best Practice: Use Delta Lake or Apache Iceberg for SCD Type 2. These formats support:
- Time travel (built-in history)
- MERGE operations (atomic upserts)
- Schema evolution (add columns without rewriting)
- Partition pruning (efficient queries)
Advanced: SCD with Late-Arriving Data
# Handle late-arriving data in SCD Type 2
def handle_late_arriving_scd(incoming, dim_table, max_lag_days=30):
"""
Late-arriving data: records arrive after their effective date
Strategy: Gap detection and backfill
"""
# Find records that should have been inserted earlier
late_records = incoming.alias("src") \
.join(dim_table.alias("tgt"),
(F.col("src.customer_id") == F.col("tgt.business_customer_id")) &
(F.col("src.effective_date") < F.col("tgt.effective_date")),
"left") \
.filter(F.col("tgt.customer_key").isNull()) \
.filter(F.datediff(F.current_date(), F.col("src.effective_date")) <= max_lag_days)
# For late records, insert with appropriate effective dates
# and adjust expiry dates of overlapping records
return late_records
⚠️
Critical Issue: Late-arriving data is one of the hardest problems in SCD management. Always have a clear policy: how far back will you accept changes? What happens to aggregated reports when historical data changes?
Common Follow-Up Questions
Q1: How do you handle SCD for rapidly changing attributes?
For attributes that change frequently (e.g., stock price, user session data):
- Don't use SCD Type 2 - it would create too many rows
- Use SCD Type 1 for the latest value
- Store history in a separate fact table (e.g.,
fact_price_history) - Consider time-series databases (InfluxDB, TimescaleDB)
Q2: How do you handle SCD in real-time pipelines?
# Real-time SCD Type 2 using Kafka + Flink
from pyflink.table import EnvironmentSettings, TableEnvironment
env_settings = EnvironmentSettings.in_streaming_mode()
t_env = TableEnvironment.create(env_settings)
# Define SCD Type 2 logic in Flink SQL
t_env.execute_sql("""
CREATE TABLE dim_customer_scd (
customer_key BIGINT,
customer_id STRING,
customer_name STRING,
city STRING,
effective TIMESTAMP(3),
expiry TIMESTAMP(3),
is_current BOOLEAN,
PRIMARY KEY (customer_key) NOT ENFORCED
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:postgresql://...',
'table-name' = 'dim_customer'
);
-- Kafka source with CDC
CREATE TABLE customers_cdc (
customer_id STRING,
customer_name STRING,
city STRING,
op_type STRING,
op_ts TIMESTAMP(3)
) WITH (
'connector' = 'kafka',
'topic' = 'customers',
'properties.bootstrap.servers' = 'kafka:9092',
'format' = 'debezium-json'
);
-- SCD Type 2 merge logic
INSERT INTO dim_customer_scd
SELECT
ROW_NUMBER() OVER (ORDER BY op_ts) + COALESCE(max_key, 0),
customer_id, customer_name, city,
op_ts as effective,
TO_TIMESTAMP('9999-12-31') as expiry,
TRUE as is_current
FROM customers_cdc
WHERE op_type = 'c' OR op_type = 'u';
""")
Q3: How do you test SCD logic?
# Unit tests for SCD Type 2
def test_scd_type2_insert():
"""Test that new records are inserted correctly"""
initial = create_dataframe([
{"id": 1, "name": "John", "city": "NY", "is_current": True}
])
incoming = create_dataframe([
{"id": 1, "name": "John", "city": "SF"} # Change
])
result = apply_scd_type2(initial, incoming)
# Should have 2 rows
assert result.count() == 2
# Old record should be expired
assert result.filter("is_current = False").count() == 1
# New record should be current
assert result.filter("is_current = True AND city = 'SF'").count() == 1
def test_scd_type2_no_change():
"""Test that unchanged records are not modified"""
initial = create_dataframe([
{"id": 1, "name": "John", "city": "NY", "is_current": True}
])
incoming = create_dataframe([
{"id": 1, "name": "John", "city": "NY"} # No change
])
result = apply_scd_type2(initial, incoming)
# Should still have 1 row
assert result.count() == 1
assert result.filter("is_current = True").count() == 1
Q4: How do you handle SCD in a data mesh architecture?
In a data mesh, each domain owns its dimensions:
- Domain teams implement SCD for their domains
- Central platform provides SCD templates and tools
- Cross-domain queries use conformed dimensions
- Data contracts define SCD behavior between domains
💡
Interview Tip: When discussing SCDs, always mention the business context. Type 2 is great for compliance but expensive. Type 1 is cheap but loses history. The right choice depends on: (1) regulatory requirements, (2) query patterns, (3) data volume, and (4) latency requirements.
Company-Specific Tips
Google Interview Tips
- Discuss BigQuery's time-travel feature as an alternative to SCD Type 2
- Mention partitioned tables for efficient SCD queries
- Be ready to discuss SCD in data pipelines (Dataflow, Pub/Sub)
- Talk about cost implications of each SCD type in cloud warehouses
Meta Interview Tips
- Focus on user dimension changes (profile updates, privacy settings)
- Discuss real-time SCD for ad targeting
- Mention data governance requirements for user data
- Talk about schema evolution in Hive/Delta Lake
Netflix Interview Tips
- Discuss content metadata changes (genre reclassification, ratings)
- Explain how viewing history affects SCD design
- Mention A/B testing dimensions and their SCD requirements
- Talk about cost optimization for large content catalogs
ℹ️
Final Takeaway: SCD Type 2 is the most commonly used in production, but it's not always the right choice. Start with business requirements: Do you need history? How much? What's the query pattern? Then choose the simplest SCD type that meets those requirements.