🏗️ SQL Data Modeling

Amazon & Microsoft Interview Deep Dive

🏢 Amazon🏢 Microsoft⚡ Difficulty: Hard⏱️ 50 min

📋 Interview Question

ℹ️🔴 Amazon/Microsoft Interview Question

"Design a data warehouse schema for an e-commerce platform. The system needs to track: products, customers, orders, order items, and inventory. Consider: 1) Star vs Snowflake schema trade-offs, 2) Slowly Changing Dimensions (SCD), 3) Fact table granularity, 4) Partitioning strategy for 1B+ rows."

Companies: Amazon, Microsoft | Difficulty: Hard | Time: 50 minutes

📊 Part 1: Star Schema Design

ℹ️🔍 Star Schema

Star schema is the foundation of dimensional modeling:

Fact table: Contains measurements/metrics (center)
Dimension tables: Contain descriptive attributes (points of star)
Benefits: Simple queries, fewer JOINs, optimal for OLAP

Dimension Tables

-- Dimension: Products
CREATE TABLE dim_product (
    product_key SERIAL PRIMARY KEY,  -- Surrogate key
    product_id VARCHAR(50),          -- Natural/Business key
    product_name VARCHAR(200),
    category VARCHAR(100),
    subcategory VARCHAR(100),
    brand VARCHAR(100),
    unit_cost DECIMAL(12, 2),
    unit_price DECIMAL(12, 2),
    supplier_id VARCHAR(50),
    supplier_name VARCHAR(200),
    is_active BOOLEAN DEFAULT true,
    effective_date TIMESTAMP,
    expiration_date TIMESTAMP,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

-- Dimension: Customers
CREATE TABLE dim_customer (
    customer_key SERIAL PRIMARY KEY,
    customer_id VARCHAR(50),
    customer_name VARCHAR(200),
    email VARCHAR(200),
    phone VARCHAR(50),
    address_line1 VARCHAR(200),
    address_line2 VARCHAR(200),
    city VARCHAR(100),
    state VARCHAR(50),
    country VARCHAR(100),
    postal_code VARCHAR(20),
    segment VARCHAR(50),  -- Premium, Regular, New
    loyalty_tier VARCHAR(20),  -- Gold, Silver, Bronze
    registration_date DATE,
    is_active BOOLEAN DEFAULT true,
    effective_date TIMESTAMP,
    expiration_date TIMESTAMP
);

-- Dimension: Time (Date dimension)
CREATE TABLE dim_date (
    date_key INT PRIMARY KEY,  -- YYYYMMDD format
    full_date DATE UNIQUE,
    year INT,
    quarter INT,
    month INT,
    month_name VARCHAR(20),
    week INT,
    day_of_month INT,
    day_of_week INT,
    day_name VARCHAR(20),
    is_weekend BOOLEAN,
    is_holiday BOOLEAN,
    holiday_name VARCHAR(100),
    fiscal_year INT,
    fiscal_quarter INT
);

-- Dimension: Stores/Channels
CREATE TABLE dim_store (
    store_key SERIAL PRIMARY KEY,
    store_id VARCHAR(50),
    store_name VARCHAR(200),
    store_type VARCHAR(50),  -- Online, Retail, Marketplace
    region VARCHAR(100),
    country VARCHAR(100),
    city VARCHAR(100),
    manager VARCHAR(200),
    square_footage INT,
    is_active BOOLEAN DEFAULT true,
    effective_date TIMESTAMP,
    expiration_date TIMESTAMP
);

-- Dimension: Promotions
CREATE TABLE dim_promotion (
    promotion_key SERIAL PRIMARY KEY,
    promotion_id VARCHAR(50),
    promotion_name VARCHAR(200),
    promotion_type VARCHAR(50),
    discount_percent DECIMAL(5, 2),
    start_date DATE,
    end_date DATE,
    is_active BOOLEAN DEFAULT true
);

Fact Table

-- Fact Table: Orders (Transaction grain)
CREATE TABLE fact_orders (
    order_key BIGSERIAL PRIMARY KEY,
    order_id VARCHAR(50),           -- Natural key
    product_key INT REFERENCES dim_product(product_key),
    customer_key INT REFERENCES dim_customer(customer_key),
    date_key INT REFERENCES dim_date(date_key),
    store_key INT REFERENCES dim_store(store_key),
    promotion_key INT REFERENCES dim_promotion(promotion_key),
    -- Measures
    quantity INT,
    unit_price DECIMAL(12, 2),
    unit_cost DECIMAL(12, 2),
    discount_amount DECIMAL(12, 2),
    tax_amount DECIMAL(12, 2),
    shipping_amount DECIMAL(12, 2),
    total_amount DECIMAL(12, 2),    -- quantity * unit_price - discount + tax + shipping
    profit DECIMAL(12, 2),          -- total_amount - (quantity * unit_cost)
    -- Status
    order_status VARCHAR(20),       -- Pending, Shipped, Delivered, Cancelled
    payment_method VARCHAR(30),
    -- Timestamps
    order_timestamp TIMESTAMP,
    ship_date TIMESTAMP,
    delivery_date TIMESTAMP,
    -- Metadata
    etl_batch_id INT,
    etl_load_timestamp TIMESTAMP DEFAULT NOW()
) PARTITION BY RANGE (date_key);

-- Create partitions by year
CREATE TABLE fact_orders_2023 PARTITION OF fact_orders
FOR VALUES FROM (20230101) TO (20240101);

CREATE TABLE fact_orders_2024 PARTITION OF fact_orders
FOR VALUES FROM (20240101) TO (20250101);

❄️ Part 2: Snowflake Schema

ℹ️🔍 Snowflake Schema

Snowflake schema normalizes dimension tables into sub-dimensions:

Benefits: Less data redundancy, smaller dimension tables
Drawbacks: More JOINs, more complex queries
Use when: Dimension tables are very large with many attributes

-- Snowflake: Normalize Product dimension
CREATE TABLE dim_category (
    category_key SERIAL PRIMARY KEY,
    category_name VARCHAR(100),
    category_description TEXT
);

CREATE TABLE dim_subcategory (
    subcategory_key SERIAL PRIMARY KEY,
    category_key INT REFERENCES dim_category(category_key),
    subcategory_name VARCHAR(100)
);

CREATE TABLE dim_brand (
    brand_key SERIAL PRIMARY KEY,
    brand_name VARCHAR(100),
    brand_description TEXT
);

CREATE TABLE dim_supplier (
    supplier_key SERIAL PRIMARY KEY,
    supplier_name VARCHAR(200),
    contact_name VARCHAR(200),
    phone VARCHAR(50),
    email VARCHAR(200)
);

-- Modified Product dimension (snowflaked)
CREATE TABLE dim_product_snowflake (
    product_key SERIAL PRIMARY KEY,
    product_id VARCHAR(50),
    product_name VARCHAR(200),
    subcategory_key INT REFERENCES dim_subcategory(subcategory_key),
    brand_key INT REFERENCES dim_brand(brand_key),
    supplier_key INT REFERENCES dim_supplier(supplier_key),
    unit_cost DECIMAL(12, 2),
    unit_price DECIMAL(12, 2),
    effective_date TIMESTAMP,
    expiration_date TIMESTAMP
);

-- Query comparison
-- Star schema: 1 JOIN for product info
SELECT p.product_name, p.category, f.total_amount
FROM fact_orders f
JOIN dim_product p ON f.product_key = p.product_key;

-- Snowflake schema: 3 JOINs for same info
SELECT
    p.product_name,
    sc.subcategory_name,
    c.category_name,
    f.total_amount
FROM fact_orders f
JOIN dim_product_snowflake p ON f.product_key = p.product_key
JOIN dim_subcategory sc ON p.subcategory_key = sc.subcategory_key
JOIN dim_category c ON sc.category_key = c.category_key;

🏦 Part 3: Data Vault Modeling

ℹ️🔍 Data Vault

Data Vault is designed for auditability and flexibility:

Hubs: Business keys (entities)
Links: Relationships between hubs
Satellites: Descriptive attributes and history
Benefits: Full historical tracking, parallel loading, audit trail

-- Data Vault: Hubs (Business Keys)
CREATE TABLE hub_customer (
    hub_customer_hk CHAR(32) PRIMARY KEY,  -- Hash key
    customer_id VARCHAR(50),
    load_date TIMESTAMP,
    record_source VARCHAR(100)
);

CREATE TABLE hub_product (
    hub_product_hk CHAR(32) PRIMARY KEY,
    product_id VARCHAR(50),
    load_date TIMESTAMP,
    record_source VARCHAR(100)
);

CREATE TABLE hub_order (
    hub_order_hk CHAR(32) PRIMARY KEY,
    order_id VARCHAR(50),
    load_date TIMESTAMP,
    record_source VARCHAR(100)
);

-- Data Vault: Links (Relationships)
CREATE TABLE link_order_customer (
    link_order_customer_hk CHAR(32) PRIMARY KEY,
    hub_order_hk CHAR(32) REFERENCES hub_order(hub_order_hk),
    hub_customer_hk CHAR(32) REFERENCES hub_customer(hub_customer_hk),
    load_date TIMESTAMP,
    record_source VARCHAR(100)
);

CREATE TABLE link_order_product (
    link_order_product_hk CHAR(32) PRIMARY KEY,
    hub_order_hk CHAR(32) REFERENCES hub_order(hub_order_hk),
    hub_product_hk CHAR(32) REFERENCES hub_product(hub_product_hk),
    load_date TIMESTAMP,
    record_source VARCHAR(100)
);

-- Data Vault: Satellites (Attributes & History)
CREATE TABLE sat_customer (
    sat_customer_hk CHAR(32) PRIMARY KEY,
    hub_customer_hk CHAR(32) REFERENCES hub_customer(hub_customer_hk),
    load_date TIMESTAMP,
    load_end_date TIMESTAMP,
    hash_diff CHAR(32),  -- For detecting changes
    customer_name VARCHAR(200),
    email VARCHAR(200),
    phone VARCHAR(50),
    address VARCHAR(500),
    segment VARCHAR(50),
    record_source VARCHAR(100)
);

CREATE TABLE sat_product (
    sat_product_hk CHAR(32) PRIMARY KEY,
    hub_product_hk CHAR(32) REFERENCES hub_product(hub_product_hk),
    load_date TIMESTAMP,
    load_end_date TIMESTAMP,
    hash_diff CHAR(32),
    product_name VARCHAR(200),
    category VARCHAR(100),
    brand VARCHAR(100),
    unit_cost DECIMAL(12, 2),
    unit_price DECIMAL(12, 2),
    record_source VARCHAR(100)
);

CREATE TABLE sat_order (
    sat_order_hk CHAR(32) PRIMARY KEY,
    hub_order_hk CHAR(32) REFERENCES hub_order(hub_order_hk),
    load_date TIMESTAMP,
    load_end_date TIMESTAMP,
    hash_diff CHAR(32),
    order_date TIMESTAMP,
    status VARCHAR(20),
    total_amount DECIMAL(12, 2),
    record_source VARCHAR(100)
);

Hash Function for Data Vault

-- Generate hash keys using MD5
CREATE OR REPLACE FUNCTION generate_hash_key(VARIADIC args TEXT[])
RETURNS CHAR(32) AS $$
BEGIN
    RETURN MD5(ARRAY_TO_STRING(args, '||'));
END;
$$ LANGUAGE plpgsql;

-- Example usage
INSERT INTO hub_customer (hub_customer_hk, customer_id, load_date, record_source)
VALUES (
    generate_hash_key('CUST-001'),
    'CUST-001',
    NOW(),
    'source_system_a'
);

🔄 Part 4: Slowly Changing Dimensions (SCD)

SCD Type 1: Overwrite

-- Type 1: No history, just update current value
CREATE TABLE dim_customer_scd1 (
    customer_key SERIAL PRIMARY KEY,
    customer_id VARCHAR(50),
    customer_name VARCHAR(200),
    email VARCHAR(200),
    city VARCHAR(100),
    segment VARCHAR(50),
    last_updated TIMESTAMP DEFAULT NOW()
);

-- Update procedure
CREATE OR REPLACE PROCEDURE update_customer_scd1(
    p_customer_id VARCHAR,
    p_name VARCHAR,
    p_email VARCHAR,
    p_city VARCHAR,
    p_segment VARCHAR
)
LANGUAGE plpgsql
AS $$
BEGIN
    UPDATE dim_customer_scd1
    SET customer_name = p_name,
        email = p_email,
        city = p_city,
        segment = p_segment,
        last_updated = NOW()
    WHERE customer_id = p_customer_id;

    IF NOT FOUND THEN
        INSERT INTO dim_customer_scd1 (customer_id, customer_name, email, city, segment)
        VALUES (p_customer_id, p_name, p_email, p_city, p_segment);
    END IF;
END;
$$;

SCD Type 2: Full History

-- Type 2: Keep full history with effective dates
CREATE TABLE dim_customer_scd2 (
    customer_key SERIAL PRIMARY KEY,
    customer_id VARCHAR(50),
    customer_name VARCHAR(200),
    email VARCHAR(200),
    city VARCHAR(100),
    segment VARCHAR(50),
    effective_date TIMESTAMP,
    expiration_date TIMESTAMP,
    is_current BOOLEAN DEFAULT true,
    version INT DEFAULT 1
);

-- Insert/Update procedure for SCD Type 2
CREATE OR REPLACE PROCEDURE update_customer_scd2(
    p_customer_id VARCHAR,
    p_name VARCHAR,
    p_email VARCHAR,
    p_city VARCHAR,
    p_segment VARCHAR
)
LANGUAGE plpgsql
AS $$
DECLARE
    v_current_key INT;
    v_hash_diff CHAR(32);
BEGIN
    -- Calculate hash of new values
    v_hash_diff := MD5(p_name || p_email || p_city || p_segment);

    -- Check if record exists and has changed
    SELECT customer_key INTO v_current_key
    FROM dim_customer_scd2
    WHERE customer_id = p_customer_id
    AND is_current = true;

    IF v_current_key IS NOT NULL THEN
        -- Check if values changed
        IF (SELECT hash_diff FROM dim_customer_scd2 WHERE customer_key = v_current_key) != v_hash_diff THEN
            -- Expire current record
            UPDATE dim_customer_scd2
            SET is_current = false,
                expiration_date = NOW()
            WHERE customer_key = v_current_key;

            -- Insert new version
            INSERT INTO dim_customer_scd2 (
                customer_id, customer_name, email, city, segment,
                effective_date, expiration_date, is_current, version
            )
            SELECT
                customer_id, p_name, p_email, p_city, p_segment,
                NOW(), NULL, true, version + 1
            FROM dim_customer_scd2
            WHERE customer_key = v_current_key;
        END IF;
    ELSE
        -- Insert new record
        INSERT INTO dim_customer_scd2 (
            customer_id, customer_name, email, city, segment,
            effective_date, expiration_date, is_current, version
        )
        VALUES (
            p_customer_id, p_name, p_email, p_city, p_segment,
            NOW(), NULL, true, 1
        );
    END IF;
END;
$$;

SCD Type 3: Limited History

-- Type 3: Keep previous and current values
CREATE TABLE dim_customer_scd3 (
    customer_key SERIAL PRIMARY KEY,
    customer_id VARCHAR(50),
    customer_name VARCHAR(200),
    email VARCHAR(200),
    city_current VARCHAR(100),
    city_previous VARCHAR(100),
    segment_current VARCHAR(50),
    segment_previous VARCHAR(50),
    effective_date TIMESTAMP,
    previous_effective_date TIMESTAMP
);

📊 Part 5: Aggregate Fact Tables

-- Daily aggregate fact table
CREATE TABLE fact_orders_daily (
    date_key INT REFERENCES dim_date(date_key),
    product_key INT REFERENCES dim_product(product_key),
    store_key INT REFERENCES dim_store(store_key),
    -- Aggregated measures
    order_count INT,
    quantity_sum INT,
    total_amount_sum DECIMAL(15, 2),
    profit_sum DECIMAL(15, 2),
    avg_order_value DECIMAL(12, 2),
    unique_customers INT,
    PRIMARY KEY (date_key, product_key, store_key)
);

-- Populate aggregate from transaction fact
INSERT INTO fact_orders_daily (
    date_key, product_key, store_key,
    order_count, quantity_sum, total_amount_sum,
    profit_sum, avg_order_value, unique_customers
)
SELECT
    date_key,
    product_key,
    store_key,
    COUNT(DISTINCT order_id) AS order_count,
    SUM(quantity) AS quantity_sum,
    SUM(total_amount) AS total_amount_sum,
    SUM(profit) AS profit_sum,
    AVG(total_amount) AS avg_order_value,
    COUNT(DISTINCT customer_key) AS unique_customers
FROM fact_orders
GROUP BY date_key, product_key, store_key;

📈 Part 6: Comparison Table

💡✅ Schema Selection Guide

Feature	Star Schema	Snowflake	Data Vault
Simplicity	High	Medium	Low
Query Performance	Fastest	Slower (more JOINs)	Slowest (many JOINs)
Storage Efficiency	Lower	Higher	Highest
Historical Tracking	Limited	Limited	Full
Audit Trail	No	No	Yes
Load Performance	Good	Good	Best (parallel)
Flexibility	Medium	Medium	High
Best For	OLAP/BI	Normalized dimensions	Enterprise DW

🎯 Quiz Section

🏆 Best Practices for Interviews

💡✅ Data Modeling Best Practices

1. Define Grain First: Always clarify what each fact table row represents before designing the schema.

2. Use Surrogate Keys:

-- BAD: Using natural keys in fact table
order_id VARCHAR(50)  -- May change, not unique across sources

-- GOOD: Surrogate key from dimension
product_key SERIAL  -- Always unique, never changes

3. Conformed Dimensions: Ensure dimensions are consistent across fact tables so you can JOIN different facts.

4. Handle NULLs Appropriately:

-- Create Unknown member for dimensions
INSERT INTO dim_product (product_key, product_name, ...)
VALUES (-1, 'Unknown', ...);

-- Use -1 as foreign key for NULL dimension references

5. Consider Partitioning Early: For large fact tables, design partitioning strategy from the start (by date, region, etc.).

🔗 Common Follow-Up Questions

"How do you handle late-arriving facts?" — Buffer period, ETL procedures, SCD handling
"When would you use a factless fact table?" — Coverage tables, event tracking without measures
"How do you handle many-to-many in dimensional modeling?" — Bridge tables, junk dimensions
"What is a junk dimension?" — Combines low-cardinality flags/statuses into one dimension

⚠️⚠️ Interview Tip

When discussing data modeling, always mention:

Business requirements driving the design
Query patterns you're optimizing for
Data volume and growth expectations
ETL/ELT complexity considerations

Topic: SQL Data Modeling for FAANG Interviews