ποΈ SQL Data Modeling
Amazon & Microsoft Interview Deep Dive
π Interview Question
βΉοΈπ΄ Amazon/Microsoft Interview Question
"Design a data warehouse schema for an e-commerce platform. The system needs to track: products, customers, orders, order items, and inventory. Consider: 1) Star vs Snowflake schema trade-offs, 2) Slowly Changing Dimensions (SCD), 3) Fact table granularity, 4) Partitioning strategy for 1B+ rows."
Companies: Amazon, Microsoft | Difficulty: Hard | Time: 50 minutes
π Part 1: Star Schema Design
βΉοΈπ Star Schema
Star schema is the foundation of dimensional modeling:
- Fact table: Contains measurements/metrics (center)
- Dimension tables: Contain descriptive attributes (points of star)
- Benefits: Simple queries, fewer JOINs, optimal for OLAP
Dimension Tables
-- Dimension: Products
CREATE TABLE dim_product (
product_key SERIAL PRIMARY KEY, -- Surrogate key
product_id VARCHAR(50), -- Natural/Business key
product_name VARCHAR(200),
category VARCHAR(100),
subcategory VARCHAR(100),
brand VARCHAR(100),
unit_cost DECIMAL(12, 2),
unit_price DECIMAL(12, 2),
supplier_id VARCHAR(50),
supplier_name VARCHAR(200),
is_active BOOLEAN DEFAULT true,
effective_date TIMESTAMP,
expiration_date TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
-- Dimension: Customers
CREATE TABLE dim_customer (
customer_key SERIAL PRIMARY KEY,
customer_id VARCHAR(50),
customer_name VARCHAR(200),
email VARCHAR(200),
phone VARCHAR(50),
address_line1 VARCHAR(200),
address_line2 VARCHAR(200),
city VARCHAR(100),
state VARCHAR(50),
country VARCHAR(100),
postal_code VARCHAR(20),
segment VARCHAR(50), -- Premium, Regular, New
loyalty_tier VARCHAR(20), -- Gold, Silver, Bronze
registration_date DATE,
is_active BOOLEAN DEFAULT true,
effective_date TIMESTAMP,
expiration_date TIMESTAMP
);
-- Dimension: Time (Date dimension)
CREATE TABLE dim_date (
date_key INT PRIMARY KEY, -- YYYYMMDD format
full_date DATE UNIQUE,
year INT,
quarter INT,
month INT,
month_name VARCHAR(20),
week INT,
day_of_month INT,
day_of_week INT,
day_name VARCHAR(20),
is_weekend BOOLEAN,
is_holiday BOOLEAN,
holiday_name VARCHAR(100),
fiscal_year INT,
fiscal_quarter INT
);
-- Dimension: Stores/Channels
CREATE TABLE dim_store (
store_key SERIAL PRIMARY KEY,
store_id VARCHAR(50),
store_name VARCHAR(200),
store_type VARCHAR(50), -- Online, Retail, Marketplace
region VARCHAR(100),
country VARCHAR(100),
city VARCHAR(100),
manager VARCHAR(200),
square_footage INT,
is_active BOOLEAN DEFAULT true,
effective_date TIMESTAMP,
expiration_date TIMESTAMP
);
-- Dimension: Promotions
CREATE TABLE dim_promotion (
promotion_key SERIAL PRIMARY KEY,
promotion_id VARCHAR(50),
promotion_name VARCHAR(200),
promotion_type VARCHAR(50),
discount_percent DECIMAL(5, 2),
start_date DATE,
end_date DATE,
is_active BOOLEAN DEFAULT true
);
Fact Table
-- Fact Table: Orders (Transaction grain)
CREATE TABLE fact_orders (
order_key BIGSERIAL PRIMARY KEY,
order_id VARCHAR(50), -- Natural key
product_key INT REFERENCES dim_product(product_key),
customer_key INT REFERENCES dim_customer(customer_key),
date_key INT REFERENCES dim_date(date_key),
store_key INT REFERENCES dim_store(store_key),
promotion_key INT REFERENCES dim_promotion(promotion_key),
-- Measures
quantity INT,
unit_price DECIMAL(12, 2),
unit_cost DECIMAL(12, 2),
discount_amount DECIMAL(12, 2),
tax_amount DECIMAL(12, 2),
shipping_amount DECIMAL(12, 2),
total_amount DECIMAL(12, 2), -- quantity * unit_price - discount + tax + shipping
profit DECIMAL(12, 2), -- total_amount - (quantity * unit_cost)
-- Status
order_status VARCHAR(20), -- Pending, Shipped, Delivered, Cancelled
payment_method VARCHAR(30),
-- Timestamps
order_timestamp TIMESTAMP,
ship_date TIMESTAMP,
delivery_date TIMESTAMP,
-- Metadata
etl_batch_id INT,
etl_load_timestamp TIMESTAMP DEFAULT NOW()
) PARTITION BY RANGE (date_key);
-- Create partitions by year
CREATE TABLE fact_orders_2023 PARTITION OF fact_orders
FOR VALUES FROM (20230101) TO (20240101);
CREATE TABLE fact_orders_2024 PARTITION OF fact_orders
FOR VALUES FROM (20240101) TO (20250101);
βοΈ Part 2: Snowflake Schema
βΉοΈπ Snowflake Schema
Snowflake schema normalizes dimension tables into sub-dimensions:
- Benefits: Less data redundancy, smaller dimension tables
- Drawbacks: More JOINs, more complex queries
- Use when: Dimension tables are very large with many attributes
-- Snowflake: Normalize Product dimension
CREATE TABLE dim_category (
category_key SERIAL PRIMARY KEY,
category_name VARCHAR(100),
category_description TEXT
);
CREATE TABLE dim_subcategory (
subcategory_key SERIAL PRIMARY KEY,
category_key INT REFERENCES dim_category(category_key),
subcategory_name VARCHAR(100)
);
CREATE TABLE dim_brand (
brand_key SERIAL PRIMARY KEY,
brand_name VARCHAR(100),
brand_description TEXT
);
CREATE TABLE dim_supplier (
supplier_key SERIAL PRIMARY KEY,
supplier_name VARCHAR(200),
contact_name VARCHAR(200),
phone VARCHAR(50),
email VARCHAR(200)
);
-- Modified Product dimension (snowflaked)
CREATE TABLE dim_product_snowflake (
product_key SERIAL PRIMARY KEY,
product_id VARCHAR(50),
product_name VARCHAR(200),
subcategory_key INT REFERENCES dim_subcategory(subcategory_key),
brand_key INT REFERENCES dim_brand(brand_key),
supplier_key INT REFERENCES dim_supplier(supplier_key),
unit_cost DECIMAL(12, 2),
unit_price DECIMAL(12, 2),
effective_date TIMESTAMP,
expiration_date TIMESTAMP
);
-- Query comparison
-- Star schema: 1 JOIN for product info
SELECT p.product_name, p.category, f.total_amount
FROM fact_orders f
JOIN dim_product p ON f.product_key = p.product_key;
-- Snowflake schema: 3 JOINs for same info
SELECT
p.product_name,
sc.subcategory_name,
c.category_name,
f.total_amount
FROM fact_orders f
JOIN dim_product_snowflake p ON f.product_key = p.product_key
JOIN dim_subcategory sc ON p.subcategory_key = sc.subcategory_key
JOIN dim_category c ON sc.category_key = c.category_key;
π¦ Part 3: Data Vault Modeling
βΉοΈπ Data Vault
Data Vault is designed for auditability and flexibility:
- Hubs: Business keys (entities)
- Links: Relationships between hubs
- Satellites: Descriptive attributes and history
- Benefits: Full historical tracking, parallel loading, audit trail
-- Data Vault: Hubs (Business Keys)
CREATE TABLE hub_customer (
hub_customer_hk CHAR(32) PRIMARY KEY, -- Hash key
customer_id VARCHAR(50),
load_date TIMESTAMP,
record_source VARCHAR(100)
);
CREATE TABLE hub_product (
hub_product_hk CHAR(32) PRIMARY KEY,
product_id VARCHAR(50),
load_date TIMESTAMP,
record_source VARCHAR(100)
);
CREATE TABLE hub_order (
hub_order_hk CHAR(32) PRIMARY KEY,
order_id VARCHAR(50),
load_date TIMESTAMP,
record_source VARCHAR(100)
);
-- Data Vault: Links (Relationships)
CREATE TABLE link_order_customer (
link_order_customer_hk CHAR(32) PRIMARY KEY,
hub_order_hk CHAR(32) REFERENCES hub_order(hub_order_hk),
hub_customer_hk CHAR(32) REFERENCES hub_customer(hub_customer_hk),
load_date TIMESTAMP,
record_source VARCHAR(100)
);
CREATE TABLE link_order_product (
link_order_product_hk CHAR(32) PRIMARY KEY,
hub_order_hk CHAR(32) REFERENCES hub_order(hub_order_hk),
hub_product_hk CHAR(32) REFERENCES hub_product(hub_product_hk),
load_date TIMESTAMP,
record_source VARCHAR(100)
);
-- Data Vault: Satellites (Attributes & History)
CREATE TABLE sat_customer (
sat_customer_hk CHAR(32) PRIMARY KEY,
hub_customer_hk CHAR(32) REFERENCES hub_customer(hub_customer_hk),
load_date TIMESTAMP,
load_end_date TIMESTAMP,
hash_diff CHAR(32), -- For detecting changes
customer_name VARCHAR(200),
email VARCHAR(200),
phone VARCHAR(50),
address VARCHAR(500),
segment VARCHAR(50),
record_source VARCHAR(100)
);
CREATE TABLE sat_product (
sat_product_hk CHAR(32) PRIMARY KEY,
hub_product_hk CHAR(32) REFERENCES hub_product(hub_product_hk),
load_date TIMESTAMP,
load_end_date TIMESTAMP,
hash_diff CHAR(32),
product_name VARCHAR(200),
category VARCHAR(100),
brand VARCHAR(100),
unit_cost DECIMAL(12, 2),
unit_price DECIMAL(12, 2),
record_source VARCHAR(100)
);
CREATE TABLE sat_order (
sat_order_hk CHAR(32) PRIMARY KEY,
hub_order_hk CHAR(32) REFERENCES hub_order(hub_order_hk),
load_date TIMESTAMP,
load_end_date TIMESTAMP,
hash_diff CHAR(32),
order_date TIMESTAMP,
status VARCHAR(20),
total_amount DECIMAL(12, 2),
record_source VARCHAR(100)
);
Hash Function for Data Vault
-- Generate hash keys using MD5
CREATE OR REPLACE FUNCTION generate_hash_key(VARIADIC args TEXT[])
RETURNS CHAR(32) AS $$
BEGIN
RETURN MD5(ARRAY_TO_STRING(args, '||'));
END;
$$ LANGUAGE plpgsql;
-- Example usage
INSERT INTO hub_customer (hub_customer_hk, customer_id, load_date, record_source)
VALUES (
generate_hash_key('CUST-001'),
'CUST-001',
NOW(),
'source_system_a'
);
π Part 4: Slowly Changing Dimensions (SCD)
SCD Type 1: Overwrite
-- Type 1: No history, just update current value
CREATE TABLE dim_customer_scd1 (
customer_key SERIAL PRIMARY KEY,
customer_id VARCHAR(50),
customer_name VARCHAR(200),
email VARCHAR(200),
city VARCHAR(100),
segment VARCHAR(50),
last_updated TIMESTAMP DEFAULT NOW()
);
-- Update procedure
CREATE OR REPLACE PROCEDURE update_customer_scd1(
p_customer_id VARCHAR,
p_name VARCHAR,
p_email VARCHAR,
p_city VARCHAR,
p_segment VARCHAR
)
LANGUAGE plpgsql
AS $$
BEGIN
UPDATE dim_customer_scd1
SET customer_name = p_name,
email = p_email,
city = p_city,
segment = p_segment,
last_updated = NOW()
WHERE customer_id = p_customer_id;
IF NOT FOUND THEN
INSERT INTO dim_customer_scd1 (customer_id, customer_name, email, city, segment)
VALUES (p_customer_id, p_name, p_email, p_city, p_segment);
END IF;
END;
$$;
SCD Type 2: Full History
-- Type 2: Keep full history with effective dates
CREATE TABLE dim_customer_scd2 (
customer_key SERIAL PRIMARY KEY,
customer_id VARCHAR(50),
customer_name VARCHAR(200),
email VARCHAR(200),
city VARCHAR(100),
segment VARCHAR(50),
effective_date TIMESTAMP,
expiration_date TIMESTAMP,
is_current BOOLEAN DEFAULT true,
version INT DEFAULT 1
);
-- Insert/Update procedure for SCD Type 2
CREATE OR REPLACE PROCEDURE update_customer_scd2(
p_customer_id VARCHAR,
p_name VARCHAR,
p_email VARCHAR,
p_city VARCHAR,
p_segment VARCHAR
)
LANGUAGE plpgsql
AS $$
DECLARE
v_current_key INT;
v_hash_diff CHAR(32);
BEGIN
-- Calculate hash of new values
v_hash_diff := MD5(p_name || p_email || p_city || p_segment);
-- Check if record exists and has changed
SELECT customer_key INTO v_current_key
FROM dim_customer_scd2
WHERE customer_id = p_customer_id
AND is_current = true;
IF v_current_key IS NOT NULL THEN
-- Check if values changed
IF (SELECT hash_diff FROM dim_customer_scd2 WHERE customer_key = v_current_key) != v_hash_diff THEN
-- Expire current record
UPDATE dim_customer_scd2
SET is_current = false,
expiration_date = NOW()
WHERE customer_key = v_current_key;
-- Insert new version
INSERT INTO dim_customer_scd2 (
customer_id, customer_name, email, city, segment,
effective_date, expiration_date, is_current, version
)
SELECT
customer_id, p_name, p_email, p_city, p_segment,
NOW(), NULL, true, version + 1
FROM dim_customer_scd2
WHERE customer_key = v_current_key;
END IF;
ELSE
-- Insert new record
INSERT INTO dim_customer_scd2 (
customer_id, customer_name, email, city, segment,
effective_date, expiration_date, is_current, version
)
VALUES (
p_customer_id, p_name, p_email, p_city, p_segment,
NOW(), NULL, true, 1
);
END IF;
END;
$$;
SCD Type 3: Limited History
-- Type 3: Keep previous and current values
CREATE TABLE dim_customer_scd3 (
customer_key SERIAL PRIMARY KEY,
customer_id VARCHAR(50),
customer_name VARCHAR(200),
email VARCHAR(200),
city_current VARCHAR(100),
city_previous VARCHAR(100),
segment_current VARCHAR(50),
segment_previous VARCHAR(50),
effective_date TIMESTAMP,
previous_effective_date TIMESTAMP
);
π Part 5: Aggregate Fact Tables
-- Daily aggregate fact table
CREATE TABLE fact_orders_daily (
date_key INT REFERENCES dim_date(date_key),
product_key INT REFERENCES dim_product(product_key),
store_key INT REFERENCES dim_store(store_key),
-- Aggregated measures
order_count INT,
quantity_sum INT,
total_amount_sum DECIMAL(15, 2),
profit_sum DECIMAL(15, 2),
avg_order_value DECIMAL(12, 2),
unique_customers INT,
PRIMARY KEY (date_key, product_key, store_key)
);
-- Populate aggregate from transaction fact
INSERT INTO fact_orders_daily (
date_key, product_key, store_key,
order_count, quantity_sum, total_amount_sum,
profit_sum, avg_order_value, unique_customers
)
SELECT
date_key,
product_key,
store_key,
COUNT(DISTINCT order_id) AS order_count,
SUM(quantity) AS quantity_sum,
SUM(total_amount) AS total_amount_sum,
SUM(profit) AS profit_sum,
AVG(total_amount) AS avg_order_value,
COUNT(DISTINCT customer_key) AS unique_customers
FROM fact_orders
GROUP BY date_key, product_key, store_key;
π Part 6: Comparison Table
π‘β Schema Selection Guide
| Feature | Star Schema | Snowflake | Data Vault |
|---|---|---|---|
| Simplicity | High | Medium | Low |
| Query Performance | Fastest | Slower (more JOINs) | Slowest (many JOINs) |
| Storage Efficiency | Lower | Higher | Highest |
| Historical Tracking | Limited | Limited | Full |
| Audit Trail | No | No | Yes |
| Load Performance | Good | Good | Best (parallel) |
| Flexibility | Medium | Medium | High |
| Best For | OLAP/BI | Normalized dimensions | Enterprise DW |
π― Quiz Section
π Best Practices for Interviews
π‘β Data Modeling Best Practices
1. Define Grain First: Always clarify what each fact table row represents before designing the schema.
2. Use Surrogate Keys:
-- BAD: Using natural keys in fact table
order_id VARCHAR(50) -- May change, not unique across sources
-- GOOD: Surrogate key from dimension
product_key SERIAL -- Always unique, never changes
3. Conformed Dimensions: Ensure dimensions are consistent across fact tables so you can JOIN different facts.
4. Handle NULLs Appropriately:
-- Create Unknown member for dimensions
INSERT INTO dim_product (product_key, product_name, ...)
VALUES (-1, 'Unknown', ...);
-- Use -1 as foreign key for NULL dimension references
5. Consider Partitioning Early: For large fact tables, design partitioning strategy from the start (by date, region, etc.).
π Common Follow-Up Questions
- "How do you handle late-arriving facts?" β Buffer period, ETL procedures, SCD handling
- "When would you use a factless fact table?" β Coverage tables, event tracking without measures
- "How do you handle many-to-many in dimensional modeling?" β Bridge tables, junk dimensions
- "What is a junk dimension?" β Combines low-cardinality flags/statuses into one dimension
β οΈβ οΈ Interview Tip
When discussing data modeling, always mention:
- Business requirements driving the design
- Query patterns you're optimizing for
- Data volume and growth expectations
- ETL/ELT complexity considerations