Data Warehouse Concepts: The Foundation of Analytical Systems
What is a Data Warehouse?
A data warehouse is a centralized repository designed for analytical processing (OLAP) that integrates data from multiple heterogeneous sources to support decision-making.
Key Characteristics:
- Integrates data from multiple heterogeneous sources
- Optimized for complex read-heavy queries
- Designed for massive historical datasets
- Supports decision-making and business intelligence
Why Data Warehouses Matter
Problems Without a Data Warehouse:
- Data silos β information trapped in disconnected systems
- Inconsistent metrics β different teams calculate KPIs differently
- Slow reporting β weeks to generate insights instead of minutes
- No single source of truth β conflicting data across departments
How Data Warehouses Solve These Problems:
- Unified data layer β consolidate CRM, ERP, IoT, and web data
- Consistent metrics β standardized definitions across the organization
- Predictable query performance β optimized for analytical workloads
- Single source of truth β one version of the facts for all stakeholders
Key Insight: Modern enterprises generate petabytes of operational data. Without a unified analytical layer, organizations face data silos, inconsistent metrics, and slow reporting cycles.
Architecture Overview
Star Schema Diagram
Inmon vs Kimball
Inmon vs. Kimball Approaches
Star Schema
A star schema is a dimensional model consisting of one or more fact tables referencing one or more dimension tables, where each dimension is a single flat denormalized table directly connected to the fact table. The schema visually resembles a star with the fact table at the center.
Formal Properties
Star Schema Properties
- F = Fact table containing foreign keys to dimensions and measures
- D = {dβ, dβ, ..., dβ} = Set of dimension tables
- Join Pattern: F.dα΅’_key -> dα΅’.dα΅’_key for each i β {1, ..., n}
- Cardinality: |dα΅’| >> |F| typically (dimensions are smaller but denormalized)
- Query Pattern: SELECT ... FROM F JOIN dβ JOIN dβ ... WHERE dβ.attr AND dβ.attr
- Denormalization Factor: DF = (redundant_rows Γ attributes) / normalized_rows
-- Star Schema: Fact Table with Surrogate Keys
CREATE TABLE fact_sales (
sale_key BIGINT IDENTITY(1,1) PRIMARY KEY,
customer_key INT NOT NULL REFERENCES dim_customer(customer_key),
product_key INT NOT NULL REFERENCES dim_product(product_key),
store_key INT NOT NULL REFERENCES dim_store(store_key),
date_key INT NOT NULL REFERENCES dim_date(date_key),
promotion_key INT NOT NULL REFERENCES dim_promotion(promotion_key),
quantity_sold INT,
unit_price DECIMAL(12,2),
discount_pct DECIMAL(5,4),
gross_amount DECIMAL(14,2),
net_amount DECIMAL(14,2),
cost_of_goods DECIMAL(14,2),
profit_margin DECIMAL(5,4),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Star Schema: Denormalized Customer Dimension
CREATE TABLE dim_customer (
customer_key INT IDENTITY(1,1) PRIMARY KEY,
customer_id VARCHAR(50) UNIQUE NOT NULL,
full_name VARCHAR(200),
email VARCHAR(300),
phone VARCHAR(50),
address_line1 VARCHAR(200),
address_line2 VARCHAR(200),
city VARCHAR(100),
state_province VARCHAR(100),
postal_code VARCHAR(20),
country VARCHAR(100),
region VARCHAR(50),
segment VARCHAR(50),
loyalty_tier VARCHAR(20),
lifetime_value DECIMAL(14,2),
valid_from TIMESTAMP,
valid_to TIMESTAMP,
is_current BOOLEAN DEFAULT TRUE
);
-- Star Schema: Date Dimension
CREATE TABLE dim_date (
date_key INT PRIMARY KEY,
full_date DATE UNIQUE NOT NULL,
day_of_week SMALLINT,
day_name VARCHAR(10),
month_number SMALLINT,
month_name VARCHAR(10),
quarter SMALLINT,
year INT,
fiscal_year INT,
fiscal_quarter SMALLINT,
is_weekend BOOLEAN,
is_holiday BOOLEAN
);
-- Sample Star Schema Query
SELECT
d.year,
d.quarter,
c.country,
c.segment,
p.product_category,
SUM(f.net_amount) AS total_revenue,
SUM(f.quantity_sold) AS units_sold,
AVG(f.profit_margin) AS avg_margin,
COUNT(DISTINCT f.customer_key) AS unique_customers
FROM fact_sales f
JOIN dim_customer c ON f.customer_key = c.customer_key
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_date d ON f.date_key = d.date_key
WHERE d.year = 2025 AND c.country = 'United States'
GROUP BY d.year, d.quarter, c.country, c.segment, p.product_category
ORDER BY total_revenue DESC;
Snowflake Schema
A snowflake schema extends the star schema by normalizing dimension tables into sub-dimension tables. Each dimension is decomposed into multiple related tables forming a hierarchy, reducing redundancy at the cost of additional joins.
Normalization Cost Analysis
Snowflake Join Complexity
- Let K = number of dimensions, N = average normalization depth per dimension
- Star JOINs: K joins (one per dimension)
- Snowflake JOINs: Ξ£α΅’ββα΄· nα΅’ joins where nα΅’ is the depth of dimension i
- Example: 5 dimensions each normalized to depth 3 -> 15 joins vs 5 joins
- Storage Savings: S = Ξ£α΅’ββα΄· (redundancy_i Γ row_count_i Γ row_size_i)
-- Snowflake Schema: Normalized Dimensions
CREATE TABLE dim_customer (
customer_key INT IDENTITY(1,1) PRIMARY KEY,
customer_id VARCHAR(50) UNIQUE NOT NULL,
full_name VARCHAR(200),
email VARCHAR(300),
city_key INT REFERENCES dim_city(city_key),
segment_key INT REFERENCES dim_segment(segment_key),
valid_from TIMESTAMP,
valid_to TIMESTAMP,
is_current BOOLEAN DEFAULT TRUE
);
CREATE TABLE dim_city (
city_key INT IDENTITY(1,1) PRIMARY KEY,
city_name VARCHAR(100),
state_key INT REFERENCES dim_state(state_key)
);
CREATE TABLE dim_state (
state_key INT IDENTITY(1,1) PRIMARY KEY,
state_name VARCHAR(100),
country_key INT REFERENCES dim_country(country_key)
);
CREATE TABLE dim_country (
country_key INT IDENTITY(1,1) PRIMARY KEY,
country_name VARCHAR(100),
continent VARCHAR(50),
currency_code VARCHAR(3)
);
-- Snowflake Query with Multiple Joins
SELECT
co.country_name,
st.state_name,
ci.city_name,
sg.segment_name,
SUM(f.net_amount) AS total_revenue
FROM fact_sales f
JOIN dim_customer cu ON f.customer_key = cu.customer_key
JOIN dim_city ci ON cu.city_key = ci.city_key
JOIN dim_state st ON ci.state_key = st.state_key
JOIN dim_country co ON st.country_key = co.country_key
JOIN dim_segment sg ON cu.segment_key = sg.segment_key
GROUP BY co.country_name, st.state_name, ci.city_name, sg.segment_name;
Slowly Changing Dimensions (SCD)
SCD strategies manage how changes in dimension attributes are tracked over time. The type determines whether historical values are preserved, overwritten, or partially tracked.
| SCD Type | Strategy | Storage Impact | Historical Accuracy | Query Complexity | Use Case |
|---|---|---|---|---|---|
| Type 0 | Retain original | Minimal | Original only | Low | Immutable attributes |
| Type 1 | Overwrite | Minimal | No history | Low | Correcting errors |
| Type 2 | Add row | High (2x-5x) | Full history | Medium | Audit/compliance |
| Type 3 | Add column | Moderate | Limited (N versions) | Medium | Limited history |
| Type 4 | History table | High | Full history | High | Very large dimensions |
| Type 6 | Hybrid | High | Full + current | High | Complex requirements |
-- SCD Type 2 Implementation
CREATE TABLE dim_customer_scd2 (
customer_key BIGINT IDENTITY(1,1) PRIMARY KEY,
customer_id VARCHAR(50) NOT NULL,
full_name VARCHAR(200),
email VARCHAR(300),
segment VARCHAR(50),
city VARCHAR(100),
valid_from TIMESTAMP NOT NULL,
valid_to TIMESTAMP NOT NULL DEFAULT '9999-12-31',
is_current BOOLEAN DEFAULT TRUE,
row_hash VARCHAR(64) -- MD5 hash for change detection
);
-- Merge logic for SCD Type 2
MERGE INTO dim_customer_scd2 AS target
USING (
SELECT
customer_id,
full_name,
email,
segment,
city,
MD5(CONCAT(full_name, '|', email, '|', segment, '|', city)) AS row_hash
FROM staging.customers
) AS source
ON target.customer_id = source.customer_id AND target.is_current = TRUE
WHEN MATCHED AND target.row_hash != source.row_hash THEN
UPDATE SET
is_current = FALSE,
valid_to = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
INSERT (customer_id, full_name, email, segment, city, valid_from, is_current, row_hash)
VALUES (source.customer_id, source.full_name, source.email,
source.segment, source.city, CURRENT_TIMESTAMP, TRUE, source.row_hash);
-- Insert new versions for changed records
INSERT INTO dim_customer_scd2
(customer_id, full_name, email, segment, city, valid_from, is_current, row_hash)
SELECT
source.customer_id, source.full_name, source.email,
source.segment, source.city, CURRENT_TIMESTAMP, TRUE, source.row_hash
FROM staging.customers source
JOIN dim_customer_scd2 target
ON source.customer_id = target.customer_id
AND target.row_hash != MD5(CONCAT(source.full_name, '|', source.email,
'|', source.segment, '|', source.city))
WHERE target.is_current = TRUE;
Fact Table Types
Grain defines what a single row in a fact table represents. The grain determines the most atomic level of measurement and directly impacts query flexibility, storage requirements, and aggregation performance.
| Fact Type | Grain | Measures | Null Behavior | Use Case |
|---|---|---|---|---|
| Transaction | One row per event | Additive | No nulls possible | Sales, clicks, logs |
| Periodic Snapshot | One row per period per entity | Additive | Nulls for inactive periods | Monthly balances |
| Accumulating Snapshot | One row per lifecycle | Semi-additive | Nulls for incomplete | Order fulfillment |
| Factless Fact | Events without measures | Count only | N/A | Attendance, status changes |
-- Transaction Fact Table
CREATE TABLE fact_web_events (
event_key BIGINT IDENTITY(1,1) PRIMARY KEY,
visitor_key INT NOT NULL,
session_key INT NOT NULL,
page_key INT NOT NULL,
event_date_key INT NOT NULL,
event_timestamp TIMESTAMP,
page_view_count INT DEFAULT 1,
click_count INT DEFAULT 0,
bounce_flag BOOLEAN,
session_duration_sec INT,
referrer_domain VARCHAR(200)
);
-- Periodic Snapshot Fact Table (Monthly)
CREATE TABLE fact_inventory_monthly (
inventory_key BIGINT IDENTITY(1,1) PRIMARY KEY,
product_key INT NOT NULL,
warehouse_key INT NOT NULL,
month_date_key INT NOT NULL, -- First day of month
quantity_on_hand INT,
quantity_reserved INT,
quantity_available INT,
unit_cost DECIMAL(12,4),
total_value DECIMAL(14,2),
reorder_point INT,
stockout_flag BOOLEAN
);
-- Accumulating Snapshot Fact Table
CREATE TABLE fact_order_lifecycle (
order_key BIGINT PRIMARY KEY,
customer_key INT NOT NULL,
product_key INT NOT NULL,
order_date_key INT NOT NULL,
ship_date_key INT, -- NULL until shipped
delivery_date_key INT, -- NULL until delivered
return_date_key INT, -- NULL if not returned
order_amount DECIMAL(14,2),
shipping_cost DECIMAL(10,2),
days_to_ship INT, -- Calculated: ship_date - order_date
days_to_deliver INT -- Calculated: delivery_date - ship_date
);
Key Concepts Summary
| Concept | Definition | When to Use | Trade-off |
|---|---|---|---|
| Star Schema | Denormalized dimensions around fact table | Fast reads, simple queries | Redundant storage |
| Snowflake Schema | Normalized dimension hierarchy | Storage efficiency | Complex joins |
| Fact Table | Stores business metrics/measures | All quantitative analysis | Large row counts |
| Dimension Table | Descriptive attributes for filtering/grouping | Filtering, grouping, labeling | Redundancy |
| Surrogate Key | System-generated unique identifier | All dimensions | Extra join columns |
| Natural Key | Business-meaningful identifier | Source alignment | May change |
| Grain | Atomic unit of measurement | Schema design | Flexibility vs storage |
| Conformed Dimension | Shared across multiple facts | Cross-subject analysis | Governance overhead |
| Degenerate Dimension | Dimension in fact table (no dim table) | Transaction identifiers | Limited attributes |
| Junk Dimension | Combines low-cardinality flags/statuses | Small categorical attributes | Complexity |
| Role-Playing Dimension | Same dimension used for multiple roles | Multiple date references | Join complexity |
| Hierarchy | Drill-down path in dimension | Organizational structures | Normalization depth |
Performance Metrics
| Metric | Star Schema | Snowflake | 3NF (Inmon) |
|---|---|---|---|
| Avg JOINs per query | K | Ξ£nα΅’ | 3K-5K |
| Storage efficiency | 60-70% | 80-85% | 95-100% |
| Query readability | High | Medium | Low |
| Aggregation speed | Fastest | Fast | Slow |
| ETL complexity | Low | Medium | High |
| BI tool compatibility | Excellent | Good | Poor |
| Update anomalies | Possible | Rare | Prevented |
10 Best Practices
- Always use surrogate keys for dimension tables β natural keys can change and break historical joins
- Define grain before designing facts β every fact table must answer "what does one row represent?"
- Conform dimensions early β shared dimensions enable cross-subject analytics
- Partition large fact tables by date to enable partition pruning and incremental loads
- Implement SCD Type 2 for any dimension requiring historical analysis or audit trails
- Denormalize dimensions for star schema in OLAP systems β storage is cheap, joins are expensive
- Pre-aggregate into summary tables for dashboards with fixed query patterns
- Use columnar storage (Parquet, ORC, column-store) for analytical workloads
- Separate hot and cold data β keep recent data in fast storage, archive old data to cheaper tiers
- Document every metric β create a data dictionary defining each measure, grain, and calculation
- Star schemas optimize for query simplicity and speed via denormalized dimensions
- Snowflake schemas trade join complexity for storage normalization
- SCD Type 2 is the gold standard for dimensional history tracking
- Fact grain definition is the single most important design decision
- Conformed dimensions enable enterprise-wide analytics across subject areas
See Also
- Snowflake Fundamentals β Cloud data warehouse platform with compute-storage separation
- dbt Fundamentals β SQL-first transformation layer for analytics engineering
- Data Lake Architecture β Designing scalable raw data storage systems
- Partitioning & Indexing β Optimizing data access patterns for warehouses
- Project 3: Warehouse Build β End-to-end data warehouse capstone project
- Data Governance & Catalog β Managing data metadata and discoverability at scale