πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Data Warehouse Concepts: Star Schema, Snowflake Schema & Dimensional Modeling

Module 3: Data Warehouses & StorageData Storage & Warehousing🟒 Free Lesson

Advertisement

Data Warehouse Concepts: The Foundation of Analytical Systems


What is a Data Warehouse?

A data warehouse is a centralized repository designed for analytical processing (OLAP) that integrates data from multiple heterogeneous sources to support decision-making.

Key Characteristics:

  • Integrates data from multiple heterogeneous sources
  • Optimized for complex read-heavy queries
  • Designed for massive historical datasets
  • Supports decision-making and business intelligence

Why Data Warehouses Matter


Problems Without a Data Warehouse:

  • Data silos β€” information trapped in disconnected systems
  • Inconsistent metrics β€” different teams calculate KPIs differently
  • Slow reporting β€” weeks to generate insights instead of minutes
  • No single source of truth β€” conflicting data across departments

How Data Warehouses Solve These Problems:

  1. Unified data layer β€” consolidate CRM, ERP, IoT, and web data
  2. Consistent metrics β€” standardized definitions across the organization
  3. Predictable query performance β€” optimized for analytical workloads
  4. Single source of truth β€” one version of the facts for all stakeholders

Key Insight: Modern enterprises generate petabytes of operational data. Without a unified analytical layer, organizations face data silos, inconsistent metrics, and slow reporting cycles.


Architecture Overview

Star Schema Diagram

Star Schema: Sales Data Warehousefact_salesdate_key | product_keycustomer_key | store_keyquantity | revenue | costdim_datedate | day | monthquarter | yeardim_productname | categorybrand | supplierdim_customername | segmentregion | loyaltydim_storename | citystate | countryFact: measures + FKs | Dimensions: descriptive attributes | Joins via dimension keys

Inmon vs Kimball

Inmon vs Kimball MethodologyInmon (Top-Down)Enterprise-wide 3NF model firstETL: Source -> Staging -> EDW -> MartsPros: single source of truthCons: long implementation cyclesSchema: 3NF (normalized)Best for: large enterprisesKimball (Bottom-Up)Business process dimensional modelETL: Source -> Dimensional DW -> MartsPros: fast time-to-valueCons: can have redundancySchema: Star/SnowflakeBest for: agile analytics

Inmon vs. Kimball Approaches


Star Schema

A star schema is a dimensional model consisting of one or more fact tables referencing one or more dimension tables, where each dimension is a single flat denormalized table directly connected to the fact table. The schema visually resembles a star with the fact table at the center.

Formal Properties

Star Schema Properties

  • F = Fact table containing foreign keys to dimensions and measures
  • D = {d₁, dβ‚‚, ..., dβ‚™} = Set of dimension tables
  • Join Pattern: F.dα΅’_key -> dα΅’.dα΅’_key for each i ∈ {1, ..., n}
  • Cardinality: |dα΅’| >> |F| typically (dimensions are smaller but denormalized)
  • Query Pattern: SELECT ... FROM F JOIN d₁ JOIN dβ‚‚ ... WHERE d₁.attr AND dβ‚‚.attr
  • Denormalization Factor: DF = (redundant_rows Γ— attributes) / normalized_rows
-- Star Schema: Fact Table with Surrogate Keys
CREATE TABLE fact_sales (
    sale_key        BIGINT IDENTITY(1,1) PRIMARY KEY,
    customer_key    INT NOT NULL REFERENCES dim_customer(customer_key),
    product_key     INT NOT NULL REFERENCES dim_product(product_key),
    store_key       INT NOT NULL REFERENCES dim_store(store_key),
    date_key        INT NOT NULL REFERENCES dim_date(date_key),
    promotion_key   INT NOT NULL REFERENCES dim_promotion(promotion_key),
    quantity_sold   INT,
    unit_price      DECIMAL(12,2),
    discount_pct    DECIMAL(5,4),
    gross_amount    DECIMAL(14,2),
    net_amount      DECIMAL(14,2),
    cost_of_goods   DECIMAL(14,2),
    profit_margin   DECIMAL(5,4),
    created_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Star Schema: Denormalized Customer Dimension
CREATE TABLE dim_customer (
    customer_key    INT IDENTITY(1,1) PRIMARY KEY,
    customer_id     VARCHAR(50) UNIQUE NOT NULL,
    full_name       VARCHAR(200),
    email           VARCHAR(300),
    phone           VARCHAR(50),
    address_line1   VARCHAR(200),
    address_line2   VARCHAR(200),
    city            VARCHAR(100),
    state_province  VARCHAR(100),
    postal_code     VARCHAR(20),
    country         VARCHAR(100),
    region          VARCHAR(50),
    segment         VARCHAR(50),
    loyalty_tier    VARCHAR(20),
    lifetime_value  DECIMAL(14,2),
    valid_from      TIMESTAMP,
    valid_to        TIMESTAMP,
    is_current      BOOLEAN DEFAULT TRUE
);

-- Star Schema: Date Dimension
CREATE TABLE dim_date (
    date_key        INT PRIMARY KEY,
    full_date       DATE UNIQUE NOT NULL,
    day_of_week     SMALLINT,
    day_name        VARCHAR(10),
    month_number    SMALLINT,
    month_name      VARCHAR(10),
    quarter         SMALLINT,
    year            INT,
    fiscal_year     INT,
    fiscal_quarter  SMALLINT,
    is_weekend      BOOLEAN,
    is_holiday      BOOLEAN
);

-- Sample Star Schema Query
SELECT
    d.year,
    d.quarter,
    c.country,
    c.segment,
    p.product_category,
    SUM(f.net_amount)           AS total_revenue,
    SUM(f.quantity_sold)        AS units_sold,
    AVG(f.profit_margin)        AS avg_margin,
    COUNT(DISTINCT f.customer_key) AS unique_customers
FROM fact_sales f
JOIN dim_customer c  ON f.customer_key = c.customer_key
JOIN dim_product p   ON f.product_key = p.product_key
JOIN dim_date d      ON f.date_key = d.date_key
WHERE d.year = 2025 AND c.country = 'United States'
GROUP BY d.year, d.quarter, c.country, c.segment, p.product_category
ORDER BY total_revenue DESC;

Snowflake Schema

A snowflake schema extends the star schema by normalizing dimension tables into sub-dimension tables. Each dimension is decomposed into multiple related tables forming a hierarchy, reducing redundancy at the cost of additional joins.

Normalization Cost Analysis

Snowflake Join Complexity

  • Let K = number of dimensions, N = average normalization depth per dimension
  • Star JOINs: K joins (one per dimension)
  • Snowflake JOINs: Ξ£α΅’β‚Œβ‚α΄· nα΅’ joins where nα΅’ is the depth of dimension i
  • Example: 5 dimensions each normalized to depth 3 -> 15 joins vs 5 joins
  • Storage Savings: S = Ξ£α΅’β‚Œβ‚α΄· (redundancy_i Γ— row_count_i Γ— row_size_i)
-- Snowflake Schema: Normalized Dimensions
CREATE TABLE dim_customer (
    customer_key    INT IDENTITY(1,1) PRIMARY KEY,
    customer_id     VARCHAR(50) UNIQUE NOT NULL,
    full_name       VARCHAR(200),
    email           VARCHAR(300),
    city_key        INT REFERENCES dim_city(city_key),
    segment_key     INT REFERENCES dim_segment(segment_key),
    valid_from      TIMESTAMP,
    valid_to        TIMESTAMP,
    is_current      BOOLEAN DEFAULT TRUE
);

CREATE TABLE dim_city (
    city_key        INT IDENTITY(1,1) PRIMARY KEY,
    city_name       VARCHAR(100),
    state_key       INT REFERENCES dim_state(state_key)
);

CREATE TABLE dim_state (
    state_key       INT IDENTITY(1,1) PRIMARY KEY,
    state_name      VARCHAR(100),
    country_key     INT REFERENCES dim_country(country_key)
);

CREATE TABLE dim_country (
    country_key     INT IDENTITY(1,1) PRIMARY KEY,
    country_name    VARCHAR(100),
    continent       VARCHAR(50),
    currency_code   VARCHAR(3)
);

-- Snowflake Query with Multiple Joins
SELECT
    co.country_name,
    st.state_name,
    ci.city_name,
    sg.segment_name,
    SUM(f.net_amount) AS total_revenue
FROM fact_sales f
JOIN dim_customer cu  ON f.customer_key = cu.customer_key
JOIN dim_city ci      ON cu.city_key = ci.city_key
JOIN dim_state st     ON ci.state_key = st.state_key
JOIN dim_country co   ON st.country_key = co.country_key
JOIN dim_segment sg   ON cu.segment_key = sg.segment_key
GROUP BY co.country_name, st.state_name, ci.city_name, sg.segment_name;

Slowly Changing Dimensions (SCD)

SCD strategies manage how changes in dimension attributes are tracked over time. The type determines whether historical values are preserved, overwritten, or partially tracked.

SCD TypeStrategyStorage ImpactHistorical AccuracyQuery ComplexityUse Case
Type 0Retain originalMinimalOriginal onlyLowImmutable attributes
Type 1OverwriteMinimalNo historyLowCorrecting errors
Type 2Add rowHigh (2x-5x)Full historyMediumAudit/compliance
Type 3Add columnModerateLimited (N versions)MediumLimited history
Type 4History tableHighFull historyHighVery large dimensions
Type 6HybridHighFull + currentHighComplex requirements
-- SCD Type 2 Implementation
CREATE TABLE dim_customer_scd2 (
    customer_key    BIGINT IDENTITY(1,1) PRIMARY KEY,
    customer_id     VARCHAR(50) NOT NULL,
    full_name       VARCHAR(200),
    email           VARCHAR(300),
    segment         VARCHAR(50),
    city            VARCHAR(100),
    valid_from      TIMESTAMP NOT NULL,
    valid_to        TIMESTAMP NOT NULL DEFAULT '9999-12-31',
    is_current      BOOLEAN DEFAULT TRUE,
    row_hash        VARCHAR(64)  -- MD5 hash for change detection
);

-- Merge logic for SCD Type 2
MERGE INTO dim_customer_scd2 AS target
USING (
    SELECT
        customer_id,
        full_name,
        email,
        segment,
        city,
        MD5(CONCAT(full_name, '|', email, '|', segment, '|', city)) AS row_hash
    FROM staging.customers
) AS source
ON target.customer_id = source.customer_id AND target.is_current = TRUE
WHEN MATCHED AND target.row_hash != source.row_hash THEN
    UPDATE SET
        is_current = FALSE,
        valid_to = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
    INSERT (customer_id, full_name, email, segment, city, valid_from, is_current, row_hash)
    VALUES (source.customer_id, source.full_name, source.email,
            source.segment, source.city, CURRENT_TIMESTAMP, TRUE, source.row_hash);

-- Insert new versions for changed records
INSERT INTO dim_customer_scd2
    (customer_id, full_name, email, segment, city, valid_from, is_current, row_hash)
SELECT
    source.customer_id, source.full_name, source.email,
    source.segment, source.city, CURRENT_TIMESTAMP, TRUE, source.row_hash
FROM staging.customers source
JOIN dim_customer_scd2 target
    ON source.customer_id = target.customer_id
    AND target.row_hash != MD5(CONCAT(source.full_name, '|', source.email,
        '|', source.segment, '|', source.city))
WHERE target.is_current = TRUE;

Fact Table Types

Grain defines what a single row in a fact table represents. The grain determines the most atomic level of measurement and directly impacts query flexibility, storage requirements, and aggregation performance.

Fact TypeGrainMeasuresNull BehaviorUse Case
TransactionOne row per eventAdditiveNo nulls possibleSales, clicks, logs
Periodic SnapshotOne row per period per entityAdditiveNulls for inactive periodsMonthly balances
Accumulating SnapshotOne row per lifecycleSemi-additiveNulls for incompleteOrder fulfillment
Factless FactEvents without measuresCount onlyN/AAttendance, status changes
-- Transaction Fact Table
CREATE TABLE fact_web_events (
    event_key       BIGINT IDENTITY(1,1) PRIMARY KEY,
    visitor_key     INT NOT NULL,
    session_key     INT NOT NULL,
    page_key        INT NOT NULL,
    event_date_key  INT NOT NULL,
    event_timestamp TIMESTAMP,
    page_view_count INT DEFAULT 1,
    click_count     INT DEFAULT 0,
    bounce_flag     BOOLEAN,
    session_duration_sec INT,
    referrer_domain VARCHAR(200)
);

-- Periodic Snapshot Fact Table (Monthly)
CREATE TABLE fact_inventory_monthly (
    inventory_key   BIGINT IDENTITY(1,1) PRIMARY KEY,
    product_key     INT NOT NULL,
    warehouse_key   INT NOT NULL,
    month_date_key  INT NOT NULL,  -- First day of month
    quantity_on_hand INT,
    quantity_reserved INT,
    quantity_available INT,
    unit_cost       DECIMAL(12,4),
    total_value     DECIMAL(14,2),
    reorder_point   INT,
    stockout_flag   BOOLEAN
);

-- Accumulating Snapshot Fact Table
CREATE TABLE fact_order_lifecycle (
    order_key           BIGINT PRIMARY KEY,
    customer_key        INT NOT NULL,
    product_key         INT NOT NULL,
    order_date_key      INT NOT NULL,
    ship_date_key       INT,  -- NULL until shipped
    delivery_date_key   INT,  -- NULL until delivered
    return_date_key     INT,  -- NULL if not returned
    order_amount        DECIMAL(14,2),
    shipping_cost       DECIMAL(10,2),
    days_to_ship        INT,  -- Calculated: ship_date - order_date
    days_to_deliver     INT   -- Calculated: delivery_date - ship_date
);

Key Concepts Summary

ConceptDefinitionWhen to UseTrade-off
Star SchemaDenormalized dimensions around fact tableFast reads, simple queriesRedundant storage
Snowflake SchemaNormalized dimension hierarchyStorage efficiencyComplex joins
Fact TableStores business metrics/measuresAll quantitative analysisLarge row counts
Dimension TableDescriptive attributes for filtering/groupingFiltering, grouping, labelingRedundancy
Surrogate KeySystem-generated unique identifierAll dimensionsExtra join columns
Natural KeyBusiness-meaningful identifierSource alignmentMay change
GrainAtomic unit of measurementSchema designFlexibility vs storage
Conformed DimensionShared across multiple factsCross-subject analysisGovernance overhead
Degenerate DimensionDimension in fact table (no dim table)Transaction identifiersLimited attributes
Junk DimensionCombines low-cardinality flags/statusesSmall categorical attributesComplexity
Role-Playing DimensionSame dimension used for multiple rolesMultiple date referencesJoin complexity
HierarchyDrill-down path in dimensionOrganizational structuresNormalization depth

Performance Metrics

MetricStar SchemaSnowflake3NF (Inmon)
Avg JOINs per queryKΞ£nα΅’3K-5K
Storage efficiency60-70%80-85%95-100%
Query readabilityHighMediumLow
Aggregation speedFastestFastSlow
ETL complexityLowMediumHigh
BI tool compatibilityExcellentGoodPoor
Update anomaliesPossibleRarePrevented

10 Best Practices

  1. Always use surrogate keys for dimension tables β€” natural keys can change and break historical joins
  2. Define grain before designing facts β€” every fact table must answer "what does one row represent?"
  3. Conform dimensions early β€” shared dimensions enable cross-subject analytics
  4. Partition large fact tables by date to enable partition pruning and incremental loads
  5. Implement SCD Type 2 for any dimension requiring historical analysis or audit trails
  6. Denormalize dimensions for star schema in OLAP systems β€” storage is cheap, joins are expensive
  7. Pre-aggregate into summary tables for dashboards with fixed query patterns
  8. Use columnar storage (Parquet, ORC, column-store) for analytical workloads
  9. Separate hot and cold data β€” keep recent data in fast storage, archive old data to cheaper tiers
  10. Document every metric β€” create a data dictionary defining each measure, grain, and calculation

  • Star schemas optimize for query simplicity and speed via denormalized dimensions
  • Snowflake schemas trade join complexity for storage normalization
  • SCD Type 2 is the gold standard for dimensional history tracking
  • Fact grain definition is the single most important design decision
  • Conformed dimensions enable enterprise-wide analytics across subject areas

See Also

⭐

Premium Content

Data Warehouse Concepts: Star Schema, Snowflake Schema & Dimensional Modeling

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Data Engineering Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement