🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Data Modeling Basics — ERD, Normalization, and Dimensional Modeling

Data Engineering FoundationsData Engineering Fundamentals🟢 Free Lesson

Advertisement

Data Modeling Basics — ERD, Normalization, and Dimensional Modeling

Data engineers build and maintain the infrastructure that powers data pipelines, warehouses, and analytics systems. Data modeling is the foundation of every well-designed data system.

Overview

Entity-Relationship Diagrams (ERD)

ERD Symbols and Notation

ERD Entities and Relationships:

EntityPrimary KeyForeign KeysAttributes
CUSTOMERcustomer_idfirst_name, last_name, email, created_at
ORDERorder_idcustomer_id (FK)order_date, amount, status
ORDER_ITEMitem_idorder_id (FK), product_id (FK)quantity, unit_price
PRODUCTproduct_idname, category, price

Relationships:

  • CUSTOMER ||--o{ ORDER : places (one-to-many)
  • ORDER ||--|{ ORDER_ITEM : contains (one-to-many)
  • PRODUCT ||--o{ ORDER_ITEM : "ordered in" (one-to-many)

Relationship Types

RelationshipNotationDescriptionExample
One-to-One1:1Each record maps to exactly one otherUser and Profile
One-to-Many1:NOne record maps to many othersCustomer and Orders
Many-to-ManyM:NRecords on both sides map to manyStudents and Courses
Self-referentialN:1 (self)Record relates to others in same tableEmployee and Manager

ERD Design Process

Normalization Deep Dive

Normal Forms with Examples

1NF — Atomic Values (no repeating groups)

Before 1NFAfter 1NF
order_id: 1, items: "Widget A, Widget B"order_id: 1, item: "Widget A"
order_id: 1, item: "Widget B"

2NF — No Partial Dependencies

Before 2NFAfter 2NF
order_id, product_id, product_name, qtyorder_id, product_id, qty
product_name depends only on product_idproduct_id, product_name (separate table)

3NF — No Transitive Dependencies

Before 3NFAfter 3NF
order_id, customer_id, customer_zip, customer_cityorder_id, customer_id
customer_city depends on customer_zip, not order_idcustomer_zip, customer_city (separate table)

Normalization vs Denormalization

FactorNormalized (OLTP)Denormalized (OLAP)
RedundancyLowHigh
Write performanceBetter (one place to update)Worse (update multiple places)
Read performanceRequires JOINsPre-joined, faster reads
StorageLessMore
Use caseTransactional systemsAnalytics / data warehouses
Data integrityStrong (constraints)Application responsibility

Dimensional Modeling

Star Schema

Star Schema DesignFACT_SALESsale_id, quantityrevenue, costFKs: date, product, storeDIM_DATEdate_key, yearquarter, monthDIM_PRODUCTproduct_id, namecategory, brandDIM_STOREstore_id, namecity, regionDIM_CUSTOMERcustomer_id, namesegment, city

Star Schema Design:

TableTypePrimary KeyForeign KeysAttributes
FACT_SALESFactsale_idsale_date (FK), product_id (FK), store_id (FK), customer_id (FK)quantity, revenue, cost
DIM_DATEDimensiondate_keyyear, quarter, month, day_of_week, is_holiday
DIM_PRODUCTDimensionproduct_idproduct_name, category, brand, unit_cost
DIM_STOREDimensionstore_idstore_name, city, state, region
DIM_CUSTOMERDimensioncustomer_idcustomer_name, segment, city, state

Star Schema Relationships:

  • FACT_SALES ||--o{ DIM_DATE : "sold on"
  • FACT_SALES ||--o{ DIM_PRODUCT : "contains"
  • FACT_SALES ||--o{ DIM_STORE : "sold at"
  • FACT_SALES ||--o{ DIM_CUSTOMER : "bought by"

Snowflake Schema

Snowflake Schema Design:

TableTypePrimary KeyForeign KeysAttributes
FACT_SALESFactsale_idsale_date (FK), product_id (FK), store_id (FK)quantity, revenue
DIM_PRODUCTDimensionproduct_idcategory_id (FK)product_name
DIM_CATEGORYDimensioncategory_iddepartment_id (FK)category_name
DIM_DEPARTMENTDimensiondepartment_iddepartment_name
DIM_REGIONDimensionregion_idregion_name, country
Snowflake Schema DesignFACT_SALESsale_id, quantityrevenueDIM_PRODUCTproduct_nameDIM_CATEGORYcategory_nameDIM_DEPTdepartment_nameDIM_DATEdate_key, yearDIM_STOREstore_name, citycategory_id FKdept_id FK

Snowflake Schema Relationships:

  • FACT_SALES ||--o{ DIM_DATE : "sold on"
  • FACT_SALES ||--o{ DIM_PRODUCT : "contains"
  • DIM_PRODUCT ||--o{ DIM_CATEGORY : "belongs to"
  • DIM_CATEGORY ||--o{ DIM_DEPARTMENT : "part of"
  • DIM_STORE ||--o{ DIM_REGION : "located in"

Star vs Snowflake Comparison

FactorStar SchemaSnowflake Schema
NormalizationDenormalized dimensionsNormalized dimensions
Query complexitySimple JOINsMultiple JOINs
Query performanceFaster (fewer JOINs)Slower (more JOINs)
StorageMore (redundant data)Less (normalized)
MaintenanceEasier (fewer tables)Harder (more tables)
Best forMost data warehousesWhen dimension data is very large

Slowly Changing Dimensions (SCD)

SCD Types

TypeBehaviorUse CaseImplementation
SCD Type 1Overwrite old valueCorrecting mistakesUPDATE SET col = new_val
SCD Type 2Add new row, keep historyTrack changes over timeAdd valid_from, valid_to, is_current
SCD Type 3Add column for previous valueLimited historyAdd prev_col column
SCD Type 4History in separate tableFull audit trailSeparate dim_history table

SCD Type 2 Implementation

-- SCD Type 2 table design
CREATE TABLE dim_customer_scd2 (
    customer_sk SERIAL PRIMARY KEY,
    customer_id INTEGER,
    first_name VARCHAR(100),
    last_name VARCHAR(100),
    email VARCHAR(255),
    segment VARCHAR(50),
    valid_from TIMESTAMP NOT NULL,
    valid_to TIMESTAMP,
    is_current BOOLEAN DEFAULT TRUE
);

-- Insert initial record
INSERT INTO dim_customer_scd2 (customer_id, first_name, last_name, email, segment, valid_from, is_current)
VALUES (1001, 'John', 'Doe', 'john@example.com', 'Premium', '2024-01-01', TRUE);

-- When customer changes segment, expire old record and insert new
UPDATE dim_customer_scd2
SET valid_to = CURRENT_TIMESTAMP, is_current = FALSE
WHERE customer_id = 1001 AND is_current = TRUE;

INSERT INTO dim_customer_scd2 (customer_id, first_name, last_name, email, segment, valid_from, is_current)
VALUES (1001, 'John', 'Doe', 'john@example.com', 'Enterprise', CURRENT_TIMESTAMP, TRUE);

-- Query current state
SELECT * FROM dim_customer_scd2 WHERE is_current = TRUE;

-- Query history
SELECT * FROM dim_customer_scd2 WHERE customer_id = 1001 ORDER BY valid_from;

SCD Decision Flow

Data Vault Modeling

ComponentPurposeExample
HubBusiness keys and identifiershub_customer(customer_id, load_date, record_source)
LinkRelationships between hubslink_order_customer(order_id, customer_id, load_date)
SatelliteDescriptive attributes and historysat_customer_details(customer_id, name, email, load_date, load_end_date)

Best Practices for Data Modeling

PracticeRationale
Start with requirementsModel for the queries you need, not all possible queries
Use surrogate keysSurrogate keys (SK) decouple from source system keys
Name consistentlydim_* for dimensions, fact_* for facts, stg_* for staging
Document the modelERDs, data dictionaries, and lineage diagrams
Test data qualityValidate relationships, nulls, and referential integrity
Plan for growthPartition fact tables by date, use appropriate data types
Version your modelsTrack schema changes in Git alongside code

MathSummary Takeaways

  1. ERDs visualize data relationships — use them to communicate entity structures before writing any SQL.
  2. Normalization eliminates redundancy — follow 1NF through 3NF for transactional (OLTP) systems.
  3. Dimensional modeling optimizes analytics — star schemas with fact and dimension tables are the standard for data warehouses.
  4. SCD Type 2 preserves history — track how dimension attributes change over time using validity windows.
  5. Denormalize intentionally — sacrifice normalization for query performance in OLAP workloads.
  6. Surrogate keys are preferred — they protect against source system key changes and enable SCD tracking.
  7. Data Vault enables auditing — Hubs, Links, and Satellites provide full lineage and historization.
  8. Model for the queries you run — the best model depends on access patterns, not theoretical purity.

See Also

Practice Exercises

  1. ERD design: Create an ERD for a library management system with books, authors, members, loans, and fines.

  2. Normalization exercise: Take a denormalized spreadsheet and normalize it to 3NF. Document each step.

  3. Star schema design: Design a star schema for a retail sales data warehouse with appropriate fact and dimension tables.

  4. SCD implementation: Implement SCD Type 2 for a customer dimension table in PostgreSQL. Write the merge logic.

  5. Data Vault: Convert a star schema to a Data Vault model. Compare the number of tables and query complexity.

Premium Content

Data Modeling Basics — ERD, Normalization, and Dimensional Modeling

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Data Engineering Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement