Data Modeling Basics — ERD, Normalization, and Dimensional Modeling

Data engineers build and maintain the infrastructure that powers data pipelines, warehouses, and analytics systems. Data modeling is the foundation of every well-designed data system.

Overview

Entity-Relationship Diagrams (ERD)

ERD Symbols and Notation

ERD Entities and Relationships:

Entity	Primary Key	Foreign Keys	Attributes
CUSTOMER	customer_id	—	first_name, last_name, email, created_at
ORDER	order_id	customer_id (FK)	order_date, amount, status
ORDER_ITEM	item_id	order_id (FK), product_id (FK)	quantity, unit_price
PRODUCT	product_id	—	name, category, price

Relationships:

CUSTOMER ||--o{ ORDER : places (one-to-many)
ORDER ||--|{ ORDER_ITEM : contains (one-to-many)
PRODUCT ||--o{ ORDER_ITEM : "ordered in" (one-to-many)

Relationship Types

Relationship	Notation	Description	Example
One-to-One	`1:1`	Each record maps to exactly one other	User and Profile
One-to-Many	`1:N`	One record maps to many others	Customer and Orders
Many-to-Many	`M:N`	Records on both sides map to many	Students and Courses
Self-referential	`N:1` (self)	Record relates to others in same table	Employee and Manager

ERD Design Process

Normalization Deep Dive

Normal Forms with Examples

1NF — Atomic Values (no repeating groups)

Before 1NF	After 1NF
order_id: 1, items: "Widget A, Widget B"	order_id: 1, item: "Widget A"
	order_id: 1, item: "Widget B"

2NF — No Partial Dependencies

Before 2NF	After 2NF
order_id, product_id, product_name, qty	order_id, product_id, qty
product_name depends only on product_id	product_id, product_name (separate table)

3NF — No Transitive Dependencies

Before 3NF	After 3NF
order_id, customer_id, customer_zip, customer_city	order_id, customer_id
customer_city depends on customer_zip, not order_id	customer_zip, customer_city (separate table)

Normalization vs Denormalization

Factor	Normalized (OLTP)	Denormalized (OLAP)
Redundancy	Low	High
Write performance	Better (one place to update)	Worse (update multiple places)
Read performance	Requires JOINs	Pre-joined, faster reads
Storage	Less	More
Use case	Transactional systems	Analytics / data warehouses
Data integrity	Strong (constraints)	Application responsibility

Dimensional Modeling

Star Schema

Star Schema Design:

Table	Type	Primary Key	Foreign Keys	Attributes
FACT_SALES	Fact	sale_id	sale_date (FK), product_id (FK), store_id (FK), customer_id (FK)	quantity, revenue, cost
DIM_DATE	Dimension	date_key	—	year, quarter, month, day_of_week, is_holiday
DIM_PRODUCT	Dimension	product_id	—	product_name, category, brand, unit_cost
DIM_STORE	Dimension	store_id	—	store_name, city, state, region
DIM_CUSTOMER	Dimension	customer_id	—	customer_name, segment, city, state

Star Schema Relationships:

FACT_SALES ||--o{ DIM_DATE : "sold on"
FACT_SALES ||--o{ DIM_PRODUCT : "contains"
FACT_SALES ||--o{ DIM_STORE : "sold at"
FACT_SALES ||--o{ DIM_CUSTOMER : "bought by"

Snowflake Schema

Snowflake Schema Design:

Table	Type	Primary Key	Foreign Keys	Attributes
FACT_SALES	Fact	sale_id	sale_date (FK), product_id (FK), store_id (FK)	quantity, revenue
DIM_PRODUCT	Dimension	product_id	category_id (FK)	product_name
DIM_CATEGORY	Dimension	category_id	department_id (FK)	category_name
DIM_DEPARTMENT	Dimension	department_id	—	department_name
DIM_REGION	Dimension	region_id	—	region_name, country

Snowflake Schema Relationships:

FACT_SALES ||--o{ DIM_DATE : "sold on"
FACT_SALES ||--o{ DIM_PRODUCT : "contains"
DIM_PRODUCT ||--o{ DIM_CATEGORY : "belongs to"
DIM_CATEGORY ||--o{ DIM_DEPARTMENT : "part of"
DIM_STORE ||--o{ DIM_REGION : "located in"

Star vs Snowflake Comparison

Factor	Star Schema	Snowflake Schema
Normalization	Denormalized dimensions	Normalized dimensions
Query complexity	Simple JOINs	Multiple JOINs
Query performance	Faster (fewer JOINs)	Slower (more JOINs)
Storage	More (redundant data)	Less (normalized)
Maintenance	Easier (fewer tables)	Harder (more tables)
Best for	Most data warehouses	When dimension data is very large

Slowly Changing Dimensions (SCD)

SCD Types

Type	Behavior	Use Case	Implementation
SCD Type 1	Overwrite old value	Correcting mistakes	`UPDATE SET col = new_val`
SCD Type 2	Add new row, keep history	Track changes over time	Add `valid_from`, `valid_to`, `is_current`
SCD Type 3	Add column for previous value	Limited history	Add `prev_col` column
SCD Type 4	History in separate table	Full audit trail	Separate `dim_history` table

SCD Type 2 Implementation

-- SCD Type 2 table design
CREATE TABLE dim_customer_scd2 (
    customer_sk SERIAL PRIMARY KEY,
    customer_id INTEGER,
    first_name VARCHAR(100),
    last_name VARCHAR(100),
    email VARCHAR(255),
    segment VARCHAR(50),
    valid_from TIMESTAMP NOT NULL,
    valid_to TIMESTAMP,
    is_current BOOLEAN DEFAULT TRUE
);

-- Insert initial record
INSERT INTO dim_customer_scd2 (customer_id, first_name, last_name, email, segment, valid_from, is_current)
VALUES (1001, 'John', 'Doe', 'john@example.com', 'Premium', '2024-01-01', TRUE);

-- When customer changes segment, expire old record and insert new
UPDATE dim_customer_scd2
SET valid_to = CURRENT_TIMESTAMP, is_current = FALSE
WHERE customer_id = 1001 AND is_current = TRUE;

INSERT INTO dim_customer_scd2 (customer_id, first_name, last_name, email, segment, valid_from, is_current)
VALUES (1001, 'John', 'Doe', 'john@example.com', 'Enterprise', CURRENT_TIMESTAMP, TRUE);

-- Query current state
SELECT * FROM dim_customer_scd2 WHERE is_current = TRUE;

-- Query history
SELECT * FROM dim_customer_scd2 WHERE customer_id = 1001 ORDER BY valid_from;

SCD Decision Flow

Data Vault Modeling

Component	Purpose	Example
Hub	Business keys and identifiers	`hub_customer(customer_id, load_date, record_source)`
Link	Relationships between hubs	`link_order_customer(order_id, customer_id, load_date)`
Satellite	Descriptive attributes and history	`sat_customer_details(customer_id, name, email, load_date, load_end_date)`

Best Practices for Data Modeling

Practice	Rationale
Start with requirements	Model for the queries you need, not all possible queries
Use surrogate keys	Surrogate keys (SK) decouple from source system keys
Name consistently	`dim_` for dimensions, `fact_` for facts, `stg_*` for staging
Document the model	ERDs, data dictionaries, and lineage diagrams
Test data quality	Validate relationships, nulls, and referential integrity
Plan for growth	Partition fact tables by date, use appropriate data types
Version your models	Track schema changes in Git alongside code

MathSummary Takeaways

ERDs visualize data relationships — use them to communicate entity structures before writing any SQL.
Normalization eliminates redundancy — follow 1NF through 3NF for transactional (OLTP) systems.
Dimensional modeling optimizes analytics — star schemas with fact and dimension tables are the standard for data warehouses.
SCD Type 2 preserves history — track how dimension attributes change over time using validity windows.
Denormalize intentionally — sacrifice normalization for query performance in OLAP workloads.
Surrogate keys are preferred — they protect against source system key changes and enable SCD tracking.
Data Vault enables auditing — Hubs, Links, and Satellites provide full lineage and historization.
Model for the queries you run — the best model depends on access patterns, not theoretical purity.

Practice Exercises

ERD design: Create an ERD for a library management system with books, authors, members, loans, and fines.
Normalization exercise: Take a denormalized spreadsheet and normalize it to 3NF. Document each step.
Star schema design: Design a star schema for a retail sales data warehouse with appropriate fact and dimension tables.
SCD implementation: Implement SCD Type 2 for a customer dimension table in PostgreSQL. Write the merge logic.
Data Vault: Convert a star schema to a Data Vault model. Compare the number of tables and query complexity.

Data Modeling Basics — ERD, Normalization, and Dimensional Modeling

Data Modeling Basics — ERD, Normalization, and Dimensional Modeling

Overview

Entity-Relationship Diagrams (ERD)

ERD Symbols and Notation

Relationship Types

ERD Design Process

Normalization Deep Dive

Normal Forms with Examples

Normalization vs Denormalization

Dimensional Modeling

Star Schema

Snowflake Schema

Star vs Snowflake Comparison

Slowly Changing Dimensions (SCD)

SCD Types

SCD Type 2 Implementation

SCD Decision Flow

Data Vault Modeling

Best Practices for Data Modeling

MathSummary Takeaways

See Also

Practice Exercises

Premium Content

Need Expert Data Engineering Help?