Data Modeling Basics — ERD, Normalization, and Dimensional Modeling
Data engineers build and maintain the infrastructure that powers data pipelines, warehouses, and analytics systems. Data modeling is the foundation of every well-designed data system.
Overview
Entity-Relationship Diagrams (ERD)
ERD Symbols and Notation
ERD Entities and Relationships:
| Entity | Primary Key | Foreign Keys | Attributes |
|---|---|---|---|
| CUSTOMER | customer_id | — | first_name, last_name, email, created_at |
| ORDER | order_id | customer_id (FK) | order_date, amount, status |
| ORDER_ITEM | item_id | order_id (FK), product_id (FK) | quantity, unit_price |
| PRODUCT | product_id | — | name, category, price |
Relationships:
- CUSTOMER ||--o{ ORDER : places (one-to-many)
- ORDER ||--|{ ORDER_ITEM : contains (one-to-many)
- PRODUCT ||--o{ ORDER_ITEM : "ordered in" (one-to-many)
Relationship Types
| Relationship | Notation | Description | Example |
|---|---|---|---|
| One-to-One | 1:1 | Each record maps to exactly one other | User and Profile |
| One-to-Many | 1:N | One record maps to many others | Customer and Orders |
| Many-to-Many | M:N | Records on both sides map to many | Students and Courses |
| Self-referential | N:1 (self) | Record relates to others in same table | Employee and Manager |
ERD Design Process
Normalization Deep Dive
Normal Forms with Examples
1NF — Atomic Values (no repeating groups)
| Before 1NF | After 1NF |
|---|---|
| order_id: 1, items: "Widget A, Widget B" | order_id: 1, item: "Widget A" |
| order_id: 1, item: "Widget B" |
2NF — No Partial Dependencies
| Before 2NF | After 2NF |
|---|---|
| order_id, product_id, product_name, qty | order_id, product_id, qty |
| product_name depends only on product_id | product_id, product_name (separate table) |
3NF — No Transitive Dependencies
| Before 3NF | After 3NF |
|---|---|
| order_id, customer_id, customer_zip, customer_city | order_id, customer_id |
| customer_city depends on customer_zip, not order_id | customer_zip, customer_city (separate table) |
Normalization vs Denormalization
| Factor | Normalized (OLTP) | Denormalized (OLAP) |
|---|---|---|
| Redundancy | Low | High |
| Write performance | Better (one place to update) | Worse (update multiple places) |
| Read performance | Requires JOINs | Pre-joined, faster reads |
| Storage | Less | More |
| Use case | Transactional systems | Analytics / data warehouses |
| Data integrity | Strong (constraints) | Application responsibility |
Dimensional Modeling
Star Schema
Star Schema Design:
| Table | Type | Primary Key | Foreign Keys | Attributes |
|---|---|---|---|---|
| FACT_SALES | Fact | sale_id | sale_date (FK), product_id (FK), store_id (FK), customer_id (FK) | quantity, revenue, cost |
| DIM_DATE | Dimension | date_key | — | year, quarter, month, day_of_week, is_holiday |
| DIM_PRODUCT | Dimension | product_id | — | product_name, category, brand, unit_cost |
| DIM_STORE | Dimension | store_id | — | store_name, city, state, region |
| DIM_CUSTOMER | Dimension | customer_id | — | customer_name, segment, city, state |
Star Schema Relationships:
- FACT_SALES ||--o{ DIM_DATE : "sold on"
- FACT_SALES ||--o{ DIM_PRODUCT : "contains"
- FACT_SALES ||--o{ DIM_STORE : "sold at"
- FACT_SALES ||--o{ DIM_CUSTOMER : "bought by"
Snowflake Schema
Snowflake Schema Design:
| Table | Type | Primary Key | Foreign Keys | Attributes |
|---|---|---|---|---|
| FACT_SALES | Fact | sale_id | sale_date (FK), product_id (FK), store_id (FK) | quantity, revenue |
| DIM_PRODUCT | Dimension | product_id | category_id (FK) | product_name |
| DIM_CATEGORY | Dimension | category_id | department_id (FK) | category_name |
| DIM_DEPARTMENT | Dimension | department_id | — | department_name |
| DIM_REGION | Dimension | region_id | — | region_name, country |
Snowflake Schema Relationships:
- FACT_SALES ||--o{ DIM_DATE : "sold on"
- FACT_SALES ||--o{ DIM_PRODUCT : "contains"
- DIM_PRODUCT ||--o{ DIM_CATEGORY : "belongs to"
- DIM_CATEGORY ||--o{ DIM_DEPARTMENT : "part of"
- DIM_STORE ||--o{ DIM_REGION : "located in"
Star vs Snowflake Comparison
| Factor | Star Schema | Snowflake Schema |
|---|---|---|
| Normalization | Denormalized dimensions | Normalized dimensions |
| Query complexity | Simple JOINs | Multiple JOINs |
| Query performance | Faster (fewer JOINs) | Slower (more JOINs) |
| Storage | More (redundant data) | Less (normalized) |
| Maintenance | Easier (fewer tables) | Harder (more tables) |
| Best for | Most data warehouses | When dimension data is very large |
Slowly Changing Dimensions (SCD)
SCD Types
| Type | Behavior | Use Case | Implementation |
|---|---|---|---|
| SCD Type 1 | Overwrite old value | Correcting mistakes | UPDATE SET col = new_val |
| SCD Type 2 | Add new row, keep history | Track changes over time | Add valid_from, valid_to, is_current |
| SCD Type 3 | Add column for previous value | Limited history | Add prev_col column |
| SCD Type 4 | History in separate table | Full audit trail | Separate dim_history table |
SCD Type 2 Implementation
-- SCD Type 2 table design
CREATE TABLE dim_customer_scd2 (
customer_sk SERIAL PRIMARY KEY,
customer_id INTEGER,
first_name VARCHAR(100),
last_name VARCHAR(100),
email VARCHAR(255),
segment VARCHAR(50),
valid_from TIMESTAMP NOT NULL,
valid_to TIMESTAMP,
is_current BOOLEAN DEFAULT TRUE
);
-- Insert initial record
INSERT INTO dim_customer_scd2 (customer_id, first_name, last_name, email, segment, valid_from, is_current)
VALUES (1001, 'John', 'Doe', 'john@example.com', 'Premium', '2024-01-01', TRUE);
-- When customer changes segment, expire old record and insert new
UPDATE dim_customer_scd2
SET valid_to = CURRENT_TIMESTAMP, is_current = FALSE
WHERE customer_id = 1001 AND is_current = TRUE;
INSERT INTO dim_customer_scd2 (customer_id, first_name, last_name, email, segment, valid_from, is_current)
VALUES (1001, 'John', 'Doe', 'john@example.com', 'Enterprise', CURRENT_TIMESTAMP, TRUE);
-- Query current state
SELECT * FROM dim_customer_scd2 WHERE is_current = TRUE;
-- Query history
SELECT * FROM dim_customer_scd2 WHERE customer_id = 1001 ORDER BY valid_from;
SCD Decision Flow
Data Vault Modeling
| Component | Purpose | Example |
|---|---|---|
| Hub | Business keys and identifiers | hub_customer(customer_id, load_date, record_source) |
| Link | Relationships between hubs | link_order_customer(order_id, customer_id, load_date) |
| Satellite | Descriptive attributes and history | sat_customer_details(customer_id, name, email, load_date, load_end_date) |
Best Practices for Data Modeling
| Practice | Rationale |
|---|---|
| Start with requirements | Model for the queries you need, not all possible queries |
| Use surrogate keys | Surrogate keys (SK) decouple from source system keys |
| Name consistently | dim_* for dimensions, fact_* for facts, stg_* for staging |
| Document the model | ERDs, data dictionaries, and lineage diagrams |
| Test data quality | Validate relationships, nulls, and referential integrity |
| Plan for growth | Partition fact tables by date, use appropriate data types |
| Version your models | Track schema changes in Git alongside code |
MathSummary Takeaways
- ERDs visualize data relationships — use them to communicate entity structures before writing any SQL.
- Normalization eliminates redundancy — follow 1NF through 3NF for transactional (OLTP) systems.
- Dimensional modeling optimizes analytics — star schemas with fact and dimension tables are the standard for data warehouses.
- SCD Type 2 preserves history — track how dimension attributes change over time using validity windows.
- Denormalize intentionally — sacrifice normalization for query performance in OLAP workloads.
- Surrogate keys are preferred — they protect against source system key changes and enable SCD tracking.
- Data Vault enables auditing — Hubs, Links, and Satellites provide full lineage and historization.
- Model for the queries you run — the best model depends on access patterns, not theoretical purity.
See Also
- What is Data Engineering — Introduction to data engineering
- SQL Fundamentals — Essential SQL skills
- Advanced SQL — Advanced SQL techniques
- Databases Fundamentals — Relational vs NoSQL databases
- Data Formats — JSON, Parquet, Avro comparison
Practice Exercises
-
ERD design: Create an ERD for a library management system with books, authors, members, loans, and fines.
-
Normalization exercise: Take a denormalized spreadsheet and normalize it to 3NF. Document each step.
-
Star schema design: Design a star schema for a retail sales data warehouse with appropriate fact and dimension tables.
-
SCD implementation: Implement SCD Type 2 for a customer dimension table in PostgreSQL. Write the merge logic.
-
Data Vault: Convert a star schema to a Data Vault model. Compare the number of tables and query complexity.