Database Normalization
Design efficient databases by eliminating data redundancy through normalization.
- Normal Forms β Structured approach from 1NF to BCNF for organizing data
- Data Integrity β Prevent anomalies during insert, update, and delete operations
- Schema Design β Build flexible databases that scale with your application Good database design is the foundation of efficient, maintainable data systems.
What is Normalization?
DfNormalization
Normalization is the process of organizing relational database tables to reduce data redundancy and improve data integrity. It involves decomposing tables into smaller, well-structured tables and defining relationships between them, following a series of normal forms (1NF, 2NF, 3NF, BCNF).
Without normalization, databases suffer from insertion anomalies, update anomalies, and deletion anomalies β situations where adding, changing, or removing data causes unintended loss of information.
| Anomaly | Description | Example |
|---|---|---|
| Insertion | Cannot add data without unrelated data | Need a customer to add an order |
| Update | Changing one row requires updating multiple rows | Customer address stored in every order |
| Deletion | Deleting a row loses unrelated data | Canceling an order deletes customer info |
Normal Forms Explained
First Normal Form (1NF)
DfFirst Normal Form (1NF)
A table is in 1NF if all columns contain atomic (indivisible) values and each row is unique. No repeating groups or arrays are allowed in a single column.
-- VIOLATION: Repeating group in a single column
CREATE TABLE orders_bad (
order_id INTEGER PRIMARY KEY,
customer_name TEXT,
products TEXT -- "Widget A, Gadget B, Widget C" (not atomic!)
);
-- COMPLIANT: 1NF with atomic values
CREATE TABLE orders_1nf (
order_id INTEGER PRIMARY KEY,
customer_name TEXT NOT NULL,
product_name TEXT NOT NULL,
quantity INTEGER NOT NULL
);
-- Insert data in 1NF format
INSERT INTO orders_1nf (order_id, customer_name, product_name, quantity)
VALUES
(1, 'Alice Smith', 'Widget A', 3),
(1, 'Alice Smith', 'Gadget B', 1),
(2, 'Bob Jones', 'Widget C', 5);
Second Normal Form (2NF)
DfSecond Normal Form (2NF)
A table is in 2NF if it is in 1NF and all non-key columns are fully dependent on the entire primary key. This eliminates partial dependencies where a column depends on only part of a composite key.
-- VIOLATION: Partial dependency (product_name depends only on product_id, not order_id)
CREATE TABLE order_items_bad (
order_id INTEGER,
product_id INTEGER,
product_name TEXT, -- Depends only on product_id
quantity INTEGER,
PRIMARY KEY (order_id, product_id)
);
-- COMPLIANT: 2NF by splitting into separate tables
CREATE TABLE products_2nf (
product_id INTEGER PRIMARY KEY,
product_name TEXT NOT NULL,
unit_price DECIMAL(10,2)
);
CREATE TABLE order_items_2nf (
order_id INTEGER,
product_id INTEGER,
quantity INTEGER NOT NULL,
PRIMARY KEY (order_id, product_id),
FOREIGN KEY (product_id) REFERENCES products_2nf(product_id)
);
Third Normal Form (3NF)
DfThird Normal Form (3NF)
A table is in 3NF if it is in 2NF and no non-key column depends on another non-key column (no transitive dependencies). Every non-key attribute must depend only on the primary key.
-- VIOLATION: Transitive dependency (city -> state)
CREATE TABLE customers_bad (
customer_id INTEGER PRIMARY KEY,
customer_name TEXT NOT NULL,
city TEXT NOT NULL,
state TEXT NOT NULL -- Depends on city, not customer_id!
);
-- COMPLIANT: 3NF by extracting city-state relationship
CREATE TABLE locations_3nf (
city TEXT PRIMARY KEY,
state TEXT NOT NULL
);
CREATE TABLE customers_3nf (
customer_id INTEGER PRIMARY KEY,
customer_name TEXT NOT NULL,
city TEXT NOT NULL,
FOREIGN KEY (city) REFERENCES locations_3nf(city)
);
Boyce-Codd Normal Form (BCNF)
DfBoyce-Codd Normal Form (BCNF)
BCNF is a stricter version of 3NF. A table is in BCNF if for every functional dependency X β Y, X is a superkey. This handles edge cases where 3NF is insufficient due to overlapping candidate keys.
-- BCNF example: Instructor-Course-Student relationship
CREATE TABLE teaches (
instructor_id INTEGER,
course_id INTEGER,
student_id INTEGER,
semester TEXT NOT NULL,
PRIMARY KEY (instructor_id, course_id, student_id)
);
-- If course_id determines instructor_id (each course has one instructor),
-- we decompose to satisfy BCNF:
CREATE TABLE course_instructor (
course_id INTEGER PRIMARY KEY,
instructor_id INTEGER NOT NULL
);
CREATE TABLE enrollment (
course_id INTEGER,
student_id INTEGER,
semester TEXT NOT NULL,
PRIMARY KEY (course_id, student_id),
FOREIGN KEY (course_id) REFERENCES course_instructor(course_id)
);
Normalization Summary Table
| Normal Form | Requirement | What It Eliminates |
|---|---|---|
| 1NF | Atomic values, unique rows | Repeating groups |
| 2NF | 1NF + full functional dependency | Partial dependencies |
| 3NF | 2NF + no transitive dependencies | Transitive dependencies |
| BCNF | Stricter 3NF for all FDs | Anomalies from overlapping keys |
| 4NF | BCNF + no multi-valued dependencies | Independent multi-valued facts |
| 5NF | 4NF + no join dependencies | Lossy decompositions |
Practical Example: E-Commerce Database
-- Denormalized (bad) single table
CREATE TABLE order_data_denorm (
order_id INTEGER PRIMARY KEY,
customer_name TEXT,
customer_email TEXT,
product_name TEXT,
product_category TEXT,
product_price DECIMAL(10,2),
order_date DATE,
ship_address TEXT
);
-- Normalized (good) schema
CREATE TABLE customers_norm (
customer_id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
email TEXT NOT NULL UNIQUE
);
CREATE TABLE addresses (
address_id INTEGER PRIMARY KEY,
customer_id INTEGER NOT NULL,
street TEXT NOT NULL,
city TEXT NOT NULL,
state TEXT NOT NULL,
zip_code TEXT NOT NULL,
FOREIGN KEY (customer_id) REFERENCES customers_norm(customer_id)
);
CREATE TABLE categories (
category_id INTEGER PRIMARY KEY,
name TEXT NOT NULL
);
CREATE TABLE products_norm (
product_id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
category_id INTEGER NOT NULL,
price DECIMAL(10,2) NOT NULL,
FOREIGN KEY (category_id) REFERENCES categories(category_id)
);
CREATE TABLE orders_norm (
order_id INTEGER PRIMARY KEY,
customer_id INTEGER NOT NULL,
address_id INTEGER NOT NULL,
order_date DATE NOT NULL DEFAULT CURRENT_DATE,
FOREIGN KEY (customer_id) REFERENCES customers_norm(customer_id),
FOREIGN KEY (address_id) REFERENCES addresses(address_id)
);
CREATE TABLE order_items_norm (
order_id INTEGER,
product_id INTEGER,
quantity INTEGER NOT NULL,
unit_price DECIMAL(10,2) NOT NULL,
PRIMARY KEY (order_id, product_id),
FOREIGN KEY (order_id) REFERENCES orders_norm(order_id),
FOREIGN KEY (product_id) REFERENCES products_norm(product_id)
);
Checking Normalization Violations
-- Find duplicate customer emails (potential data integrity issue)
SELECT email, COUNT(*) AS occurrences
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;
-- Detect transitive dependency: city determines state
SELECT city, COUNT(DISTINCT state) AS unique_states
FROM customers
GROUP BY city
HAVING COUNT(DISTINCT state) > 1;
-- Find tables with redundant data patterns
SELECT
table_name,
column_name,
COUNT(*) OVER (PARTITION BY column_name) AS duplicate_count
FROM information_schema.columns
WHERE table_schema = 'public';
When to Denormalize
Sometimes breaking normalization rules is the right choice:
| Scenario | Reason | Example |
|---|---|---|
| Read-heavy analytics | Reduce JOINs for reporting | Materialized views |
| Caching frequently accessed data | Avoid repeated computations | Denormalized totals in orders table |
| Write-once data | Simplicity over normalization | Log tables, event data |
| Performance-critical paths | Sub-millisecond latency required | User profile with embedded address |
-- Controlled denormalization: add computed column for fast reads
ALTER TABLE orders
ADD COLUMN total_amount DECIMAL(12,2)
GENERATED ALWAYS AS (
(SELECT SUM(quantity * unit_price) FROM order_items WHERE order_id = orders.order_id)
) STORED;
-- Use a materialized view for analytical queries
CREATE MATERIALIZED VIEW customer_order_summary AS
SELECT
c.customer_id,
c.name,
COUNT(o.order_id) AS total_orders,
SUM(o.total_amount) AS lifetime_value,
MAX(o.order_date) AS last_order_date
FROM customers_norm c
LEFT JOIN orders_norm o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name;
Key Takeaways
- Normalization reduces redundancy β Each fact is stored in exactly one place
- Follow normal forms sequentially β 1NF β 2NF β 3NF β BCNF for most applications
- 3NF is the practical sweet spot β Handles most real-world scenarios without over-complication
- Denormalize strategically β Use it as a performance optimization, not a default
- Document design decisions β Future developers need to understand your normalization choices
Quick Reference Card
| Term | Definition |
|---|---|
| Functional Dependency | Column A determines column B (A β B) |
| Partial Dependency | Non-key column depends on part of a composite key |
| Transitive Dependency | Non-key column depends on another non-key column |
| Superkey | Set of columns that uniquely identifies a row |
| Candidate Key | Minimal superkey (no redundant columns) |
| Surrogate Key | Artificial primary key (auto-increment, UUID) |
| Natural Key | Business-meaningful primary key (email, SSN) |