Database Fundamentals — Relational vs NoSQL

Data engineers build and maintain the infrastructure that powers data pipelines, warehouses, and analytics systems. Understanding database fundamentals is critical to building reliable, scalable data systems.

Overview

Relational Databases (RDBMS)

Core Concepts

Relational databases store data in tables with rows and columns, enforcing a predefined schema. Relationships between tables are expressed through foreign keys.

-- Create tables with relationships
CREATE TABLE customers (
    customer_id SERIAL PRIMARY KEY,
    first_name VARCHAR(100) NOT NULL,
    last_name VARCHAR(100) NOT NULL,
    email VARCHAR(255) UNIQUE NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INTEGER REFERENCES customers(customer_id),
    order_date DATE NOT NULL,
    amount DECIMAL(10,2) NOT NULL,
    status VARCHAR(50) DEFAULT 'pending'
);

-- Query with JOIN
SELECT c.first_name, c.last_name, COUNT(o.order_id) AS order_count,
       SUM(o.amount) AS total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name
HAVING SUM(o.amount) > 1000
ORDER BY total_spent DESC;

ACID Properties

Property	Description	Example
Atomicity	All operations succeed or all fail	Transfer: debit and credit happen together or not at all
Consistency	Data always satisfies constraints	A foreign key always references an existing row
Isolation	Concurrent transactions don't interfere	Two users can't buy the last item simultaneously
Durability	Committed data survives crashes	Power failure doesn't lose confirmed transactions

Normalization

Normalization reduces data redundancy by splitting data into related tables.

Normal Form	Rule	Example Problem Solved
1NF	Each column contains atomic values	Remove comma-separated lists from a cell
2NF	No partial dependency on composite key	Move customer name out of orders table
3NF	No transitive dependency	Move product category name to its own table

Indexing Strategies

-- B-tree index (default, good for equality and range)
CREATE INDEX idx_orders_date ON orders(order_date);

-- Composite index (order matters!)
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);

-- Partial index (index only relevant rows)
CREATE INDEX idx_orders_pending ON orders(order_date)
WHERE status = 'pending';

-- GIN index for full-text search
CREATE INDEX idx_products_search ON products
USING gin(to_tsvector('english', name || ' ' || description));

Index Type	Best For	Not Good For
B-tree	Equality, range queries, sorting	Full-text search, geometric data
Hash	Equality lookups only	Range queries, sorting
GIN	Full-text search, arrays, JSONB	Simple equality
GiST	Geometric, full-text search	Simple lookups
BRIN	Large, naturally ordered tables	Random access patterns

NoSQL Databases

Types of NoSQL Databases

Document Store (MongoDB)

// MongoDB document (flexible schema)
db.orders.insertOne({
    order_id: "ORD-001",
    customer: {
        name: "John Doe",
        email: "john@example.com",
        address: {
            city: "New York",
            state: "NY"
        }
    },
    items: [
        { product: "Widget A", qty: 2, price: 29.99 },
        { product: "Widget B", qty: 1, price: 49.99 }
    ],
    total: 109.97,
    created_at: new Date()
});

// Aggregation pipeline
db.orders.aggregate([
    { $match: { "customer.address.state": "NY" } },
    { $unwind: "$items" },
    { $group: {
        _id: "$items.product",
        total_qty: { $sum: "$items.qty" },
        total_revenue: { $sum: { $multiply: ["$items.qty", "$items.price"] } }
    }},
    { $sort: { total_revenue: -1 } }
]);

Wide-Column Store (Cassandra)

-- Cassandra CQL (column-family model)
CREATE TABLE orders_by_customer (
    customer_id UUID,
    order_date TIMESTAMP,
    order_id UUID,
    amount DECIMAL,
    status TEXT,
    PRIMARY KEY ((customer_id), order_date, order_id)
) WITH CLUSTERING ORDER BY (order_date DESC);

-- Query (must use partition key)
SELECT * FROM orders_by_customer
WHERE customer_id = ? AND order_date >= '2024-01-01';

Key-Value Store (Redis)

import redis

r = redis.Redis(host='localhost', port=6379, db=0)

# Simple key-value operations
r.set('user:1001:name', 'John Doe')
r.get('user:1001:name')

# Cache query results
import json

def get_cached_orders(customer_id, db):
    cache_key = f"orders:{customer_id}"
    cached = r.get(cache_key)
    if cached:
        return json.loads(cached)
    
    orders = db.execute_query(
        "SELECT * FROM orders WHERE customer_id = %s",
        (customer_id,)
    )
    r.setex(cache_key, 3600, json.dumps(orders))  # TTL: 1 hour
    return orders

CAP Theorem

System	Consistency	Availability	Partition Tolerance	Type
PostgreSQL	Yes	Yes	No (single node)	CP
MongoDB (default)	Yes	Yes	Yes	CP
Cassandra	Tunable	Yes	Yes	AP
DynamoDB	Tunable	Yes	Yes	AP
Redis (standalone)	Yes	Yes	No	CP
Riak	No	Yes	Yes	AP

When to Use Which

Use Case	Recommended	Why
Transactional ERP system	PostgreSQL / MySQL	ACID compliance, complex joins
User session storage	Redis	Sub-millisecond reads, TTL support
Product catalog (varied attributes)	MongoDB	Flexible schema, nested documents
Time-series metrics	Cassandra / TimescaleDB	High write throughput, time-based partitioning
Social network / recommendations	Neo4j	Graph traversal queries
Analytics warehouse	PostgreSQL / Snowflake	Complex aggregations, columnar storage
IoT sensor data	Cassandra / InfluxDB	High write volume, time-series pattern
Real-time leaderboards	Redis Sorted Sets	O(log N) rank queries

Performance Comparison

Metric	PostgreSQL	MongoDB	Cassandra	Redis
Read latency	1-10ms	1-10ms	1-5ms	0.1-1ms
Write throughput	1K-10K/s	10K-100K/s	10K-1M/s	100K-1M/s
Max data size	TB	PB	PB	GB (RAM)
Schema flexibility	Low	High	Medium	None
Complex queries	Excellent	Good	Limited	Key-based only
JOIN support	Full	Limited ($lookup)	None	None

Best Practices for Data Engineers

Practice	Rationale
Choose the right database for the workload	Don't use MongoDB for complex joins; don't use PostgreSQL for 1M writes/sec
Use connection pooling	Prevent exhausting DB connections in pipelines
Index columns used in WHERE/JOIN	Dramatically speeds up query performance
Partition large tables	Improve query performance and maintenance
Monitor slow queries	Use `pg_stat_statements` or MongoDB profiler
Test with production-like data volume	Performance characteristics change with scale

MathSummary Takeaways

Relational databases enforce ACID — use them when data consistency and complex joins are critical (e.g., financial transactions).
NoSQL databases optimize for specific access patterns — document stores for flexible schemas, key-value for caching, wide-column for time-series.
The CAP theorem forces trade-offs — you cannot simultaneously guarantee consistency, availability, and partition tolerance.
Normalization reduces redundancy — follow 1NF/2NF/3NF for transactional systems; denormalize intentionally for analytics.
Indexing is critical — a missing index can turn a 1ms query into a 10-second table scan.
Schema design affects performance — partitioning, clustering keys, and composite indexes must match your query patterns.
Connection pooling prevents outages — always use SQLAlchemy pool or pgbouncer in production.
Choose the tool for the job — PostgreSQL for analytics, Redis for caching, Cassandra for high-throughput writes.

Practice Exercises

Database design: Design a normalized schema (3NF) for an e-commerce platform with customers, orders, products, and inventory tables.
Index optimization: Given a slow query, analyze the execution plan and create appropriate indexes.
NoSQL modeling: Design a MongoDB document schema for a blog with posts, comments, and authors. Compare with a relational approach.
Cassandra partitioning: Design a Cassandra table for time-series IoT sensor data with partition key and clustering key decisions.
Database comparison: Benchmark the same dataset on PostgreSQL and MongoDB for three different query patterns. Document performance differences.

Database Fundamentals — Relational vs NoSQL

Database Fundamentals — Relational vs NoSQL

Overview

Relational Databases (RDBMS)

Core Concepts

ACID Properties

Normalization

Indexing Strategies

NoSQL Databases

Types of NoSQL Databases

Document Store (MongoDB)

Wide-Column Store (Cassandra)

Key-Value Store (Redis)

CAP Theorem

When to Use Which

Performance Comparison

Best Practices for Data Engineers

MathSummary Takeaways

See Also

Practice Exercises

Premium Content

Need Expert Data Engineering Help?