Database Fundamentals — Relational vs NoSQL
Data engineers build and maintain the infrastructure that powers data pipelines, warehouses, and analytics systems. Understanding database fundamentals is critical to building reliable, scalable data systems.
Overview
Relational Databases (RDBMS)
Core Concepts
Relational databases store data in tables with rows and columns, enforcing a predefined schema. Relationships between tables are expressed through foreign keys.
-- Create tables with relationships
CREATE TABLE customers (
customer_id SERIAL PRIMARY KEY,
first_name VARCHAR(100) NOT NULL,
last_name VARCHAR(100) NOT NULL,
email VARCHAR(255) UNIQUE NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
customer_id INTEGER REFERENCES customers(customer_id),
order_date DATE NOT NULL,
amount DECIMAL(10,2) NOT NULL,
status VARCHAR(50) DEFAULT 'pending'
);
-- Query with JOIN
SELECT c.first_name, c.last_name, COUNT(o.order_id) AS order_count,
SUM(o.amount) AS total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name
HAVING SUM(o.amount) > 1000
ORDER BY total_spent DESC;
ACID Properties
| Property | Description | Example |
|---|---|---|
| Atomicity | All operations succeed or all fail | Transfer: debit and credit happen together or not at all |
| Consistency | Data always satisfies constraints | A foreign key always references an existing row |
| Isolation | Concurrent transactions don't interfere | Two users can't buy the last item simultaneously |
| Durability | Committed data survives crashes | Power failure doesn't lose confirmed transactions |
Normalization
Normalization reduces data redundancy by splitting data into related tables.
| Normal Form | Rule | Example Problem Solved |
|---|---|---|
| 1NF | Each column contains atomic values | Remove comma-separated lists from a cell |
| 2NF | No partial dependency on composite key | Move customer name out of orders table |
| 3NF | No transitive dependency | Move product category name to its own table |
Indexing Strategies
-- B-tree index (default, good for equality and range)
CREATE INDEX idx_orders_date ON orders(order_date);
-- Composite index (order matters!)
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);
-- Partial index (index only relevant rows)
CREATE INDEX idx_orders_pending ON orders(order_date)
WHERE status = 'pending';
-- GIN index for full-text search
CREATE INDEX idx_products_search ON products
USING gin(to_tsvector('english', name || ' ' || description));
| Index Type | Best For | Not Good For |
|---|---|---|
| B-tree | Equality, range queries, sorting | Full-text search, geometric data |
| Hash | Equality lookups only | Range queries, sorting |
| GIN | Full-text search, arrays, JSONB | Simple equality |
| GiST | Geometric, full-text search | Simple lookups |
| BRIN | Large, naturally ordered tables | Random access patterns |
NoSQL Databases
Types of NoSQL Databases
Document Store (MongoDB)
// MongoDB document (flexible schema)
db.orders.insertOne({
order_id: "ORD-001",
customer: {
name: "John Doe",
email: "john@example.com",
address: {
city: "New York",
state: "NY"
}
},
items: [
{ product: "Widget A", qty: 2, price: 29.99 },
{ product: "Widget B", qty: 1, price: 49.99 }
],
total: 109.97,
created_at: new Date()
});
// Aggregation pipeline
db.orders.aggregate([
{ $match: { "customer.address.state": "NY" } },
{ $unwind: "$items" },
{ $group: {
_id: "$items.product",
total_qty: { $sum: "$items.qty" },
total_revenue: { $sum: { $multiply: ["$items.qty", "$items.price"] } }
}},
{ $sort: { total_revenue: -1 } }
]);
Wide-Column Store (Cassandra)
-- Cassandra CQL (column-family model)
CREATE TABLE orders_by_customer (
customer_id UUID,
order_date TIMESTAMP,
order_id UUID,
amount DECIMAL,
status TEXT,
PRIMARY KEY ((customer_id), order_date, order_id)
) WITH CLUSTERING ORDER BY (order_date DESC);
-- Query (must use partition key)
SELECT * FROM orders_by_customer
WHERE customer_id = ? AND order_date >= '2024-01-01';
Key-Value Store (Redis)
import redis
r = redis.Redis(host='localhost', port=6379, db=0)
# Simple key-value operations
r.set('user:1001:name', 'John Doe')
r.get('user:1001:name')
# Cache query results
import json
def get_cached_orders(customer_id, db):
cache_key = f"orders:{customer_id}"
cached = r.get(cache_key)
if cached:
return json.loads(cached)
orders = db.execute_query(
"SELECT * FROM orders WHERE customer_id = %s",
(customer_id,)
)
r.setex(cache_key, 3600, json.dumps(orders)) # TTL: 1 hour
return orders
CAP Theorem
| System | Consistency | Availability | Partition Tolerance | Type |
|---|---|---|---|---|
| PostgreSQL | Yes | Yes | No (single node) | CP |
| MongoDB (default) | Yes | Yes | Yes | CP |
| Cassandra | Tunable | Yes | Yes | AP |
| DynamoDB | Tunable | Yes | Yes | AP |
| Redis (standalone) | Yes | Yes | No | CP |
| Riak | No | Yes | Yes | AP |
When to Use Which
| Use Case | Recommended | Why |
|---|---|---|
| Transactional ERP system | PostgreSQL / MySQL | ACID compliance, complex joins |
| User session storage | Redis | Sub-millisecond reads, TTL support |
| Product catalog (varied attributes) | MongoDB | Flexible schema, nested documents |
| Time-series metrics | Cassandra / TimescaleDB | High write throughput, time-based partitioning |
| Social network / recommendations | Neo4j | Graph traversal queries |
| Analytics warehouse | PostgreSQL / Snowflake | Complex aggregations, columnar storage |
| IoT sensor data | Cassandra / InfluxDB | High write volume, time-series pattern |
| Real-time leaderboards | Redis Sorted Sets | O(log N) rank queries |
Performance Comparison
| Metric | PostgreSQL | MongoDB | Cassandra | Redis |
|---|---|---|---|---|
| Read latency | 1-10ms | 1-10ms | 1-5ms | 0.1-1ms |
| Write throughput | 1K-10K/s | 10K-100K/s | 10K-1M/s | 100K-1M/s |
| Max data size | TB | PB | PB | GB (RAM) |
| Schema flexibility | Low | High | Medium | None |
| Complex queries | Excellent | Good | Limited | Key-based only |
| JOIN support | Full | Limited ($lookup) | None | None |
Best Practices for Data Engineers
| Practice | Rationale |
|---|---|
| Choose the right database for the workload | Don't use MongoDB for complex joins; don't use PostgreSQL for 1M writes/sec |
| Use connection pooling | Prevent exhausting DB connections in pipelines |
| Index columns used in WHERE/JOIN | Dramatically speeds up query performance |
| Partition large tables | Improve query performance and maintenance |
| Monitor slow queries | Use pg_stat_statements or MongoDB profiler |
| Test with production-like data volume | Performance characteristics change with scale |
MathSummary Takeaways
- Relational databases enforce ACID — use them when data consistency and complex joins are critical (e.g., financial transactions).
- NoSQL databases optimize for specific access patterns — document stores for flexible schemas, key-value for caching, wide-column for time-series.
- The CAP theorem forces trade-offs — you cannot simultaneously guarantee consistency, availability, and partition tolerance.
- Normalization reduces redundancy — follow 1NF/2NF/3NF for transactional systems; denormalize intentionally for analytics.
- Indexing is critical — a missing index can turn a 1ms query into a 10-second table scan.
- Schema design affects performance — partitioning, clustering keys, and composite indexes must match your query patterns.
- Connection pooling prevents outages — always use SQLAlchemy pool or pgbouncer in production.
- Choose the tool for the job — PostgreSQL for analytics, Redis for caching, Cassandra for high-throughput writes.
See Also
- What is Data Engineering — Introduction to data engineering
- SQL Fundamentals — Essential SQL skills
- Advanced SQL — Advanced SQL techniques
- Data Modeling Basics — ERD, normalization, dimensional modeling
- Docker for Data Engineers — Containerizing data pipelines
Practice Exercises
-
Database design: Design a normalized schema (3NF) for an e-commerce platform with customers, orders, products, and inventory tables.
-
Index optimization: Given a slow query, analyze the execution plan and create appropriate indexes.
-
NoSQL modeling: Design a MongoDB document schema for a blog with posts, comments, and authors. Compare with a relational approach.
-
Cassandra partitioning: Design a Cassandra table for time-series IoT sensor data with partition key and clustering key decisions.
-
Database comparison: Benchmark the same dataset on PostgreSQL and MongoDB for three different query patterns. Document performance differences.