🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Database Fundamentals — Relational vs NoSQL

Data Engineering FoundationsData Engineering Fundamentals🟢 Free Lesson

Advertisement

Database Fundamentals — Relational vs NoSQL

Data engineers build and maintain the infrastructure that powers data pipelines, warehouses, and analytics systems. Understanding database fundamentals is critical to building reliable, scalable data systems.

CAP Theorem TriangleConsistencyAll nodes seesame dataAvailabilityEvery requestgets responsePartitionToleranceNetwork splitCAPostgreSQLCPMongoDBAPCassandraPick 2 of 3

Overview

Relational Databases (RDBMS)

Core Concepts

Relational databases store data in tables with rows and columns, enforcing a predefined schema. Relationships between tables are expressed through foreign keys.

-- Create tables with relationships
CREATE TABLE customers (
    customer_id SERIAL PRIMARY KEY,
    first_name VARCHAR(100) NOT NULL,
    last_name VARCHAR(100) NOT NULL,
    email VARCHAR(255) UNIQUE NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INTEGER REFERENCES customers(customer_id),
    order_date DATE NOT NULL,
    amount DECIMAL(10,2) NOT NULL,
    status VARCHAR(50) DEFAULT 'pending'
);

-- Query with JOIN
SELECT c.first_name, c.last_name, COUNT(o.order_id) AS order_count,
       SUM(o.amount) AS total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name
HAVING SUM(o.amount) > 1000
ORDER BY total_spent DESC;

ACID Properties

PropertyDescriptionExample
AtomicityAll operations succeed or all failTransfer: debit and credit happen together or not at all
ConsistencyData always satisfies constraintsA foreign key always references an existing row
IsolationConcurrent transactions don't interfereTwo users can't buy the last item simultaneously
DurabilityCommitted data survives crashesPower failure doesn't lose confirmed transactions
ACID Properties ExplainedAtomicityAll or nothingBEGIN → UPDATE → COMMITIf any step failsall changes rollbackConsistencyRules always enforcedForeign keys validCHECK constraintsNOT NULL enforcedIsolationConcurrent safetyREAD COMMITTEDREPEATABLE READSERIALIZABLEDurabilitySurvives crashesWAL (Write-Ahead Log)fsync to diskReplication

Normalization

Normalization reduces data redundancy by splitting data into related tables.

Normal FormRuleExample Problem Solved
1NFEach column contains atomic valuesRemove comma-separated lists from a cell
2NFNo partial dependency on composite keyMove customer name out of orders table
3NFNo transitive dependencyMove product category name to its own table

Indexing Strategies

-- B-tree index (default, good for equality and range)
CREATE INDEX idx_orders_date ON orders(order_date);

-- Composite index (order matters!)
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);

-- Partial index (index only relevant rows)
CREATE INDEX idx_orders_pending ON orders(order_date)
WHERE status = 'pending';

-- GIN index for full-text search
CREATE INDEX idx_products_search ON products
USING gin(to_tsvector('english', name || ' ' || description));
Index TypeBest ForNot Good For
B-treeEquality, range queries, sortingFull-text search, geometric data
HashEquality lookups onlyRange queries, sorting
GINFull-text search, arrays, JSONBSimple equality
GiSTGeometric, full-text searchSimple lookups
BRINLarge, naturally ordered tablesRandom access patterns

NoSQL Databases

Types of NoSQL Databases

Document Store (MongoDB)

// MongoDB document (flexible schema)
db.orders.insertOne({
    order_id: "ORD-001",
    customer: {
        name: "John Doe",
        email: "john@example.com",
        address: {
            city: "New York",
            state: "NY"
        }
    },
    items: [
        { product: "Widget A", qty: 2, price: 29.99 },
        { product: "Widget B", qty: 1, price: 49.99 }
    ],
    total: 109.97,
    created_at: new Date()
});

// Aggregation pipeline
db.orders.aggregate([
    { $match: { "customer.address.state": "NY" } },
    { $unwind: "$items" },
    { $group: {
        _id: "$items.product",
        total_qty: { $sum: "$items.qty" },
        total_revenue: { $sum: { $multiply: ["$items.qty", "$items.price"] } }
    }},
    { $sort: { total_revenue: -1 } }
]);

Wide-Column Store (Cassandra)

-- Cassandra CQL (column-family model)
CREATE TABLE orders_by_customer (
    customer_id UUID,
    order_date TIMESTAMP,
    order_id UUID,
    amount DECIMAL,
    status TEXT,
    PRIMARY KEY ((customer_id), order_date, order_id)
) WITH CLUSTERING ORDER BY (order_date DESC);

-- Query (must use partition key)
SELECT * FROM orders_by_customer
WHERE customer_id = ? AND order_date >= '2024-01-01';

Key-Value Store (Redis)

import redis

r = redis.Redis(host='localhost', port=6379, db=0)

# Simple key-value operations
r.set('user:1001:name', 'John Doe')
r.get('user:1001:name')

# Cache query results
import json

def get_cached_orders(customer_id, db):
    cache_key = f"orders:{customer_id}"
    cached = r.get(cache_key)
    if cached:
        return json.loads(cached)
    
    orders = db.execute_query(
        "SELECT * FROM orders WHERE customer_id = %s",
        (customer_id,)
    )
    r.setex(cache_key, 3600, json.dumps(orders))  # TTL: 1 hour
    return orders

CAP Theorem

SystemConsistencyAvailabilityPartition ToleranceType
PostgreSQLYesYesNo (single node)CP
MongoDB (default)YesYesYesCP
CassandraTunableYesYesAP
DynamoDBTunableYesYesAP
Redis (standalone)YesYesNoCP
RiakNoYesYesAP

When to Use Which

Use CaseRecommendedWhy
Transactional ERP systemPostgreSQL / MySQLACID compliance, complex joins
User session storageRedisSub-millisecond reads, TTL support
Product catalog (varied attributes)MongoDBFlexible schema, nested documents
Time-series metricsCassandra / TimescaleDBHigh write throughput, time-based partitioning
Social network / recommendationsNeo4jGraph traversal queries
Analytics warehousePostgreSQL / SnowflakeComplex aggregations, columnar storage
IoT sensor dataCassandra / InfluxDBHigh write volume, time-series pattern
Real-time leaderboardsRedis Sorted SetsO(log N) rank queries

Performance Comparison

MetricPostgreSQLMongoDBCassandraRedis
Read latency1-10ms1-10ms1-5ms0.1-1ms
Write throughput1K-10K/s10K-100K/s10K-1M/s100K-1M/s
Max data sizeTBPBPBGB (RAM)
Schema flexibilityLowHighMediumNone
Complex queriesExcellentGoodLimitedKey-based only
JOIN supportFullLimited ($lookup)NoneNone

Best Practices for Data Engineers

PracticeRationale
Choose the right database for the workloadDon't use MongoDB for complex joins; don't use PostgreSQL for 1M writes/sec
Use connection poolingPrevent exhausting DB connections in pipelines
Index columns used in WHERE/JOINDramatically speeds up query performance
Partition large tablesImprove query performance and maintenance
Monitor slow queriesUse pg_stat_statements or MongoDB profiler
Test with production-like data volumePerformance characteristics change with scale

MathSummary Takeaways

  1. Relational databases enforce ACID — use them when data consistency and complex joins are critical (e.g., financial transactions).
  2. NoSQL databases optimize for specific access patterns — document stores for flexible schemas, key-value for caching, wide-column for time-series.
  3. The CAP theorem forces trade-offs — you cannot simultaneously guarantee consistency, availability, and partition tolerance.
  4. Normalization reduces redundancy — follow 1NF/2NF/3NF for transactional systems; denormalize intentionally for analytics.
  5. Indexing is critical — a missing index can turn a 1ms query into a 10-second table scan.
  6. Schema design affects performance — partitioning, clustering keys, and composite indexes must match your query patterns.
  7. Connection pooling prevents outages — always use SQLAlchemy pool or pgbouncer in production.
  8. Choose the tool for the job — PostgreSQL for analytics, Redis for caching, Cassandra for high-throughput writes.

See Also

Practice Exercises

  1. Database design: Design a normalized schema (3NF) for an e-commerce platform with customers, orders, products, and inventory tables.

  2. Index optimization: Given a slow query, analyze the execution plan and create appropriate indexes.

  3. NoSQL modeling: Design a MongoDB document schema for a blog with posts, comments, and authors. Compare with a relational approach.

  4. Cassandra partitioning: Design a Cassandra table for time-series IoT sensor data with partition key and clustering key decisions.

  5. Database comparison: Benchmark the same dataset on PostgreSQL and MongoDB for three different query patterns. Document performance differences.

Premium Content

Database Fundamentals — Relational vs NoSQL

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Data Engineering Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement