Data Systems
NoSQL Deep Dive
NoSQL databases trade ACID guarantees for horizontal scalability and flexible data models. Master the four categories of NoSQL databases and their optimal use cases.
- Scalability — Horizontal scaling across commodity servers
- Flexibility — Schema-less or schema-on-read data models
- Performance — Optimized for specific access patterns
NoSQL is not "No SQL"—it's "Not Only SQL." Choose the right tool for the job.
The Four Categories of NoSQL
Document Databases (MongoDB)
DfDocument Database
A document database stores data as documents (typically JSON/BSON) with nested structures. Each document can have a different schema, allowing flexible data models. Documents are retrieved by their unique key and can be queried using field values, array elements, and nested document fields.
MongoDB Internals
| Component | Purpose |
|---|---|
| WiredTiger | Storage engine with document-level locking |
| B-tree indexes | Primary index structure |
| Replica Set | Primary-secondary replication for HA |
| Sharding | Horizontal scaling across clusters |
| Aggregation Pipeline | Server-side data processing |
Data Modeling Patterns
MongoDB Embedded vs Referenced Design
Embedded (denormalized):
{
"_id": "user123",
"name": "Alice",
"orders": [
{ "item": "laptop", "price": 999, "date": "2024-01-15" },
{ "item": "mouse", "price": 29, "date": "2024-01-16" }
]
}
Pros: Single query, atomic updates, no JOINs Cons: Document size limit (16MB), data duplication
Referenced (normalized):
{ "_id": "user123", "name": "Alice", "order_ids": ["o1", "o2"] }
{ "_id": "o1", "item": "laptop", "price": 999, "user_id": "user123" }
Pros: No duplication, unlimited related data Cons: Requires multiple queries or $lookup aggregation
Choose embedded design when related data is always accessed together and the total size is bounded. Choose referenced design when related data is accessed independently or grows unbounded.
Key-Value Databases (Redis, DynamoDB)
DfKey-Value Store
A key-value store is the simplest NoSQL model: data is stored as key-value pairs, and access is primarily through key-based lookups. It provides O(1) average-case performance for reads and writes, making it ideal for caching, session storage, and high-throughput simple operations.
Redis Data Structures
| Structure | Use Case | Example |
|---|---|---|
| String | Cache, counters | SET user:123 "Alice" |
| Hash | Object storage | HSET user:123 name "Alice" age 30 |
| List | Message queues | LPUSH queue "msg1" "msg2" |
| Set | Tags, unique items | SADD tags "python" "system-design" |
| Sorted Set | Leaderboards | ZADD leaderboard 100 "player1" |
| Stream | Event sourcing | XADD events * type "click" page "/home" |
Redis Latency
Here,
- =99th percentile latency
- =Sub-millisecond response time
DynamoDB Partitioning
DfDynamoDB Partitioning
DynamoDB automatically partitions data across multiple servers using the partition key. Each partition holds a contiguous range of keys, and DynamoDB evenly distributes load by choosing partition keys that create uniform access patterns. Hot partitions (uneven key distribution) are the primary performance concern.
A common DynamoDB anti-pattern is choosing a partition key with low cardinality (e.g., "status" with only a few values). This creates hot partitions where most traffic hits a single partition. Choose partition keys with high cardinality for uniform distribution.
Column-Family Databases (Cassandra)
DfColumn-Family Store
A column-family store (wide-column store) organizes data into column families (similar to tables), rows, and columns. Unlike relational tables, each row can have a different set of columns. This model is optimized for write-heavy workloads and time-series data, with excellent write throughput and horizontal scalability.
Cassandra Data Model
Cassandra Query Patterns
| Pattern | Description | Example |
|---|---|---|
| Partition lookup | Get all data for a partition key | WHERE user_id = 'abc' |
| Range within partition | Query by clustering key | WHERE user_id = 'abc' AND timestamp > '2024-01-01' |
| Time-series | Latest events for a user | ORDER BY timestamp DESC LIMIT 10 |
Cassandra does not support JOINs, aggregations, or flexible WHERE clauses. You must model your tables to match your query patterns (query-first design). This is the opposite of relational modeling, where you normalize first and optimize queries later.
Graph Databases (Neo4j)
DfGraph Database
A graph database stores data as nodes (entities) and edges (relationships), with properties on both. It excels at queries that traverse relationships, such as finding shortest paths, detecting cycles, or recommending connections. Graph databases use index-free adjacency, meaning each node directly references its neighbors.
Graph Query Patterns
| Query Type | Description | Example |
|---|---|---|
| Path finding | Shortest path between nodes | Find 3-degree connections |
| Pattern matching | Find specific subgraphs | Users who bought X and Y |
| Centrality | Most connected nodes | Influencer detection |
| Community detection | Cluster related nodes | Social group identification |
Graph Traversal Complexity
Here,
- =Traversal time
- =Average degree (connections per node)
- =Traversal depth
Social Network Query
Find all users within 3 degrees of separation from user "alice":
MATCH (alice:User {name: 'Alice'})-[:FRIEND*1..3]-(friend)
RETURN DISTINCT friend.name
In a relational database, this requires 3 self-joins on the friends table—extremely expensive at scale. In Neo4j, this is a simple traversal using index-free adjacency.
NoSQL Comparison Matrix
| Criteria | Document | Key-Value | Column-Family | Graph |
|---|---|---|---|---|
| Data model | JSON docs | Key→Value | Wide columns | Nodes + edges |
| Query flexibility | High | Low (key only) | Moderate | High (traversals) |
| Write throughput | Good | Excellent | Excellent | Moderate |
| Read throughput | Good | Excellent | Good | Depends on query |
| Horizontal scaling | Good | Excellent | Excellent | Hard |
| Consistency | Configurable | Configurable | Tunable | Strong |
| Best use case | Content management | Caching, sessions | Time-series, IoT | Social networks |
Practice Exercises
-
Data Modeling: Design the MongoDB schema for a blogging platform with users, posts, comments, and tags. Decide which fields to embed vs reference. Justify your choices.
-
Key-Value Design: Using Redis, design a rate limiter that allows 100 requests per minute per user. What data structures would you use? How do you handle expiration?
-
Column-Family Design: Design the Cassandra table schema for a time-series IoT sensor data system. What is the partition key? What is the clustering key?
-
Graph Query: Given a social network graph, write the Cypher query to find "friends of friends who live in the same city and share at least 3 interests."
Key Takeaways:
- Document databases (MongoDB) excel at flexible, nested data with schema-on-read
- Key-value stores (Redis) provide O(1) lookups for caching and sessions
- Column-family stores (Cassandra) optimize for write-heavy time-series workloads
- Graph databases (Neo4j) use index-free adjacency for relationship queries
- Choose based on your primary access pattern, not on popularity
What to Learn Next
-> SQL Deep Dive PostgreSQL, MySQL, indexing strategies, and query optimization.
-> MongoDB Deep Dive Advanced MongoDB features, aggregation pipeline, and sharding.
-> Redis Deep Dive Redis data structures, persistence, clustering, and use cases.
-> Cassandra Deep Dive Cassandra architecture, data modeling, and operational patterns.
-> Choosing the Right Database Systematic framework for database selection.
-> DynamoDB Deep Dive DynamoDB internals, partitioning, and global tables.