Data Systems
MongoDB Deep Dive
MongoDB is the leading document database. Master its flexible document model, powerful aggregation pipeline, indexing strategies, and horizontal scaling through sharding.
- Flexible Schema — Evolve your data model without migrations
- Aggregation Pipeline — Server-side data processing with composable stages
- Horizontal Scaling — Automatic sharding across clusters
MongoDB's flexibility is its strength—and its danger. Use it wisely.
MongoDB Architecture
DfMongoDB
MongoDB is a document database that stores data as BSON (Binary JSON) documents with dynamic schemas. It provides rich query language, secondary indexes, aggregation pipelines, and automatic sharding. The document model maps naturally to objects in application code, reducing the impedance mismatch of relational databases.
Document Model
DfBSON Document
BSON (Binary JSON) is MongoDB's document format. It extends JSON with additional data types (Date, ObjectId, Binary, Decimal128) and is more efficient for encoding and decoding. Documents are limited to 16MB in size and are stored in collections (analogous to tables in SQL).
MongoDB Document
{
"_id": ObjectId("507f1f77bcf86cd799439011"),
"name": "Alice Johnson",
"email": "alice@example.com",
"address": {
"street": "123 Main St",
"city": "San Francisco",
"state": "CA",
"zip": "94105"
},
"orders": [
{
"order_id": ObjectId("..."),
"item": "laptop",
"price": 999.99,
"date": ISODate("2024-01-15")
}
],
"created_at": ISODate("2023-06-01"),
"tags": ["premium", "tech"]
}
Aggregation Pipeline
DfAggregation Pipeline
The aggregation pipeline processes documents through a series of stages. Each stage transforms the documents and passes the result to the next stage. It's MongoDB's equivalent of SQL's GROUP BY, JOIN, and window functions, but more flexible and composable.
| Stage | Purpose | SQL Equivalent |
|---|---|---|
| $match | Filter documents | WHERE |
| $group | Group by field(s) | GROUP BY |
| $project | Select/reshape fields | SELECT |
| $sort | Sort results | ORDER BY |
| $limit | Limit results | LIMIT |
| $lookup | Join with another collection | JOIN |
| $unwind | Deconstruct arrays | LATERAL VIEW |
| $bucket | Group into ranges | GROUP BY RANGE |
| $facet | Multiple pipelines in parallel | Subqueries |
Aggregation Pipeline for E-commerce Analytics
db.orders.aggregate([
// Filter orders from last 30 days
{ $match: {
created_at: { $gte: ISODate("2024-01-01") }
}},
// Join with products collection
{ $lookup: {
from: "products",
localField: "product_id",
foreignField: "_id",
as: "product"
}},
// Unwind the product array
{ $unwind: "$product" },
// Group by category, calculate metrics
{ $group: {
_id: "$product.category",
total_orders: { $sum: 1 },
total_revenue: { $sum: "$amount" },
avg_order_value: { $avg: "$amount" },
unique_customers: { $addToSet: "$customer_id" }
}},
// Sort by revenue descending
{ $sort: { total_revenue: -1 } },
// Add customer count field
{ $project: {
category: "$_id",
total_orders: 1,
total_revenue: 1,
avg_order_value: { $round: ["$avg_order_value", 2] },
customer_count: { $size: "$unique_customers" }
}}
]);
Indexing Strategies
DfMongoDB Indexes
MongoDB indexes are B-tree data structures that improve query performance. Without indexes, MongoDB performs a collection scan (reads every document). Indexes enable efficient range queries, sorting, and text search. MongoDB supports compound, multikey, geospatial, text, and hashed indexes.
| Index Type | Use Case | Example |
|---|---|---|
| Single field | Simple equality/range | db.users.createIndex({email: 1}) |
| Compound | Multi-field queries | db.users.createIndex({city: 1, age: -1}) |
| Multikey | Array field queries | db.users.createIndex({tags: 1}) |
| Text | Full-text search | db.articles.createIndex({content: "text"}) |
| Hashed | Equality only (sharding) | db.users.createIndex({email: "hashed"}) |
| TTL | Auto-expire documents | db.sessions.createIndex({created_at: 1}, {expireAfterSeconds: 3600}) |
A common anti-pattern is creating too many indexes. Each index slows down writes because MongoDB must update all indexes on every write. Profile your queries and create indexes only for the most frequent and slowest queries.
Sharding
DfMongoDB Sharding
MongoDB sharding distributes data across multiple servers (shards). Each shard holds a subset of the data, determined by the shard key. The mongos router directs queries to the appropriate shard(s). A good shard key ensures even distribution and query isolation.
| Shard Key Strategy | Description | Best For |
|---|---|---|
| Hashed | Hash of shard key | Even distribution, equality queries |
| Ranged | Range-based distribution | Range queries, time-series |
| Zone | Geographic distribution | Data residency requirements |
Shard Key Selection
Bad shard key: _id (ObjectId) — monotonically increasing, all writes hit one shard
Good shard key: { user_id: "hashed" } — even distribution across shards
Good shard key: { created_at: 1, user_id: 1 } — range queries on time, distribution by user
The shard key cannot be changed after creation. Choose carefully.
Replica Sets
DfReplica Set
A replica set is a group of MongoDB nodes that maintain the same data set. One node is elected as primary; all writes go to the primary. Secondaries replicate the primary's data and can serve read queries. If the primary fails, a new primary is automatically elected.
Common Anti-Patterns
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Unbounded arrays | Document growth, poor performance | Limit array size, normalize |
| Massive documents | Slow queries, memory pressure | Keep documents small |
| Missing indexes | Collection scans | Add appropriate indexes |
| Wrong shard key | Hot partitions, scatter-gather | Choose high-cardinality key |
| Over-normalization | Excessive $lookup | Denormalize for read patterns |
Practice Exercises
-
Document Design: Design the MongoDB document schema for a blogging platform with users, posts, comments, tags, and categories. Decide what to embed vs reference.
-
Aggregation Pipeline: Write an aggregation pipeline to find the top 5 most popular products in the last 7 days, including average rating and total sales.
-
Sharding Design: Design the sharding strategy for a social media app with 100M users. What shard key would you choose? How do you handle hot users?
-
Index Optimization: Given a MongoDB collection with 50M documents and these query patterns, design the optimal indexes:
- Find users by email (unique)
- Find users by city and age range
- Search users by name (partial match)
- Get recent users by creation date
Key Takeaways:
- MongoDB stores flexible BSON documents with dynamic schemas
- The aggregation pipeline provides composable server-side data processing
- Choose indexes carefully—each slows down writes
- Shard keys cannot be changed; choose high-cardinality keys for even distribution
- Replica sets provide automatic failover and read scaling
- Keep documents small and avoid unbounded arrays
What to Learn Next
-> NoSQL Deep Dive Document, key-value, column-family, and graph databases overview.
-> SQL Deep Dive PostgreSQL, MySQL, indexing strategies, and query optimization.
-> PostgreSQL Deep Dive Advanced PostgreSQL features, extensions, and optimization.
-> Redis Deep Dive Redis data structures, persistence, clustering, and use cases.
-> Data Partitioning Sharding strategies, consistent hashing, and partition keys.
-> Choosing the Right Database Systematic framework for database selection.