Data Systems

MongoDB Deep Dive

MongoDB is the leading document database. Master its flexible document model, powerful aggregation pipeline, indexing strategies, and horizontal scaling through sharding.

Flexible Schema — Evolve your data model without migrations
Aggregation Pipeline — Server-side data processing with composable stages
Horizontal Scaling — Automatic sharding across clusters

MongoDB's flexibility is its strength—and its danger. Use it wisely.

MongoDB Architecture

DfMongoDB

MongoDB is a document database that stores data as BSON (Binary JSON) documents with dynamic schemas. It provides rich query language, secondary indexes, aggregation pipelines, and automatic sharding. The document model maps naturally to objects in application code, reducing the impedance mismatch of relational databases.

Document Model

DfBSON Document

BSON (Binary JSON) is MongoDB's document format. It extends JSON with additional data types (Date, ObjectId, Binary, Decimal128) and is more efficient for encoding and decoding. Documents are limited to 16MB in size and are stored in collections (analogous to tables in SQL).

MongoDB Document

{
  "_id": ObjectId("507f1f77bcf86cd799439011"),
  "name": "Alice Johnson",
  "email": "alice@example.com",
  "address": {
    "street": "123 Main St",
    "city": "San Francisco",
    "state": "CA",
    "zip": "94105"
  },
  "orders": [
    {
      "order_id": ObjectId("..."),
      "item": "laptop",
      "price": 999.99,
      "date": ISODate("2024-01-15")
    }
  ],
  "created_at": ISODate("2023-06-01"),
  "tags": ["premium", "tech"]
}

Aggregation Pipeline

DfAggregation Pipeline

The aggregation pipeline processes documents through a series of stages. Each stage transforms the documents and passes the result to the next stage. It's MongoDB's equivalent of SQL's GROUP BY, JOIN, and window functions, but more flexible and composable.

Stage	Purpose	SQL Equivalent
$match	Filter documents	WHERE
$group	Group by field(s)	GROUP BY
$project	Select/reshape fields	SELECT
$sort	Sort results	ORDER BY
$limit	Limit results	LIMIT
$lookup	Join with another collection	JOIN
$unwind	Deconstruct arrays	LATERAL VIEW
$bucket	Group into ranges	GROUP BY RANGE
$facet	Multiple pipelines in parallel	Subqueries

Aggregation Pipeline for E-commerce Analytics

db.orders.aggregate([
  // Filter orders from last 30 days
  { $match: { 
    created_at: { $gte: ISODate("2024-01-01") } 
  }},
  
  // Join with products collection
  { $lookup: {
    from: "products",
    localField: "product_id",
    foreignField: "_id",
    as: "product"
  }},
  
  // Unwind the product array
  { $unwind: "$product" },
  
  // Group by category, calculate metrics
  { $group: {
    _id: "$product.category",
    total_orders: { $sum: 1 },
    total_revenue: { $sum: "$amount" },
    avg_order_value: { $avg: "$amount" },
    unique_customers: { $addToSet: "$customer_id" }
  }},
  
  // Sort by revenue descending
  { $sort: { total_revenue: -1 } },
  
  // Add customer count field
  { $project: {
    category: "$_id",
    total_orders: 1,
    total_revenue: 1,
    avg_order_value: { $round: ["$avg_order_value", 2] },
    customer_count: { $size: "$unique_customers" }
  }}
]);

Indexing Strategies

DfMongoDB Indexes

MongoDB indexes are B-tree data structures that improve query performance. Without indexes, MongoDB performs a collection scan (reads every document). Indexes enable efficient range queries, sorting, and text search. MongoDB supports compound, multikey, geospatial, text, and hashed indexes.

Index Type	Use Case	Example
Single field	Simple equality/range	`db.users.createIndex({email: 1})`
Compound	Multi-field queries	`db.users.createIndex({city: 1, age: -1})`
Multikey	Array field queries	`db.users.createIndex({tags: 1})`
Text	Full-text search	`db.articles.createIndex({content: "text"})`
Hashed	Equality only (sharding)	`db.users.createIndex({email: "hashed"})`
TTL	Auto-expire documents	`db.sessions.createIndex({created_at: 1}, {expireAfterSeconds: 3600})`

A common anti-pattern is creating too many indexes. Each index slows down writes because MongoDB must update all indexes on every write. Profile your queries and create indexes only for the most frequent and slowest queries.

Sharding

DfMongoDB Sharding

MongoDB sharding distributes data across multiple servers (shards). Each shard holds a subset of the data, determined by the shard key. The mongos router directs queries to the appropriate shard(s). A good shard key ensures even distribution and query isolation.

Shard Key Strategy	Description	Best For
Hashed	Hash of shard key	Even distribution, equality queries
Ranged	Range-based distribution	Range queries, time-series
Zone	Geographic distribution	Data residency requirements

Shard Key Selection

Bad shard key: _id (ObjectId) — monotonically increasing, all writes hit one shard Good shard key: { user_id: "hashed" } — even distribution across shards Good shard key: { created_at: 1, user_id: 1 } — range queries on time, distribution by user

The shard key cannot be changed after creation. Choose carefully.

Replica Sets

DfReplica Set

A replica set is a group of MongoDB nodes that maintain the same data set. One node is elected as primary; all writes go to the primary. Secondaries replicate the primary's data and can serve read queries. If the primary fails, a new primary is automatically elected.

Common Anti-Patterns

Anti-Pattern	Problem	Solution
Unbounded arrays	Document growth, poor performance	Limit array size, normalize
Massive documents	Slow queries, memory pressure	Keep documents small
Missing indexes	Collection scans	Add appropriate indexes
Wrong shard key	Hot partitions, scatter-gather	Choose high-cardinality key
Over-normalization	Excessive $lookup	Denormalize for read patterns

Practice Exercises

Document Design: Design the MongoDB document schema for a blogging platform with users, posts, comments, tags, and categories. Decide what to embed vs reference.
Aggregation Pipeline: Write an aggregation pipeline to find the top 5 most popular products in the last 7 days, including average rating and total sales.
Sharding Design: Design the sharding strategy for a social media app with 100M users. What shard key would you choose? How do you handle hot users?
Index Optimization: Given a MongoDB collection with 50M documents and these query patterns, design the optimal indexes:
- Find users by email (unique)
- Find users by city and age range
- Search users by name (partial match)
- Get recent users by creation date

Key Takeaways:

MongoDB stores flexible BSON documents with dynamic schemas
The aggregation pipeline provides composable server-side data processing
Choose indexes carefully—each slows down writes
Shard keys cannot be changed; choose high-cardinality keys for even distribution
Replica sets provide automatic failover and read scaling
Keep documents small and avoid unbounded arrays

What to Learn Next

-> NoSQL Deep Dive Document, key-value, column-family, and graph databases overview.

-> SQL Deep Dive PostgreSQL, MySQL, indexing strategies, and query optimization.

-> PostgreSQL Deep Dive Advanced PostgreSQL features, extensions, and optimization.

-> Redis Deep Dive Redis data structures, persistence, clustering, and use cases.

-> Data Partitioning Sharding strategies, consistent hashing, and partition keys.

-> Choosing the Right Database Systematic framework for database selection.

MongoDB Deep Dive

MongoDB Deep Dive

MongoDB Architecture

DfMongoDB

Document Model

DfBSON Document

MongoDB Document

Aggregation Pipeline

DfAggregation Pipeline

Aggregation Pipeline for E-commerce Analytics

Indexing Strategies

DfMongoDB Indexes

Sharding

DfMongoDB Sharding

Shard Key Selection

Replica Sets

DfReplica Set

Common Anti-Patterns

Practice Exercises

What to Learn Next

Premium Content

Need Expert System Design Help?