System Design — Data Systems
Databases
Databases are the backbone of every system. Choosing the right database, understanding its trade-offs, and designing for scale is one of the most impactful decisions in system design.
- SQL vs NoSQL — Structured vs flexible data models
- Indexing — Making reads fast at the cost of writes
- Sharding — Distributing data across machines for horizontal scaling
Data is the new oil—but only if you can store, query, and scale it effectively.
Database Taxonomy
SQL vs NoSQL
DfRelational Database (SQL)
A relational database stores data in structured tables with predefined schemas, using SQL for querying. It guarantees ACID properties (Atomicity, Consistency, Isolation, Durability) for transactional integrity. Best for data with clear relationships and when consistency is critical.
DfNoSQL Database
A NoSQL database encompasses various non-relational data stores—document, key-value, column-family, and graph databases. They prioritize availability, partition tolerance, and horizontal scalability over strict consistency. Best for large-scale, high-throughput workloads with flexible schemas.
| Dimension | SQL | NoSQL |
|---|---|---|
| Schema | Fixed, predefined | Dynamic, flexible |
| Relationships | JOINs, foreign keys | Denormalized, embedded |
| Scaling | Vertical (scale up) | Horizontal (scale out) |
| Consistency | Strong (ACID) | Eventual (BASE) |
| Query Language | SQL | Varies by database |
| Best For | Complex queries, transactions | High throughput, flexible data |
Indexing
Indexes are data structures that speed up reads at the cost of write performance and storage.
DfDatabase Index
A database index is a data structure (typically a B-tree or hash table) that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space. Indexes allow the database engine to find rows without scanning the entire table.
B-Tree Index Structure
Index Types
| Type | Structure | Best For |
|---|---|---|
| B-tree | Balanced tree | Range queries, equality, ORDER BY |
| Hash | Hash table | Exact equality lookups only |
| GIN | Inverted index | Full-text search, array containment |
| GiST | Generalized search tree | Geometric, full-text, range |
| Composite | Multi-column | Queries filtering on multiple columns |
Index Design Rules
- Index selective columns: Low-cardinality columns (e.g., boolean) benefit less
- Covering indexes: Include all columns needed for a query to avoid table lookups
- Leftmost prefix: Composite indexes work left-to-right
- Avoid over-indexing: Each index slows writes and uses storage
- Partial indexes: Index only rows matching a condition
A common misconception is that more indexes always mean faster queries. In reality, each index consumes storage, slows down writes, and can cause the query planner to choose suboptimal plans. Profile your actual queries before adding indexes.
Replication
Replication copies data across multiple nodes for availability and read scaling.
DfDatabase Replication
Replication is the process of copying data from one database server (primary/master) to one or more other servers (replicas/slaves). It provides high availability (failover), read scalability (distribute read traffic), and geographic distribution (reduce latency).
Replication Strategies
| Strategy | Consistency | Availability | Use Case |
|---|---|---|---|
| Synchronous | Strong | Lower (waits for replicas) | Financial, critical data |
| Asynchronous | Eventual | Higher (doesn't wait) | Social media, analytics |
| Semi-synchronous | Between | Balanced | Most production systems |
Replication Topology
Replication Lag
Replication Lag Impact
Here,
- =Time between primary write and replica update
- =Time between consecutive writes to the same data
Sharding
Sharding distributes data across multiple database instances to achieve horizontal scalability.
DfSharding
Sharding is a database architecture pattern that partitions data horizontally across multiple database instances (shards). Each shard contains a subset of the data and operates independently, allowing the system to scale beyond the capacity of a single machine.
Sharding Strategies
| Strategy | How It Works | Trade-off |
|---|---|---|
| Hash-based | hash(key) % num_shards | Even distribution, range queries hard |
| Range-based | Shard by key ranges | Range queries easy, hotspots possible |
| Directory-based | Lookup table maps keys to shards | Flexible, single point of failure |
| Geographic | Shard by region | Data locality, regulatory compliance |
Sharding Challenges
Sharding introduces significant complexity:
- Cross-shard queries: JOINs across shards are expensive
- Rebalancing: Adding/removing shards requires data migration
- Hotspots: Uneven data distribution can overwhelm single shards
- Distributed transactions: ACID across shards requires 2PC or Saga pattern
Hash-Based Sharding Example
Shard 1: hash(user_id) % 4 == 0 → users: 4, 8, 12, 16, ...
Shard 2: hash(user_id) % 4 == 1 → users: 1, 5, 9, 13, ...
Shard 3: hash(user_id) % 4 == 2 → users: 2, 6, 10, 14, ...
Shard 4: hash(user_id) % 4 == 3 → users: 3, 7, 11, 15, ...
Consistent hashing (used in DynamoDB, Cassandra) minimizes data movement when adding/removing shards. Instead of hash % N, it uses a hash ring where each shard owns an arc. Adding a shard only redistributes data from adjacent shards.
Database Selection Guide
| Use Case | Recommended Database |
|---|---|
| Complex transactions (banking) | PostgreSQL |
| High-throughput key-value | Redis, DynamoDB |
| Document store (CMS, profiles) | MongoDB |
| Time-series data (metrics) | InfluxDB, TimescaleDB |
| Graph relationships (social) | Neo4j |
| Full-text search | Elasticsearch |
| Wide-column analytics | Cassandra, ScyllaDB |
| Caching layer | Redis, Memcached |
Practice Exercises
-
Design: Design a database schema for a URL shortener. Include tables for URLs, users, and analytics. What indexes would you create?
-
Trade-offs: Compare PostgreSQL and MongoDB for an e-commerce product catalog. Consider: schema flexibility, query patterns, scaling needs, and transaction requirements.
-
Architecture: Design a sharding strategy for a social media platform with 1 billion users. How do you handle user profile lookups, friend lists, and news feeds?
-
Analysis: Given a PostgreSQL database serving 10,000 QPS with 500ms p99 latency, identify the top 3 things you would investigate to improve performance.
Key Takeaways:
- SQL databases provide ACID guarantees and complex querying; NoSQL databases provide flexibility and horizontal scaling
- Indexes speed reads (O(log n) lookups) but slow writes and consume storage
- Replication provides availability and read scaling; synchronous replication sacrifices latency for consistency
- Sharding distributes data across machines; consistent hashing minimizes redistribution on resharding
- Choose the database based on access patterns, not hype—polyglot persistence is common
What to Learn Next
-> Caching Strategies Redis, Memcached, cache invalidation, and write strategies.
-> Load Balancing Algorithms, health checks, and L4 vs L7.
-> Message Queues Kafka, RabbitMQ, event-driven architecture.
-> CAP Theorem Consistency models, availability, and partition tolerance.
-> Microservices Service decomposition, discovery, and API gateways.
-> Scalability Fundamentals Vertical vs horizontal scaling and capacity planning.