System Design — Data Systems

Databases

Databases are the backbone of every system. Choosing the right database, understanding its trade-offs, and designing for scale is one of the most impactful decisions in system design.

SQL vs NoSQL — Structured vs flexible data models
Indexing — Making reads fast at the cost of writes
Sharding — Distributing data across machines for horizontal scaling

Data is the new oil—but only if you can store, query, and scale it effectively.

Database Taxonomy

SQL vs NoSQL

DfRelational Database (SQL)

A relational database stores data in structured tables with predefined schemas, using SQL for querying. It guarantees ACID properties (Atomicity, Consistency, Isolation, Durability) for transactional integrity. Best for data with clear relationships and when consistency is critical.

DfNoSQL Database

A NoSQL database encompasses various non-relational data stores—document, key-value, column-family, and graph databases. They prioritize availability, partition tolerance, and horizontal scalability over strict consistency. Best for large-scale, high-throughput workloads with flexible schemas.

Dimension	SQL	NoSQL
Schema	Fixed, predefined	Dynamic, flexible
Relationships	JOINs, foreign keys	Denormalized, embedded
Scaling	Vertical (scale up)	Horizontal (scale out)
Consistency	Strong (ACID)	Eventual (BASE)
Query Language	SQL	Varies by database
Best For	Complex queries, transactions	High throughput, flexible data

Indexing

Indexes are data structures that speed up reads at the cost of write performance and storage.

DfDatabase Index

A database index is a data structure (typically a B-tree or hash table) that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space. Indexes allow the database engine to find rows without scanning the entire table.

B-Tree Index Structure

Index Types

Type	Structure	Best For
B-tree	Balanced tree	Range queries, equality, ORDER BY
Hash	Hash table	Exact equality lookups only
GIN	Inverted index	Full-text search, array containment
GiST	Generalized search tree	Geometric, full-text, range
Composite	Multi-column	Queries filtering on multiple columns

Index Design Rules

Index selective columns: Low-cardinality columns (e.g., boolean) benefit less
Covering indexes: Include all columns needed for a query to avoid table lookups
Leftmost prefix: Composite indexes work left-to-right
Avoid over-indexing: Each index slows writes and uses storage
Partial indexes: Index only rows matching a condition

A common misconception is that more indexes always mean faster queries. In reality, each index consumes storage, slows down writes, and can cause the query planner to choose suboptimal plans. Profile your actual queries before adding indexes.

Replication

Replication copies data across multiple nodes for availability and read scaling.

DfDatabase Replication

Replication is the process of copying data from one database server (primary/master) to one or more other servers (replicas/slaves). It provides high availability (failover), read scalability (distribute read traffic), and geographic distribution (reduce latency).

Replication Strategies

Strategy	Consistency	Availability	Use Case
Synchronous	Strong	Lower (waits for replicas)	Financial, critical data
Asynchronous	Eventual	Higher (doesn't wait)	Social media, analytics
Semi-synchronous	Between	Balanced	Most production systems

Replication Topology

Replication Lag

Replication Lag Impact

\text{Consistent Read Probability} = 1 - \frac{\text{Lag}}{\text{Write Interval}}

Here,

$Lag$ =Time between primary write and replica update
$Write Interval$ =Time between consecutive writes to the same data

Sharding

Sharding distributes data across multiple database instances to achieve horizontal scalability.

DfSharding

Sharding is a database architecture pattern that partitions data horizontally across multiple database instances (shards). Each shard contains a subset of the data and operates independently, allowing the system to scale beyond the capacity of a single machine.

Sharding Strategies

Strategy	How It Works	Trade-off
Hash-based	hash(key) % num_shards	Even distribution, range queries hard
Range-based	Shard by key ranges	Range queries easy, hotspots possible
Directory-based	Lookup table maps keys to shards	Flexible, single point of failure
Geographic	Shard by region	Data locality, regulatory compliance

Sharding Challenges

Sharding introduces significant complexity:

Cross-shard queries: JOINs across shards are expensive
Rebalancing: Adding/removing shards requires data migration
Hotspots: Uneven data distribution can overwhelm single shards
Distributed transactions: ACID across shards requires 2PC or Saga pattern

Hash-Based Sharding Example

Architecture Diagram

Shard 1: hash(user_id) % 4 == 0  →  users: 4, 8, 12, 16, ...
Shard 2: hash(user_id) % 4 == 1  →  users: 1, 5, 9, 13, ...
Shard 3: hash(user_id) % 4 == 2  →  users: 2, 6, 10, 14, ...
Shard 4: hash(user_id) % 4 == 3  →  users: 3, 7, 11, 15, ...

Consistent hashing (used in DynamoDB, Cassandra) minimizes data movement when adding/removing shards. Instead of hash % N, it uses a hash ring where each shard owns an arc. Adding a shard only redistributes data from adjacent shards.

Database Selection Guide

Use Case	Recommended Database
Complex transactions (banking)	PostgreSQL
High-throughput key-value	Redis, DynamoDB
Document store (CMS, profiles)	MongoDB
Time-series data (metrics)	InfluxDB, TimescaleDB
Graph relationships (social)	Neo4j
Full-text search	Elasticsearch
Wide-column analytics	Cassandra, ScyllaDB
Caching layer	Redis, Memcached

Practice Exercises

Design: Design a database schema for a URL shortener. Include tables for URLs, users, and analytics. What indexes would you create?
Trade-offs: Compare PostgreSQL and MongoDB for an e-commerce product catalog. Consider: schema flexibility, query patterns, scaling needs, and transaction requirements.
Architecture: Design a sharding strategy for a social media platform with 1 billion users. How do you handle user profile lookups, friend lists, and news feeds?
Analysis: Given a PostgreSQL database serving 10,000 QPS with 500ms p99 latency, identify the top 3 things you would investigate to improve performance.

Key Takeaways:

SQL databases provide ACID guarantees and complex querying; NoSQL databases provide flexibility and horizontal scaling
Indexes speed reads (O(log n) lookups) but slow writes and consume storage
Replication provides availability and read scaling; synchronous replication sacrifices latency for consistency
Sharding distributes data across machines; consistent hashing minimizes redistribution on resharding
Choose the database based on access patterns, not hype—polyglot persistence is common

What to Learn Next

-> Caching Strategies Redis, Memcached, cache invalidation, and write strategies.

-> Load Balancing Algorithms, health checks, and L4 vs L7.

-> Message Queues Kafka, RabbitMQ, event-driven architecture.

-> CAP Theorem Consistency models, availability, and partition tolerance.

-> Microservices Service decomposition, discovery, and API gateways.

-> Scalability Fundamentals Vertical vs horizontal scaling and capacity planning.

Databases

Databases

Database Taxonomy

SQL vs NoSQL

DfRelational Database (SQL)

DfNoSQL Database

Indexing

DfDatabase Index

B-Tree Index Structure

Index Types

Index Design Rules

Replication

DfDatabase Replication

Replication Strategies

Replication Topology

Replication Lag

Replication Lag Impact

Sharding

DfSharding

Sharding Strategies

Sharding Challenges

Hash-Based Sharding Example

Database Selection Guide

Practice Exercises

What to Learn Next

Premium Content

Need Expert System Design Help?