🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Databases

Data SystemsData Storage🟢 Free Lesson

Advertisement

System Design — Data Systems

Databases

Databases are the backbone of every system. Choosing the right database, understanding its trade-offs, and designing for scale is one of the most impactful decisions in system design.

  • SQL vs NoSQL — Structured vs flexible data models
  • Indexing — Making reads fast at the cost of writes
  • Sharding — Distributing data across machines for horizontal scaling

Data is the new oil—but only if you can store, query, and scale it effectively.

Database Taxonomy

Database TaxonomyRelational (SQL)ACID, Schema, JOINsPostgreSQLMySQLBest for:Complex queries, transactionsData integrity, ACID complianceStructured, relational dataSmall-to-medium datasetsPostgreSQL, MySQL, Oracle, SQL ServerNon-Relational (NoSQL)BASE, Flexible Schema, ScaleDocumentKey-ValWideBest for:High write throughputFlexible schema, rapid iterationHorizontal scalingLarge-scale, high availabilityMongoDB, Redis, Cassandra, DynamoDB

SQL vs NoSQL

DfRelational Database (SQL)

A relational database stores data in structured tables with predefined schemas, using SQL for querying. It guarantees ACID properties (Atomicity, Consistency, Isolation, Durability) for transactional integrity. Best for data with clear relationships and when consistency is critical.

DfNoSQL Database

A NoSQL database encompasses various non-relational data stores—document, key-value, column-family, and graph databases. They prioritize availability, partition tolerance, and horizontal scalability over strict consistency. Best for large-scale, high-throughput workloads with flexible schemas.

DimensionSQLNoSQL
SchemaFixed, predefinedDynamic, flexible
RelationshipsJOINs, foreign keysDenormalized, embedded
ScalingVertical (scale up)Horizontal (scale out)
ConsistencyStrong (ACID)Eventual (BASE)
Query LanguageSQLVaries by database
Best ForComplex queries, transactionsHigh throughput, flexible data

Indexing

Indexes are data structures that speed up reads at the cost of write performance and storage.

DfDatabase Index

A database index is a data structure (typically a B-tree or hash table) that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space. Indexes allow the database engine to find rows without scanning the entire table.

B-Tree Index Structure

[30 | 60][10 | 20][40 | 50][70 | 80]DataDataDataDataDataDataDataB-tree: O(log n) lookup, range scans, ordered traversal

Index Types

TypeStructureBest For
B-treeBalanced treeRange queries, equality, ORDER BY
HashHash tableExact equality lookups only
GINInverted indexFull-text search, array containment
GiSTGeneralized search treeGeometric, full-text, range
CompositeMulti-columnQueries filtering on multiple columns

Index Design Rules

  1. Index selective columns: Low-cardinality columns (e.g., boolean) benefit less
  2. Covering indexes: Include all columns needed for a query to avoid table lookups
  3. Leftmost prefix: Composite indexes work left-to-right
  4. Avoid over-indexing: Each index slows writes and uses storage
  5. Partial indexes: Index only rows matching a condition

A common misconception is that more indexes always mean faster queries. In reality, each index consumes storage, slows down writes, and can cause the query planner to choose suboptimal plans. Profile your actual queries before adding indexes.

Replication

Replication copies data across multiple nodes for availability and read scaling.

DfDatabase Replication

Replication is the process of copying data from one database server (primary/master) to one or more other servers (replicas/slaves). It provides high availability (failover), read scalability (distribute read traffic), and geographic distribution (reduce latency).

Replication Strategies

StrategyConsistencyAvailabilityUse Case
SynchronousStrongLower (waits for replicas)Financial, critical data
AsynchronousEventualHigher (doesn't wait)Social media, analytics
Semi-synchronousBetweenBalancedMost production systems

Replication Topology

PrimaryReplica 1Replica 2Replica 3Writes go to primary, reads distributed across replicas

Replication Lag

Replication Lag Impact

Consistent Read Probability=1LagWrite Interval\text{Consistent Read Probability} = 1 - \frac{\text{Lag}}{\text{Write Interval}}

Here,

  • LagLag=Time between primary write and replica update
  • WriteIntervalWrite Interval=Time between consecutive writes to the same data

Sharding

Sharding distributes data across multiple database instances to achieve horizontal scalability.

DfSharding

Sharding is a database architecture pattern that partitions data horizontally across multiple database instances (shards). Each shard contains a subset of the data and operates independently, allowing the system to scale beyond the capacity of a single machine.

Sharding Strategies

StrategyHow It WorksTrade-off
Hash-basedhash(key) % num_shardsEven distribution, range queries hard
Range-basedShard by key rangesRange queries easy, hotspots possible
Directory-basedLookup table maps keys to shardsFlexible, single point of failure
GeographicShard by regionData locality, regulatory compliance

Sharding Challenges

Sharding introduces significant complexity:

  • Cross-shard queries: JOINs across shards are expensive
  • Rebalancing: Adding/removing shards requires data migration
  • Hotspots: Uneven data distribution can overwhelm single shards
  • Distributed transactions: ACID across shards requires 2PC or Saga pattern

Hash-Based Sharding Example

Architecture Diagram
Shard 1: hash(user_id) % 4 == 0  →  users: 4, 8, 12, 16, ...
Shard 2: hash(user_id) % 4 == 1  →  users: 1, 5, 9, 13, ...
Shard 3: hash(user_id) % 4 == 2  →  users: 2, 6, 10, 14, ...
Shard 4: hash(user_id) % 4 == 3  →  users: 3, 7, 11, 15, ...

Consistent hashing (used in DynamoDB, Cassandra) minimizes data movement when adding/removing shards. Instead of hash % N, it uses a hash ring where each shard owns an arc. Adding a shard only redistributes data from adjacent shards.

Database Selection Guide

Use CaseRecommended Database
Complex transactions (banking)PostgreSQL
High-throughput key-valueRedis, DynamoDB
Document store (CMS, profiles)MongoDB
Time-series data (metrics)InfluxDB, TimescaleDB
Graph relationships (social)Neo4j
Full-text searchElasticsearch
Wide-column analyticsCassandra, ScyllaDB
Caching layerRedis, Memcached

Practice Exercises

  1. Design: Design a database schema for a URL shortener. Include tables for URLs, users, and analytics. What indexes would you create?

  2. Trade-offs: Compare PostgreSQL and MongoDB for an e-commerce product catalog. Consider: schema flexibility, query patterns, scaling needs, and transaction requirements.

  3. Architecture: Design a sharding strategy for a social media platform with 1 billion users. How do you handle user profile lookups, friend lists, and news feeds?

  4. Analysis: Given a PostgreSQL database serving 10,000 QPS with 500ms p99 latency, identify the top 3 things you would investigate to improve performance.

Key Takeaways:

  • SQL databases provide ACID guarantees and complex querying; NoSQL databases provide flexibility and horizontal scaling
  • Indexes speed reads (O(log n) lookups) but slow writes and consume storage
  • Replication provides availability and read scaling; synchronous replication sacrifices latency for consistency
  • Sharding distributes data across machines; consistent hashing minimizes redistribution on resharding
  • Choose the database based on access patterns, not hype—polyglot persistence is common

What to Learn Next

-> Caching Strategies Redis, Memcached, cache invalidation, and write strategies.

-> Load Balancing Algorithms, health checks, and L4 vs L7.

-> Message Queues Kafka, RabbitMQ, event-driven architecture.

-> CAP Theorem Consistency models, availability, and partition tolerance.

-> Microservices Service decomposition, discovery, and API gateways.

-> Scalability Fundamentals Vertical vs horizontal scaling and capacity planning.

Premium Content

Databases

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert System Design Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement