Data Systems
SQL Deep Dive
SQL databases remain the backbone of most systems. Master PostgreSQL and MySQL internals, indexing strategies, transaction isolation levels, and scaling patterns.
- ACID — Strong consistency guarantees for critical data
- Indexing — B-trees, hash indexes, and query optimization
- Scaling — Read replicas, connection pooling, and sharding
SQL databases are not legacy—they are battle-tested foundations for reliable systems.
SQL Database Architecture
Understanding the internal architecture of SQL databases is essential for making informed design decisions.
DfRelational Database
A relational database organizes data into tables (relations) with rows (tuples) and columns (attributes). It enforces a predefined schema, supports ACID transactions, and uses SQL for querying. The relational model provides strong data integrity through constraints, foreign keys, and normalization.
PostgreSQL vs MySQL
| Feature | PostgreSQL | MySQL |
|---|---|---|
| ACID Compliance | Full | InnoDB only |
| JSON Support | Excellent (JSONB) | Good (JSON) |
| Full-Text Search | Built-in | Built-in |
| Replication | Streaming, logical | Asynchronous, semi-sync |
| Extensions | Rich ecosystem | Plugin-based |
| Performance | Complex queries | Simple queries |
| Concurrency | MVCC | MVCC (InnoDB) |
| Best For | Complex analytics, geospatial | Web apps, read-heavy |
Storage Engine Internals
B-Tree Indexing
DfB-Tree
A B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. It is the standard index structure for most SQL databases, optimized for disk I/O with high branching factor.
B-Tree Search Complexity
Here,
- =Search time complexity
- =Branching factor (typically 100-500)
- =Number of records
B-Tree Height Calculation
For a table with 1 billion rows and branching factor of 500:
Height = log_500(1,000,000,000) ≈ 5 levels
With each node fitting in a 16KB page, the entire index tree for 1B rows is only 5 disk reads deep. This is why B-trees are so efficient for disk-based storage.
Buffer Pool and Page Management
Transaction Isolation Levels
DfTransaction Isolation
Transaction isolation defines how and when changes made by one transaction are visible to other transactions. Higher isolation prevents more anomalies but reduces concurrency. The four standard levels represent a trade-off between correctness and performance.
| Isolation Level | Phenomena Prevented | Performance |
|---|---|---|
| Read Uncommitted | Dirty reads | Highest |
| Read Committed | Dirty reads | High |
| Repeatable Read | Dirty, non-repeatable reads | Moderate |
| Serializable | All anomalies | Lowest |
Isolation-Performance Trade-off
Here,
- =Transactions per second
- =Strictness of isolation (higher = more restrictive)
The default isolation level in PostgreSQL is Read Committed, while MySQL's InnoDB uses Repeatable Read. This can cause subtle bugs when migrating between databases. Always explicitly set your isolation level.
Indexing Strategies
When to Create an Index
| Access Pattern | Index Type | Example |
|---|---|---|
| Equality lookup | B-tree | WHERE id = 123 |
| Range query | B-tree | WHERE date > '2024-01-01' |
| Full-text search | GIN/GiST | WHERE content ILIKE '%keyword%' |
| Geospatial | GiST | WHERE ST_DWithin(loc, point, 1000) |
| Array contains | GIN | WHERE tags @> ARRAY['sql'] |
| JSON field | GIN | WHERE metadata @> '{"key": "value"}' |
Composite Indexes
DfComposite Index
A composite index (multi-column index) covers multiple columns in a single index. The order of columns matters—queries must use the leftmost columns to benefit from the index. This is known as the leftmost prefix rule.
Composite Index Design
Given: CREATE INDEX idx_user_date ON orders(user_id, order_date);
This index supports:
WHERE user_id = 100(uses index)WHERE user_id = 100 AND order_date > '2024-01-01'(uses index)WHERE order_date > '2024-01-01'(does NOT use index—skips leftmost column)
Query Optimization
EXPLAIN ANALYZE
The most important tool for query optimization:
EXPLAIN ANALYZE
SELECT u.name, COUNT(o.id)
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.created_at > '2024-01-01'
GROUP BY u.name
ORDER BY COUNT(o.id) DESC
LIMIT 10;
Common Performance Anti-Patterns
| Anti-Pattern | Problem | Solution |
|---|---|---|
SELECT * | Fetches unnecessary columns | Select only needed columns |
| N+1 queries | Repeated queries in loop | Use JOIN or batch loading |
| Missing indexes | Full table scans | Add appropriate indexes |
| Over-indexing | Slow writes, storage waste | Remove unused indexes |
| Implicit type conversion | Prevents index usage | Match types explicitly |
Always run EXPLAIN ANALYZE on slow queries. The output shows the actual execution plan, including which indexes are used, how many rows are scanned, and where the time is spent.
Scaling SQL Databases
Read Replicas
DfRead Replica
A read replica is a copy of the primary database that accepts read queries. Writes go to the primary, then replicate to replicas. This distributes read load across multiple servers while maintaining a single source of truth for writes.
Connection Pooling
Connection Pool Sizing
Here,
- =Recommended connection pool size
- =Queries per second
- =Average query latency in seconds
- =CPU cores on the database server
Connection Pool Calculation
For a database handling 5000 QPS with 10ms average query latency and 16 CPU cores:
Pool_size = (5000 × 0.01) / 16 ≈ 3 connections
This seems low, but it's correct—each connection can handle many queries per second. Over-provisioning connections causes context switching overhead.
Practice Exercises
-
Index Design: Given a users table with 100M rows and these query patterns, design the optimal indexes:
- Lookup by email (unique)
- Find users by city and age range
- Search users by name (partial match)
- Get recent users by creation date
-
Query Optimization: Rewrite this query for better performance:
SELECT * FROM orders WHERE YEAR(created_at) = 2024 AND status IN ('pending', 'processing') ORDER BY created_at DESC; -
Scaling Decision: Your PostgreSQL database handles 10K QPS reads and 1K QPS writes. The read latency is 50ms P99. Design a scaling strategy that maintains <100ms P99 read latency while supporting 2x growth.
-
Transaction Design: Design the transaction boundaries for a banking transfer operation. Consider isolation levels, deadlock prevention, and recovery from failures.
Key Takeaways:
- PostgreSQL excels at complex queries; MySQL at simple reads
- B-tree indexes provide O(log N) lookups with high branching factor
- Transaction isolation levels trade consistency for concurrency
- Composite indexes follow the leftmost prefix rule
- Read replicas distribute read load; connection pooling manages connections
- Always use EXPLAIN ANALYZE to diagnose slow queries
What to Learn Next
-> NoSQL Deep Dive Document, key-value, column-family, and graph databases.
-> PostgreSQL Deep Dive Advanced PostgreSQL features, extensions, and optimization.
-> Database Indexing B-trees, hash indexes, and indexing strategies.
-> Data Replication Sync vs async replication, leader election, and consistency.
-> Choosing the Right Database Systematic framework for database selection.
-> Data Partitioning Sharding strategies, consistent hashing, and partition keys.