Data Systems

SQL Deep Dive

SQL databases remain the backbone of most systems. Master PostgreSQL and MySQL internals, indexing strategies, transaction isolation levels, and scaling patterns.

ACID — Strong consistency guarantees for critical data
Indexing — B-trees, hash indexes, and query optimization
Scaling — Read replicas, connection pooling, and sharding

SQL databases are not legacy—they are battle-tested foundations for reliable systems.

SQL Database Architecture

Understanding the internal architecture of SQL databases is essential for making informed design decisions.

DfRelational Database

A relational database organizes data into tables (relations) with rows (tuples) and columns (attributes). It enforces a predefined schema, supports ACID transactions, and uses SQL for querying. The relational model provides strong data integrity through constraints, foreign keys, and normalization.

PostgreSQL vs MySQL

Feature	PostgreSQL	MySQL
ACID Compliance	Full	InnoDB only
JSON Support	Excellent (JSONB)	Good (JSON)
Full-Text Search	Built-in	Built-in
Replication	Streaming, logical	Asynchronous, semi-sync
Extensions	Rich ecosystem	Plugin-based
Performance	Complex queries	Simple queries
Concurrency	MVCC	MVCC (InnoDB)
Best For	Complex analytics, geospatial	Web apps, read-heavy

Storage Engine Internals

B-Tree Indexing

DfB-Tree

A B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. It is the standard index structure for most SQL databases, optimized for disk I/O with high branching factor.

B-Tree Search Complexity

T_{search} = O(\log_B N)

Here,

$T_{search}$ =Search time complexity
$B$ =Branching factor (typically 100-500)
$N$ =Number of records

B-Tree Height Calculation

For a table with 1 billion rows and branching factor of 500:

Height = log_500(1,000,000,000) ≈ 5 levels

With each node fitting in a 16KB page, the entire index tree for 1B rows is only 5 disk reads deep. This is why B-trees are so efficient for disk-based storage.

Buffer Pool and Page Management

Transaction Isolation Levels

DfTransaction Isolation

Transaction isolation defines how and when changes made by one transaction are visible to other transactions. Higher isolation prevents more anomalies but reduces concurrency. The four standard levels represent a trade-off between correctness and performance.

Isolation Level	Phenomena Prevented	Performance
Read Uncommitted	Dirty reads	Highest
Read Committed	Dirty reads	High
Repeatable Read	Dirty, non-repeatable reads	Moderate
Serializable	All anomalies	Lowest

Isolation-Performance Trade-off

Throughput \propto \frac{1}{Isolation\ Level}

Here,

$Throughput$ =Transactions per second
$Isolation Level$ =Strictness of isolation (higher = more restrictive)

The default isolation level in PostgreSQL is Read Committed, while MySQL's InnoDB uses Repeatable Read. This can cause subtle bugs when migrating between databases. Always explicitly set your isolation level.

Indexing Strategies

When to Create an Index

Access Pattern	Index Type	Example
Equality lookup	B-tree	`WHERE id = 123`
Range query	B-tree	`WHERE date > '2024-01-01'`
Full-text search	GIN/GiST	`WHERE content ILIKE '%keyword%'`
Geospatial	GiST	`WHERE ST_DWithin(loc, point, 1000)`
Array contains	GIN	`WHERE tags @> ARRAY['sql']`
JSON field	GIN	`WHERE metadata @> '{"key": "value"}'`

Composite Indexes

DfComposite Index

A composite index (multi-column index) covers multiple columns in a single index. The order of columns matters—queries must use the leftmost columns to benefit from the index. This is known as the leftmost prefix rule.

Composite Index Design

Given: CREATE INDEX idx_user_date ON orders(user_id, order_date);

This index supports:

WHERE user_id = 100 (uses index)
WHERE user_id = 100 AND order_date > '2024-01-01' (uses index)
WHERE order_date > '2024-01-01' (does NOT use index—skips leftmost column)

Query Optimization

EXPLAIN ANALYZE

The most important tool for query optimization:

EXPLAIN ANALYZE
SELECT u.name, COUNT(o.id)
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.created_at > '2024-01-01'
GROUP BY u.name
ORDER BY COUNT(o.id) DESC
LIMIT 10;

Common Performance Anti-Patterns

Anti-Pattern	Problem	Solution
`SELECT *`	Fetches unnecessary columns	Select only needed columns
N+1 queries	Repeated queries in loop	Use JOIN or batch loading
Missing indexes	Full table scans	Add appropriate indexes
Over-indexing	Slow writes, storage waste	Remove unused indexes
Implicit type conversion	Prevents index usage	Match types explicitly

Always run EXPLAIN ANALYZE on slow queries. The output shows the actual execution plan, including which indexes are used, how many rows are scanned, and where the time is spent.

Scaling SQL Databases

Read Replicas

DfRead Replica

A read replica is a copy of the primary database that accepts read queries. Writes go to the primary, then replicate to replicas. This distributes read load across multiple servers while maintaining a single source of truth for writes.

Connection Pooling

Connection Pool Sizing

Pool_{size} = \frac{QPS \times Latency_{query}}{CPU_{cores}}

Here,

$Pool_{size}$ =Recommended connection pool size
$QPS$ =Queries per second
$Latency_{query}$ =Average query latency in seconds
$CPU_{cores}$ =CPU cores on the database server

Connection Pool Calculation

For a database handling 5000 QPS with 10ms average query latency and 16 CPU cores:

Pool_size = (5000 × 0.01) / 16 ≈ 3 connections

This seems low, but it's correct—each connection can handle many queries per second. Over-provisioning connections causes context switching overhead.

Practice Exercises

Index Design: Given a users table with 100M rows and these query patterns, design the optimal indexes:
- Lookup by email (unique)
- Find users by city and age range
- Search users by name (partial match)
- Get recent users by creation date

Query Optimization: Rewrite this query for better performance:

SELECT * FROM orders 
WHERE YEAR(created_at) = 2024 
AND status IN ('pending', 'processing')
ORDER BY created_at DESC;

Scaling Decision: Your PostgreSQL database handles 10K QPS reads and 1K QPS writes. The read latency is 50ms P99. Design a scaling strategy that maintains <100ms P99 read latency while supporting 2x growth.
Transaction Design: Design the transaction boundaries for a banking transfer operation. Consider isolation levels, deadlock prevention, and recovery from failures.

Key Takeaways:

PostgreSQL excels at complex queries; MySQL at simple reads
B-tree indexes provide O(log N) lookups with high branching factor
Transaction isolation levels trade consistency for concurrency
Composite indexes follow the leftmost prefix rule
Read replicas distribute read load; connection pooling manages connections
Always use EXPLAIN ANALYZE to diagnose slow queries

What to Learn Next

-> NoSQL Deep Dive Document, key-value, column-family, and graph databases.

-> PostgreSQL Deep Dive Advanced PostgreSQL features, extensions, and optimization.

-> Database Indexing B-trees, hash indexes, and indexing strategies.

-> Data Replication Sync vs async replication, leader election, and consistency.

-> Choosing the Right Database Systematic framework for database selection.

-> Data Partitioning Sharding strategies, consistent hashing, and partition keys.

SQL Deep Dive

SQL Deep Dive

SQL Database Architecture

DfRelational Database

PostgreSQL vs MySQL

Storage Engine Internals

B-Tree Indexing

DfB-Tree

B-Tree Search Complexity

B-Tree Height Calculation

Buffer Pool and Page Management

Transaction Isolation Levels

DfTransaction Isolation

Isolation-Performance Trade-off

Indexing Strategies

When to Create an Index

Composite Indexes

DfComposite Index

Composite Index Design

Query Optimization

EXPLAIN ANALYZE

Common Performance Anti-Patterns

Scaling SQL Databases

Read Replicas

DfRead Replica

Connection Pooling

Connection Pool Sizing

Connection Pool Calculation

Practice Exercises

What to Learn Next

Premium Content

Need Expert System Design Help?