Data Systems

Database Indexing

Indexes enable efficient data retrieval without scanning entire tables. Choosing the right index structure and strategy is critical for query performance at scale.

B-Tree — Balanced tree for reads and range queries
LSM Tree — Write-optimized structure for high throughput
Composite Index — Multi-column indexes for complex queries

Indexing is the single most impactful optimization for database performance.

Why Index?

Without an index, every query scans the entire table. Indexes create auxiliary data structures that map search keys to row locations.

DfDatabase Index

A database index is a data structure that improves the speed of data retrieval operations at the cost of additional storage and write overhead. An index stores a copy of selected columns with pointers to the original rows, organized in a structure (B-tree, hash, etc.) optimized for fast lookups.

Index Scan vs Full Table Scan

cost_{index} = O(\log n) \ll cost_{full} = O(n)

Here,

$n$ =Number of rows in the table
$cost_{index}$ =B-tree index lookup cost
$cost_{full}$ =Full table scan cost

Index Impact

Table with 100M rows, 1KB per row:

Full table scan: 100GB read
B-tree index lookup: ~3-4 disk reads (index height) = ~16KB

For a query returning 1 row, the index is ~6,000x faster.

B-Tree Index

The most common index structure in relational databases.

DfB-Tree

A B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in O(log n) time. Each node contains multiple keys and child pointers. B-trees are optimized for systems that read and write large blocks of data.

B-Tree Properties

Property	Value
Height	O(log n) — typically 3-4 for billions of rows
Lookup	O(log n) — follow path from root to leaf
Range scan	O(k + log n) — traverse linked leaves
Insert/Delete	O(log n) — with rebalancing
Block size	Typically 4KB-16KB (aligns with disk pages)

B-trees are cache-friendly because each node fits in a disk page. A B-tree with branching factor 500 and height 3 can store 125 million keys with only 3 disk reads. This is why B-trees dominate database indexing.

LSM Tree

Write-optimized index for high-throughput systems.

DfLSM Tree

A Log-Structured Merge Tree (LSM tree) buffers writes in memory (memtable), then flushes sorted runs to disk. Background compaction merges runs. LSM trees excel at write-heavy workloads because writes are sequential (append-only). Reads may require checking multiple levels.

B-Tree vs LSM Tree

Aspect	B-Tree	LSM Tree
Write pattern	Random I/O	Sequential I/O
Read pattern	O(log n) single lookup	May check multiple levels
Write throughput	Moderate	High (10-100x)
Read throughput	High	Moderate
Space efficiency	Fragmentation over time	Better (compaction)
Write amplification	Low	High (compaction)
Use case	OLTP, mixed workloads	Write-heavy, time-series

LSM trees are used by RocksDB, LevelDB, Cassandra, HBase, and ScyllaDB. B-trees are used by PostgreSQL, MySQL (InnoDB), SQL Server, and Oracle. Choose based on your read/write ratio.

Index Types

Primary Index

DfPrimary Index

A primary index is built on the primary key of a table. In InnoDB, the primary key is the clustered index — the actual row data is stored in the primary key's B-tree. All secondary indexes point to the primary key value.

Secondary Index

DfSecondary Index

A secondary index is built on non-primary-key columns. It stores the indexed column value plus a pointer to the primary key. To fetch additional columns, the database performs a secondary lookup using the primary key.

Composite Index

DfComposite Index

A composite index (multi-column index) is built on multiple columns. The order of columns matters: the index can be used for queries filtering by the leftmost prefix of columns. An index on (a, b, c) supports queries on (a), (a, b), and (a, b, c), but not (b) or (c) alone.

The column order in a composite index should match the query's WHERE clause. Put high-selectivity columns first for point queries, or put range columns last to avoid limiting index usage for equality conditions.

Covering Index

DfCovering Index

A covering index includes all columns needed by a query, eliminating the need to access the table data. The query is answered entirely from the index, dramatically reducing I/O.

Covering Index Benefit

Query: SELECT user_id, name FROM users WHERE email = 'alice@example.com'

Without covering index:

B-tree lookup on email → find primary key
Secondary index lookup → find row pointer
Table read → fetch user_id, name

With covering index on (email, user_id, name):

B-tree lookup on email → find entry with user_id, name
Done — no table access needed

Index Selection

Index Selectivity

selectivity = \frac{distinct\_values}{total\_rows}

Here,

$selectivity$ =Fraction of unique values in the indexed column
$distinct_values$ =Number of distinct values
$total_rows$ =Total number of rows

High selectivity columns (e.g., email, UUID) are excellent index candidates. Low selectivity columns (e.g., boolean, gender) often benefit more from full table scans. Index columns that appear in WHERE, JOIN, ORDER BY, and GROUP BY clauses.

Practice Exercises

Analysis: A table has 100M rows with a B-tree index on column status. There are only 3 distinct values (active, inactive, pending). Should you use this index for a query filtering status = 'active'?
Design: Design composite indexes for these queries:
- SELECT * FROM orders WHERE user_id = ? AND status = 'shipped' ORDER BY created_at DESC
- SELECT * FROM orders WHERE created_at > ? AND total > 100
Comparison: Compare B-tree and LSM tree performance for a time-series database receiving 100K writes/second with occasional range queries.
Optimization: A query takes 5 seconds with a full table scan. The table has 50M rows. After adding an index, it takes 50ms. Explain the improvement and identify what else could be done.

Key Takeaways:

Indexes enable O(log n) lookups vs O(n) full table scans
B-trees optimize for reads; LSM trees optimize for writes
Composite indexes follow leftmost prefix rule
Covering indexes eliminate table access for read-only queries
Index selectivity determines index effectiveness
Indexes slow down writes — only create indexes that improve query performance

What to Learn Next

-> Databases SQL vs NoSQL, indexing, replication, and sharding.

-> Data Partitioning Horizontal partitioning, range vs hash partitioning.

-> Data Replication Leader-follower, multi-leader, and conflict resolution.

-> Scalability Fundamentals Vertical vs horizontal scaling and capacity planning.

-> Caching Redis, Memcached, cache strategies, and invalidation.

-> Event-Driven Architecture Event sourcing, CQRS, and saga patterns.

Database Indexing

Database Indexing

Why Index?

DfDatabase Index

Index Scan vs Full Table Scan

Index Impact

B-Tree Index

DfB-Tree

B-Tree Properties

LSM Tree

DfLSM Tree

B-Tree vs LSM Tree

Index Types

Primary Index

DfPrimary Index

Secondary Index

DfSecondary Index

Composite Index

DfComposite Index

Covering Index

DfCovering Index

Covering Index Benefit

Index Selection

Index Selectivity

Practice Exercises

What to Learn Next

Premium Content

Need Expert System Design Help?