πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Database Indexing

Data SystemsStorage & Retrieval🟒 Free Lesson

Advertisement

Data Systems

Database Indexing

Indexes enable efficient data retrieval without scanning entire tables. Choosing the right index structure and strategy is critical for query performance at scale.

  • B-Tree β€” Balanced tree for reads and range queries
  • LSM Tree β€” Write-optimized structure for high throughput
  • Composite Index β€” Multi-column indexes for complex queries

Indexing is the single most impactful optimization for database performance.

Why Index?

Without an index, every query scans the entire table. Indexes create auxiliary data structures that map search keys to row locations.

DfDatabase Index

A database index is a data structure that improves the speed of data retrieval operations at the cost of additional storage and write overhead. An index stores a copy of selected columns with pointers to the original rows, organized in a structure (B-tree, hash, etc.) optimized for fast lookups.

Index Scan vs Full Table Scan

costindex=O(log⁑n)β‰ͺcostfull=O(n)cost_{index} = O(\log n) \ll cost_{full} = O(n)

Here,

  • nn=Number of rows in the table
  • costindexcost_{index}=B-tree index lookup cost
  • costfullcost_{full}=Full table scan cost

Index Impact

Table with 100M rows, 1KB per row:

  • Full table scan: 100GB read
  • B-tree index lookup: ~3-4 disk reads (index height) = ~16KB

For a query returning 1 row, the index is ~6,000x faster.

B-Tree Index

The most common index structure in relational databases.

DfB-Tree

A B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in O(log n) time. Each node contains multiple keys and child pointers. B-trees are optimized for systems that read and write large blocks of data.

B-Tree Index Structure[10 | 20 | 30][3 | 5 | 7][12 | 15 | 18][22 | 25 | 28][32 | 35]Leaf: 1-9β†’ Row pointersLeaf: 10-19β†’ Row pointersLeaf: 20-29β†’ Row pointersLeaf: 30-39β†’ Row pointersLeaf nodes linked for efficient range scans

B-Tree Properties

PropertyValue
HeightO(log n) β€” typically 3-4 for billions of rows
LookupO(log n) β€” follow path from root to leaf
Range scanO(k + log n) β€” traverse linked leaves
Insert/DeleteO(log n) β€” with rebalancing
Block sizeTypically 4KB-16KB (aligns with disk pages)

B-trees are cache-friendly because each node fits in a disk page. A B-tree with branching factor 500 and height 3 can store 125 million keys with only 3 disk reads. This is why B-trees dominate database indexing.

LSM Tree

Write-optimized index for high-throughput systems.

DfLSM Tree

A Log-Structured Merge Tree (LSM tree) buffers writes in memory (memtable), then flushes sorted runs to disk. Background compaction merges runs. LSM trees excel at write-heavy workloads because writes are sequential (append-only). Reads may require checking multiple levels.

LSM Tree StructureMemtableIn-memory sortedflushSSTables (Disk)L0: NewestL1: Sorted runsL2: MergedL3: Largestcompaction

B-Tree vs LSM Tree

AspectB-TreeLSM Tree
Write patternRandom I/OSequential I/O
Read patternO(log n) single lookupMay check multiple levels
Write throughputModerateHigh (10-100x)
Read throughputHighModerate
Space efficiencyFragmentation over timeBetter (compaction)
Write amplificationLowHigh (compaction)
Use caseOLTP, mixed workloadsWrite-heavy, time-series

LSM trees are used by RocksDB, LevelDB, Cassandra, HBase, and ScyllaDB. B-trees are used by PostgreSQL, MySQL (InnoDB), SQL Server, and Oracle. Choose based on your read/write ratio.

Index Types

Primary Index

DfPrimary Index

A primary index is built on the primary key of a table. In InnoDB, the primary key is the clustered index β€” the actual row data is stored in the primary key's B-tree. All secondary indexes point to the primary key value.

Secondary Index

DfSecondary Index

A secondary index is built on non-primary-key columns. It stores the indexed column value plus a pointer to the primary key. To fetch additional columns, the database performs a secondary lookup using the primary key.

Composite Index

DfComposite Index

A composite index (multi-column index) is built on multiple columns. The order of columns matters: the index can be used for queries filtering by the leftmost prefix of columns. An index on (a, b, c) supports queries on (a), (a, b), and (a, b, c), but not (b) or (c) alone.

The column order in a composite index should match the query's WHERE clause. Put high-selectivity columns first for point queries, or put range columns last to avoid limiting index usage for equality conditions.

Covering Index

DfCovering Index

A covering index includes all columns needed by a query, eliminating the need to access the table data. The query is answered entirely from the index, dramatically reducing I/O.

Covering Index Benefit

Query: SELECT user_id, name FROM users WHERE email = 'alice@example.com'

Without covering index:

  1. B-tree lookup on email β†’ find primary key
  2. Secondary index lookup β†’ find row pointer
  3. Table read β†’ fetch user_id, name

With covering index on (email, user_id, name):

  1. B-tree lookup on email β†’ find entry with user_id, name
  2. Done β€” no table access needed

Index Selection

Index Selectivity

selectivity=distinct_valuestotal_rowsselectivity = \frac{distinct\_values}{total\_rows}

Here,

  • selectivityselectivity=Fraction of unique values in the indexed column
  • distinctvaluesdistinct_values=Number of distinct values
  • totalrowstotal_rows=Total number of rows

High selectivity columns (e.g., email, UUID) are excellent index candidates. Low selectivity columns (e.g., boolean, gender) often benefit more from full table scans. Index columns that appear in WHERE, JOIN, ORDER BY, and GROUP BY clauses.

Practice Exercises

  1. Analysis: A table has 100M rows with a B-tree index on column status. There are only 3 distinct values (active, inactive, pending). Should you use this index for a query filtering status = 'active'?

  2. Design: Design composite indexes for these queries:

    • SELECT * FROM orders WHERE user_id = ? AND status = 'shipped' ORDER BY created_at DESC
    • SELECT * FROM orders WHERE created_at > ? AND total > 100
  3. Comparison: Compare B-tree and LSM tree performance for a time-series database receiving 100K writes/second with occasional range queries.

  4. Optimization: A query takes 5 seconds with a full table scan. The table has 50M rows. After adding an index, it takes 50ms. Explain the improvement and identify what else could be done.

Key Takeaways:

  • Indexes enable O(log n) lookups vs O(n) full table scans
  • B-trees optimize for reads; LSM trees optimize for writes
  • Composite indexes follow leftmost prefix rule
  • Covering indexes eliminate table access for read-only queries
  • Index selectivity determines index effectiveness
  • Indexes slow down writes β€” only create indexes that improve query performance

What to Learn Next

-> Databases SQL vs NoSQL, indexing, replication, and sharding.

-> Data Partitioning Horizontal partitioning, range vs hash partitioning.

-> Data Replication Leader-follower, multi-leader, and conflict resolution.

-> Scalability Fundamentals Vertical vs horizontal scaling and capacity planning.

-> Caching Redis, Memcached, cache strategies, and invalidation.

-> Event-Driven Architecture Event sourcing, CQRS, and saga patterns.

⭐

Premium Content

Database Indexing

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert System Design Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement