System Design Problems

Design a Search Engine

A search engine indexes the entire web and returns relevant results in milliseconds. Google processes 8.5 billion searches daily across 100+ billion web pages, requiring inverted indices, relevance ranking (PageRank), and distributed query processing.

Crawling — Discover and download web pages continuously
Indexing — Build inverted indices mapping words to documents
Ranking — Score and order results by relevance

The core challenge is balancing index completeness (covering the entire web) with query speed (returning results in < 200ms) across billions of documents.

Requirements

Functional Requirements

Users can search with keywords and get ranked results
Results include title, snippet, URL, and metadata
Support for operators (AND, OR, NOT, exact phrase)
Autocomplete and spell correction
Search in multiple languages
Personalized results based on search history

Non-Functional Requirements

Latency: Results in < 200ms (p99)
Relevance: Top 10 results must be highly relevant
Scale: 100 billion indexed pages, 100K QPS
Freshness: New pages indexed within 1 hour
Availability: 99.99%

Search engines face a fundamental tension: larger indices provide better coverage but slower queries. The inverted index data structure solves this by enabling O(1) lookup from terms to documents.

Back-of-the-Envelope Estimation

Search Engine Capacity

100 billion pages × 1 KB average content = 100 TB raw
Inverted index: ~10 TB (terms → doc IDs with positions)
Query QPS: 8.5B / 86400 ≈ 100K QPS
Peak QPS (3x): 300K QPS
Index partitioning: 1000 shards × 10 GB each

Inverted Index

DfInverted Index

An inverted index maps each term to a list of documents that contain it, along with positions and frequencies. This data structure enables O(1) term lookup, making it the foundation of search engines.

TF-IDF Score

tfidf(t, d) = tf(t, d) \times idf(t)

Here,

$tf(t, d)$ =Term frequency: how often term t appears in document d
$idf(t)$ =Inverse document frequency: how rare term t is across all documents
$idf(t)$ == log(N / df(t)) where N = total docs, df(t) = docs containing t

TF-IDF Calculation

Term "distributed" appears 3 times in Doc1 (TF=3). 10 billion total docs, 100 million contain "distributed": IDF = log(10B / 100M) = log(100) = 2

TF-IDF = 3 × 2 = 6

The term "the" might have TF=10 but IDF=0.01 (appears in almost every doc). TF-IDF = 10 × 0.01 = 0.1 (low relevance)

PageRank

DfPageRank

PageRank scores web pages based on the quantity and quality of inbound links. A page linked from many high-quality pages receives a higher score.

PageRank Iteration

PR(d) = \frac{1-d}{N} + d \times \sum_{p \in B(d)} \frac{PR(p)}{L(p)}

Here,

$PR(d)$ =PageRank of document d
$d$ =Damping factor (typically 0.85)
$N$ =Total number of pages
$B(d)$ =Pages that link to d
$L(p)$ =Number of outbound links from page p

PageRank Calculation

Page A has PageRank 10 and 5 outbound links. Page B is linked from Page A and Page C (PR=5, 2 outbound links).

PR(B) = (1-0.85)/1B + 0.85 × (10/5 + 5/2) = 0.15/B + 0.85 × (2 + 2.5) ≈ 3.825

The page receives more rank from links on pages with fewer outbound links.

High-Level Architecture

Query Processing

Parse: Tokenize query, correct spelling, expand synonyms
Planner: Determine which index shards to query (scatter-gather)
Retrieve: Look up inverted index for each term
Score: Compute TF-IDF + PageRank + personalization
Rank: Sort by relevance score
Rerank: Apply ML model for final ranking
Return: Top 10 results with snippets

The scatter-gather pattern sends the query to all index shards in parallel, each shard returns top-K results, and the coordinator merges and re-ranks. This enables horizontal scaling of the index.

Practice Exercises

Algorithm: Implement an inverted index in code. What data structure maps terms to posting lists efficiently?
Scale: If the index has 100 billion documents and 1 billion unique terms, estimate the memory needed for the inverted index with 4-byte doc IDs and TF values.
Ranking: Design a ranking function that combines TF-IDF, PageRank, click-through rate, and freshness. How would you weight these signals?
Freshness: How would you update the inverted index when a page changes without rebuilding the entire index? Design an incremental update strategy.

Key Takeaways:

The inverted index is the fundamental data structure mapping terms to documents
TF-IDF measures term importance within a document; PageRank measures page importance globally
Scatter-gather across index shards enables horizontal scaling
Query processing: parse → retrieve → score → rank → rerank
Incremental index updates enable freshness without full rebuilds

What to Learn Next

-> Design Web Crawler Discovering and downloading web pages for indexing.

-> Design Search Autocomplete Trie-based typeahead suggestions.

-> Database Indexing B-tree and LSM-tree index structures.

-> Design Recommendation System ML-based ranking and personalization.

-> CDNs Serving search results from edge locations.

-> Databases Distributed storage for inverted indices.

Design a Search Engine

Design a Search Engine

Requirements

Functional Requirements

Non-Functional Requirements

Back-of-the-Envelope Estimation

Search Engine Capacity

Inverted Index

DfInverted Index

TF-IDF Score

TF-IDF Calculation

PageRank

DfPageRank

PageRank Iteration

PageRank Calculation

High-Level Architecture

Query Processing

Practice Exercises

What to Learn Next

Premium Content

Need Expert System Design Help?