System Design Problems
Design a Search Engine
A search engine indexes the entire web and returns relevant results in milliseconds. Google processes 8.5 billion searches daily across 100+ billion web pages, requiring inverted indices, relevance ranking (PageRank), and distributed query processing.
- Crawling β Discover and download web pages continuously
- Indexing β Build inverted indices mapping words to documents
- Ranking β Score and order results by relevance
The core challenge is balancing index completeness (covering the entire web) with query speed (returning results in < 200ms) across billions of documents.
Requirements
Functional Requirements
- Users can search with keywords and get ranked results
- Results include title, snippet, URL, and metadata
- Support for operators (AND, OR, NOT, exact phrase)
- Autocomplete and spell correction
- Search in multiple languages
- Personalized results based on search history
Non-Functional Requirements
- Latency: Results in < 200ms (p99)
- Relevance: Top 10 results must be highly relevant
- Scale: 100 billion indexed pages, 100K QPS
- Freshness: New pages indexed within 1 hour
- Availability: 99.99%
Search engines face a fundamental tension: larger indices provide better coverage but slower queries. The inverted index data structure solves this by enabling O(1) lookup from terms to documents.
Back-of-the-Envelope Estimation
Search Engine Capacity
- 100 billion pages Γ 1 KB average content = 100 TB raw
- Inverted index: ~10 TB (terms β doc IDs with positions)
- Query QPS: 8.5B / 86400 β 100K QPS
- Peak QPS (3x): 300K QPS
- Index partitioning: 1000 shards Γ 10 GB each
Inverted Index
DfInverted Index
An inverted index maps each term to a list of documents that contain it, along with positions and frequencies. This data structure enables O(1) term lookup, making it the foundation of search engines.
TF-IDF Score
Here,
- =Term frequency: how often term t appears in document d
- =Inverse document frequency: how rare term t is across all documents
- == log(N / df(t)) where N = total docs, df(t) = docs containing t
TF-IDF Calculation
Term "distributed" appears 3 times in Doc1 (TF=3). 10 billion total docs, 100 million contain "distributed": IDF = log(10B / 100M) = log(100) = 2
TF-IDF = 3 Γ 2 = 6
The term "the" might have TF=10 but IDF=0.01 (appears in almost every doc). TF-IDF = 10 Γ 0.01 = 0.1 (low relevance)
PageRank
DfPageRank
PageRank scores web pages based on the quantity and quality of inbound links. A page linked from many high-quality pages receives a higher score.
PageRank Iteration
Here,
- =PageRank of document d
- =Damping factor (typically 0.85)
- =Total number of pages
- =Pages that link to d
- =Number of outbound links from page p
PageRank Calculation
Page A has PageRank 10 and 5 outbound links. Page B is linked from Page A and Page C (PR=5, 2 outbound links).
PR(B) = (1-0.85)/1B + 0.85 Γ (10/5 + 5/2) = 0.15/B + 0.85 Γ (2 + 2.5) β 3.825
The page receives more rank from links on pages with fewer outbound links.
High-Level Architecture
Query Processing
- Parse: Tokenize query, correct spelling, expand synonyms
- Planner: Determine which index shards to query (scatter-gather)
- Retrieve: Look up inverted index for each term
- Score: Compute TF-IDF + PageRank + personalization
- Rank: Sort by relevance score
- Rerank: Apply ML model for final ranking
- Return: Top 10 results with snippets
The scatter-gather pattern sends the query to all index shards in parallel, each shard returns top-K results, and the coordinator merges and re-ranks. This enables horizontal scaling of the index.
Practice Exercises
-
Algorithm: Implement an inverted index in code. What data structure maps terms to posting lists efficiently?
-
Scale: If the index has 100 billion documents and 1 billion unique terms, estimate the memory needed for the inverted index with 4-byte doc IDs and TF values.
-
Ranking: Design a ranking function that combines TF-IDF, PageRank, click-through rate, and freshness. How would you weight these signals?
-
Freshness: How would you update the inverted index when a page changes without rebuilding the entire index? Design an incremental update strategy.
Key Takeaways:
- The inverted index is the fundamental data structure mapping terms to documents
- TF-IDF measures term importance within a document; PageRank measures page importance globally
- Scatter-gather across index shards enables horizontal scaling
- Query processing: parse β retrieve β score β rank β rerank
- Incremental index updates enable freshness without full rebuilds
What to Learn Next
-> Design Web Crawler Discovering and downloading web pages for indexing.
-> Design Search Autocomplete Trie-based typeahead suggestions.
-> Database Indexing B-tree and LSM-tree index structures.
-> Design Recommendation System ML-based ranking and personalization.
-> CDNs Serving search results from edge locations.
-> Databases Distributed storage for inverted indices.