πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Design a Search Engine

System Design ProblemsWeb Search🟒 Free Lesson

Advertisement

System Design Problems

Design a Search Engine

A search engine indexes the entire web and returns relevant results in milliseconds. Google processes 8.5 billion searches daily across 100+ billion web pages, requiring inverted indices, relevance ranking (PageRank), and distributed query processing.

  • Crawling β€” Discover and download web pages continuously
  • Indexing β€” Build inverted indices mapping words to documents
  • Ranking β€” Score and order results by relevance

The core challenge is balancing index completeness (covering the entire web) with query speed (returning results in < 200ms) across billions of documents.

Requirements

Functional Requirements

  • Users can search with keywords and get ranked results
  • Results include title, snippet, URL, and metadata
  • Support for operators (AND, OR, NOT, exact phrase)
  • Autocomplete and spell correction
  • Search in multiple languages
  • Personalized results based on search history

Non-Functional Requirements

  • Latency: Results in < 200ms (p99)
  • Relevance: Top 10 results must be highly relevant
  • Scale: 100 billion indexed pages, 100K QPS
  • Freshness: New pages indexed within 1 hour
  • Availability: 99.99%

Search engines face a fundamental tension: larger indices provide better coverage but slower queries. The inverted index data structure solves this by enabling O(1) lookup from terms to documents.

Back-of-the-Envelope Estimation

Search Engine Capacity

  • 100 billion pages Γ— 1 KB average content = 100 TB raw
  • Inverted index: ~10 TB (terms β†’ doc IDs with positions)
  • Query QPS: 8.5B / 86400 β‰ˆ 100K QPS
  • Peak QPS (3x): 300K QPS
  • Index partitioning: 1000 shards Γ— 10 GB each

Inverted Index

DfInverted Index

An inverted index maps each term to a list of documents that contain it, along with positions and frequencies. This data structure enables O(1) term lookup, making it the foundation of search engines.

"distributed""database""scale""design"Posting ListDoc1:3Doc5:1Doc12:2Doc89:1Doc1:2Doc3:4Doc12:3Doc1:1Doc7:2Term β†’ Posting List (DocID, TF, positions)

TF-IDF Score

tfidf(t,d)=tf(t,d)Γ—idf(t)tfidf(t, d) = tf(t, d) \times idf(t)

Here,

  • tf(t,d)tf(t, d)=Term frequency: how often term t appears in document d
  • idf(t)idf(t)=Inverse document frequency: how rare term t is across all documents
  • idf(t)idf(t)== log(N / df(t)) where N = total docs, df(t) = docs containing t

TF-IDF Calculation

Term "distributed" appears 3 times in Doc1 (TF=3). 10 billion total docs, 100 million contain "distributed": IDF = log(10B / 100M) = log(100) = 2

TF-IDF = 3 Γ— 2 = 6

The term "the" might have TF=10 but IDF=0.01 (appears in almost every doc). TF-IDF = 10 Γ— 0.01 = 0.1 (low relevance)

PageRank

DfPageRank

PageRank scores web pages based on the quantity and quality of inbound links. A page linked from many high-quality pages receives a higher score.

PageRank Iteration

PR(d)=1βˆ’dN+dΓ—βˆ‘p∈B(d)PR(p)L(p)PR(d) = \frac{1-d}{N} + d \times \sum_{p \in B(d)} \frac{PR(p)}{L(p)}

Here,

  • PR(d)PR(d)=PageRank of document d
  • dd=Damping factor (typically 0.85)
  • NN=Total number of pages
  • B(d)B(d)=Pages that link to d
  • L(p)L(p)=Number of outbound links from page p

PageRank Calculation

Page A has PageRank 10 and 5 outbound links. Page B is linked from Page A and Page C (PR=5, 2 outbound links).

PR(B) = (1-0.85)/1B + 0.85 Γ— (10/5 + 5/2) = 0.15/B + 0.85 Γ— (2 + 2.5) β‰ˆ 3.825

The page receives more rank from links on pages with fewer outbound links.

High-Level Architecture

Web CrawlerBFS/DFSRaw Content(S3/HDFS)IndexerTokenizeBuild Inverted IndexIndex Store(1000 shards)UserQuery Parser+ Spell CheckQuery PlannerScatter-gatherRanker+ RerankerSearch Engine Architecture

Query Processing

  1. Parse: Tokenize query, correct spelling, expand synonyms
  2. Planner: Determine which index shards to query (scatter-gather)
  3. Retrieve: Look up inverted index for each term
  4. Score: Compute TF-IDF + PageRank + personalization
  5. Rank: Sort by relevance score
  6. Rerank: Apply ML model for final ranking
  7. Return: Top 10 results with snippets

The scatter-gather pattern sends the query to all index shards in parallel, each shard returns top-K results, and the coordinator merges and re-ranks. This enables horizontal scaling of the index.

Practice Exercises

  1. Algorithm: Implement an inverted index in code. What data structure maps terms to posting lists efficiently?

  2. Scale: If the index has 100 billion documents and 1 billion unique terms, estimate the memory needed for the inverted index with 4-byte doc IDs and TF values.

  3. Ranking: Design a ranking function that combines TF-IDF, PageRank, click-through rate, and freshness. How would you weight these signals?

  4. Freshness: How would you update the inverted index when a page changes without rebuilding the entire index? Design an incremental update strategy.

Key Takeaways:

  • The inverted index is the fundamental data structure mapping terms to documents
  • TF-IDF measures term importance within a document; PageRank measures page importance globally
  • Scatter-gather across index shards enables horizontal scaling
  • Query processing: parse β†’ retrieve β†’ score β†’ rank β†’ rerank
  • Incremental index updates enable freshness without full rebuilds

What to Learn Next

-> Design Web Crawler Discovering and downloading web pages for indexing.

-> Design Search Autocomplete Trie-based typeahead suggestions.

-> Database Indexing B-tree and LSM-tree index structures.

-> Design Recommendation System ML-based ranking and personalization.

-> CDNs Serving search results from edge locations.

-> Databases Distributed storage for inverted indices.

⭐

Premium Content

Design a Search Engine

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert System Design Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement