πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Design Google Search

System Design ProblemsSearch Systems🟒 Free Lesson

Advertisement

System Design Problems

Design Google Search

Google processes 8.5B+ searches per day across 100M+ web pages. This design covers web crawling, inverted index construction, and the PageRank ranking algorithm.

  • Scale β€” 8.5B searches/day, 100B+ pages indexed
  • Index β€” Petabytes of inverted index data
  • Latency β€” Search results in < 200ms

Google Search is the ultimate big data problem: crawling the entire web, building a searchable index, and ranking results in real-time.

Requirements Clarification

Functional Requirements

  1. Search web pages by keywords
  2. Return ranked results with snippets
  3. Support boolean operators (AND, OR, NOT)
  4. Autocomplete and spell correction
  5. Search images, videos, news
  6. Personalized results based on history

Non-Functional Requirements

  1. Availability: 99.99% uptime
  2. Latency: Results < 200ms
  3. Freshness: New pages indexed within hours
  4. Scale: 8.5B searches/day, 100B+ pages

Google Search is a three-phase system: (1) Crawling discovers pages, (2) Indexing builds the inverted index, (3) Serving handles queries in real-time. The index is rebuilt continuously.

Back-of-the-Envelope Estimation

Search QPS

QPS=8.5Γ—10986400β‰ˆ98KΒ QPS\text{QPS} = \frac{8.5 \times 10^9}{86400} \approx 98K \text{ QPS}

Here,

  • 8.5B8.5B=Searches per day
  • 8640086400=Seconds in a day

Index Size Estimation

Average web page: 100KB Unique words per page: 1000 Posting list per word: 10M pages x 8 bytes = 80 MB

Total index size: 1M unique words x 80 MB = 80 PB

High-Level Architecture

Search ClientLoad Balancer / DNSQuery SvcIndex SvcRanking SvcCrawler SvcSnippet SvcSpell SvcMapReduce PipelineInverted IndexPageRank DBURL StoreDoc Store (GFS)Cache (Bigtable)

Web Crawler

DfFocused BFS Crawler

Google's crawler uses breadth-first search with priority. High-quality pages (high PageRank) are crawled more frequently. The crawler respects robots.txt and uses a politeness policy to avoid overwhelming servers.

Crawl Frequency

f(page)=fbaseΓ—PageRank(page)Γ—change_rate(page)f(page) = f_{\text{base}} \times \text{PageRank}(page) \times \text{change\_rate}(page)

Here,

  • fbasef_base=Base crawl frequency
  • PageRankPageRank=Page importance score
  • changeratechange_rate=How often the page changes

Inverted Index

DfInverted Index Structure

An inverted index maps each word to a list of documents (posting list) containing that word. The posting list includes document ID, term frequency, and position. This enables O(1) lookup for single-word queries.

Posting List Entry

posting=(doc_id,tf,positions[])\text{posting} = (doc\_id, tf, positions[])

Here,

  • dociddoc_id=Document identifier
  • tftf=Term frequency in document
  • positions[]positions[]=Word positions for phrase queries

PageRank Algorithm

DfPageRank

PageRank models a "random surfer" who follows links randomly. The probability of being on a page equals its PageRank. Pages with more inbound links from high-PageRank pages have higher PageRank.

PageRank Formula

PR(pi)=1βˆ’dN+dβˆ‘pj∈M(pi)PR(pj)L(pj)PR(p_i) = \frac{1-d}{N} + d \sum_{p_j \in M(p_i)} \frac{PR(p_j)}{L(p_j)}

Here,

  • PR(pi)PR(p_i)=PageRank of page i
  • dd=Damping factor (0.85)
  • NN=Total number of pages
  • L(pj)L(p_j)=Number of outbound links from page j

PageRank Calculation

Page A has PageRank 10 and 5 outbound links. Page B gets 1/5 of A's rank.

PR(B) from A = 10 / 5 = 2

If 100 pages link to B with similar rank distribution, B's total PageRank accumulates from all of them.

Query Processing Pipeline

ParseSpell FixExpandRetrieveRankReturn

Data Model

Document Schema

Document=(doc_id,url,title,content,pagerank,crawl_time,freshness)\text{Document} = (doc\_id, url, title, content, pagerank, crawl\_time, freshness)

Here,

  • dociddoc_id=Unique document identifier
  • pagerankpagerank=Pre-computed PageRank score
  • freshnessfreshness=Last crawl timestamp

Practice Exercises

  1. Crawler Design: How would you prioritize crawling 1B new pages daily while respecting politeness limits?
  2. Index Design: Design a distributed inverted index that supports 100K QPS queries.
  3. Ranking: How would you combine PageRank with user-specific signals for personalized search?
  4. Freshness: Design a system that indexes breaking news within minutes of publication.

Key Takeaways:

  • Google Search uses a three-phase pipeline: crawl, index, serve
  • Inverted index enables O(1) single-word lookup
  • PageRank models link structure as a random walk
  • MapReduce builds the index in parallel across thousands of machines
  • Query processing includes spell correction and result ranking

What to Learn Next

-> Design Leaderboard Sorted sets and real-time ranking.

-> Design Dropbox File storage and sync systems.

-> Design Amazon E-commerce search and catalog.

-> Design Twitter Trending topics and search.

-> Back Pressure Managing load in crawling systems.

-> Circuit Breaker Resilient web crawling.

⭐

Premium Content

Design Google Search

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert System Design Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement