System Design Problems

Design Google Search

Google processes 8.5B+ searches per day across 100M+ web pages. This design covers web crawling, inverted index construction, and the PageRank ranking algorithm.

Scale — 8.5B searches/day, 100B+ pages indexed
Index — Petabytes of inverted index data
Latency — Search results in < 200ms

Google Search is the ultimate big data problem: crawling the entire web, building a searchable index, and ranking results in real-time.

Requirements Clarification

Functional Requirements

Search web pages by keywords
Return ranked results with snippets
Support boolean operators (AND, OR, NOT)
Autocomplete and spell correction
Search images, videos, news
Personalized results based on history

Non-Functional Requirements

Availability: 99.99% uptime
Latency: Results < 200ms
Freshness: New pages indexed within hours
Scale: 8.5B searches/day, 100B+ pages

Google Search is a three-phase system: (1) Crawling discovers pages, (2) Indexing builds the inverted index, (3) Serving handles queries in real-time. The index is rebuilt continuously.

Back-of-the-Envelope Estimation

Search QPS

\text{QPS} = \frac{8.5 \times 10^9}{86400} \approx 98K \text{ QPS}

Here,

$8.5B$ =Searches per day
$86400$ =Seconds in a day

Index Size Estimation

Average web page: 100KB Unique words per page: 1000 Posting list per word: 10M pages x 8 bytes = 80 MB

Total index size: 1M unique words x 80 MB = 80 PB

High-Level Architecture

Web Crawler

DfFocused BFS Crawler

Google's crawler uses breadth-first search with priority. High-quality pages (high PageRank) are crawled more frequently. The crawler respects robots.txt and uses a politeness policy to avoid overwhelming servers.

Crawl Frequency

f(page) = f_{\text{base}} \times \text{PageRank}(page) \times \text{change\_rate}(page)

Here,

$f_base$ =Base crawl frequency
$PageRank$ =Page importance score
$change_rate$ =How often the page changes

Inverted Index

DfInverted Index Structure

An inverted index maps each word to a list of documents (posting list) containing that word. The posting list includes document ID, term frequency, and position. This enables O(1) lookup for single-word queries.

Posting List Entry

\text{posting} = (doc\_id, tf, positions[])

Here,

$doc_id$ =Document identifier
$tf$ =Term frequency in document
$positions[]$ =Word positions for phrase queries

PageRank Algorithm

DfPageRank

PageRank models a "random surfer" who follows links randomly. The probability of being on a page equals its PageRank. Pages with more inbound links from high-PageRank pages have higher PageRank.

PageRank Formula

PR(p_i) = \frac{1-d}{N} + d \sum_{p_j \in M(p_i)} \frac{PR(p_j)}{L(p_j)}

Here,

$PR(p_i)$ =PageRank of page i
$d$ =Damping factor (0.85)
$N$ =Total number of pages
$L(p_j)$ =Number of outbound links from page j

PageRank Calculation

Page A has PageRank 10 and 5 outbound links. Page B gets 1/5 of A's rank.

PR(B) from A = 10 / 5 = 2

If 100 pages link to B with similar rank distribution, B's total PageRank accumulates from all of them.

Query Processing Pipeline

Data Model

Document Schema

\text{Document} = (doc\_id, url, title, content, pagerank, crawl\_time, freshness)

Here,

$doc_id$ =Unique document identifier
$pagerank$ =Pre-computed PageRank score
$freshness$ =Last crawl timestamp

Practice Exercises

Crawler Design: How would you prioritize crawling 1B new pages daily while respecting politeness limits?
Index Design: Design a distributed inverted index that supports 100K QPS queries.
Ranking: How would you combine PageRank with user-specific signals for personalized search?
Freshness: Design a system that indexes breaking news within minutes of publication.

Key Takeaways:

Google Search uses a three-phase pipeline: crawl, index, serve
Inverted index enables O(1) single-word lookup
PageRank models link structure as a random walk
MapReduce builds the index in parallel across thousands of machines
Query processing includes spell correction and result ranking

What to Learn Next

-> Design Leaderboard Sorted sets and real-time ranking.

-> Design Dropbox File storage and sync systems.

-> Design Amazon E-commerce search and catalog.

-> Design Twitter Trending topics and search.

-> Back Pressure Managing load in crawling systems.

-> Circuit Breaker Resilient web crawling.

Design Google Search

Design Google Search

Requirements Clarification

Functional Requirements

Non-Functional Requirements

Back-of-the-Envelope Estimation

Search QPS

Index Size Estimation

High-Level Architecture

Web Crawler

DfFocused BFS Crawler

Crawl Frequency

Inverted Index

DfInverted Index Structure

Posting List Entry

PageRank Algorithm

DfPageRank

PageRank Formula

PageRank Calculation

Query Processing Pipeline

Data Model

Document Schema

Practice Exercises

What to Learn Next

Premium Content

Need Expert System Design Help?