System Design Problems
Design Google Search
Google processes 8.5B+ searches per day across 100M+ web pages. This design covers web crawling, inverted index construction, and the PageRank ranking algorithm.
- Scale β 8.5B searches/day, 100B+ pages indexed
- Index β Petabytes of inverted index data
- Latency β Search results in < 200ms
Google Search is the ultimate big data problem: crawling the entire web, building a searchable index, and ranking results in real-time.
Requirements Clarification
Functional Requirements
- Search web pages by keywords
- Return ranked results with snippets
- Support boolean operators (AND, OR, NOT)
- Autocomplete and spell correction
- Search images, videos, news
- Personalized results based on history
Non-Functional Requirements
- Availability: 99.99% uptime
- Latency: Results < 200ms
- Freshness: New pages indexed within hours
- Scale: 8.5B searches/day, 100B+ pages
Google Search is a three-phase system: (1) Crawling discovers pages, (2) Indexing builds the inverted index, (3) Serving handles queries in real-time. The index is rebuilt continuously.
Back-of-the-Envelope Estimation
Search QPS
Here,
- =Searches per day
- =Seconds in a day
Index Size Estimation
Average web page: 100KB Unique words per page: 1000 Posting list per word: 10M pages x 8 bytes = 80 MB
Total index size: 1M unique words x 80 MB = 80 PB
High-Level Architecture
Web Crawler
DfFocused BFS Crawler
Google's crawler uses breadth-first search with priority. High-quality pages (high PageRank) are crawled more frequently. The crawler respects robots.txt and uses a politeness policy to avoid overwhelming servers.
Crawl Frequency
Here,
- =Base crawl frequency
- =Page importance score
- =How often the page changes
Inverted Index
DfInverted Index Structure
An inverted index maps each word to a list of documents (posting list) containing that word. The posting list includes document ID, term frequency, and position. This enables O(1) lookup for single-word queries.
Posting List Entry
Here,
- =Document identifier
- =Term frequency in document
- =Word positions for phrase queries
PageRank Algorithm
DfPageRank
PageRank models a "random surfer" who follows links randomly. The probability of being on a page equals its PageRank. Pages with more inbound links from high-PageRank pages have higher PageRank.
PageRank Formula
Here,
- =PageRank of page i
- =Damping factor (0.85)
- =Total number of pages
- =Number of outbound links from page j
PageRank Calculation
Page A has PageRank 10 and 5 outbound links. Page B gets 1/5 of A's rank.
PR(B) from A = 10 / 5 = 2
If 100 pages link to B with similar rank distribution, B's total PageRank accumulates from all of them.
Query Processing Pipeline
Data Model
Document Schema
Here,
- =Unique document identifier
- =Pre-computed PageRank score
- =Last crawl timestamp
Practice Exercises
- Crawler Design: How would you prioritize crawling 1B new pages daily while respecting politeness limits?
- Index Design: Design a distributed inverted index that supports 100K QPS queries.
- Ranking: How would you combine PageRank with user-specific signals for personalized search?
- Freshness: Design a system that indexes breaking news within minutes of publication.
Key Takeaways:
- Google Search uses a three-phase pipeline: crawl, index, serve
- Inverted index enables O(1) single-word lookup
- PageRank models link structure as a random walk
- MapReduce builds the index in parallel across thousands of machines
- Query processing includes spell correction and result ranking
What to Learn Next
-> Design Leaderboard Sorted sets and real-time ranking.
-> Design Dropbox File storage and sync systems.
-> Design Amazon E-commerce search and catalog.
-> Design Twitter Trending topics and search.
-> Back Pressure Managing load in crawling systems.
-> Circuit Breaker Resilient web crawling.