System Design Problems
Design a Web Crawler
A web crawler systematically discovers and downloads web pages from the internet. Search engines like Google use crawlers to discover billions of pages, respecting robots.txt, managing politeness, and handling the enormous scale of the web.
- Discovery β Find new pages by following links from known pages
- Politeness β Respect crawl rate limits and robots.txt directives
- Freshness β Recrawl pages periodically to detect changes
The web has over 4 billion pages. A crawler must be distributed, polite, and fault-tolerant while discovering new content efficiently and avoiding duplicate work.
Requirements
Functional Requirements
- Crawl web pages starting from a set of seed URLs
- Extract links from pages and discover new URLs
- Respect robots.txt and crawl rate limits
- Store crawled content for indexing
- Recrawl pages based on change frequency
- Support priority-based crawling (important pages first)
Non-Functional Requirements
- Throughput: Crawl 1 billion pages per day
- Politeness: Max 1 request/second per domain
- Freshness: High-priority pages recrawled every 24 hours
- Robustness: Handle malformed HTML, timeouts, and failures
- Scalability: Horizontal scaling across thousands of machines
Web crawling is I/O bound (network requests), not CPU bound. The bottleneck is network bandwidth and DNS resolution, not computation.
Back-of-the-Envelope Estimation
Crawl Capacity Estimation
- 1 billion pages/day
- Average page size: 500 KB
- Daily bandwidth: 1B Γ 500 KB = 500 TB/day
- QPS_crawl = 1B / 86400 β 11,600 pages/sec
- DNS lookups: 11,600 QPS (with caching)
- 1000 crawler machines: ~12 pages/sec per machine
API Design
POST /api/v1/crawl/submit
Request: { "urls": ["https://..."], "priority": "high" }
Response: { "job_id": "j_123", "status": "queued" }
GET /api/v1/crawl/status/{job_id}
Response: { "status": "running", "pages_crawled": 50000, "errors": 120 }
GET /api/v1/crawl/robots/{domain}
Response: { "allow": ["/"], "disallow": ["/admin"], "crawl_delay": 1 }
High-Level Architecture
Detailed Design
URL Frontier
The URL frontier is a priority queue that manages URLs to be crawled:
DfURL Frontier
The URL frontier is a data structure that stores URLs waiting to be crawled. It supports priority ordering, politeness constraints (per-domain rate limiting), and deduplication.
| Component | Purpose |
|---|---|
| Priority Queue | Order URLs by importance (PageRank, freshness) |
| Politeness Queue | Per-domain queues with crawl delay enforcement |
| URL Dedup Set | Bloom filter or set to avoid re-crawling |
| DNS Resolver Cache | Cache DNS lookups to reduce DNS load |
Crawl Priority
Here,
- =Page importance score (0-1)
- =Time since last crawl (older = higher priority)
- =Link depth from seed (shallower = higher priority)
Politeness
Respect website crawl rate limits:
DfPoliteness Policy
Each domain has a crawl delay specified in robots.txt. The crawler enforces this delay between consecutive requests to the same domain using per-domain queues.
Crawl Delay Enforcement
Here,
- =Minimum seconds between requests to same domain
- =Timestamp of last request to this domain
Politeness Enforcement
Domain: example.com, crawl_delay: 2s
- Request 1: t=0s
- Request 2: t=2s (enforced delay)
- Request 3: t=4s (enforced delay)
Always respect robots.txt. Violating robots.txt can result in IP bans and legal issues. Cache robots.txt per domain with a reasonable TTL (24 hours).
URL Deduplication
Avoid crawling the same page twice:
DfBloom Filter
A Bloom filter is a probabilistic data structure that tests whether an element is a member of a set. It has zero false negatives but may have false positives. Space-efficient for billions of URLs.
Bloom Filter False Positive Rate
Here,
- =Size of bit array
- =Number of inserted elements
- =Number of hash functions
Bloom Filter Sizing
For 10 billion URLs with 1% false positive rate:
- Optimal k = (m/n) Γ ln(2) β 7 hash functions
- m = -n Γ ln(f) / (ln(2))Β² β 96 GB
This fits in memory across a distributed cluster.
Politeness Queue Design
Use a per-domain queue with a delay queue:
- Hash each URL to its domain
- Place URL in the domain's queue
- Domain queue has a timer enforcing crawl_delay
- When timer fires, dequeue next URL and fetch
Practice Exercises
-
Design: How would you handle JavaScript-rendered pages (SPAs) that require a headless browser to render? What are the resource implications?
-
Scale: If the crawler needs to crawl 1 billion pages per day across 1000 machines, estimate the network bandwidth required and the per-machine crawl rate.
-
Freshness: Design a recrawl scheduling algorithm that adapts to page change frequency. How would you prioritize recrawling?
-
Robustness: How would you handle a crawler trap (infinite URL space, recursive redirects)? Design detection and mitigation mechanisms.
Key Takeaways:
- The URL frontier manages priority, politeness, and deduplication for efficient crawling
- Per-domain politeness queues enforce robots.txt crawl delay requirements
- Bloom filters enable space-efficient URL deduplication across billions of pages
- Priority scheduling uses PageRank, freshness, and link depth signals
- Recrawl scheduling adapts to page change frequency for freshness optimization
What to Learn Next
-> Design Search Engine Inverted indices, relevance ranking, and search infrastructure.
-> Design Web Crawler This article (web crawler design deep dive).
-> Message Queues Async task distribution for crawl workers.
-> Rate Limiting Politeness and per-domain rate limiting strategies.
-> Databases Storing crawled content and URL metadata.
-> Caching Strategies DNS caching and robots.txt caching.