System Design Problems

Design a Web Crawler

A web crawler systematically discovers and downloads web pages from the internet. Search engines like Google use crawlers to discover billions of pages, respecting robots.txt, managing politeness, and handling the enormous scale of the web.

Discovery — Find new pages by following links from known pages
Politeness — Respect crawl rate limits and robots.txt directives
Freshness — Recrawl pages periodically to detect changes

The web has over 4 billion pages. A crawler must be distributed, polite, and fault-tolerant while discovering new content efficiently and avoiding duplicate work.

Requirements

Functional Requirements

Crawl web pages starting from a set of seed URLs
Extract links from pages and discover new URLs
Respect robots.txt and crawl rate limits
Store crawled content for indexing
Recrawl pages based on change frequency
Support priority-based crawling (important pages first)

Non-Functional Requirements

Throughput: Crawl 1 billion pages per day
Politeness: Max 1 request/second per domain
Freshness: High-priority pages recrawled every 24 hours
Robustness: Handle malformed HTML, timeouts, and failures
Scalability: Horizontal scaling across thousands of machines

Web crawling is I/O bound (network requests), not CPU bound. The bottleneck is network bandwidth and DNS resolution, not computation.

Back-of-the-Envelope Estimation

Crawl Capacity Estimation

1 billion pages/day
Average page size: 500 KB
Daily bandwidth: 1B × 500 KB = 500 TB/day
QPS_crawl = 1B / 86400 ≈ 11,600 pages/sec
DNS lookups: 11,600 QPS (with caching)
1000 crawler machines: ~12 pages/sec per machine

API Design

Architecture Diagram

POST /api/v1/crawl/submit
Request:  { "urls": ["https://..."], "priority": "high" }
Response: { "job_id": "j_123", "status": "queued" }

GET /api/v1/crawl/status/{job_id}
Response: { "status": "running", "pages_crawled": 50000, "errors": 120 }

GET /api/v1/crawl/robots/{domain}
Response: { "allow": ["/"], "disallow": ["/admin"], "crawl_delay": 1 }

High-Level Architecture

Detailed Design

URL Frontier

The URL frontier is a priority queue that manages URLs to be crawled:

DfURL Frontier

The URL frontier is a data structure that stores URLs waiting to be crawled. It supports priority ordering, politeness constraints (per-domain rate limiting), and deduplication.

Component	Purpose
Priority Queue	Order URLs by importance (PageRank, freshness)
Politeness Queue	Per-domain queues with crawl delay enforcement
URL Dedup Set	Bloom filter or set to avoid re-crawling
DNS Resolver Cache	Cache DNS lookups to reduce DNS load

Crawl Priority

priority(url) = \alpha \cdot PageRank(url) + \beta \cdot freshness(url) + \gamma \cdot depth(url)

Here,

$PageRank(url)$ =Page importance score (0-1)
$freshness(url)$ =Time since last crawl (older = higher priority)
$depth(url)$ =Link depth from seed (shallower = higher priority)

Politeness

Respect website crawl rate limits:

DfPoliteness Policy

Each domain has a crawl delay specified in robots.txt. The crawler enforces this delay between consecutive requests to the same domain using per-domain queues.

Crawl Delay Enforcement

next\_request\_time = max(current\_time, last\_request\_time + crawl\_delay)

Here,

$crawl_delay$ =Minimum seconds between requests to same domain
$last_request_time$ =Timestamp of last request to this domain

Politeness Enforcement

Domain: example.com, crawl_delay: 2s

Request 1: t=0s
Request 2: t=2s (enforced delay)
Request 3: t=4s (enforced delay)

Always respect robots.txt. Violating robots.txt can result in IP bans and legal issues. Cache robots.txt per domain with a reasonable TTL (24 hours).

URL Deduplication

Avoid crawling the same page twice:

DfBloom Filter

A Bloom filter is a probabilistic data structure that tests whether an element is a member of a set. It has zero false negatives but may have false positives. Space-efficient for billions of URLs.

Bloom Filter False Positive Rate

f = (1 - e^{-kn/m})^k

Here,

$m$ =Size of bit array
$n$ =Number of inserted elements
$k$ =Number of hash functions

Bloom Filter Sizing

For 10 billion URLs with 1% false positive rate:

Optimal k = (m/n) × ln(2) ≈ 7 hash functions
m = -n × ln(f) / (ln(2))² ≈ 96 GB

This fits in memory across a distributed cluster.

Politeness Queue Design

Use a per-domain queue with a delay queue:

Hash each URL to its domain
Place URL in the domain's queue
Domain queue has a timer enforcing crawl_delay
When timer fires, dequeue next URL and fetch

Practice Exercises

Design: How would you handle JavaScript-rendered pages (SPAs) that require a headless browser to render? What are the resource implications?
Scale: If the crawler needs to crawl 1 billion pages per day across 1000 machines, estimate the network bandwidth required and the per-machine crawl rate.
Freshness: Design a recrawl scheduling algorithm that adapts to page change frequency. How would you prioritize recrawling?
Robustness: How would you handle a crawler trap (infinite URL space, recursive redirects)? Design detection and mitigation mechanisms.

Key Takeaways:

The URL frontier manages priority, politeness, and deduplication for efficient crawling
Per-domain politeness queues enforce robots.txt crawl delay requirements
Bloom filters enable space-efficient URL deduplication across billions of pages
Priority scheduling uses PageRank, freshness, and link depth signals
Recrawl scheduling adapts to page change frequency for freshness optimization

What to Learn Next

-> Design Search Engine Inverted indices, relevance ranking, and search infrastructure.

-> Design Web Crawler This article (web crawler design deep dive).

-> Message Queues Async task distribution for crawl workers.

-> Rate Limiting Politeness and per-domain rate limiting strategies.

-> Databases Storing crawled content and URL metadata.

-> Caching Strategies DNS caching and robots.txt caching.

Design a Web Crawler

Design a Web Crawler

Requirements

Functional Requirements

Non-Functional Requirements

Back-of-the-Envelope Estimation

Crawl Capacity Estimation

API Design

High-Level Architecture

Detailed Design

URL Frontier

DfURL Frontier

Crawl Priority

Politeness

DfPoliteness Policy

Crawl Delay Enforcement

Politeness Enforcement

URL Deduplication

DfBloom Filter

Bloom Filter False Positive Rate

Bloom Filter Sizing

Politeness Queue Design

Practice Exercises

What to Learn Next

Premium Content

Need Expert System Design Help?