πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Design a Web Crawler

System Design ProblemsWeb Crawling🟒 Free Lesson

Advertisement

System Design Problems

Design a Web Crawler

A web crawler systematically discovers and downloads web pages from the internet. Search engines like Google use crawlers to discover billions of pages, respecting robots.txt, managing politeness, and handling the enormous scale of the web.

  • Discovery β€” Find new pages by following links from known pages
  • Politeness β€” Respect crawl rate limits and robots.txt directives
  • Freshness β€” Recrawl pages periodically to detect changes

The web has over 4 billion pages. A crawler must be distributed, polite, and fault-tolerant while discovering new content efficiently and avoiding duplicate work.

Requirements

Functional Requirements

  • Crawl web pages starting from a set of seed URLs
  • Extract links from pages and discover new URLs
  • Respect robots.txt and crawl rate limits
  • Store crawled content for indexing
  • Recrawl pages based on change frequency
  • Support priority-based crawling (important pages first)

Non-Functional Requirements

  • Throughput: Crawl 1 billion pages per day
  • Politeness: Max 1 request/second per domain
  • Freshness: High-priority pages recrawled every 24 hours
  • Robustness: Handle malformed HTML, timeouts, and failures
  • Scalability: Horizontal scaling across thousands of machines

Web crawling is I/O bound (network requests), not CPU bound. The bottleneck is network bandwidth and DNS resolution, not computation.

Back-of-the-Envelope Estimation

Crawl Capacity Estimation

  • 1 billion pages/day
  • Average page size: 500 KB
  • Daily bandwidth: 1B Γ— 500 KB = 500 TB/day
  • QPS_crawl = 1B / 86400 β‰ˆ 11,600 pages/sec
  • DNS lookups: 11,600 QPS (with caching)
  • 1000 crawler machines: ~12 pages/sec per machine

API Design

Architecture Diagram
POST /api/v1/crawl/submit
Request:  { "urls": ["https://..."], "priority": "high" }
Response: { "job_id": "j_123", "status": "queued" }

GET /api/v1/crawl/status/{job_id}
Response: { "status": "running", "pages_crawled": 50000, "errors": 120 }

GET /api/v1/crawl/robots/{domain}
Response: { "allow": ["/"], "disallow": ["/admin"], "crawl_delay": 1 }

High-Level Architecture

Seed URLsURL FrontierPriority QueueURL DedupDNS CacheCrawler WorkersFetcher (HTTP)HTML ParserLink ExtractorContent FilterContent StoreURL DatabaseIndexerWeb Crawler Architecture

Detailed Design

URL Frontier

The URL frontier is a priority queue that manages URLs to be crawled:

DfURL Frontier

The URL frontier is a data structure that stores URLs waiting to be crawled. It supports priority ordering, politeness constraints (per-domain rate limiting), and deduplication.

ComponentPurpose
Priority QueueOrder URLs by importance (PageRank, freshness)
Politeness QueuePer-domain queues with crawl delay enforcement
URL Dedup SetBloom filter or set to avoid re-crawling
DNS Resolver CacheCache DNS lookups to reduce DNS load

Crawl Priority

priority(url)=Ξ±β‹…PageRank(url)+Ξ²β‹…freshness(url)+Ξ³β‹…depth(url)priority(url) = \alpha \cdot PageRank(url) + \beta \cdot freshness(url) + \gamma \cdot depth(url)

Here,

  • PageRank(url)PageRank(url)=Page importance score (0-1)
  • freshness(url)freshness(url)=Time since last crawl (older = higher priority)
  • depth(url)depth(url)=Link depth from seed (shallower = higher priority)

Politeness

Respect website crawl rate limits:

DfPoliteness Policy

Each domain has a crawl delay specified in robots.txt. The crawler enforces this delay between consecutive requests to the same domain using per-domain queues.

Crawl Delay Enforcement

next_request_time=max(current_time,last_request_time+crawl_delay)next\_request\_time = max(current\_time, last\_request\_time + crawl\_delay)

Here,

  • crawldelaycrawl_delay=Minimum seconds between requests to same domain
  • lastrequesttimelast_request_time=Timestamp of last request to this domain

Politeness Enforcement

Domain: example.com, crawl_delay: 2s

  • Request 1: t=0s
  • Request 2: t=2s (enforced delay)
  • Request 3: t=4s (enforced delay)

Always respect robots.txt. Violating robots.txt can result in IP bans and legal issues. Cache robots.txt per domain with a reasonable TTL (24 hours).

URL Deduplication

Avoid crawling the same page twice:

DfBloom Filter

A Bloom filter is a probabilistic data structure that tests whether an element is a member of a set. It has zero false negatives but may have false positives. Space-efficient for billions of URLs.

Bloom Filter False Positive Rate

f=(1βˆ’eβˆ’kn/m)kf = (1 - e^{-kn/m})^k

Here,

  • mm=Size of bit array
  • nn=Number of inserted elements
  • kk=Number of hash functions

Bloom Filter Sizing

For 10 billion URLs with 1% false positive rate:

  • Optimal k = (m/n) Γ— ln(2) β‰ˆ 7 hash functions
  • m = -n Γ— ln(f) / (ln(2))Β² β‰ˆ 96 GB

This fits in memory across a distributed cluster.

Politeness Queue Design

Use a per-domain queue with a delay queue:

  1. Hash each URL to its domain
  2. Place URL in the domain's queue
  3. Domain queue has a timer enforcing crawl_delay
  4. When timer fires, dequeue next URL and fetch
example.comgoogle.comgithub.com...delay: 2sdelay: 1sdelay: 5sPer-domain politeness queues

Practice Exercises

  1. Design: How would you handle JavaScript-rendered pages (SPAs) that require a headless browser to render? What are the resource implications?

  2. Scale: If the crawler needs to crawl 1 billion pages per day across 1000 machines, estimate the network bandwidth required and the per-machine crawl rate.

  3. Freshness: Design a recrawl scheduling algorithm that adapts to page change frequency. How would you prioritize recrawling?

  4. Robustness: How would you handle a crawler trap (infinite URL space, recursive redirects)? Design detection and mitigation mechanisms.

Key Takeaways:

  • The URL frontier manages priority, politeness, and deduplication for efficient crawling
  • Per-domain politeness queues enforce robots.txt crawl delay requirements
  • Bloom filters enable space-efficient URL deduplication across billions of pages
  • Priority scheduling uses PageRank, freshness, and link depth signals
  • Recrawl scheduling adapts to page change frequency for freshness optimization

What to Learn Next

-> Design Search Engine Inverted indices, relevance ranking, and search infrastructure.

-> Design Web Crawler This article (web crawler design deep dive).

-> Message Queues Async task distribution for crawl workers.

-> Rate Limiting Politeness and per-domain rate limiting strategies.

-> Databases Storing crawled content and URL metadata.

-> Caching Strategies DNS caching and robots.txt caching.

⭐

Premium Content

Design a Web Crawler

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert System Design Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement