Interview Questions/System Design/Design a Web Crawler & Search Index
AdvancedSystem-Design
45 min

Design a Web Crawler & Search Index

SearchIRCrawlingIndexing
Advertisement
Interview Question

Design an internet-scale crawler with deduplication, politeness, incremental updates, and a query-time search index.

Key Points to Cover
  • Crawl frontier, politeness (robots.txt, rate limits), DNS performance
  • Content processing: normalization, dedup (shingling/simhash), language detection
  • Indexing: inverted index, postings compression, shard/replica strategy
  • Freshness & recrawl scheduling; change detection
  • Query layer: ranking signals, caching, snippet/preview generation
  • Abuse handling: spam detection, safe browsing integration
Evaluation Rubric
Robust, polite crawler design25% weight
Effective content processing/dedup25% weight
Scalable indexing and querying25% weight
Spam/safety considerations25% weight
Hints
  • 💡Keep DNS and robots fetching efficient and cache-friendly.
Potential Follow-up Questions
  • How do you schedule recrawls?
  • How do you fight link spam/cloaking?
Advertisement