Interview Questions/System Design/Design a Web Crawler & Search Index
AdvancedSystem-Design
45 min

Design a Web Crawler & Search Index

SearchIRCrawlingIndexing
Advertisement
Interview Question

Design an internet-scale crawler with deduplication, politeness, incremental updates, and a query-time search index.

Key Points to Cover
  • Crawl frontier, politeness (robots.txt, rate limits), DNS performance
  • Content processing: normalization, dedup (shingling/simhash), language detection
  • Indexing: inverted index, postings compression, shard/replica strategy
  • Freshness & recrawl scheduling; change detection
  • Query layer: ranking signals, caching, snippet/preview generation
  • Abuse handling: spam detection, safe browsing integration
Evaluation Rubric
Robust, polite crawler design25% weight
Effective content processing/dedup25% weight
Scalable indexing and querying25% weight
Spam/safety considerations25% weight
Hints
  • 💡Keep DNS and robots fetching efficient and cache-friendly.
Common Pitfalls to Avoid
  • ⚠️Underestimating the scale and complexity of distributed systems, leading to single points of failure or performance bottlenecks.
  • ⚠️Neglecting `robots.txt` and aggressive crawling, resulting in IP bans or legal issues.
  • ⚠️Inefficient deduplication algorithms that are too slow or too memory-intensive for internet-scale data.
  • ⚠️Poorly designed or uncompressed inverted indexes that lead to excessive storage requirements and slow query times.
  • ⚠️Lack of proper error handling and retry mechanisms, leading to data loss or incomplete crawls.
Potential Follow-up Questions
  • How do you schedule recrawls?
  • How do you fight link spam/cloaking?
Advertisement