AdvancedSystem-Design
45 min
Design a Web Crawler & Search Index
SearchIRCrawlingIndexing
Advertisement
Interview Question
Design an internet-scale crawler with deduplication, politeness, incremental updates, and a query-time search index.
Key Points to Cover
- Crawl frontier, politeness (robots.txt, rate limits), DNS performance
- Content processing: normalization, dedup (shingling/simhash), language detection
- Indexing: inverted index, postings compression, shard/replica strategy
- Freshness & recrawl scheduling; change detection
- Query layer: ranking signals, caching, snippet/preview generation
- Abuse handling: spam detection, safe browsing integration
Evaluation Rubric
Robust, polite crawler design25% weight
Effective content processing/dedup25% weight
Scalable indexing and querying25% weight
Spam/safety considerations25% weight
Hints
- 💡Keep DNS and robots fetching efficient and cache-friendly.
Potential Follow-up Questions
- ❓How do you schedule recrawls?
- ❓How do you fight link spam/cloaking?
Advertisement