Design a Web Crawler & Search Index

SearchIRCrawlingIndexing

Interview Question

Design an internet-scale crawler with deduplication, politeness, incremental updates, and a query-time search index.

Key Points to Cover

Crawl frontier, politeness (robots.txt, rate limits), DNS performance
Content processing: normalization, dedup (shingling/simhash), language detection
Indexing: inverted index, postings compression, shard/replica strategy
Freshness & recrawl scheduling; change detection
Query layer: ranking signals, caching, snippet/preview generation
Abuse handling: spam detection, safe browsing integration

Evaluation Rubric

Robust, polite crawler design25% weight

Effective content processing/dedup25% weight

Scalable indexing and querying25% weight

Spam/safety considerations25% weight

Hints

Common Pitfalls to Avoid

⚠️Underestimating the scale and complexity of distributed systems, leading to single points of failure or performance bottlenecks.
⚠️Neglecting `robots.txt` and aggressive crawling, resulting in IP bans or legal issues.
⚠️Inefficient deduplication algorithms that are too slow or too memory-intensive for internet-scale data.
⚠️Poorly designed or uncompressed inverted indexes that lead to excessive storage requirements and slow query times.
⚠️Lack of proper error handling and retry mechanisms, leading to data loss or incomplete crawls.

Potential Follow-up Questions