Design Questions
- What is web crawler and what’s its purpose
- Web crawlers collect content from a page and follow the link on the page to collect new content.
- Example purpose: Search engine indexing. Web crawler can be used to collect page information to build search index.
- Example purpose: Web monitoring. Web crawler can be used to check copyright and trademark infringement.
- Storage calculation:
- For one month: 1 billion pages * 500KB = 500TB.
- Store monthly version for 12 years: 500TB * 12 * 5 = 30PB
High Level Design
Seed URLs
→ URL Frontier
→ HTML Downloader / Content Parser / Content Storage
→ Link Extractor / URL Filter
→ URL Frontier …
- Seed URLs: we can choose different websites for different countries etc.
- URL Frontier: the component to store and sort URLs to be downloaded.
- HTML Downloader / Content Parser / Content Storage: Download HTMLs, parse and store contents.
- Link Extractor / URL Filter: extract and filter links to be added in frontier.
DFS or BFS
- DFS: The depth of the search tree could be very deep, not suitable.
- BFS: More commonly used but it didn’t consider the priority of URLs and politeness when crawling.
- We can use sub-queues to prioritize different URLs and enforce latencies when crawling.
Other topics
- Storage: store most of URLs in frontier in the disk and maintain buffers in ram.
- Server-side rendering: dynamically render the page before parsing it.
- Filter out pages: implement anti-spam component in the system
- Reliability, scalability and maintainability: horizontal scaling, pluggable components.
- Analytics: implement component to collect and analyze data.