[SysDes1][Chap. 9] Design a Web-Crawler

What is web crawler and what’s its purpose
- Web crawlers collect content from a page and follow the link on the page to collect new content.
- Example purpose: Search engine indexing. Web crawler can be used to collect page information to build search index.
- Example purpose: Web monitoring. Web crawler can be used to check copyright and trademark infringement.
Storage calculation:
- For one month: 1 billion pages * 500KB = 500TB.
- Store monthly version for 12 years: 500TB * 12 * 5 = 30PB

Seed URLs
→ URL Frontier
→ HTML Downloader / Content Parser / Content Storage
→ Link Extractor / URL Filter
→ URL Frontier …

Seed URLs: we can choose different websites for different countries etc.
URL Frontier: the component to store and sort URLs to be downloaded.
HTML Downloader / Content Parser / Content Storage: Download HTMLs, parse and store contents.
Link Extractor / URL Filter: extract and filter links to be added in frontier.

DFS: The depth of the search tree could be very deep, not suitable.
BFS: More commonly used but it didn’t consider the priority of URLs and politeness when crawling.
- We can use sub-queues to prioritize different URLs and enforce latencies when crawling.