[SysDes1][Chap. 9] Design a Web-Crawler

Design Questions

  • What is web crawler and what’s its purpose
    • Web crawlers collect content from a page and follow the link on the page to collect new content.
    • Example purpose: Search engine indexing. Web crawler can be used to collect page information to build search index.
    • Example purpose: Web monitoring. Web crawler can be used to check copyright and trademark infringement.
  • Storage calculation:
    • For one month: 1 billion pages * 500KB = 500TB.
    • Store monthly version for 12 years: 500TB * 12 * 5 = 30PB

High Level Design

Seed URLs
→ URL Frontier
→ HTML Downloader / Content Parser /  Content Storage
→ Link Extractor / URL Filter
→ URL Frontier …

  • Seed URLs: we can choose different websites for different countries etc.
  • URL Frontier: the component to store and sort URLs to be downloaded.
  • HTML Downloader / Content Parser /  Content Storage: Download HTMLs, parse and store contents.
  • Link Extractor / URL Filter: extract and filter links to be added in frontier.

DFS or BFS

  • DFS: The depth of the search tree could be very deep, not suitable.
  • BFS: More commonly used but it didn’t consider the priority of URLs and politeness when crawling.
    • We can use sub-queues to prioritize different URLs and enforce latencies when crawling.

Other topics

  • Storage: store most of URLs in frontier in the disk and maintain buffers in ram.
  • Server-side rendering: dynamically render the page before parsing it.
  • Filter out pages: implement anti-spam component in the system
  • Reliability, scalability and maintainability: horizontal scaling, pluggable components.
  • Analytics: implement component to collect and analyze data.

Leave a Reply

Your email address will not be published. Required fields are marked *