Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

Intro to Web Crawler

Published in: Technology
  • Be the first to comment


  1. 1. Introduction to Web Crawlers Presented By Rehab
  2. 2. INTRODUCTION  Definition: A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. (Wikipedia)  Utilities: Gather pages from the Web. Support a search engine, perform data mining and so on.
  3. 3. WHY CRAWLERS? Internet has a wide expanse of Information.  Finding relevant information requires an efficient mechanism. • Web Crawlers provide that scope to the search engine.
  4. 4. Features oF a crawler  Must provide:  Robustness: spider traps  Politeness: which pages can be crawled, and which cannot  Should provide:  Distributed  Scalable  Performance and efficiency  Quality  Freshness  Extensible
  5. 5. How does web crawler work?
  6. 6. arcHitecture oF a crawler www DNS Fetch Parse Content Seen? URL Filter Dup URL Elim URL Frontier Doc Fingerprint Robots templates URL set
  7. 7. Architecture of A crAwler URL Frontier: containing URLs yet to be fetches in the current crawl. At first, a seed set is stored in URL Frontier, and a crawler begins by taking a URL from the seed set. DNS: domain name service resolution. Look up IP address for domain names. Fetch: generally use the http protocol to fetch the URL. Parse: the page is parsed. Texts (images, videos, and etc.) and Links are extracted.
  8. 8. Architecture of A crAwler Content Seen?: test whether a web page with the same content has already been seen at another URL. Need to develop a way to measure the fingerprint of a web page. URL Filter: Whether the extracted URL should be excluded from the frontier (robots.txt). URL should be normalized (relative encoding).  <“disclaimer">Disclaimers</a> Dup URL Elm: the URL is checked for duplicate elimination.
  9. 9. Architecture of A crAwler Other issues:  Housekeeping tasks:  Log crawl progress statistics: URLs crawled, frontier size, etc. (Every few seconds)  Check pointing: a snapshot of the crawler’s state (the URL frontier) is committed to disk. (Every few hours)  Priority of URLs in URL frontier:  Change rate.  Quality.  Politeness:  Avoid repeated fetch requests to a host within a short time span.  Otherwise: blocked 
  10. 10. Thank You For Your Attention