Working of a Web Crawler

9,358 views
9,036 views

Published on

A presentation to give an idea as to how a web crawler works

Published in: Technology, Business
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
9,358
On SlideShare
0
From Embeds
0
Number of Embeds
832
Actions
Shares
0
Downloads
0
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Working of a Web Crawler

  1. 1. How does a Web Crawler work
  2. 2. <ul><li>User specifies the starting URL through the GUI of the crawler </li></ul><ul><li>All the links in this URL are retrieved and are added to the “crawl frontier”, which is a list of the URLs to visit </li></ul><ul><li>The links in the crawl frontier are then checked and the links present in them are retrieved. </li></ul><ul><li>This process keeps on happening recursively </li></ul>Steps Involved in the Crawling Process
  3. 3. Starting URL is specified here WebSPHINX Web Crawler’s GUI
  4. 4. Starting URL or Root of the tree The Crawler “checks” if the URL exists, parses through it and retrieves all the links then repeats this process on the links, hence obtained. It checks this by sending HTTP requests to the server
  5. 5. Root of the tree The process is done recursively Son of the root Son of the previous son of the root The crawler has to constantly check the links for duplication. This is done to avoid redundancies, which otherwise will take a toll on the efficiency of the crawling process
  6. 6. Theoretically this process should continue till all the links have been retrieved. But practically the crawler goes only to 5 levels of depth from the home page of the URL it visits . After this, it is concluded that there is no need of going further. But the crawling can still continue from other URLs Here, the process stops after five depths
  7. 7. The red crosses signify that crawling cannot be continued from that particular URL. This can arise in the following cases- 1) The server containing the URL is taking too long to respond 2) Server is not allowing access to the crawler 3) URL is left out to avoid duplication 4) The crawler has been specifically been designed to ignore such pages (“ Politeness ” of a crawler)
  8. 8. Why only go till 5 depths? <ul><li>Normally, 5 depths or levels of search are enough to gather majority of information present in the home page of the website </li></ul><ul><li>A precaution to avoid “Spider Traps” </li></ul>Spider Traps- Web pages containing an infinite loop within them. Eg- http://webcrawl.com/web/crawl/web/crawl ... Crawler is trapped in the page or can even crash. Can be intentional or unintentional. Intentionally done to trap crawlers as they eat up the page’s bandwidth Created unintentionally as in the case of dynamically created calendar, where the dates point to the next date and a year to its next year A crawler's ability to avoid spider traps is known as “ Robustness ” of the crawler .

×