• Save
Working of a Web Crawler
Upcoming SlideShare
Loading in...5
×
 

Working of a Web Crawler

on

  • 7,325 views

A presentation to give an idea as to how a web crawler works

A presentation to give an idea as to how a web crawler works

Statistics

Views

Total Views
7,325
Views on SlideShare
6,412
Embed Views
913

Actions

Likes
4
Downloads
0
Comments
1

10 Embeds 913

http://smblog2011.blogspot.com 409
http://smblog2011.blogspot.com 409
http://smblog2011.blogspot.in 55
http://www.blogger.com 17
http://www.blogger.com 17
http://smblog2011.blogspot.pt 2
http://smblog2011.blogspot.co.uk 1
http://smblog2011.blogspot.ro 1
http://smblog2011.blogspot.com.br 1
http://smblog2011.blogspot.com.es 1
More...

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Working of a Web Crawler Working of a Web Crawler Presentation Transcript

  • How does a Web Crawler work
    • User specifies the starting URL through the GUI of the crawler
    • All the links in this URL are retrieved and are added to the “crawl frontier”, which is a list of the URLs to visit
    • The links in the crawl frontier are then checked and the links present in them are retrieved.
    • This process keeps on happening recursively
    Steps Involved in the Crawling Process
  • Starting URL is specified here WebSPHINX Web Crawler’s GUI
  • Starting URL or Root of the tree The Crawler “checks” if the URL exists, parses through it and retrieves all the links then repeats this process on the links, hence obtained. It checks this by sending HTTP requests to the server
  • Root of the tree The process is done recursively Son of the root Son of the previous son of the root The crawler has to constantly check the links for duplication. This is done to avoid redundancies, which otherwise will take a toll on the efficiency of the crawling process
  • Theoretically this process should continue till all the links have been retrieved. But practically the crawler goes only to 5 levels of depth from the home page of the URL it visits . After this, it is concluded that there is no need of going further. But the crawling can still continue from other URLs Here, the process stops after five depths
  • The red crosses signify that crawling cannot be continued from that particular URL. This can arise in the following cases- 1) The server containing the URL is taking too long to respond 2) Server is not allowing access to the crawler 3) URL is left out to avoid duplication 4) The crawler has been specifically been designed to ignore such pages (“ Politeness ” of a crawler)
  • Why only go till 5 depths?
    • Normally, 5 depths or levels of search are enough to gather majority of information present in the home page of the website
    • A precaution to avoid “Spider Traps”
    Spider Traps- Web pages containing an infinite loop within them. Eg- http://webcrawl.com/web/crawl/web/crawl ... Crawler is trapped in the page or can even crash. Can be intentional or unintentional. Intentionally done to trap crawlers as they eat up the page’s bandwidth Created unintentionally as in the case of dynamically created calendar, where the dates point to the next date and a year to its next year A crawler's ability to avoid spider traps is known as “ Robustness ” of the crawler .