Working of a Web Crawler

[object Object],[object Object],[object Object],[object Object],Steps Involved in the Crawling Process

Starting URL is specified here WebSPHINX Web Crawler’s GUI

Starting URL or Root of the tree The Crawler “checks” if the URL exists, parses through it and retrieves all the links then repeats this process on the links, hence obtained. It checks this by sending HTTP requests to the server

Root of the tree The process is done recursively Son of the root Son of the previous son of the root The crawler has to constantly check the links for duplication. This is done to avoid redundancies, which otherwise will take a toll on the efficiency of the crawling process

Theoretically this process should continue till all the links have been retrieved. But practically the crawler goes only to 5 levels of depth from the home page of the URL it visits . After this, it is concluded that there is no need of going further. But the crawling can still continue from other URLs Here, the process stops after five depths

The red crosses signify that crawling cannot be continued from that particular URL. This can arise in the following cases- 1) The server containing the URL is taking too long to respond 2) Server is not allowing access to the crawler 3) URL is left out to avoid duplication 4) The crawler has been specifically been designed to ignore such pages (“ Politeness ” of a crawler)

Why only go till 5 depths? ,[object Object],[object Object],Spider Traps- Web pages containing an infinite loop within them. Eg- http://webcrawl.com/web/crawl/web/crawl ... Crawler is trapped in the page or can even crash. Can be intentional or unintentional. Intentionally done to trap crawlers as they eat up the page’s bandwidth Created unintentionally as in the case of dynamically created calendar, where the dates point to the next date and a year to its next year A crawler's ability to avoid spider traps is known as “ Robustness ” of the crawler .

Working of a Web Crawler

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Working of a Web Crawler

Similar to Working of a Web Crawler (20)

Recently uploaded

Recently uploaded (20)

Working of a Web Crawler