<ul><li>User specifies the starting URL through the GUI of the crawler </li></ul><ul><li>All the links in this URL are retrieved and are added to the “crawl frontier”, which is a list of the URLs to visit </li></ul><ul><li>The links in the crawl frontier are then checked and the links present in them are retrieved. </li></ul><ul><li>This process keeps on happening recursively </li></ul>Steps Involved in the Crawling Process
Starting URL is specified here WebSPHINX Web Crawler’s GUI
Starting URL or Root of the tree The Crawler “checks” if the URL exists, parses through it and retrieves all the links then repeats this process on the links, hence obtained. It checks this by sending HTTP requests to the server
Root of the tree The process is done recursively Son of the root Son of the previous son of the root The crawler has to constantly check the links for duplication. This is done to avoid redundancies, which otherwise will take a toll on the efficiency of the crawling process
Theoretically this process should continue till all the links have been retrieved. But practically the crawler goes only to 5 levels of depth from the home page of the URL it visits . After this, it is concluded that there is no need of going further. But the crawling can still continue from other URLs Here, the process stops after five depths
The red crosses signify that crawling cannot be continued from that particular URL. This can arise in the following cases- 1) The server containing the URL is taking too long to respond 2) Server is not allowing access to the crawler 3) URL is left out to avoid duplication 4) The crawler has been specifically been designed to ignore such pages (“ Politeness ” of a crawler)
Why only go till 5 depths? <ul><li>Normally, 5 depths or levels of search are enough to gather majority of information present in the home page of the website </li></ul><ul><li>A precaution to avoid “Spider Traps” </li></ul>Spider Traps- Web pages containing an infinite loop within them. Eg- http://webcrawl.com/web/crawl/web/crawl ... Crawler is trapped in the page or can even crash. Can be intentional or unintentional. Intentionally done to trap crawlers as they eat up the page’s bandwidth Created unintentionally as in the case of dynamically created calendar, where the dates point to the next date and a year to its next year A crawler's ability to avoid spider traps is known as “ Robustness ” of the crawler .