Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Making a Web Crawler
Code
Why has the “ –r ” option been included? -r turns on recursive retrieving, which is essential to the working of a crawler....
 
Why has the “ –spider ” option been included? This option makes wget behave like a spider, i.e, it will not download the w...
 
Why has the “ -domains ”  option been included? This option specifies the domain of the search. We have limited the crawli...
 
Clearly, the crawler cannot access www.google.co.in as the host name is not the same as www.google.com
Why has the “ -l 5 ”  option been included? This option specifies the depth of the search. It is a precaution to avoid spi...
Upcoming SlideShare
Loading in …5
×

Coding for a wget based Web Crawler

6,450 views

Published on

A web crawler I created for a project on Tools for Web Crawling

Published in: Technology, News & Politics
  • Be the first to comment

Coding for a wget based Web Crawler

  1. 1. Making a Web Crawler
  2. 2. Code
  3. 3. Why has the “ –r ” option been included? -r turns on recursive retrieving, which is essential to the working of a crawler. Without it the crawler cannot retrieve the links as can be seen when we remove this option. Understanding the code
  4. 5. Why has the “ –spider ” option been included? This option makes wget behave like a spider, i.e, it will not download the web pages, it will just check that they are there.
  5. 7. Why has the “ -domains ” option been included? This option specifies the domain of the search. We have limited the crawling to the URL specified by the user only. The next slide shows the crawler's response when this is not done
  6. 9. Clearly, the crawler cannot access www.google.co.in as the host name is not the same as www.google.com
  7. 10. Why has the “ -l 5 ” option been included? This option specifies the depth of the search. It is a precaution to avoid spider traps. Why has the “ --tries = 5 ” option been included? This option specifies the number of retries which the crawler will make in case connection with the URL has failed.

×