Making a Web Crawler
Code
Why has the “ –r ” option been included? -r turns on recursive retrieving, which is essential to the working of a crawler....
 
Why has the “ –spider ” option been included? This option makes wget behave like a spider, i.e, it will not download the w...
 
Why has the “ -domains ”  option been included? This option specifies the domain of the search. We have limited the crawli...
 
Clearly, the crawler cannot access www.google.co.in as the host name is not the same as www.google.com
Why has the “ -l 5 ”  option been included? This option specifies the depth of the search. It is a precaution to avoid spi...
Upcoming SlideShare
Loading in...5
×

Coding for a wget based Web Crawler

5,629
-1

Published on

A web crawler I created for a project on Tools for Web Crawling

Published in: Technology, News & Politics
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,629
On Slideshare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
0
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Coding for a wget based Web Crawler

  1. 1. Making a Web Crawler
  2. 2. Code
  3. 3. Why has the “ –r ” option been included? -r turns on recursive retrieving, which is essential to the working of a crawler. Without it the crawler cannot retrieve the links as can be seen when we remove this option. Understanding the code
  4. 5. Why has the “ –spider ” option been included? This option makes wget behave like a spider, i.e, it will not download the web pages, it will just check that they are there.
  5. 7. Why has the “ -domains ” option been included? This option specifies the domain of the search. We have limited the crawling to the URL specified by the user only. The next slide shows the crawler's response when this is not done
  6. 9. Clearly, the crawler cannot access www.google.co.in as the host name is not the same as www.google.com
  7. 10. Why has the “ -l 5 ” option been included? This option specifies the depth of the search. It is a precaution to avoid spider traps. Why has the “ --tries = 5 ” option been included? This option specifies the number of retries which the crawler will make in case connection with the URL has failed.

×