Your SlideShare is downloading. ×
Coding for a wget based Web Crawler
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Coding for a wget based Web Crawler

5,040

Published on

A web crawler I created for a project on Tools for Web Crawling

A web crawler I created for a project on Tools for Web Crawling

Published in: Technology, News & Politics
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,040
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
0
Comments
0
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Making a Web Crawler
  • 2. Code
  • 3. Why has the “ –r ” option been included? -r turns on recursive retrieving, which is essential to the working of a crawler. Without it the crawler cannot retrieve the links as can be seen when we remove this option. Understanding the code
  • 4.  
  • 5. Why has the “ –spider ” option been included? This option makes wget behave like a spider, i.e, it will not download the web pages, it will just check that they are there.
  • 6.  
  • 7. Why has the “ -domains ” option been included? This option specifies the domain of the search. We have limited the crawling to the URL specified by the user only. The next slide shows the crawler's response when this is not done
  • 8.  
  • 9. Clearly, the crawler cannot access www.google.co.in as the host name is not the same as www.google.com
  • 10. Why has the “ -l 5 ” option been included? This option specifies the depth of the search. It is a precaution to avoid spider traps. Why has the “ --tries = 5 ” option been included? This option specifies the number of retries which the crawler will make in case connection with the URL has failed.

×