• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Coding for a wget based Web Crawler
 

Coding for a wget based Web Crawler

on

  • 4,981 views

A web crawler I created for a project on Tools for Web Crawling

A web crawler I created for a project on Tools for Web Crawling

Statistics

Views

Total Views
4,981
Views on SlideShare
4,102
Embed Views
879

Actions

Likes
4
Downloads
0
Comments
0

10 Embeds 879

http://smblog2011.blogspot.com 402
http://smblog2011.blogspot.com 402
http://smblog2011.blogspot.in 55
http://www.blogger.com 7
http://www.blogger.com 7
http://smblog2011.blogspot.pt 2
http://smblog2011.blogspot.co.uk 1
http://smblog2011.blogspot.ro 1
http://smblog2011.blogspot.com.br 1
http://smblog2011.blogspot.com.es 1
More...

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Coding for a wget based Web Crawler Coding for a wget based Web Crawler Presentation Transcript

    • Making a Web Crawler
    • Code
    • Why has the “ –r ” option been included? -r turns on recursive retrieving, which is essential to the working of a crawler. Without it the crawler cannot retrieve the links as can be seen when we remove this option. Understanding the code
    •  
    • Why has the “ –spider ” option been included? This option makes wget behave like a spider, i.e, it will not download the web pages, it will just check that they are there.
    •  
    • Why has the “ -domains ” option been included? This option specifies the domain of the search. We have limited the crawling to the URL specified by the user only. The next slide shows the crawler's response when this is not done
    •  
    • Clearly, the crawler cannot access www.google.co.in as the host name is not the same as www.google.com
    • Why has the “ -l 5 ” option been included? This option specifies the depth of the search. It is a precaution to avoid spider traps. Why has the “ --tries = 5 ” option been included? This option specifies the number of retries which the crawler will make in case connection with the URL has failed.