Why has the “ –r ” option been included? -r turns on recursive retrieving, which is essential to the working of a crawler. Without it the crawler cannot retrieve the links as can be seen when we remove this option. Understanding the code
Why has the “ –spider ” option been included? This option makes wget behave like a spider, i.e, it will not download the web pages, it will just check that they are there.
Why has the “ -domains ” option been included? This option specifies the domain of the search. We have limited the crawling to the URL specified by the user only. The next slide shows the crawler's response when this is not done
Clearly, the crawler cannot access www.google.co.in as the host name is not the same as www.google.com
Why has the “ -l 5 ” option been included? This option specifies the depth of the search. It is a precaution to avoid spider traps. Why has the “ --tries = 5 ” option been included? This option specifies the number of retries which the crawler will make in case connection with the URL has failed.