2. ...programs which scan the web in a methodical and automated way. ...they copy all the pages they visit and leave them to the search engine for indexing. ...not all spiders have the same job though, some check links, or collect email addresses, or validate code for example. Spiders are... ...some people call them crawlers, bots and even ants or worms. (“Spidering” means to request every page on a site)
3. A spider's architecture: Downloads web pages Stuff is stored URLs get queued Co-ordinates the processes
5. The crawl list would look like this (although it would be much much bigger than this small sample): http://www.techcrunch.com/ http://www.crunchgear.com/ http://www.mobilecrunch.com/ http://www.techcrunchit.com/ http://www.crunchbase.com/ http://www.techcrunch.com/# http://www.inviteshare.com/ http://pitches.techcrunch.com/ http://gillmorgang.techcrunch.com/ http://www.talkcrunch.com/ http://www.techcrunch50.com/ http://uk.techcrunch.com/ http://fr.techcrunch.com/ http://jp.techcrunch.com/ The spider will also save a copy of each page it visits in a database. The search engine will then index those. The first URLs given to the spider as a starting point are called “seeds”. The list gets bigger and bigger and in order to make sure that the search engine index is current, the spider will need to re-visit those links often to track any changes. There are 2 lists: a list of URLs visited and a list of URLs to visit. This list is known as “The crawl frontier”.
6.
7.
8. A re-visit policy that states when to check for changes to the pages.
11. Build a spider You can use any programming language that you feel comfortable with, although JAVA, Perl and C# ones are the most popular. You can also use these tutorials: Java sun spider - http://tiny.cc/e2KAy Chilkat in python - http://tiny.cc/WH7eh Swish-e in Perl - http://tiny.cc/nNF5Q Remember that a poorly designed spider can impact overall network and server performance.
12. OpenSource spiders You can use one of these for free (some knowledge of programming can help in setting them up): OpenWebSpider in C# - http://www.openwebspider.org Arachnid in Java - http://arachnid.sourceforge.net/ Java-web-spider - http://code.google.com/p/java-web-spider/ MOMSpider in perl - http://tiny.cc/36XQA
13. Robots.txt This is a file that allows webmasters to give instructions to visiting spiders who must respect it. Some areas are off-limits. Disallow spider from everything User-agent: * Disallow: / Disallow all except Googlebot and BackRub, which can access /private User-agent: Googlebot User-agent: BackRub Disallow: /private and churl, which can access everything User-agent: churl Disallow:
20. Share your results List your spider in the database http://www.robotstxt.org/db.html
21. Spider traps Intentionally and non-intentionally, traps crop up on the spider's path sometimes and stop it functioning properly. Dynamic pages, deep directories that never end, pages with special links and commands pointing the spider to other directories...anything that can put the spider into an infinite loop is an issue. You might however want to deploy a spider trap if you know that one is visiting your site and not respecting your robots.txt for example or because it's a spambot.
22. Fleiner's spider trap <html><head><title> You are a bad netizen if you are a web bot! </title> <body><h1><b> You are a bad netizen if you are a web bot! </h1></b> <!--#config timefmt="%y%j%H%M%S" --> <!-- of date string --> <!--#exec cmd="sleep 20" --> <!-- make this page sloooow to load --> To give robots some work here some special links: these are <a href=a<!--#echo var="DATE_GMT" -->.html> some links </a> to this <a href=b<!--#echo var="DATE_GMT" -->.html> very page </a> but with <a href=c<!--#echo var="DATE_GMT" -->.html> different names </a> You can download spider traps and find out more at Fleiner's page: http://www.fleiner.com/bots/#trap