WHAT ARE WEB ROBOTS? Web Robots (also known as Web Wanderers, Crawlers, or Spiders), are programs that traverse the Web automatically. Search engines such as Google use them to index the web content, spammers use them to scan for email addresses, and they have many other uses.
WHAT IS ROBOTS.TXT? Robots.txt is a plain text file that you upload to the root directory of your site. Once the web spiders (ants, bots, indexers) that index your webpage search your site, they first look at that text file and process it. Put differently, robots.txt says to the spider which pages to crawl.
THE SIMPLEST VERSION OF ROBOTS.TXTUser-agent: *Disallow: The first line “user agent asterisk” indicates that the following lines apply to all agents. Space after "disallow:" means that nothing is limited. This robots.txt file does nothing it allows all types of robots to see everything on the site.
SOME MORE EXAMPLES OF ROBOTS.TXT To exclude all robots from the entire server User-agent: * Disallow: / To allow all robots complete access User-agent: * Disallow: (or just create an empty "/robots.txt" file, or dont use one at all)
SOME MORE EXAMPLES OF ROBOTS.TXT To exclude all robots from part of the server User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /~joe/ To exclude a single robot User-agent: BadBot Disallow: /
SOME MORE EXAMPLES OF ROBOTS.TXT To allow a single robot User-agent: Googlebot Disallow: User-agent: * Disallow: / You can disallow single pages: User-agent: * Disallow: /~joe/junk.html Disallow: /~joe/foo.html Disallow: /~joe/bar.html
SOME MORE EXAMPLES OF ROBOTS.TXT You can specify the Sitemap location in your robots.txt file User-agent: * Disallow: / Sitemap: http://www.example.com/sitemap.xml
ABOUT THE ROBOTS <META> TAG You can use a special HTML <META> tag to tell robots not to index the content of a page, and/or not scan it for links to follow. <html> <head> <title>...</title> <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> </head>
WHAT ARE SITEMAPS? Tells search engines which pages are available for crawling. A Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL. when it was last updated how often it usually changes how important it is, relative to other URLs in the site
SITEMAPS XML FORMAT The Sitemap must: Begin with an opening <urlset> tag and end with a closing </urlset> tag. Specify the namespace (protocol standard) within the <urlset> tag. Include a <url> entry for each URL, as a parent XML tag. Include a <loc> child entry for each <url> parent tag. All URLs in a Sitemap must be from a single host, such as www.example.com or store.example.com. Sitemap file must be UTF-8 encoded No more than 50,000 URLs File must not be larger than 10MB
USING SITEMAP INDEX FILES (TO GROUPMULTIPLE SITEMAP FILES) The Sitemap index file must: Begin with an opening <sitemapindex> tag and end with a closing </sitemapindex> tag. Include a <sitemap> entry for each Sitemap as a parent XML tag. Include a <loc> child entry for each <sitemap> parent tag. The optional <lastmod> tag is also available for Sitemap index files. Note: A Sitemap index file can only specify Sitemaps that are found on the same site as the Sitemap index file. For example, http://www.yoursite.com/sitemap_index.xml can include Sitemaps on http://www.yoursite.com but not on http://www.example.com or http://yourhost.yoursite.com.
SITEMAP FILE LOCATION The location of a Sitemap file determines the set of URLs that can be included in that Sitemap. A Sitemap file located at http://example.com/catalog/sitemap.xml can include any URLs starting with http://example.com/catalog/ but can not include URLs starting with http://example.com/images/.
THANK YOU ADITYA TODAWAL PROJECT COORDINATOR (SEO)SEARCH RESULTS MEDIA – INTERNET MARKETING TORONTO