Robots.txt is a file that is used to exclude content from the crawling process of search
engine spiders / bots. Robots.txt is also called the Robots Exclusion Protocol.
In general, we prefer that our webpages are indexed by the search engines. But there may be
some content that we don’t want to be crawled & indexed. Like the personal images folder,
website administration folder, customer’s test folder of a web developer, no search value
folders like cgi-bin, and many more. The main idea is we don’t want them to be indexed.
No. Standards based bots like Google’s, Yahoo’s or other big search engine’s robots listen to
your robots.txt file. This is because they are programmed to. If configured so, any search
engine bot can ignore the robots.txt file.
file has some simple directives which manages the bots. These are:
User-agent: * #allows all search engine spiders.
Disallow: /secretcontent/ #disallow them to crawl secret content folder
User-Agent: [Spider or Bot name]
Disallow: [Directory or File Name
: this parameter defines, for which bots the next parameters will be valid. * is a
wildcard which means all bots or Googlebot for Google.
: defines which folders or files will be excluded. None means nothing will be
excluded, / means everything will be excluded or /folder name/ or/filename can be used
to specify the values to excluded. Folder name between slashes like /folder name/ means
that only folder name/default.html will be excluded. Using 1 slash like /folder name means
all content inside the folder name folder will be excluded.
This file says that any robot (User-agent: *) that accesses it should ignore the
directories /cgi-bin/ and /temp/ (Disallow: /cgi-bin/ Disallow: /temp/).
This file says that any robot (User-agent: *) that accesses it should ignore the files
/jenns-stuff.htm and /private.php (Disallow: /jenns-stuff.htm Disallow: /private.php).
This file says that the lycos bot (User-agent: Lycos/x.x) is not allowed access
anywhere on the site (Disallow: / )
This file says that the Lycos bot (User-agent: Lycos/x.x) is not allowed access anywhere on
This file first disallows all robots like we did above, and then explicitly lets the Googlebot (User-
agent: Googlebot) have access to everything (Disallow:).
Canonical means original, when two or more webpages have
almost same contents, means those urls are duplicate to each
we can use canonical tag on the webpage where the duplicate
content is, and mention the canonical url there. so that search
engine can understand that the original contents are somewhere
In other words, Google lets you chose which of your duplicates
"wins". But they are still crawled and indexed. In other words, Google
is wasting its time crawling pages that you know to be duplicates.
All the URLs below are pointing to one and the same page:
While we see all the web addresses above as referring to the same page, search engines see
them as separate addresses.
Normally, search engines will choose the one that they think is your primary address, but it
may not always be the best for your website.
Search engines see www.vedaseo.com and http://vedaseo.com as different web pages.
Since you have more than one URL that serves the same content, they might tag one of
those pages as having plagiarized content.
The result is lower rankings in search results and a possibility of being banned from the
index of search engines.
Another problem that may arise from a URL having a www and non-www version is
Let’s say that your homepage receives 300 inbound links. If 100 point to mywebsite.com
and 200 point to www.mywebsite.com, then your website won’t get the full benefit of
all your 300 incoming links. In fact, your Google PageRank may only reflect the 200
incoming links instead of the 300.
In addition, because your internal links have a bearing in your link popularity, those links
pointing to your homepage with a file name at the end (i.e.
www.mywebsite.com/index.html) might also “steal” some of the link juice from the true
<link rel="canonical" href="http://www.vedaseo.com/cascade-magnum-kittens.htm" />
<link rel="canonical" href="http://www.vedaseo.com/cascade-magnum-kittens.html" />
Rel=canonical is supposed to be really easy to throw up