Canonical and robotos (2)


Published on

Published in: Technology, Design
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Canonical and robotos (2)

  1. 1. Robots.txt is a file that is used to exclude content from the crawling process of search engine spiders / bots. Robots.txt is also called the Robots Exclusion Protocol. In general, we prefer that our webpages are indexed by the search engines. But there may be some content that we don’t want to be crawled & indexed. Like the personal images folder, website administration folder, customer’s test folder of a web developer, no search value folders like cgi-bin, and many more. The main idea is we don’t want them to be indexed. No. Standards based bots like Google’s, Yahoo’s or other big search engine’s robots listen to your robots.txt file. This is because they are programmed to. If configured so, any search engine bot can ignore the robots.txt file.
  2. 2. file has some simple directives which manages the bots. These are: User-agent: * #allows all search engine spiders. Disallow: /secretcontent/ #disallow them to crawl secret content folder User-Agent: [Spider or Bot name] Disallow: [Directory or File Name : this parameter defines, for which bots the next parameters will be valid. * is a wildcard which means all bots or Googlebot for Google. : defines which folders or files will be excluded. None means nothing will be excluded, / means everything will be excluded or /folder name/ or/filename can be used to specify the values to excluded. Folder name between slashes like /folder name/ means that only folder name/default.html will be excluded. Using 1 slash like /folder name means all content inside the folder name folder will be excluded. ]
  3. 3. User-agent: * Disallow: /cgi-bin/ Disallow: /temp/ This file says that any robot (User-agent: *) that accesses it should ignore the directories /cgi-bin/ and /temp/ (Disallow: /cgi-bin/ Disallow: /temp/). User-agent: * Disallow: /jenns-stuff.htm Disallow: /private.php This file says that any robot (User-agent: *) that accesses it should ignore the files /jenns-stuff.htm and /private.php (Disallow: /jenns-stuff.htm Disallow: /private.php).
  4. 4. User-agent: Lycos/x.x Disallow: / This file says that the lycos bot (User-agent: Lycos/x.x) is not allowed access anywhere on the site (Disallow: / ) This file says that the Lycos bot (User-agent: Lycos/x.x) is not allowed access anywhere on User-agent: * Disallow: / User-agent: Googlebot Disallow: This file first disallows all robots like we did above, and then explicitly lets the Googlebot (User- agent: Googlebot) have access to everything (Disallow:).
  5. 5. Canonical means original, when two or more webpages have almost same contents, means those urls are duplicate to each other, we can use canonical tag on the webpage where the duplicate content is, and mention the canonical url there. so that search engine can understand that the original contents are somewhere else. In other words, Google lets you chose which of your duplicates "wins". But they are still crawled and indexed. In other words, Google is wasting its time crawling pages that you know to be duplicates.
  6. 6. All the URLs below are pointing to one and the same page: While we see all the web addresses above as referring to the same page, search engines see them as separate addresses. Normally, search engines will choose the one that they think is your primary address, but it may not always be the best for your website.
  7. 7. Search engines see and as different web pages. Since you have more than one URL that serves the same content, they might tag one of those pages as having plagiarized content. The result is lower rankings in search results and a possibility of being banned from the index of search engines.
  8. 8. Another problem that may arise from a URL having a www and non-www version is split-link popularity. Let’s say that your homepage receives 300 inbound links. If 100 point to and 200 point to, then your website won’t get the full benefit of all your 300 incoming links. In fact, your Google PageRank may only reflect the 200 incoming links instead of the 300. In addition, because your internal links have a bearing in your link popularity, those links pointing to your homepage with a file name at the end (i.e. might also “steal” some of the link juice from the true URL.
  9. 9. dup page page: <link rel="canonical" href="" /> proper page page: <link rel="canonical" href="" /> Rel=canonical is supposed to be really easy to throw up