A presentation I shared with members of the Ask team before we embarked on a major duplicate content cleanup project.

  Dealing with Duplicate Content Ehren Reilly May 8, 2012
  Agenda
• What is duplicate content? Why is it bad?
• Examples of duplicate content on our site and other notable sites
• Techniques
– robots.txt disallow
– Meta robots tag
– 301 redirect dupe URL to primary URL
– Canonical URL tag
– Prevent duplicate page from being created in the first place
• Related topics
– rel="alternate" for language/regional support
– Expired and no-longer-relevant content, and strategic content deletion
• Suggested reading
  4. 4. Why focus on Answers? • The same content often appears at more than one URL • Intentional duplication – Quotation – Re-use • E.g. Wikipedia – Content syndication • Inadvertent duplication – Separate mobile-optimized or printer-optimized version of page. – Separate regional design or branding • E.g., vs – Dynamic content where different queries return same results • E.g., vs about/iphone vs – Extra junk in URL that does not substantively change content. • E.g., vs – Pagination and filtering • E.g., • Manipulative duplication – Scraper sites – Blatant copyright infringement / plagiarism – SEO spam What is duplicate content?
  5. 5. Why focus on Answers? • Fragmentation of link equity, authority & anchor text – If there are 100 links to “iphone” and 50 links to “iPhones”: • Do I treat this as a single page with 150 links? • Do I treat both pages as separate and important? If so, “iphone” is 100 links worth of importance, and “iPhones” is 50 links worth of importance. • Lower confidence in single, definitive source – If there are many versions, which version is the definitive one? – Which URL has the most relevant/reliable copy of this for a given search query? – “I know is a good source for X, but I can’t figure out which of these URLs is’ definitive page on X.” • Penalties for manipulative and non-user-friendly duplication – Posting exact same content on multiple different sites – Panda penalty for “thin content” Why is duplicate content a problem?
  6. 6. Why focus on Answers? • Case-insensitive URLs – /q/ vs /Q/ • Sepsis • Sepsis – Questions About page paths • • • Duplicate questions in Community (e.g.) • US vs UK wiki – vs • Accidentally indexable weird subdomains – Examples from
  7. 7. Why focus on Answers?Examples from Other Sites Same people’s bios used verbatim for two different brands’ websites. Google will never show you both versions for a single search query.
  8. 8. Why focus on Answers? • Facebook has massive duplicate content issues, and as a result deep pages do not rank well in Google search results. – Five different versions of NYC Ballet’s “Videos” page. – None of them is on the first page in Google for New York City Ballet Videos. • Instead, main page is #8 in Google. • In this heavily re-blogged post from Google’s blog – Many people quoted this passage. – The original source shows up first in the SERP Examples from Other Sites
  9. 9. Why focus on Answers? • Many ways duplicate content can arise and many techniques to manage it. • Different techniques are better suited to different situations. • Things to consider about each method: – Prevents penalties? – Allows for alternate styling? – Speed/effectiveness? – Propagates link equity to all outbound links? – Consolidates link equity from all inbound links Techniques for Managing Duplicate Content
  10. 10. Why focus on Answers? • What it is: File on site that tells bots how to crawl various sections of your site. Specific to each subdomain. • Message to bots: “Don’t crawl this content, don’t put it in your index, and disregard any links that point here. Go away.” • What it’s good for: – Sections of the site that have no SEO value. – Secret stuff that you don’t want getting crawled. • What it’s bad for: – Inelegant, brute-force way of dealing with duplicate content. Robots.txt “Disallow”
  11. 11. Why focus on Answers? • What it is: Meta tag on individual page, which is like a more targeted version of robots.txt. • Message to bots: Has two separate parameters. – index/noindex: Should this page be crawled & indexed by the bot? – follow/nofollow: Should links out from these page be allowed to propagate link equity? ☞ Usually, if you’re trying to block a page from the index, but it has links to other indexed pages, you want <meta name=“robots” content=“noindex,follow”> • What it’s good for: – More targeted version of robots.txt – Allows you to block from index but still propagate link equity. – Great for deep pages of paginated/listed content. • What it’s bad for: – Alternate versions of content that users might actually want to find from search. • Suggested Use: eHow Content Pages that they won’t let us use for SEO. Meta Robots
  12. 12. Why focus on Answers?301 Redirect • What it is: Permanent redirection of duplicate/old URL to primary/new URL. • Message to bots: “Don’t go to that old URL, go to this new one. Remove the old one from the index, and forward all link equity to the new one.” • What it’s good for: – Consolidating content that exists in unnecessary variations. – Preserving the value of links, no matter which URL they link to. • What it’s bad for: – Not possible to maintain alternate versions of content, since both users and bots are redirected to a different URL. • Suggested Use: Content deleted from Community because it is redundant/duplicate.
  13. 13. Why focus on Answers?Canonical URL Tag • What it is: Meta tag that tells bots which instance of the page to index. If there are multiple instances of the same, they are consolidated together into a single URL. – E.g., <link rel=“canonical” href=“”/> • Message to bots: “For purposes of search listings, this content belongs to such-and- such URL”. • What it’s good for: – Consolidating link equity among various versions of the same content. – Allows you to maintain different versions without incurring a penalty or forfeiting any link equity. – Prevents accidental indexing of trivially/accidentally different URLs. • What it’s bad for: – Slow to work. – Officially just a “suggestion”, not a “rule”. – Not 100% effective at keeping pages out of index. • Suggested Use: Any page that gets listed in search engines. Especially: – Pages with lots of meaningless URL parameters – Pages with case-insensitive URLs (choose a single, canonical capitalization format) – Pages that can be accessed on multiple weird subdomains • Warning: Do not just dynamically put the page URL here. – Make sure it is actually a canonical version of the URL. – Should only not vary based on capitalization. – Should not dynamically insert domain. Should specify the correct domain for search index.
  14. 14. Why focus on Answers?Prevention • Nothing else is more effective than abstinence. • Foresee potential duplicate content issues, and build technologies to prevent them. – For user-generated pages, suggest an already-created page rather than create a new one with the same topic. (E.g., Quora) – For automatically-created pages, do programmatic de-duplication before pages are created. • Does this query return all the same content as some other query? • Don’t include words/characters in URL that don’t affect the query results.
  15. 15. Why focus on Answers?Technique Comparison Chart Prevents penalties Allows for alternate styling Fast & effective removal from index Propagates link equity to all outbound links Consolidates link equity from all inbound links Robots.txt ✔ ✔ ✔ ✗ ✗ Meta Robots “noindex,nofollow” ✔ ✔ ✔ ✗ ✗ Meta Robots “noindex,follow” ✔ ✔ ✔ ✔ ✗ 301 Redirect ✔ ✗ ✔ ✔ ✔ Canonical URL Tag ✔ ✔ ✗ ✔ ✔ Prevention ✔ ✗ ✔ ✔ ✔
  16. 16. Why focus on Answers? Related: International versions with rel=“alternate” • Used in combination with rel=“canonical” • Tells Google if and when there are country/language specific versions of the page. • Different versions share link equity and other ranking signals. • Google SERP links to the appropriate country- specific version for each user.
  17. 17. Why focus on Answers? Related: Deleted and Expired Content • Sometimes content gets intentionally deleted. – Community Terms violation. – Legal/copyright issues. – Terminated partnerships. – Expired or no longer valuable. • User experience options for deleted/empty pages a) 301 redirect to another relevant page b) Replace with “content deleted” message and links to other relevant pages. c) Generic error message. • HTTP/robots treatment of deleted/empty pages a) 301 b) 404 c) 200 with meta robots “noindex,follow” d) 200 that can be indexed  Duplicate content
  18. 18. Why focus on Answers?Suggested Reading • content • best-practices-in-plain-english-44475 • • the-most-important-advancement-in-seo- practices-since-sitemaps