Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dealing with duplicate content


Published on

A presentation I shared with members of the Ask team before we embarked on a major duplicate content cleanup project.

Published in: Technology, Design
  • Be the first to comment

Dealing with duplicate content

  1. 1. Dealing with Duplicate Content Ehren Reilly May 8, 2012
  2. 2. Why focus on Answers? • What is duplicate content? Why is it bad? • Examples of duplicate content on our site and other notable sites • Techniques – robots.txt disallow – Meta robots tag – 301 redirect dupe URL to primary URL – Canonical URL tag – Prevent duplicate page from being created in the first place • Related topics – rel=“alternate” for language/regional support – Expired and no-longer-relevant content, and strategic content deletion • Suggested reading Agenda
  3. 3. Why focus on Answers? • What is duplicate content & why is it bad? • Examples of duplicate content on our site and other notable sites • Techniques – robots.txt disallow – Meta robots tag – 301 redirect dupe URL to primary URL – Canonical URL tag – Prevent duplicate page from being created in the first place • Related topics – rel=“alternate” for language/regional support – Expired and no-longer-relevant content, and strategic content deletion • Suggested reading
  4. 4. Why focus on Answers? • The same content often appears at more than one URL • Intentional duplication – Quotation – Re-use • E.g. Wikipedia – Content syndication • Inadvertent duplication – Separate mobile-optimized or printer-optimized version of page. – Separate regional design or branding • E.g., vs – Dynamic content where different queries return same results • E.g., vs about/iphone vs – Extra junk in URL that does not substantively change content. • E.g., vs – Pagination and filtering • E.g., • Manipulative duplication – Scraper sites – Blatant copyright infringement / plagiarism – SEO spam What is duplicate content?
  5. 5. Why focus on Answers? • Fragmentation of link equity, authority & anchor text – If there are 100 links to “iphone” and 50 links to “iPhones”: • Do I treat this as a single page with 150 links? • Do I treat both pages as separate and important? If so, “iphone” is 100 links worth of importance, and “iPhones” is 50 links worth of importance. • Lower confidence in single, definitive source – If there are many versions, which version is the definitive one? – Which URL has the most relevant/reliable copy of this for a given search query? – “I know is a good source for X, but I can’t figure out which of these URLs is’ definitive page on X.” • Penalties for manipulative and non-user-friendly duplication – Posting exact same content on multiple different sites – Panda penalty for “thin content” Why is duplicate content a problem?
  6. 6. Why focus on Answers? • Case-insensitive URLs – /q/ vs /Q/ • Sepsis • Sepsis – Questions About page paths • • • Duplicate questions in Community (e.g.) • US vs UK wiki – vs • Accidentally indexable weird subdomains – Examples from
  7. 7. Why focus on Answers?Examples from Other Sites Same people’s bios used verbatim for two different brands’ websites. Google will never show you both versions for a single search query.
  8. 8. Why focus on Answers? • Facebook has massive duplicate content issues, and as a result deep pages do not rank well in Google search results. – Five different versions of NYC Ballet’s “Videos” page. – None of them is on the first page in Google for New York City Ballet Videos. • Instead, main page is #8 in Google. • In this heavily re-blogged post from Google’s blog – Many people quoted this passage. – The original source shows up first in the SERP Examples from Other Sites
  9. 9. Why focus on Answers? • Many ways duplicate content can arise and many techniques to manage it. • Different techniques are better suited to different situations. • Things to consider about each method: – Prevents penalties? – Allows for alternate styling? – Speed/effectiveness? – Propagates link equity to all outbound links? – Consolidates link equity from all inbound links Techniques for Managing Duplicate Content
  10. 10. Why focus on Answers? • What it is: File on site that tells bots how to crawl various sections of your site. Specific to each subdomain. • Message to bots: “Don’t crawl this content, don’t put it in your index, and disregard any links that point here. Go away.” • What it’s good for: – Sections of the site that have no SEO value. – Secret stuff that you don’t want getting crawled. • What it’s bad for: – Inelegant, brute-force way of dealing with duplicate content. Robots.txt “Disallow”
  11. 11. Why focus on Answers? • What it is: Meta tag on individual page, which is like a more targeted version of robots.txt. • Message to bots: Has two separate parameters. – index/noindex: Should this page be crawled & indexed by the bot? – follow/nofollow: Should links out from these page be allowed to propagate link equity? ☞ Usually, if you’re trying to block a page from the index, but it has links to other indexed pages, you want <meta name=“robots” content=“noindex,follow”> • What it’s good for: – More targeted version of robots.txt – Allows you to block from index but still propagate link equity. – Great for deep pages of paginated/listed content. • What it’s bad for: – Alternate versions of content that users might actually want to find from search. • Suggested Use: eHow Content Pages that they won’t let us use for SEO. Meta Robots
  12. 12. Why focus on Answers?301 Redirect • What it is: Permanent redirection of duplicate/old URL to primary/new URL. • Message to bots: “Don’t go to that old URL, go to this new one. Remove the old one from the index, and forward all link equity to the new one.” • What it’s good for: – Consolidating content that exists in unnecessary variations. – Preserving the value of links, no matter which URL they link to. • What it’s bad for: – Not possible to maintain alternate versions of content, since both users and bots are redirected to a different URL. • Suggested Use: Content deleted from Community because it is redundant/duplicate.
  13. 13. Why focus on Answers?Canonical URL Tag • What it is: Meta tag that tells bots which instance of the page to index. If there are multiple instances of the same, they are consolidated together into a single URL. – E.g., <link rel=“canonical” href=“”/> • Message to bots: “For purposes of search listings, this content belongs to such-and- such URL”. • What it’s good for: – Consolidating link equity among various versions of the same content. – Allows you to maintain different versions without incurring a penalty or forfeiting any link equity. – Prevents accidental indexing of trivially/accidentally different URLs. • What it’s bad for: – Slow to work. – Officially just a “suggestion”, not a “rule”. – Not 100% effective at keeping pages out of index. • Suggested Use: Any page that gets listed in search engines. Especially: – Pages with lots of meaningless URL parameters – Pages with case-insensitive URLs (choose a single, canonical capitalization format) – Pages that can be accessed on multiple weird subdomains • Warning: Do not just dynamically put the page URL here. – Make sure it is actually a canonical version of the URL. – Should only not vary based on capitalization. – Should not dynamically insert domain. Should specify the correct domain for search index.
  14. 14. Why focus on Answers?Prevention • Nothing else is more effective than abstinence. • Foresee potential duplicate content issues, and build technologies to prevent them. – For user-generated pages, suggest an already-created page rather than create a new one with the same topic. (E.g., Quora) – For automatically-created pages, do programmatic de-duplication before pages are created. • Does this query return all the same content as some other query? • Don’t include words/characters in URL that don’t affect the query results.
  15. 15. Why focus on Answers?Technique Comparison Chart Prevents penalties Allows for alternate styling Fast & effective removal from index Propagates link equity to all outbound links Consolidates link equity from all inbound links Robots.txt ✔ ✔ ✔ ✗ ✗ Meta Robots “noindex,nofollow” ✔ ✔ ✔ ✗ ✗ Meta Robots “noindex,follow” ✔ ✔ ✔ ✔ ✗ 301 Redirect ✔ ✗ ✔ ✔ ✔ Canonical URL Tag ✔ ✔ ✗ ✔ ✔ Prevention ✔ ✗ ✔ ✔ ✔
  16. 16. Why focus on Answers? Related: International versions with rel=“alternate” • Used in combination with rel=“canonical” • Tells Google if and when there are country/language specific versions of the page. • Different versions share link equity and other ranking signals. • Google SERP links to the appropriate country- specific version for each user.
  17. 17. Why focus on Answers? Related: Deleted and Expired Content • Sometimes content gets intentionally deleted. – Community Terms violation. – Legal/copyright issues. – Terminated partnerships. – Expired or no longer valuable. • User experience options for deleted/empty pages a) 301 redirect to another relevant page b) Replace with “content deleted” message and links to other relevant pages. c) Generic error message. • HTTP/robots treatment of deleted/empty pages a) 301 b) 404 c) 200 with meta robots “noindex,follow” d) 200 that can be indexed  Duplicate content
  18. 18. Why focus on Answers?Suggested Reading • content • best-practices-in-plain-english-44475 • • the-most-important-advancement-in-seo- practices-since-sitemaps