Dealing with Duplicate Content
Ehren Reilly
May 8, 2012
Why focus on Answers?
• What is duplicate content? Why is it bad?
• Examples of duplicate content on our site and other
no...
Why focus on Answers?
• What is duplicate content & why is it bad?
• Examples of duplicate content on our site and other
n...
Why focus on Answers?
• The same content often appears at more than one URL
• Intentional duplication
– Quotation
– Re-use...
Why focus on Answers?
• Fragmentation of link equity, authority & anchor text
– If there are 100 links to “iphone” and 50 ...
Why focus on Answers?
• Case-insensitive URLs
– /q/ vs /Q/
• www.ask.com/q/What-Causes-
Sepsis
• www.ask.com/Q/What-Causes...
Why focus on Answers?Examples from Other Sites
Same people’s bios used verbatim for two different brands’ websites.
http:/...
Why focus on Answers?
• Facebook has massive duplicate content issues,
and as a result deep pages do not rank well in
Goog...
Why focus on Answers?
• Many ways duplicate content can arise and many
techniques to manage it.
• Different techniques are...
Why focus on Answers?
• What it is: File on site that tells bots how to crawl
various sections of your site. Specific to e...
Why focus on Answers?
• What it is: Meta tag on individual page, which is like a more targeted
version of robots.txt.
• Me...
Why focus on Answers?301 Redirect
• What it is: Permanent redirection of duplicate/old URL to
primary/new URL.
• Message t...
Why focus on Answers?Canonical URL Tag
• What it is: Meta tag that tells bots which instance of the page to index. If ther...
Why focus on Answers?Prevention
• Nothing else is more effective than abstinence.
• Foresee potential duplicate content is...
Why focus on Answers?Technique Comparison Chart
Prevents
penalties
Allows for
alternate
styling
Fast &
effective
removal
f...
Why focus on Answers?
Related:
International versions with rel=“alternate”
• Used in combination with rel=“canonical”
• Te...
Why focus on Answers?
Related:
Deleted and Expired Content
• Sometimes content gets intentionally deleted.
– Community Ter...
Why focus on Answers?Suggested Reading
• http://www.seomoz.org/learn-seo/duplicate-
content
• http://searchengineland.com/...
Upcoming SlideShare
Loading in …5
×

Dealing with duplicate content

420 views
293 views

Published on

A presentation I shared with members of the Ask team before we embarked on a major duplicate content cleanup project.

Published in: Technology, Design
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
420
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Dealing with duplicate content

  1. 1. Dealing with Duplicate Content Ehren Reilly May 8, 2012
  2. 2. Why focus on Answers? • What is duplicate content? Why is it bad? • Examples of duplicate content on our site and other notable sites • Techniques – robots.txt disallow – Meta robots tag – 301 redirect dupe URL to primary URL – Canonical URL tag – Prevent duplicate page from being created in the first place • Related topics – rel=“alternate” for language/regional support – Expired and no-longer-relevant content, and strategic content deletion • Suggested reading Agenda
  3. 3. Why focus on Answers? • What is duplicate content & why is it bad? • Examples of duplicate content on our site and other notable sites • Techniques – robots.txt disallow – Meta robots tag – 301 redirect dupe URL to primary URL – Canonical URL tag – Prevent duplicate page from being created in the first place • Related topics – rel=“alternate” for language/regional support – Expired and no-longer-relevant content, and strategic content deletion • Suggested reading
  4. 4. Why focus on Answers? • The same content often appears at more than one URL • Intentional duplication – Quotation – Re-use • E.g. Ask.com Wikipedia – Content syndication • Inadvertent duplication – Separate mobile-optimized or printer-optimized version of page. – Separate regional design or branding • E.g., uk.ask.com/wiki/Rihanna vs www.ask.com/wiki/Rihanna – Dynamic content where different queries return same results • E.g., ask.com/questions-about/iPhones vs ask.com/questions- about/iphone vs ask.com/questions-about/iPhone – Extra junk in URL that does not substantively change content. • E.g., www.ask.com/wiki/Symbolics?qsrc=3044 vs www.ask.com/wiki/Symbolics – Pagination and filtering • E.g., • Manipulative duplication – Scraper sites – Blatant copyright infringement / plagiarism – SEO spam What is duplicate content?
  5. 5. Why focus on Answers? • Fragmentation of link equity, authority & anchor text – If there are 100 links to “iphone” and 50 links to “iPhones”: • Do I treat this as a single page with 150 links? • Do I treat both pages as separate and important? If so, “iphone” is 100 links worth of importance, and “iPhones” is 50 links worth of importance. • Lower confidence in single, definitive source – If there are many versions, which version is the definitive one? – Which URL has the most relevant/reliable copy of this for a given search query? – “I know ask.com is a good source for X, but I can’t figure out which of these URLs is ask.com’ definitive page on X.” • Penalties for manipulative and non-user-friendly duplication – Posting exact same content on multiple different sites – Panda penalty for “thin content” Why is duplicate content a problem?
  6. 6. Why focus on Answers? • Case-insensitive URLs – /q/ vs /Q/ • www.ask.com/q/What-Causes- Sepsis • www.ask.com/Q/What-Causes- Sepsis – Questions About page paths • ask.com/questions-about/t-rex • ask.com/questions-about/T-Rex • Duplicate questions in Ask.com Community (e.g.) • US vs UK Ask.com wiki – uk.ask.com/wiki/Rihanna vs www.ask.com/wiki/Rihanna • Accidentally indexable weird subdomains – replyask.lc.iad.www.ask.com Examples from Ask.com
  7. 7. Why focus on Answers?Examples from Other Sites Same people’s bios used verbatim for two different brands’ websites. http://www.google.com/search?q=pangea+media+snapapp+management+team http://www.google.com/search?q=snapapp+management+team Google will never show you both versions for a single search query.
  8. 8. Why focus on Answers? • Facebook has massive duplicate content issues, and as a result deep pages do not rank well in Google search results. – Five different versions of NYC Ballet’s “Videos” page. – None of them is on the first page in Google for New York City Ballet Videos. • Instead, main facebook.com/nycballet page is #8 in Google. • In this heavily re-blogged post from Google’s blog – Many people quoted this passage. – The original source shows up first in the SERP Examples from Other Sites
  9. 9. Why focus on Answers? • Many ways duplicate content can arise and many techniques to manage it. • Different techniques are better suited to different situations. • Things to consider about each method: – Prevents penalties? – Allows for alternate styling? – Speed/effectiveness? – Propagates link equity to all outbound links? – Consolidates link equity from all inbound links Techniques for Managing Duplicate Content
  10. 10. Why focus on Answers? • What it is: File on site that tells bots how to crawl various sections of your site. Specific to each subdomain. • Message to bots: “Don’t crawl this content, don’t put it in your index, and disregard any links that point here. Go away.” • What it’s good for: – Sections of the site that have no SEO value. – Secret stuff that you don’t want getting crawled. • What it’s bad for: – Inelegant, brute-force way of dealing with duplicate content. Robots.txt “Disallow”
  11. 11. Why focus on Answers? • What it is: Meta tag on individual page, which is like a more targeted version of robots.txt. • Message to bots: Has two separate parameters. – index/noindex: Should this page be crawled & indexed by the bot? – follow/nofollow: Should links out from these page be allowed to propagate link equity? ☞ Usually, if you’re trying to block a page from the index, but it has links to other indexed pages, you want <meta name=“robots” content=“noindex,follow”> • What it’s good for: – More targeted version of robots.txt – Allows you to block from index but still propagate link equity. – Great for deep pages of paginated/listed content. • What it’s bad for: – Alternate versions of content that users might actually want to find from search. • Suggested Use: eHow Content Pages that they won’t let us use for SEO. Meta Robots
  12. 12. Why focus on Answers?301 Redirect • What it is: Permanent redirection of duplicate/old URL to primary/new URL. • Message to bots: “Don’t go to that old URL, go to this new one. Remove the old one from the index, and forward all link equity to the new one.” • What it’s good for: – Consolidating content that exists in unnecessary variations. – Preserving the value of links, no matter which URL they link to. • What it’s bad for: – Not possible to maintain alternate versions of content, since both users and bots are redirected to a different URL. • Suggested Use: Content deleted from Community because it is redundant/duplicate.
  13. 13. Why focus on Answers?Canonical URL Tag • What it is: Meta tag that tells bots which instance of the page to index. If there are multiple instances of the same, they are consolidated together into a single URL. – E.g., <link rel=“canonical” href=“http://www.ask.com/questions-about/T-Rex”/> • Message to bots: “For purposes of search listings, this content belongs to such-and- such URL”. • What it’s good for: – Consolidating link equity among various versions of the same content. – Allows you to maintain different versions without incurring a penalty or forfeiting any link equity. – Prevents accidental indexing of trivially/accidentally different URLs. • What it’s bad for: – Slow to work. – Officially just a “suggestion”, not a “rule”. – Not 100% effective at keeping pages out of index. • Suggested Use: Any page that gets listed in search engines. Especially: – Pages with lots of meaningless URL parameters – Pages with case-insensitive URLs (choose a single, canonical capitalization format) – Pages that can be accessed on multiple weird subdomains • Warning: Do not just dynamically put the page URL here. – Make sure it is actually a canonical version of the URL. – Should only not vary based on capitalization. – Should not dynamically insert domain. Should specify the correct domain for search index.
  14. 14. Why focus on Answers?Prevention • Nothing else is more effective than abstinence. • Foresee potential duplicate content issues, and build technologies to prevent them. – For user-generated pages, suggest an already-created page rather than create a new one with the same topic. (E.g., Quora) – For automatically-created pages, do programmatic de-duplication before pages are created. • Does this query return all the same content as some other query? • Don’t include words/characters in URL that don’t affect the query results.
  15. 15. Why focus on Answers?Technique Comparison Chart Prevents penalties Allows for alternate styling Fast & effective removal from index Propagates link equity to all outbound links Consolidates link equity from all inbound links Robots.txt ✔ ✔ ✔ ✗ ✗ Meta Robots “noindex,nofollow” ✔ ✔ ✔ ✗ ✗ Meta Robots “noindex,follow” ✔ ✔ ✔ ✔ ✗ 301 Redirect ✔ ✗ ✔ ✔ ✔ Canonical URL Tag ✔ ✔ ✗ ✔ ✔ Prevention ✔ ✗ ✔ ✔ ✔
  16. 16. Why focus on Answers? Related: International versions with rel=“alternate” • Used in combination with rel=“canonical” • Tells Google if and when there are country/language specific versions of the page. • Different versions share link equity and other ranking signals. • Google SERP links to the appropriate country- specific version for each user.
  17. 17. Why focus on Answers? Related: Deleted and Expired Content • Sometimes content gets intentionally deleted. – Community Terms violation. – Legal/copyright issues. – Terminated partnerships. – Expired or no longer valuable. • User experience options for deleted/empty pages a) 301 redirect to another relevant page b) Replace with “content deleted” message and links to other relevant pages. c) Generic error message. • HTTP/robots treatment of deleted/empty pages a) 301 b) 404 c) 200 with meta robots “noindex,follow” d) 200 that can be indexed  Duplicate content
  18. 18. Why focus on Answers?Suggested Reading • http://www.seomoz.org/learn-seo/duplicate- content • http://searchengineland.com/8-canonicalization- best-practices-in-plain-english-44475 • http://support.google.com/webmasters/bin/ans wer.py?hl=en&answer=139394 • http://www.seomoz.org/blog/canonical-url-tag- the-most-important-advancement-in-seo- practices-since-sitemaps

×