Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Like this? Share it with your network


Dealing with duplicate content



A presentation I shared with members of the Ask team before we embarked on a major duplicate content cleanup project.

A presentation I shared with members of the Ask team before we embarked on a major duplicate content cleanup project.



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Dealing with duplicate content Presentation Transcript

  • 1. Dealing with Duplicate ContentEhren ReillyMay 8, 2012
  • 2. Why focus on Answers?• What is duplicate content? Why is it bad?• Examples of duplicate content on our site and othernotable sites• Techniques– robots.txt disallow– Meta robots tag– 301 redirect dupe URL to primary URL– Canonical URL tag– Prevent duplicate page from being created in the first place• Related topics– rel=“alternate” for language/regional support– Expired and no-longer-relevant content, and strategic contentdeletion• Suggested readingAgenda
  • 3. Why focus on Answers?• What is duplicate content & why is it bad?• Examples of duplicate content on our site and othernotable sites• Techniques– robots.txt disallow– Meta robots tag– 301 redirect dupe URL to primary URL– Canonical URL tag– Prevent duplicate page from being created in the first place• Related topics– rel=“alternate” for language/regional support– Expired and no-longer-relevant content, and strategic contentdeletion• Suggested reading
  • 4. Why focus on Answers?• The same content often appears at more than one URL• Intentional duplication– Quotation– Re-use• E.g. Ask.com Wikipedia– Content syndication• Inadvertent duplication– Separate mobile-optimized or printer-optimized version ofpage.– Separate regional design or branding• E.g., uk.ask.com/wiki/Rihanna vs www.ask.com/wiki/Rihanna– Dynamic content where different queries return same results• E.g., ask.com/questions-about/iPhones vs ask.com/questions-about/iphone vs ask.com/questions-about/iPhone– Extra junk in URL that does not substantively change content.• E.g., www.ask.com/wiki/Symbolics?qsrc=3044 vswww.ask.com/wiki/Symbolics– Pagination and filtering• E.g.,• Manipulative duplication– Scraper sites– Blatant copyright infringement / plagiarism– SEO spamWhat is duplicate content?
  • 5. Why focus on Answers?• Fragmentation of link equity, authority & anchor text– If there are 100 links to “iphone” and 50 links to “iPhones”:• Do I treat this as a single page with 150 links?• Do I treat both pages as separate and important? If so, “iphone” is 100links worth of importance, and “iPhones” is 50 links worth ofimportance.• Lower confidence in single, definitive source– If there are many versions, which version is the definitive one?– Which URL has the most relevant/reliable copy of this for agiven search query?– “I know ask.com is a good source for X, but I can’t figure outwhich of these URLs is ask.com’ definitive page on X.”• Penalties for manipulative and non-user-friendlyduplication– Posting exact same content on multiple different sites– Panda penalty for “thin content”Why is duplicate content aproblem?
  • 6. Why focus on Answers?• Case-insensitive URLs– /q/ vs /Q/• www.ask.com/q/What-Causes-Sepsis• www.ask.com/Q/What-Causes-Sepsis– Questions About page paths• ask.com/questions-about/t-rex• ask.com/questions-about/T-Rex• Duplicate questions in Ask.comCommunity (e.g.)• US vs UK Ask.com wiki– uk.ask.com/wiki/Rihanna vswww.ask.com/wiki/Rihanna• Accidentally indexable weirdsubdomains– replyask.lc.iad.www.ask.comExamples from Ask.com
  • 7. Why focus on Answers?Examples from Other SitesSame people’s bios used verbatim for two different brands’ websites.http://www.google.com/search?q=pangea+media+snapapp+management+teamhttp://www.google.com/search?q=snapapp+management+teamGoogle will never show you both versions for a single search query.
  • 8. Why focus on Answers?• Facebook has massive duplicate content issues,and as a result deep pages do not rank well inGoogle search results.– Five different versions of NYC Ballet’s “Videos” page.– None of them is on the first page in Google for NewYork City Ballet Videos.• Instead, main facebook.com/nycballet page is #8 in Google.• In this heavily re-blogged post from Google’s blog– Many people quoted this passage.– The original source shows up first in the SERPExamples from Other Sites
  • 9. Why focus on Answers?• Many ways duplicate content can arise and manytechniques to manage it.• Different techniques are better suited to differentsituations.• Things to consider about each method:– Prevents penalties?– Allows for alternate styling?– Speed/effectiveness?– Propagates link equity to all outbound links?– Consolidates link equity from all inbound linksTechniques for ManagingDuplicate Content
  • 10. Why focus on Answers?• What it is: File on site that tells bots how to crawlvarious sections of your site. Specific to eachsubdomain.• Message to bots: “Don’t crawl this content, don’tput it in your index, and disregard any links thatpoint here. Go away.”• What it’s good for:– Sections of the site that have no SEO value.– Secret stuff that you don’t want getting crawled.• What it’s bad for:– Inelegant, brute-force way of dealing with duplicatecontent.Robots.txt “Disallow”
  • 11. Why focus on Answers?• What it is: Meta tag on individual page, which is like a more targetedversion of robots.txt.• Message to bots: Has two separate parameters.– index/noindex: Should this page be crawled & indexed by the bot?– follow/nofollow: Should links out from these page be allowed to propagatelink equity?☞ Usually, if you’re trying to block a page from the index, but it has links to otherindexed pages, you want <meta name=“robots”content=“noindex,follow”>• What it’s good for:– More targeted version of robots.txt– Allows you to block from index but still propagate link equity.– Great for deep pages of paginated/listed content.• What it’s bad for:– Alternate versions of content that users might actually want to find fromsearch.• Suggested Use: eHow Content Pages that they won’t let us use for SEO.Meta Robots
  • 12. Why focus on Answers?301 Redirect• What it is: Permanent redirection of duplicate/old URL toprimary/new URL.• Message to bots: “Don’t go to that old URL, go to this newone. Remove the old one from the index, and forward alllink equity to the new one.”• What it’s good for:– Consolidating content that exists in unnecessary variations.– Preserving the value of links, no matter which URL they link to.• What it’s bad for:– Not possible to maintain alternate versions of content, sinceboth users and bots are redirected to a different URL.• Suggested Use: Content deleted from Community becauseit is redundant/duplicate.
  • 13. Why focus on Answers?Canonical URL Tag• What it is: Meta tag that tells bots which instance of the page to index. If there aremultiple instances of the same, they are consolidated together into a single URL.– E.g., <link rel=“canonical” href=“http://www.ask.com/questions-about/T-Rex”/>• Message to bots: “For purposes of search listings, this content belongs to such-and-such URL”.• What it’s good for:– Consolidating link equity among various versions of the same content.– Allows you to maintain different versions without incurring a penalty or forfeiting any link equity.– Prevents accidental indexing of trivially/accidentally different URLs.• What it’s bad for:– Slow to work.– Officially just a “suggestion”, not a “rule”.– Not 100% effective at keeping pages out of index.• Suggested Use: Any page that gets listed in search engines. Especially:– Pages with lots of meaningless URL parameters– Pages with case-insensitive URLs (choose a single, canonical capitalization format)– Pages that can be accessed on multiple weird subdomains• Warning: Do not just dynamically put the page URL here.– Make sure it is actually a canonical version of the URL.– Should only not vary based on capitalization.– Should not dynamically insert domain. Should specify the correct domain for search index.
  • 14. Why focus on Answers?Prevention• Nothing else is more effective than abstinence.• Foresee potential duplicate content issues, andbuild technologies to prevent them.– For user-generated pages, suggest an already-createdpage rather than create a new one with the sametopic. (E.g., Quora)– For automatically-created pages, do programmaticde-duplication before pages are created.• Does this query return all the same content as some otherquery?• Don’t include words/characters in URL that don’t affect thequery results.
  • 15. Why focus on Answers?Technique Comparison ChartPreventspenaltiesAllows foralternatestylingFast &effectiveremovalfrom indexPropagateslink equity toall outboundlinksConsolidateslink equityfrom allinbound linksRobots.txt✔ ✔ ✔ ✗ ✗Meta Robots“noindex,nofollow” ✔ ✔ ✔ ✗ ✗Meta Robots“noindex,follow”✔ ✔ ✔ ✔ ✗301 Redirect✔ ✗ ✔ ✔ ✔CanonicalURL Tag ✔ ✔ ✗ ✔ ✔Prevention✔ ✗ ✔ ✔ ✔
  • 16. Why focus on Answers?Related:International versions with rel=“alternate”• Used in combination with rel=“canonical”• Tells Google if and when there arecountry/language specific versions of the page.• Different versions share link equity and otherranking signals.• Google SERP links to the appropriate country-specific version for each user.
  • 17. Why focus on Answers?Related:Deleted and Expired Content• Sometimes content gets intentionally deleted.– Community Terms violation.– Legal/copyright issues.– Terminated partnerships.– Expired or no longer valuable.• User experience options for deleted/empty pagesa) 301 redirect to another relevant pageb) Replace with “content deleted” message and links to otherrelevant pages.c) Generic error message.• HTTP/robots treatment of deleted/empty pagesa) 301b) 404c) 200 with meta robots “noindex,follow”d) 200 that can be indexed  Duplicate content
  • 18. Why focus on Answers?Suggested Reading• http://www.seomoz.org/learn-seo/duplicate-content• http://searchengineland.com/8-canonicalization-best-practices-in-plain-english-44475• http://support.google.com/webmasters/bin/answer.py?hl=en&answer=139394• http://www.seomoz.org/blog/canonical-url-tag-the-most-important-advancement-in-seo-practices-since-sitemaps