2. Why focus on Answers?
• What is duplicate content? Why is it bad?
• Examples of duplicate content on our site and other
notable sites
• Techniques
– robots.txt disallow
– Meta robots tag
– 301 redirect dupe URL to primary URL
– Canonical URL tag
– Prevent duplicate page from being created in the first place
• Related topics
– rel=“alternate” for language/regional support
– Expired and no-longer-relevant content, and strategic content
deletion
• Suggested reading
Agenda
3. Why focus on Answers?
• What is duplicate content & why is it bad?
• Examples of duplicate content on our site and other
notable sites
• Techniques
– robots.txt disallow
– Meta robots tag
– 301 redirect dupe URL to primary URL
– Canonical URL tag
– Prevent duplicate page from being created in the first place
• Related topics
– rel=“alternate” for language/regional support
– Expired and no-longer-relevant content, and strategic content
deletion
• Suggested reading
4. Why focus on Answers?
• The same content often appears at more than one URL
• Intentional duplication
– Quotation
– Re-use
• E.g. Ask.com Wikipedia
– Content syndication
• Inadvertent duplication
– Separate mobile-optimized or printer-optimized version of
page.
– Separate regional design or branding
• E.g., uk.ask.com/wiki/Rihanna vs www.ask.com/wiki/Rihanna
– Dynamic content where different queries return same results
• E.g., ask.com/questions-about/iPhones vs ask.com/questions-
about/iphone vs ask.com/questions-about/iPhone
– Extra junk in URL that does not substantively change content.
• E.g., www.ask.com/wiki/Symbolics?qsrc=3044 vs
www.ask.com/wiki/Symbolics
– Pagination and filtering
• E.g.,
• Manipulative duplication
– Scraper sites
– Blatant copyright infringement / plagiarism
– SEO spam
What is duplicate content?
5. Why focus on Answers?
• Fragmentation of link equity, authority & anchor text
– If there are 100 links to “iphone” and 50 links to “iPhones”:
• Do I treat this as a single page with 150 links?
• Do I treat both pages as separate and important? If so, “iphone” is 100
links worth of importance, and “iPhones” is 50 links worth of
importance.
• Lower confidence in single, definitive source
– If there are many versions, which version is the definitive one?
– Which URL has the most relevant/reliable copy of this for a
given search query?
– “I know ask.com is a good source for X, but I can’t figure out
which of these URLs is ask.com’ definitive page on X.”
• Penalties for manipulative and non-user-friendly
duplication
– Posting exact same content on multiple different sites
– Panda penalty for “thin content”
Why is duplicate content a
problem?
6. Why focus on Answers?
• Case-insensitive URLs
– /q/ vs /Q/
• www.ask.com/q/What-Causes-
Sepsis
• www.ask.com/Q/What-Causes-
Sepsis
– Questions About page paths
• ask.com/questions-about/t-rex
• ask.com/questions-about/T-Rex
• Duplicate questions in Ask.com
Community (e.g.)
• US vs UK Ask.com wiki
– uk.ask.com/wiki/Rihanna vs
www.ask.com/wiki/Rihanna
• Accidentally indexable weird
subdomains
– replyask.lc.iad.www.ask.com
Examples from Ask.com
7. Why focus on Answers?Examples from Other Sites
Same people’s bios used verbatim for two different brands’ websites.
http://www.google.com/search?q=pangea+media+snapapp+management+team
http://www.google.com/search?q=snapapp+management+team
Google will never show you both versions for a single search query.
8. Why focus on Answers?
• Facebook has massive duplicate content issues,
and as a result deep pages do not rank well in
Google search results.
– Five different versions of NYC Ballet’s “Videos” page.
– None of them is on the first page in Google for New
York City Ballet Videos.
• Instead, main facebook.com/nycballet page is #8 in Google.
• In this heavily re-blogged post from Google’s blog
– Many people quoted this passage.
– The original source shows up first in the SERP
Examples from Other Sites
9. Why focus on Answers?
• Many ways duplicate content can arise and many
techniques to manage it.
• Different techniques are better suited to different
situations.
• Things to consider about each method:
– Prevents penalties?
– Allows for alternate styling?
– Speed/effectiveness?
– Propagates link equity to all outbound links?
– Consolidates link equity from all inbound links
Techniques for Managing
Duplicate Content
10. Why focus on Answers?
• What it is: File on site that tells bots how to crawl
various sections of your site. Specific to each
subdomain.
• Message to bots: “Don’t crawl this content, don’t
put it in your index, and disregard any links that
point here. Go away.”
• What it’s good for:
– Sections of the site that have no SEO value.
– Secret stuff that you don’t want getting crawled.
• What it’s bad for:
– Inelegant, brute-force way of dealing with duplicate
content.
Robots.txt “Disallow”
11. Why focus on Answers?
• What it is: Meta tag on individual page, which is like a more targeted
version of robots.txt.
• Message to bots: Has two separate parameters.
– index/noindex: Should this page be crawled & indexed by the bot?
– follow/nofollow: Should links out from these page be allowed to propagate
link equity?
☞ Usually, if you’re trying to block a page from the index, but it has links to other
indexed pages, you want <meta name=“robots”
content=“noindex,follow”>
• What it’s good for:
– More targeted version of robots.txt
– Allows you to block from index but still propagate link equity.
– Great for deep pages of paginated/listed content.
• What it’s bad for:
– Alternate versions of content that users might actually want to find from
search.
• Suggested Use: eHow Content Pages that they won’t let us use for SEO.
Meta Robots
12. Why focus on Answers?301 Redirect
• What it is: Permanent redirection of duplicate/old URL to
primary/new URL.
• Message to bots: “Don’t go to that old URL, go to this new
one. Remove the old one from the index, and forward all
link equity to the new one.”
• What it’s good for:
– Consolidating content that exists in unnecessary variations.
– Preserving the value of links, no matter which URL they link to.
• What it’s bad for:
– Not possible to maintain alternate versions of content, since
both users and bots are redirected to a different URL.
• Suggested Use: Content deleted from Community because
it is redundant/duplicate.
13. Why focus on Answers?Canonical URL Tag
• What it is: Meta tag that tells bots which instance of the page to index. If there are
multiple instances of the same, they are consolidated together into a single URL.
– E.g., <link rel=“canonical” href=“http://www.ask.com/questions-about/T-Rex”/>
• Message to bots: “For purposes of search listings, this content belongs to such-and-
such URL”.
• What it’s good for:
– Consolidating link equity among various versions of the same content.
– Allows you to maintain different versions without incurring a penalty or forfeiting any link equity.
– Prevents accidental indexing of trivially/accidentally different URLs.
• What it’s bad for:
– Slow to work.
– Officially just a “suggestion”, not a “rule”.
– Not 100% effective at keeping pages out of index.
• Suggested Use: Any page that gets listed in search engines. Especially:
– Pages with lots of meaningless URL parameters
– Pages with case-insensitive URLs (choose a single, canonical capitalization format)
– Pages that can be accessed on multiple weird subdomains
• Warning: Do not just dynamically put the page URL here.
– Make sure it is actually a canonical version of the URL.
– Should only not vary based on capitalization.
– Should not dynamically insert domain. Should specify the correct domain for search index.
14. Why focus on Answers?Prevention
• Nothing else is more effective than abstinence.
• Foresee potential duplicate content issues, and
build technologies to prevent them.
– For user-generated pages, suggest an already-created
page rather than create a new one with the same
topic. (E.g., Quora)
– For automatically-created pages, do programmatic
de-duplication before pages are created.
• Does this query return all the same content as some other
query?
• Don’t include words/characters in URL that don’t affect the
query results.
15. Why focus on Answers?Technique Comparison Chart
Prevents
penalties
Allows for
alternate
styling
Fast &
effective
removal
from index
Propagates
link equity to
all outbound
links
Consolidates
link equity
from all
inbound links
Robots.txt
✔ ✔ ✔ ✗ ✗
Meta Robots
“noindex,nofollow” ✔ ✔ ✔ ✗ ✗
Meta Robots
“noindex,follow”
✔ ✔ ✔ ✔ ✗
301 Redirect
✔ ✗ ✔ ✔ ✔
Canonical
URL Tag ✔ ✔ ✗ ✔ ✔
Prevention
✔ ✗ ✔ ✔ ✔
16. Why focus on Answers?
Related:
International versions with rel=“alternate”
• Used in combination with rel=“canonical”
• Tells Google if and when there are
country/language specific versions of the page.
• Different versions share link equity and other
ranking signals.
• Google SERP links to the appropriate country-
specific version for each user.
17. Why focus on Answers?
Related:
Deleted and Expired Content
• Sometimes content gets intentionally deleted.
– Community Terms violation.
– Legal/copyright issues.
– Terminated partnerships.
– Expired or no longer valuable.
• User experience options for deleted/empty pages
a) 301 redirect to another relevant page
b) Replace with “content deleted” message and links to other
relevant pages.
c) Generic error message.
• HTTP/robots treatment of deleted/empty pages
a) 301
b) 404
c) 200 with meta robots “noindex,follow”
d) 200 that can be indexed Duplicate content