Duplicate Content Filters, Penalties and other
             Content Minefields

              27th March 2012
Search Quality – the Duplicate Content Headache

Google can’t afford a SERPs of;



4)Search engine optimization
           Search engine optimization (SEO) is the process of     improving the
visibility of a website or a web page in search engines........
 2) Search engine optimization
           Search engine optimization (SEO) is the process of     improving the
visibility of a website or a web page in search engines........
3) Search engine optimization
           Search engine optimization (SEO) is the process of     improving the
visibility of a website or a web page in search engines........
4) Search engine optimization
           Search engine optimization (SEO) is the process of     improving the
visibility of a website or a web page in search engines........
                                                                                  2
Resource – the Duplicate Content Headache
Duplicate content has consequences for SE in;

Wastes Crawler resources - finite number of crawlers

Wastes Bandwidth – how often can you crawl 1 trillion documents and
keep your index fresh?

Increases Query CPU time – how do you search 1 trillion documents as
quickly as possible?




                                                                       3
Document importance – Duplicate Content Headache
 Duplicate content can be a signal of an important document;

 • Song lyrics

 • Scholarly texts and historical documents, eg the Bible (1,000 pages)

 • The Linux manual (2,000 pages)

 • Breaking News – Associated Press, Reuters

 etc.




                                                                          4
Types of Duplicate Content
Duplicate content comes in many forms



Intentional vs non intentional

On-site vs off-site




                                               5
On-Site Duplicate Content (Impacts Quality Score)
Intentional
•    Printer friendly pages
•Different font sizes
•PDF documents
•Archive (non graphics versions)
•Shopping filters (sort by and pagination)
•RSS feeds

Non-intentional
• Affiliate URLs - www.example.com/?btag=123
• Adwords Campaigns - www.example.com/?utc=google
•Search results
•www vs non www URLs
•https vs http
•Stubs/plugins

                                                    6
On-Site Duplicate Content (Impacts Quality Score)
10’000s of stub pages worst case scenario example;




  This was 2 weeks after Andy had removed the duplicate links from the search pages on our advice eg;
  http://www.motors.co.uk/Ford-Escort-0-9999999---2
  http://www.motors.co.uk/Ford-Escort-0-9999999--U-2-
  http://www.motors.co.uk/Ford-Escort-0-9999999---2%20-

                                                                                                        7
Off-Site Duplicate Content (Filters and Penalties)
Intentional vs non-intentional somewhat grey

Domain branding eg .com, .co.za
(Mobile website)
Content syndication
Content theft
Staging websites a common problem!!



Quality signals are often used to filter off-site Duplicates!!!




                                                                  8
How Does Google Filter Off-site Duplicate Content
Authors feel they have a right to rank for their own content –
Google’s Loyalty is to its users!!!

Google doesn’t necessarily reward a source or original but assesses;

• Relevance (eg is an article in context)
• Domain authority & links (eg Google Knol, Facebook)
• Fresh content boost

• Site quality signals (eg internal duplicate content!!!)




                                                                       9
Examples of Off-site Duplicate Content and Quality
Client with .com.au and a .com with https duplicates

Casino Client with a
lot of stub pages
(pre Panda)

Casino site
– severe health issues;




                                                       10
How to Diagnose (on-site) Duplicate Content
Link building will exacerbate duplicate content indexing

Keep an eye on indexed pages (weekly) and look for spikes in Google
Indexing, (Yahoo and Bing)

Look for site:example.com
duplicates

Use Xenu link checker

Heed any Webmaster Tools warnings

Check your crawling and cache dates
        Frequent update but stale cache dates = dupe content issues

                                                                      11
How to address on-site and off-site duplicate content
You have a whole armoury of potential tools including;

Robots.txt exclusion
Robots meta tag
Canonical tag
Webmaster URL exclusion
Password protection
(301 redirects)

(File a DMCA against serial content thieves?)

Lot of well-meaning people give bad advice though




                                                         12
Google Engineers Can’t Agree
Adam Lasnik – “Deftly Dealing with
Duplicate Content” 2006

  Probably the authoritative guide to duplicate content;

  • What is duplicate content?

  • What isn't duplicate content?

  • Why does Google care about duplicate content?

  • What does Google do about it?

  • How can Webmasters proactively address duplicate content
  issues?


  `
Deftly Dealing with... - Our advice/experience

Robots.txt

Routinely ignored by Google, probably because of malware

User-agent: *

Allow: /the-good-stuff/
Disallow: /the-malware/

Robots.txt is ignored unless combined with emergency Webmaster
Tools URL removal (3 months)




                                                                 15
Our advice/experience

Canonical tag

Works great for cross-domain duplicate content

Largely ineffective for pagination eg shopping sites

Totally ineffective unless canonical URLs are VERY similar if not identical




                                                                              16
Our advice/experience

Robots Meta Tag

Noindex,Follow - 100% obeyed by Google and passes Page Rank too

Very effective for pagination eg shopping sites

Works well for tracking links too (www.example.com/?affid=123456)

Doesn’t work when used with blocking robots.txt




                                                                    17
Our advice/experience

Password Protect/htaccess 403 Forbidden

Works great for staging sites

Stubs - Problem in that it generates Webmaster Tools errors

Our feeling best to avoid on your main domain




                                                              18
Extreme Techniques to Avoid Dupe Content
Make all your backend .exe
with htaccess
Summary

 Duplicate content is a minefield!

 Filters usually apply, penalties are very rare

 You have the answer in your own hands

 Stay on top of your site’s health – especially internal duplicate content
Thank you for your attention!

Thanks to:
Anton Groeneveldt
Carla dos Santos

Duplicate content presentation March 2012

  • 1.
    Duplicate Content Filters,Penalties and other Content Minefields 27th March 2012
  • 2.
    Search Quality –the Duplicate Content Headache Google can’t afford a SERPs of; 4)Search engine optimization Search engine optimization (SEO) is the process of improving the visibility of a website or a web page in search engines........ 2) Search engine optimization Search engine optimization (SEO) is the process of improving the visibility of a website or a web page in search engines........ 3) Search engine optimization Search engine optimization (SEO) is the process of improving the visibility of a website or a web page in search engines........ 4) Search engine optimization Search engine optimization (SEO) is the process of improving the visibility of a website or a web page in search engines........ 2
  • 3.
    Resource – theDuplicate Content Headache Duplicate content has consequences for SE in; Wastes Crawler resources - finite number of crawlers Wastes Bandwidth – how often can you crawl 1 trillion documents and keep your index fresh? Increases Query CPU time – how do you search 1 trillion documents as quickly as possible? 3
  • 4.
    Document importance –Duplicate Content Headache Duplicate content can be a signal of an important document; • Song lyrics • Scholarly texts and historical documents, eg the Bible (1,000 pages) • The Linux manual (2,000 pages) • Breaking News – Associated Press, Reuters etc. 4
  • 5.
    Types of DuplicateContent Duplicate content comes in many forms Intentional vs non intentional On-site vs off-site 5
  • 6.
    On-Site Duplicate Content(Impacts Quality Score) Intentional • Printer friendly pages •Different font sizes •PDF documents •Archive (non graphics versions) •Shopping filters (sort by and pagination) •RSS feeds Non-intentional • Affiliate URLs - www.example.com/?btag=123 • Adwords Campaigns - www.example.com/?utc=google •Search results •www vs non www URLs •https vs http •Stubs/plugins 6
  • 7.
    On-Site Duplicate Content(Impacts Quality Score) 10’000s of stub pages worst case scenario example; This was 2 weeks after Andy had removed the duplicate links from the search pages on our advice eg; http://www.motors.co.uk/Ford-Escort-0-9999999---2 http://www.motors.co.uk/Ford-Escort-0-9999999--U-2- http://www.motors.co.uk/Ford-Escort-0-9999999---2%20- 7
  • 8.
    Off-Site Duplicate Content(Filters and Penalties) Intentional vs non-intentional somewhat grey Domain branding eg .com, .co.za (Mobile website) Content syndication Content theft Staging websites a common problem!! Quality signals are often used to filter off-site Duplicates!!! 8
  • 9.
    How Does GoogleFilter Off-site Duplicate Content Authors feel they have a right to rank for their own content – Google’s Loyalty is to its users!!! Google doesn’t necessarily reward a source or original but assesses; • Relevance (eg is an article in context) • Domain authority & links (eg Google Knol, Facebook) • Fresh content boost • Site quality signals (eg internal duplicate content!!!) 9
  • 10.
    Examples of Off-siteDuplicate Content and Quality Client with .com.au and a .com with https duplicates Casino Client with a lot of stub pages (pre Panda) Casino site – severe health issues; 10
  • 11.
    How to Diagnose(on-site) Duplicate Content Link building will exacerbate duplicate content indexing Keep an eye on indexed pages (weekly) and look for spikes in Google Indexing, (Yahoo and Bing) Look for site:example.com duplicates Use Xenu link checker Heed any Webmaster Tools warnings Check your crawling and cache dates Frequent update but stale cache dates = dupe content issues 11
  • 12.
    How to addresson-site and off-site duplicate content You have a whole armoury of potential tools including; Robots.txt exclusion Robots meta tag Canonical tag Webmaster URL exclusion Password protection (301 redirects) (File a DMCA against serial content thieves?) Lot of well-meaning people give bad advice though 12
  • 13.
  • 14.
    Adam Lasnik –“Deftly Dealing with Duplicate Content” 2006 Probably the authoritative guide to duplicate content; • What is duplicate content? • What isn't duplicate content? • Why does Google care about duplicate content? • What does Google do about it? • How can Webmasters proactively address duplicate content issues? `
  • 15.
    Deftly Dealing with...- Our advice/experience Robots.txt Routinely ignored by Google, probably because of malware User-agent: * Allow: /the-good-stuff/ Disallow: /the-malware/ Robots.txt is ignored unless combined with emergency Webmaster Tools URL removal (3 months) 15
  • 16.
    Our advice/experience Canonical tag Worksgreat for cross-domain duplicate content Largely ineffective for pagination eg shopping sites Totally ineffective unless canonical URLs are VERY similar if not identical 16
  • 17.
    Our advice/experience Robots MetaTag Noindex,Follow - 100% obeyed by Google and passes Page Rank too Very effective for pagination eg shopping sites Works well for tracking links too (www.example.com/?affid=123456) Doesn’t work when used with blocking robots.txt 17
  • 18.
    Our advice/experience Password Protect/htaccess403 Forbidden Works great for staging sites Stubs - Problem in that it generates Webmaster Tools errors Our feeling best to avoid on your main domain 18
  • 19.
    Extreme Techniques toAvoid Dupe Content Make all your backend .exe with htaccess
  • 20.
    Summary Duplicate contentis a minefield! Filters usually apply, penalties are very rare You have the answer in your own hands Stay on top of your site’s health – especially internal duplicate content
  • 21.
    Thank you foryour attention! Thanks to: Anton Groeneveldt Carla dos Santos