New perspectives on duplicate content
Alexis Sanders
Senior SEO Manager at Merkle
Omi Sido
Technical SEO at Canon Europe
#OnCrawlinOrbit
Why should SEOs care about duplicate content?
#OnCrawlinOrbit
There is no manual
penalty for duplicate
content.
Source: October 2015 Google Hangout
Source: 10 Things I Hate About You
the
website
you
don’t
want to
be
what a user sees: what a bot sees:
Umm, I think I
like the white
shirt better…
Source: Introduction to Information Retrieval (c19)
“by some estimates,
as many as 40% of
the pages on the
web are duplicates
of other pages”
1. Indexing Challenges
2. Lower Link Impact
3. Internal Competition
4. Poor Crawl Bandwidth
Common sources of duplication:
o Repetitive page
o Doorway pages
o Inventory control
o Syndicated content
o PR releases
o Republishing
o Plagiarism
o Non-unique copy
o Localized content
o Thin content
o Staging sites
o HTTP vs. HTTPS
o Subdomain
o URL cases
o File extensions
o Trailing slash
o Index pages
o Parameters
o Pagination
o Mobile Configuration
o Internal site search
technical content
o Facets
o Sorts
o Image-only
How can SEOs find and identify duplicate content?
#OnCrawlinOrbit
1. Know your user journey
7. If your content is stolen:
Request a canonical tag
File a DCMA request
6. Strategically → consolidate,
create, delete, optimize
5. Leverage
appropriate
signaling
4. If the pages are
100% duplicate,
consolidate w/
a 301 redirect
3. Prioritize
duplicate
content
issues that are
affecting
performance
2. Create a strong
hierarchical URL taxonomy
oncrawl > Duplicate Content
oncrawl > Duplicate Content > By Group
Google
• Direct quotes in Google
• Searching via site:searches
• site:
• site: + inurl:
• intitle:
• filetype: (for file extensions)
GSC > Coverage > Duplicate …
•Quetext
•Noplag
•PaperRater
•Grammarly
•CopyScape
Plagiarism Tools
Keyword Density
Resolving duplicate content:
a memorable case
#OnCrawlinOrbit
+64.2% in sessions Y/Y from Google organic.
Adding unique content.
+28.7% in sessions Y/Y from Bing organic.
From only H1  SSR full UX
HTTPS
HTTP
Once fixed, clicks
returned to normal
range on HTTPS w/in
3-4 days.
Accidental HTTP
canonical.
Estimated 5-10%
loss.
What is new for duplicate content
in the past year and a half?
#OnCrawlinOrbit
What will duplicate content management
look like in the future?
#OnCrawlinOrbit
• Less technical-based duplicate content (as CMS wise up)
• More automation (unit testing and external testing)
• Automatically detect high similarity pages and page types
for writers and content managers
• Google continue to improve their existing systems and
detection
• Perhaps an alert system to escalate issue of Google not
using the right canonical
Alexis’ hopes for the future,
Do you have a favorite technical trick?
#OnCrawlinOrbit
• EC2 remote computer instance
• Check mobile first testing tool
• Switch user agent to Googlebot
• Using TechnicalSEO.com’s robots.txt
tool
• Screaming frog log analyzer
• Made with Love’s htaccess checker
• Using Google Data Studio to report on
changes (syncing Sheets with updates,
filtering each page by relevant updates)
Alexis’ tech SEO tricks
Do you have a least favorite technical SEO question?
#OnCrawlinOrbit
Do you have a favorite googlebot?
#OnCrawlinOrbit
Alexis: I like the idea that
Googlebot is tired and
overworked (from crawling 130
trillion URLs).
Do you have a favorite planet?
#OnCrawlinOrbit
Launching the best SEO tips
into space
Next up on June 27th
from Bordeaux, France
FULL AGENDA AT WWW.ONCRAWL.COM/SEOINORBIT

SEO in Orbit - Duplicate Content by OnCrawl

  • 1.
    New perspectives onduplicate content Alexis Sanders Senior SEO Manager at Merkle Omi Sido Technical SEO at Canon Europe #OnCrawlinOrbit
  • 2.
    Why should SEOscare about duplicate content? #OnCrawlinOrbit
  • 3.
    There is nomanual penalty for duplicate content. Source: October 2015 Google Hangout
  • 4.
    Source: 10 ThingsI Hate About You the website you don’t want to be
  • 5.
    what a usersees: what a bot sees: Umm, I think I like the white shirt better…
  • 6.
    Source: Introduction toInformation Retrieval (c19) “by some estimates, as many as 40% of the pages on the web are duplicates of other pages”
  • 7.
    1. Indexing Challenges 2.Lower Link Impact 3. Internal Competition 4. Poor Crawl Bandwidth
  • 8.
    Common sources ofduplication:
  • 9.
    o Repetitive page oDoorway pages o Inventory control o Syndicated content o PR releases o Republishing o Plagiarism o Non-unique copy o Localized content o Thin content o Staging sites o HTTP vs. HTTPS o Subdomain o URL cases o File extensions o Trailing slash o Index pages o Parameters o Pagination o Mobile Configuration o Internal site search technical content o Facets o Sorts o Image-only
  • 10.
    How can SEOsfind and identify duplicate content? #OnCrawlinOrbit
  • 11.
    1. Know youruser journey 7. If your content is stolen: Request a canonical tag File a DCMA request 6. Strategically → consolidate, create, delete, optimize 5. Leverage appropriate signaling 4. If the pages are 100% duplicate, consolidate w/ a 301 redirect 3. Prioritize duplicate content issues that are affecting performance 2. Create a strong hierarchical URL taxonomy
  • 12.
  • 13.
    oncrawl > DuplicateContent > By Group
  • 14.
    Google • Direct quotesin Google • Searching via site:searches • site: • site: + inurl: • intitle: • filetype: (for file extensions)
  • 15.
    GSC > Coverage> Duplicate …
  • 16.
  • 17.
  • 18.
    Resolving duplicate content: amemorable case #OnCrawlinOrbit
  • 19.
    +64.2% in sessionsY/Y from Google organic. Adding unique content.
  • 20.
    +28.7% in sessionsY/Y from Bing organic. From only H1  SSR full UX
  • 21.
    HTTPS HTTP Once fixed, clicks returnedto normal range on HTTPS w/in 3-4 days. Accidental HTTP canonical. Estimated 5-10% loss.
  • 22.
    What is newfor duplicate content in the past year and a half? #OnCrawlinOrbit
  • 23.
    What will duplicatecontent management look like in the future? #OnCrawlinOrbit
  • 24.
    • Less technical-basedduplicate content (as CMS wise up) • More automation (unit testing and external testing) • Automatically detect high similarity pages and page types for writers and content managers • Google continue to improve their existing systems and detection • Perhaps an alert system to escalate issue of Google not using the right canonical Alexis’ hopes for the future,
  • 25.
    Do you havea favorite technical trick? #OnCrawlinOrbit
  • 26.
    • EC2 remotecomputer instance • Check mobile first testing tool • Switch user agent to Googlebot • Using TechnicalSEO.com’s robots.txt tool • Screaming frog log analyzer • Made with Love’s htaccess checker • Using Google Data Studio to report on changes (syncing Sheets with updates, filtering each page by relevant updates) Alexis’ tech SEO tricks
  • 27.
    Do you havea least favorite technical SEO question? #OnCrawlinOrbit
  • 28.
    Do you havea favorite googlebot? #OnCrawlinOrbit
  • 29.
    Alexis: I likethe idea that Googlebot is tired and overworked (from crawling 130 trillion URLs).
  • 30.
    Do you havea favorite planet? #OnCrawlinOrbit
  • 31.
    Launching the bestSEO tips into space Next up on June 27th from Bordeaux, France FULL AGENDA AT WWW.ONCRAWL.COM/SEOINORBIT

Editor's Notes

  • #4 October 2015 Google Hangout
  • #5 Joey: [holding up headshots] “Which one do you like better?” Bianca: “Umm, I think I like the white shirt better.” Joey: “Yeah, it’s-it’s more…” Bianca: “Pensive?” Joey: “Damn, I was going for thoughtful.”
  • #9 Roll in like star wars? https://www.shapechef.com/blog/star-wars-intro-crawl-in-powerpoint-2013
  • #19 I have two clients that see a lot of this, one in real estate, one in health. Lot of trying to add unique content and information architecture.
  • #20 From Google
  • #22 Incorrect canonicals: ~2/1/19 Fixed: 4/18/19
  • #23 Same as stuff in the past Finding ways to make content unique, especially when site is massive Machine generated content
  • #26 Alexis: Check mobile first testing tool, switch user agent to Googlebot Using technicalseo.com’s robots.txt tool Screaming frog log analyzer EC2 remote computer I’ll try to think of something better.
  • #28 Robots.txt
  • #29 Alexis – the newest, evergreen one.
  • #31 Alexis: Jupiter