getting_rid_of_duplicate_content_iss-priyank_garg.ppt

538 views
531 views

Published on

Published in: Technology, Design
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
538
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

getting_rid_of_duplicate_content_iss-priyank_garg.ppt

  1. 1. Content Duplication Priyank Garg Yahoo! Web Search November 13, 2008
  2. 2. Content Duplication - Outline <ul><li>Where Yahoo! Search eliminates duplication </li></ul><ul><li>Why should you, the webmaster, care? </li></ul><ul><li>Reasons to preserve dupes </li></ul><ul><li>Sources of duplication </li></ul><ul><li>The abusive fringe </li></ul><ul><li>What should you do? </li></ul>
  3. 3. Why Do Search Engines Care About Duplication? <ul><li>There are few elements related to duplicates for search engines: </li></ul><ul><li>Lack of diversity of results </li></ul><ul><ul><li>Same abstract, similar URLs different hosts/domains </li></ul></ul><ul><ul><li>Landing pages have identical content </li></ul></ul><ul><ul><li>Looks bad, waste of screen real estate for user </li></ul></ul><ul><li>Increased likelihood of user disappointment </li></ul><ul><li>Wasted resources (crawl and index capacity) in most cases </li></ul><ul><li>Finally, would like to show original content only for originator(!) </li></ul><ul><ul><li>Want to drive traffic to content creators not, um, flattering imitators </li></ul></ul>
  4. 4. Where does Y! Search eliminate dupes? <ul><li>A: At every point in the pipeline, but as much as possible at query-time </li></ul><ul><li>Crawl-time filtering </li></ul><ul><ul><li>Less likely to extract links from known duplicate pages </li></ul></ul><ul><ul><li>Less likely to crawl new docs from duplicative sites </li></ul></ul><ul><li>Index-time filtering </li></ul><ul><ul><li>Less representation from dupes when choosing crawled pages to put in index </li></ul></ul><ul><li>Query-time dup elimination </li></ul><ul><ul><li>Limits of URLs/host per SRP, domain restrictions as well </li></ul></ul><ul><ul><li>Filtering of similar documents </li></ul></ul><ul><li>Duplication doesn’t have to be exact </li></ul><ul><ul><li>Approximate page-level dupes, site mirrors </li></ul></ul>
  5. 5. Why should you, the webmaster, care? <ul><li>For each site, we allocate certain crawl and index resources for the site using many aspects </li></ul><ul><ul><li>site importance, </li></ul></ul><ul><ul><li>content quality and uniqueness </li></ul></ul><ul><ul><li>overall unique additional value to search users </li></ul></ul><ul><li>Let’s say you have a recipe site with 15k good recipes </li></ul><ul><li>http://recipe-site.com/yourbasicporkchop.html </li></ul><ul><li>http://recipe-site.com/lambvindaloo.html </li></ul><ul><li>http://recipe-site.com/sausagegumbo.html </li></ul><ul><li>http://recipe-site.com/spicyvegansurprise.html </li></ul><ul><li>http://recipe-site.com/whateversinthefridgeplusoregano.html </li></ul><ul><li>[..] </li></ul>
  6. 6. Why should you, the webmaster, care? <ul><li>But unfortunately, your entire crawl quota found by Slurp look like this: </li></ul><ul><li>http://recipe-site.com/yourbasicporkchop.html?sessid=aba89s </li></ul><ul><li>http://recipe-site.com/yourbasicporkchop.html?sessid=acc90x </li></ul><ul><li>http://recipe-site.com/yourbasicporkchop.html?sessid=aff23f </li></ul><ul><li>http://recipe-site.com/yourbasicporkchop.html?sessid=ccr33a </li></ul><ul><li>[....] </li></ul><ul><li>The upshot may be that only one page of (unique) content survives, though we would have taken more </li></ul><ul><li>Note that all the duplicate pages would probably get filtered later in the pipeline, wasting your chances to get referrals </li></ul>
  7. 7. Why keep dupes at all? <ul><li>Why would search engines want duplicative documents in their index? </li></ul><ul><li>Site-restricted queries </li></ul><ul><ul><li>If you’re looking for a news story on picayune-gazette.com, you don’t care if another copy is on herald-tribune.com </li></ul></ul><ul><li>Regional preference </li></ul><ul><ul><li>We rank results slightly differently for UK users than U.S. users </li></ul></ul><ul><ul><li>UK users would rather see a bbc.co.uk result than usatoday.com result </li></ul></ul><ul><li>Redundancy </li></ul><ul><ul><li>Copy #1 might suddenly go 404 </li></ul></ul><ul><ul><li>We don’t need a million copies of Unix man pages, but ... more than one. </li></ul></ul>
  8. 8. Legitimate reasons to duplicate <ul><li>Alternate document formats </li></ul><ul><ul><li>Present the same content in HTML, Word, PDF.... </li></ul></ul><ul><li>Legitimate syndication </li></ul><ul><ul><li>For example, many newspaper sites have wire-service stories </li></ul></ul><ul><li>Multiple language/regional markets </li></ul><ul><ul><li>Note that different languages won’t even detect as dups </li></ul></ul>
  9. 9. Accidental “duplication” <ul><li>Session IDs in URLs </li></ul><ul><ul><li>Remember, to engines a URL is a URL is a URL.... </li></ul></ul><ul><ul><li>Two URLs referring to the same doc look like dupes </li></ul></ul><ul><ul><li>We can sort this out, but it may inhibit crawling </li></ul></ul><ul><ul><li>Embedding session-IDs in non-dynamic URLs doesn’t change fundamental problem </li></ul></ul><ul><ul><ul><li>http://yoursite/yourpage/sessid489/a.html is still a dup of http://yoursite/yourpage/sessid524/a.html </li></ul></ul></ul><ul><li>Soft 404s </li></ul><ul><ul><li>“ Not found” error pages should return a 404 HTTP status code when crawled. </li></ul></ul><ul><ul><li>If not, we can crawl many copies of the same “not found” page </li></ul></ul><ul><li>Not considered abusive, but it can still hamper our ability to display your content. </li></ul>
  10. 10. Dodgy duplication <ul><li>Replicating content across multiple domains unnecessarily </li></ul><ul><li>“ Aggregation” of content found elsewhere on web </li></ul><ul><ul><li>Ownership questions? </li></ul></ul><ul><ul><li>Is there value added in the aggregation? </li></ul></ul><ul><ul><li>Search engines are themselves aggregators, but shouldn’t necessarily point to other aggregations or search results page </li></ul></ul><ul><li>Identical content repeated with minimal value added </li></ul><ul><ul><li>How much of the page is duplicated? Is what is new worth anything? </li></ul></ul><ul><ul><li>May be handled by dup detection algorithms (if you’re OK with that) </li></ul></ul><ul><ul><li>Particularly an issue with regionally-targeted content </li></ul></ul>
  11. 11. Dodgy duplication, cont. <ul><li>When repeated elements dominate, approximate dupes may be (appropriately) filtered out. </li></ul><ul><li>Real estate advice for FLORIDA homeowners: </li></ul><ul><li>Buy low, sell high! Don’t leave your home on the market for too long. Consider being your own agent! Price your home to move. </li></ul><ul><li>Real estate advice for TENNESSEE homeowners: </li></ul><ul><li>Buy low, sell high! Don’t leave your home on the market for too long. Consider being your own agent! Price your home to move. </li></ul><ul><li>Real estate advice for MONTANA homeowners: </li></ul><ul><li>Buy low, sell high! Don’t leave your home on the market for too long. Consider being your own agent! Price your home to move. </li></ul>
  12. 12. The abusive fringe <ul><li>Scraper spammers </li></ul><ul><ul><li>Other people’s content + their ads, in bulk </li></ul></ul><ul><li>Weaving/stitching </li></ul><ul><ul><li>Mix-and-match content (at the phrase, sentence, paragraph, section level) from different sources </li></ul></ul><ul><ul><li>Often an attempt to defeat duplicate detection </li></ul></ul><ul><li>Bulk cross-domain duplication </li></ul><ul><ul><li>Often an attempt to get around hosts-per-SRP limits </li></ul></ul><ul><li>Bulk duplication with small changes </li></ul><ul><ul><li>Often an attempt to defeat duplicate detection </li></ul></ul><ul><li>All of the above are outside our content guidelines, and may lead to unanticipated results for publishers. </li></ul>
  13. 13. What should you do? <ul><li>Avoid bulk duplication of underlying documents </li></ul><ul><ul><li>If small variations, does search engine need all versions? </li></ul></ul><ul><ul><li>Use robots.txt to hide parts of site that are duplicate (say print versions of pages) </li></ul></ul><ul><ul><li>Use 301s to redirect dups to original </li></ul></ul><ul><li>Avoid accidental proliferation of many URLs for the same documents </li></ul><ul><ul><li>SessionIDs, soft 404s, etc. </li></ul></ul><ul><ul><li>Not abusive by our guidelines, but impair effective crawling </li></ul></ul><ul><li>Avoid duplication of sites across many domains </li></ul><ul><li>When importing content from elsewhere, ask </li></ul><ul><ul><li>Do you own it (or have rights to it)? </li></ul></ul><ul><ul><li>Are you adding value in addition, or just duplicating? </li></ul></ul>
  14. 14. Tools from Yahoo! <ul><li>Yahoo! Slurp supports wildcards in robots.txt: http://www.ysearchblog.com/archives/000372.html </li></ul><ul><ul><li>to make it easy to mark out areas of sites not to be crawled and indexed: </li></ul></ul><ul><li>Site Explorer allows Delete URL or entire path from the index for authenticated sites: </li></ul><ul><li>http://www.ysearchblog.com/archives/000400.html </li></ul><ul><li>Use robots-nocontent tag on non-relevant parts of a page: </li></ul><ul><li>http://www.ysearchblog.com/archives/000444.html </li></ul><ul><ul><li>Can be used to mark out boilerplate content </li></ul></ul><ul><ul><li>Or syndicated content that may be useful in context for user but not for search engines </li></ul></ul>
  15. 15. Tools from Yahoo! (contd.) <ul><li>Dynamic URL Rewriting in Site Explorer </li></ul><ul><li>http://www.ysearchblog.com/archives/000479.html </li></ul><ul><ul><li>Ability to indicate parameter to remove from URLs across site </li></ul></ul><ul><ul><li>More efficient crawl , with less duplicate URLs </li></ul></ul><ul><ul><li>Better site coverage as fewer resources wasted on duplicates </li></ul></ul><ul><ul><li>More unique content discovered for same crawl </li></ul></ul><ul><ul><li>Fewer risks of crawler traps </li></ul></ul><ul><ul><li>Cleaner URL , easier for user to read and more likely to be clicked </li></ul></ul><ul><ul><li>Better ranking due to reduced link juice fragmentation </li></ul></ul><ul><ul><li>Some sites have had 5m URLs cleaned with a single rule !!! </li></ul></ul>
  16. 16. Dynamic URL Rewriting <ul><li>Login to Site Explorer </li></ul><ul><li>Manage a Site (here http://example.com) </li></ul>
  17. 17. Dynamic URL Rewriting - 2 <ul><li>Select a parameter and action </li></ul>
  18. 18. Dynamic URL Rewriting - 3 <ul><li>Confirm the action </li></ul>
  19. 19. Dynamic URL Rewriting - 4 <ul><li>Input another parameter and action </li></ul>
  20. 20. Dynamic URL Rewriting - 5 <ul><li>If validation fails, Override if you’re sure </li></ul>
  21. 21. Dynamic URL Rewriting - 6 <ul><li>You can see the active parameter actions </li></ul>
  22. 22. Dynamic URL Rewriting - 7 <ul><li>Actions Tab </li></ul><ul><ul><li>Available along with the Delete actions </li></ul></ul><ul><ul><li>Can be undone if needed </li></ul></ul><ul><ul><li>More details available in Help </li></ul></ul>
  23. 23. Why should I use this? <ul><li>fewer duplicate URLs are crawled </li></ul><ul><li>Better and deeper site coverage due to freed up crawl quota </li></ul><ul><li>More unique content discovered </li></ul><ul><li>Fewer chances of crawler traps </li></ul><ul><li>Cleaner and easier-to-read display URLs </li></ul><ul><li>Better aggregation of link juice to your sites </li></ul>
  24. 24. Check out <ul><li>Site Explorer from Yahoo! Search </li></ul><ul><li>http://siteexplorer.search.yahoo.com </li></ul><ul><li>Yahoo! Search Blog </li></ul><ul><li>http://ysearchblog.com </li></ul>
  25. 25. Search Information and Contacts <ul><li>Site Explorer: </li></ul><ul><li>http://siteexplorer.search.yahoo.com/ </li></ul><ul><li>Search Happenings and People: </li></ul><ul><li>http://www.ysearchblog.com/ </li></ul><ul><li>Search Information, Guidelines and FAQ: </li></ul><ul><li>http://help.yahoo.com/help/us/ysearch/ </li></ul><ul><li>Search content guidelines: </li></ul><ul><li>http://help.yahoo.com/help/us/ysearch/basics/basics-18.html </li></ul><ul><li>To report spam: </li></ul><ul><li>http://add.yahoo.com/fast/help/us/ysearch/cgi_reportsearchspam </li></ul><ul><li>To check on a site, or to get a site reviewed : </li></ul><ul><li>http://add.yahoo.com/fast/help/us/ysearch/cgi_urlstatus </li></ul><ul><li>Search Support: </li></ul><ul><li>http://add.yahoo.com/fast/help/us/ysearch/cgi_feedback </li></ul>

×