Inside Google's Search Algorythm! (by Google Researchers)

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Inside Google's Search Algorythm! (by Google Researchers) - Presentation Transcript

    1. Sitemaps : Above and Beyond the Crawl of Duty Sitemaps! Sitemaps! Uri Schonfeld (Google and UCLA) Narayanan Shivakumar (Google) Copyright Uri Schonfeld, shuri.org April 2009
    2. What are we going to talk about?
      • The sitemaps protocol:
        • Not introduced in this paper
        • Friendly web servers publishing URL lists
      • Popular and growing in popularity
        • First large scale study over real data :
          • How it is used by users
          • Its Impact
        • First look at how it can be used by search engines
        • Lots of future work to get excited over
      • Let’s start with:
        • Underlying problem that sitemaps addresses
      Copyright Uri Schonfeld, shuri.org April 2009
    3. Dream of the Perfect Crawl
        • Users Have High Expectations :
          • Coverage : Every page should be findable
          • Freshness : Latest event, viral video,...
          • Deep Web : ajax, flash, silverlight,....
        • Search Engines Dream of the perfect crawl:
          • Everything the users want
          • … but efficient:
            • No 404s
            • No duplicates
        • Sitemaps to the rescue ...
      Copyright Uri Schonfeld, shuri.org April 2009
    4. Sitemaps
        • Basic idea: The web server
          • Puts a URL list, a sitemaps file , on its site
          • Includes new and changed content
          • Lets the search engines know
        • The URL list may also include:
            • URLs
            • Last Modification Time
            • Expected Change Frequency
            • Priority
        • Let the search engine know:
          • " Ping " search engines that their sitemaps  file has changed
          • Alternatively include sitemaps in robots.txt file (April 2007)
      Copyright Uri Schonfeld, shuri.org April 2009
    5. Sitemaps: This is how it looks
      • <?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?> <urlset xmlns=   &quot;http://www.sitemaps.org/schemas/sitemap/0.9&quot;>    <url>       <loc>http://www.example.com/</loc>       <lastmod>2005-01-01</lastmod>       <changefreq>monthly</changefreq>
      •       <priority>0.8</priority>    </url>    <url>        ...    <url> </urlset>
      Copyright Uri Schonfeld, shuri.org April 2009
    6. Related Work
        • 1999: &quot;Santa Fe Convention&quot;
          • Lead to OAI-PMH
          • &quot;...e-print servers to expose metadata for the papers it held&quot;
          • Coalition for Networked Information, Digital Library Federation, Open Archives Initiative (OAI), Herbert Van de Sompel, Carl Lagoze
        • 2000: Crawler Friendly Web Servers: Brandman, Cho, Garcia-Molina and Shivakumar
          • Export list of URLs and changed content
        • 2005/6: Sitemaps:
          • Introduced in 2005 by Google
          • 2006 Microsoft, Yahoo and Google announced joint support
      Copyright Uri Schonfeld, shuri.org April 2009
    7. Our Main Contributions
        • First Study of Sitemaps over real world data :
          • How it is used
          • It’s impact
        • Define metrics to evaluate Sitemaps feeds.
        • Explore:
          • The challenges of using Sitemaps together with Discovery Crawl
          • Define a preliminary algorithm combining the two crawls.
      Copyright Uri Schonfeld, shuri.org April 2009
    8. Inside Google
        • Sitemaps & Discovery
        • Sitemaps :
          • Sitemaps are fetched:
            • After they are pinged.
            • Several frequencies.
          • Sitemaps discovered URLs are fed to the crawling pipeline.
          • Some sources are fed directly for instant crawling .  
        • Discovery : 
          • New URLs and URLs of changed content are fed back to the pipeline 
        • Pipeline  
      Copyright Uri Schonfeld, shuri.org April 2009
    9. How Sitemaps Is Used?
        • Approximately 35M websites publish Sitemaps, and give us metadata for several billions of URLs .
        • Metadata:
          • 61% include a priority field .
          • 58% of URLs include a lastmodification date
          • 7% include a change frequency field
        • Formats Breakdown: 
          • XML Sitemap 76.76 
          • Url List 3.42 
          • Atom 1.61 
          • RSS 0.11
          • Unknown 17.51 
        • Robots.txt announced April 2007
      Copyright Uri Schonfeld, shuri.org April 2009
    10. Sitemaps Case Studies Copyright Uri Schonfeld, shuri.org April 2009
    11. Sitemaps Use Case Studies
        • Looked at three different sites:
          • Amazon: Large.
          • CNN: Dynamic.
          • Pubmedcentral.nih.gov: Archival.
        • Amazon :
          • Huge .
          • Service Oriented Architecture :
            • Hard to list valid URLs, when content changes
            •  Research Opportunity: Auto Generation of Sitemaps
          • 20M URLs published in:
            • 10,000 sitemaps files.
            • Each file: 20,000-50,000 URLs.
            • Log based.
          • Efficiency : URLs crawled vs unique pages
            • Discovery 63%, Sitemaps 86%.
      Copyright Uri Schonfeld, shuri.org April 2009
    12. Case Study: CNN
      • Very Dynamic :
        • Many new URLs added daily
      • Sitemaps:
        • News : 200-400 URLs
        • Weekly:2500-3000 URLs
        • Monthly:5000-10000 URLs
        • The lists don't overlap but complete 
        • Additional SitemapsIndex of hub pages
      Copyright Uri Schonfeld, shuri.org April 2009
    13. Case Study Pubmedcentral.nih.gov
        • Archival domain:
          • Add and hardly change.
          • Oldest journal published in1809.
        • Thus, can be exhaustive. 
        • Sitemap files:
          • 50+ sitemaps files.
          • 30,000 URLs in each.
          • Last modification inaccurate (unlike CNN and Amazon).
      Copyright Uri Schonfeld, shuri.org April 2009
    14. Pubmedcentral.nih.gov (cont’)
        • URL break down
          • Discovery and Sitemaps 3 million
          • Sitemaps only 1.7 million
          • 1 million due to duplicates
        • Manually examined 3000 sample URLs from the missing ~300,000
          • 8% errors 
          • 10% redirects
          • 11% other duplicate content
          • 51% judgment call needed (should crawl or not)
      Copyright Uri Schonfeld, shuri.org April 2009
    15. Pubmedcentral
      •  
      Copyright Uri Schonfeld, shuri.org April 2009
    16. CNN: New URLs Seen Over Time
      •  
      Copyright Uri Schonfeld, shuri.org April 2009
    17. Evaluating Sitemaps Copyright Uri Schonfeld, shuri.org April 2009
    18. Evaluating Sitemaps
        • Coverage and Freshness 
        • How should we judge usefulness ? 
        •   How far does a URL get in our pipeline :
          • Seen
          • Crawled
          • Unique
          • Indexed
          • Results
          • Clicked
        • UniqueCoverage = UniqueSitemaps(D) / Unique(D)
        • IndexCoverage = IndexedSitemaps(D) / Indexed(D)
        • PageRankCoverage = RankMassSitemaps(D) / RankMass(D)
      Copyright Uri Schonfeld, shuri.org April 2009
    19. Coverage Copyright Uri Schonfeld, shuri.org April 2009
    20. Coverage vs UniqueCoverage
      •  
      Copyright Uri Schonfeld, shuri.org April 2009
    21. UniqueCoverage vs Domain Size
      •  
        • 46% domains have above 50% UniqueCoverage 
        • 12% domains have 90% UniqueCoverage.
      Copyright Uri Schonfeld, shuri.org April 2009
    22. While PageRank Coverage…
      •  
      Copyright Uri Schonfeld, shuri.org April 2009
    23. Bang for the Buck…
      •  
      Copyright Uri Schonfeld, shuri.org April 2009
    24. Pings and Freshness
      • First Seen by Sitemaps
      •  
        • Ping: 12.7%
        • Non-Ping: 80.3%
      • First Seen by Discovery
      •  
        • Ping: 1.5%
        • Non-Ping: 5.5%
        • 14.2% Discovered through pings .
        • But who saw first is independent .
        • Doesn't reflect the potential.
        •  Research Opportunity: Detect and ping policy
        • Of URLs seen by both Sitemaps and Discovery.
          • 78% Seen first by Sitemaps
          • 22% Seen first by Discovery
      Copyright Uri Schonfeld, shuri.org April 2009
    25. Doing Both : Sitemaps and Discovery
        • New URLs and Refresh: we’ll talk new URLs . 
        • You can't fetch it all  per site quota .
        • What to fetch?
        • Crawl uses some ranking.
        • What should ranking for Sitemaps URLs?
        • How to balance between them ?
      Copyright Uri Schonfeld, shuri.org April 2009
    26. Ranking URLs in Sitemaps
        • Priority :
          • Full autority to the webmaster.
          • Is not available all the time.
        • PageRank :
          •   Provenly effective.
          • Not available for the truly new pages .
          • Webmasters don't have a Say at all.
        • PriorityRank :
          • Modify graph to take both into account 
          • Add sitemaps as a page implicitly linked to from the root.
          • Links from Sitemaps are weighted by priority if available 
          • Calculate PageRank over this modified graph.
          • Hybrid of the two previous methods .
      Copyright Uri Schonfeld, shuri.org April 2009
    27. Balancing the Crawl:  Algorithm Simplified
        • for epoch in 0..infinity do
        • kD = kS = 1/2 
          • Fetch:
            • Top kD * Quota from Discovery
            • Top kS * Quota from Sitemaps
          • Measure derivative of the utility (IndexCoverage)
          • Adjust kC and KS
      Copyright Uri Schonfeld, shuri.org April 2009
    28. Conclusion and Future Work
        • Large scale study, real data
        • You cannot stop Discovery… yet.
        • Presented metrics for freshness and coverage.
        • Sitemaps evaluated for coverage and freshness.
        • Presented Algorithm to combine Sitemaps & Discovery
        • To Be Done
          • Good news: tons of future work
          • Duplicates not solved on web-server side either .
          • Better Pings.
          • Ranking Sitemaps URLs can be a challenge.
      Copyright Uri Schonfeld, shuri.org April 2009
    29. Acks
      • We wish to thank many Googlers!
      • thank...
      • Dennis Geels, Ori Gershony, Laramie, Madhu, Thomal, Alkis, Peter Dickman, Arup, Charlie, Nish, Rosemary, Ralph, Nikhil.
      Copyright Uri Schonfeld, shuri.org April 2009
    30. The End Thank You! Copyright Uri Schonfeld, shuri.org April 2009

    + Mark FeldmanMark Feldman, 7 months ago

    custom

    587 views, 0 favs, 0 embeds more stats

    Comprehensive coverage of the public web is crucial more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 587
      • 587 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 10
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories