Inside Google's Search Algorythm! (by Google Researchers) - Presentation Transcript
Sitemaps : Above and Beyond the Crawl of Duty Sitemaps! Sitemaps! Uri Schonfeld (Google and UCLA) Narayanan Shivakumar (Google) Copyright Uri Schonfeld, shuri.org April 2009
What are we going to talk about?
The sitemaps protocol:
Not introduced in this paper
Friendly web servers publishing URL lists
Popular and growing in popularity
First large scale study over real data :
How it is used by users
Its Impact
First look at how it can be used by search engines
Lots of future work to get excited over
Let’s start with:
Underlying problem that sitemaps addresses
Copyright Uri Schonfeld, shuri.org April 2009
Dream of the Perfect Crawl
Users Have High Expectations :
Coverage : Every page should be findable
Freshness : Latest event, viral video,...
Deep Web : ajax, flash, silverlight,....
Search Engines Dream of the perfect crawl:
Everything the users want
… but efficient:
No 404s
No duplicates
Sitemaps to the rescue ...
Copyright Uri Schonfeld, shuri.org April 2009
Sitemaps
Basic idea: The web server
Puts a URL list, a sitemaps file , on its site
Includes new and changed content
Lets the search engines know
The URL list may also include:
URLs
Last Modification Time
Expected Change Frequency
Priority
Let the search engine know:
" Ping " search engines that their sitemaps file has changed
Alternatively include sitemaps in robots.txt file (April 2007)
Comprehensive coverage of the public web is crucial more
Comprehensive coverage of the public web is crucial to web search engines. Search engines use crawlers to retrieve pages and then discover new ones by extracting the pages’ outgoing links. However, the set of pages reachable from the publicly linked web is estimated to be significantly smaller than the invisible web [5], the set of documents that have no incoming links and can only be retrieved through web applications and web forms. The Sitemaps protocol is a fast-growing web protocol supported jointly by major search engines to help content creators and search engines unlock this hidden data by making it available to search engines. In this paper, we perform a detailed study of how “classic” discovery crawling compares with Sitemaps, in key measures such as coverage and freshness over key representative websites as well as over billions of URLs seen at Google. We observe that Sitemaps and discovery crawling complement each other very well, and offer different tradeoffs. Categories and Subject Descriptors: H.3.3: Information Search and Retrieval. General Terms: Experimentation, Algorithms. Keywords: search engines, crawling, sitemaps, metrics, quality. less
0 comments
Post a comment