More Related Content Similar to Web search-metrics-tutorial-www2010-section-5of7-discovery (20) Web search-metrics-tutorial-www2010-section-5of7-discovery1. 1
Web Search Engine Metrics
for Measuring User
Satisfaction
[Section 5 of 7: Discovery]
Ali Dasdan, eBay
Kostas Tsioutsiouliklis, Yahoo!
Emre Velipasaoglu, Yahoo!
With contributions from Prasad Kantamneni, Yahoo!
27 Apr 2010
(Update in Aug 2015: The authors work in different companies now.)
3. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Disclaimers
• This talk presents the opinions of the
authors. It does not necessarily reflect
the views of our employers.
• This talk does not imply that these
metrics are used by our employers, or
should they be used, they may not be
used in the way described in this talk.
• The examples are just that – examples.
Please do not generalize them to the
level of comparing search engines.
3
10. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Lives of many URLs
10
AGE
LATENCY
BORN DISCOVERED NOW EXPIRED
TIME
LATENCY
LATENCY
LATENCY
11. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
How to measure discovery and
latency
• Consider a sample of new pages on the Web
– Feeds at regular intervals
– Each sample monitored for a period (e.g., 15 days)
• User view
– Discovery: Measure how many of these new pages are in
the search results?
• using the coverage ratio formula
– Latency: Measure how long it took to get these new pages
in the search results?
• variants as ‘Time-To-First-* (TTF*)’ metrics, e.g., Time-To-
First-Click and Time-To-First-View
• System view
– Discovery: Measure how many of these new pages are in a
catalog?
– Latency: Measure how long it took to get these new pages
in a catalog?
11
12. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Discovery profile of a search
engine component: Overview
12
Time to reach a certain coverage percentage
No expiration yet
Content expired
Convergence
Over many URLs, per search engine component
Otherbehaviors
13. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Discovery profiles and monitoring:
Examples
13
Profiles Monitoring
of
profile
parameters
14. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Latency profiles of a search engine
component: Overview
14
Over many URLs, per search engine component
Desired skewness directionClose to zero for crawlers
15. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Latency profiles and monitoring:
Examples
15
Profiles Monitoring
of
profile
parameters
16. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Further issues to consider
• How to discover samples to measure
discovery and latency
• How to beat crawlers to acquire
samples
• Discovery of top-level pages
• Discovery of deep links
• Discovery of hidden web content
• How to balance discovery against
other objectives
16
17. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Key problems
• Predict content changes on the Web
• Discover new content almost
instantaneously
• Reduce latency per search engine
component and overall
17
18. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Reference review on discovery
metrics
• Cho, Garcia-Molina, & Page (1998)
– discusses how to order URL accesses based on importance
scores
• importance: PageRank (best), link count, similarity to query in
anchortext or URL string, attributes of URL string.
• Dasgupta et al. (2007)
– formulates the problem of discoverability (discover new content
from the fewest number of known pages) and proposes
approximation algorithms
• Kim and Kang (2007)
– compares top three search engines for discovery (called
“timeliness”), freshness, and latency
18
19. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Reference review on discovery
metrics
• Lewandowski (2008)
– compares top three search engines for freshness and latency
• Dasdan and Drome (2009)
– proposes discovery metrics along the lines discussed in this
section
• Olston and Najork (2010)
– gives a detailed survey of web crawling, including how crawlers
discover URLs
– discusses how to optimize for both coverage and freshness in a
web crawler
19
20. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
References
• J. Cho, H. Garcia-Molina, and L. Page (1998), Efficient Crawling
Through URL Ordering, Computer Networks and ISDN Systems,
30(1-7):161-172.
• A. Dasdan and C. Drome (2009), Discovery coverage: Measuring how
fast content is discovered by search engines, submitted.
• A. Dasgupta, A. Ghosh R. Kumar, C. Olston, S. Pandey, and A.
Tomkins (2007), The discoverability of the Web, WWW’07.
• J. Dean (2009), Challenges in building large-scale information retrieval
systems, WSDM’09.
• N. Eiron, K.S. McCurley, and J.A. Tomlin, Ranking the Web frontier,
WWW’04.
• C. Grimes, D. Ford, and E. Tassone (2008), Keeping a search engine
index fresh: Risk and optimality in estimating refresh rates for web
pages, INTERFACE’08.
• Y.S. Kim and B.H. Kang (2007), Coverage and timeliness analysis of
search engines with webpage monitoring results, WISE’07.
• D. Lewandowski (2008), A three-year study on the freshness of Web
search engine databases, to appear in J. Info. Syst., 2008.
• C. Olston and M. Najork (2010), Web crawling, Chapter in Foundations
and Trends in Information Retrieval, 4(3):175--246.
20