SlideShare a Scribd company logo
1 of 20
Download to read offline
1	

Web Search Engine Metrics
for Measuring User
Satisfaction
[Section 5 of 7: Discovery]
Ali Dasdan, eBay
Kostas Tsioutsiouliklis, Yahoo!
Emre Velipasaoglu, Yahoo!
With contributions from Prasad Kantamneni, Yahoo!
27 Apr 2010
(Update in Aug 2015: The authors work in different companies now.)
2	

Tutorial
@
19th International
World Wide Web
Conference
http://www2010.org/
April 26-30, 2010
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Disclaimers
•  This talk presents the opinions of the
authors. It does not necessarily reflect
the views of our employers.
•  This talk does not imply that these
metrics are used by our employers, or
should they be used, they may not be
used in the way described in this talk.
•  The examples are just that – examples.
Please do not generalize them to the
level of comparing search engines.
3
4	

Discovery and Latency
Metrics
Section 5/7
of
WWW’10 Tutorial on Web Search Engine Metrics
by
A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Example on discovery: Page was
born ~30 minutes before
5
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Example on discovery: URL of page
was not found
6
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Example on discovery: But content
existed under different URLs
7
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Example on discovery: URL was
also found after ~1 hr
8
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Life of a URL
9	

AGE
LATENCY
BORN DISCOVERED NOW EXPIRED
TIME
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Lives of many URLs
10	

AGE
LATENCY
BORN DISCOVERED NOW EXPIRED
TIME
LATENCY
LATENCY
LATENCY
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
How to measure discovery and
latency
•  Consider a sample of new pages on the Web
–  Feeds at regular intervals
–  Each sample monitored for a period (e.g., 15 days)
•  User view
–  Discovery: Measure how many of these new pages are in
the search results?
•  using the coverage ratio formula
–  Latency: Measure how long it took to get these new pages
in the search results?
•  variants as ‘Time-To-First-* (TTF*)’ metrics, e.g., Time-To-
First-Click and Time-To-First-View
•  System view
–  Discovery: Measure how many of these new pages are in a
catalog?
–  Latency: Measure how long it took to get these new pages
in a catalog?
11
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Discovery profile of a search
engine component: Overview
12	

Time to reach a certain coverage percentage
No expiration yet
Content expired
Convergence
Over many URLs, per search engine component
Otherbehaviors
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Discovery profiles and monitoring:
Examples
13	

Profiles Monitoring
of
profile
parameters
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Latency profiles of a search engine
component: Overview
14	

Over many URLs, per search engine component
Desired skewness directionClose to zero for crawlers
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Latency profiles and monitoring:
Examples
15	

Profiles Monitoring
of
profile
parameters
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Further issues to consider
•  How to discover samples to measure
discovery and latency
•  How to beat crawlers to acquire
samples
•  Discovery of top-level pages
•  Discovery of deep links
•  Discovery of hidden web content
•  How to balance discovery against
other objectives
16
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Key problems
•  Predict content changes on the Web
•  Discover new content almost
instantaneously
•  Reduce latency per search engine
component and overall
17
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Reference review on discovery
metrics
•  Cho, Garcia-Molina, & Page (1998)
–  discusses how to order URL accesses based on importance
scores
•  importance: PageRank (best), link count, similarity to query in
anchortext or URL string, attributes of URL string.
•  Dasgupta et al. (2007)
–  formulates the problem of discoverability (discover new content
from the fewest number of known pages) and proposes
approximation algorithms
•  Kim and Kang (2007)
–  compares top three search engines for discovery (called
“timeliness”), freshness, and latency
18
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Reference review on discovery
metrics
•  Lewandowski (2008)
–  compares top three search engines for freshness and latency
•  Dasdan and Drome (2009)
–  proposes discovery metrics along the lines discussed in this
section
•  Olston and Najork (2010)
–  gives a detailed survey of web crawling, including how crawlers
discover URLs
–  discusses how to optimize for both coverage and freshness in a
web crawler
19
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
References
•  J. Cho, H. Garcia-Molina, and L. Page (1998), Efficient Crawling
Through URL Ordering, Computer Networks and ISDN Systems,
30(1-7):161-172.
•  A. Dasdan and C. Drome (2009), Discovery coverage: Measuring how
fast content is discovered by search engines, submitted.
•  A. Dasgupta, A. Ghosh R. Kumar, C. Olston, S. Pandey, and A.
Tomkins (2007), The discoverability of the Web, WWW’07.
•  J. Dean (2009), Challenges in building large-scale information retrieval
systems, WSDM’09.
•  N. Eiron, K.S. McCurley, and J.A. Tomlin, Ranking the Web frontier,
WWW’04.
•  C. Grimes, D. Ford, and E. Tassone (2008), Keeping a search engine
index fresh: Risk and optimality in estimating refresh rates for web
pages, INTERFACE’08.
•  Y.S. Kim and B.H. Kang (2007), Coverage and timeliness analysis of
search engines with webpage monitoring results, WISE’07.
•  D. Lewandowski (2008), A three-year study on the freshness of Web
search engine databases, to appear in J. Info. Syst., 2008.
•  C. Olston and M. Najork (2010), Web crawling, Chapter in Foundations
and Trends in Information Retrieval, 4(3):175--246.
20

More Related Content

Similar to Web search-metrics-tutorial-www2010-section-5of7-discovery

Charting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data MiningCharting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data Mining
Valeria de Paiva
 

Similar to Web search-metrics-tutorial-www2010-section-5of7-discovery (20)

Web search-metrics-tutorial-www2010-section-1of7-introduction
Web search-metrics-tutorial-www2010-section-1of7-introductionWeb search-metrics-tutorial-www2010-section-1of7-introduction
Web search-metrics-tutorial-www2010-section-1of7-introduction
 
Web search-metrics-tutorial-www2010-section-3of7-coverage
Web search-metrics-tutorial-www2010-section-3of7-coverageWeb search-metrics-tutorial-www2010-section-3of7-coverage
Web search-metrics-tutorial-www2010-section-3of7-coverage
 
Web search-metrics-tutorial-www2010-section-7of7-presentation
Web search-metrics-tutorial-www2010-section-7of7-presentationWeb search-metrics-tutorial-www2010-section-7of7-presentation
Web search-metrics-tutorial-www2010-section-7of7-presentation
 
Web search-metrics-tutorial-www2010-section-6of7-freshness
Web search-metrics-tutorial-www2010-section-6of7-freshnessWeb search-metrics-tutorial-www2010-section-6of7-freshness
Web search-metrics-tutorial-www2010-section-6of7-freshness
 
`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas
 
Perception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document ClusteringPerception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document Clustering
 
Pf3426712675
Pf3426712675Pf3426712675
Pf3426712675
 
Charting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data MiningCharting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data Mining
 
Searchland2
Searchland2Searchland2
Searchland2
 
E3602042044
E3602042044E3602042044
E3602042044
 
A Survey on Automatically Mining Facets for Queries from their Search Results
A Survey on Automatically Mining Facets for Queries from their Search ResultsA Survey on Automatically Mining Facets for Queries from their Search Results
A Survey on Automatically Mining Facets for Queries from their Search Results
 
Pdd crawler a focused web
Pdd crawler  a focused webPdd crawler  a focused web
Pdd crawler a focused web
 
Web search-metrics-tutorial-www2010-section-4of7-diversity
Web search-metrics-tutorial-www2010-section-4of7-diversityWeb search-metrics-tutorial-www2010-section-4of7-diversity
Web search-metrics-tutorial-www2010-section-4of7-diversity
 
Towards Semantic APIs for Research Data Services (Invited Talk)
Towards Semantic APIs for Research Data Services (Invited Talk)Towards Semantic APIs for Research Data Services (Invited Talk)
Towards Semantic APIs for Research Data Services (Invited Talk)
 
Searchland: Search quality for Beginners
Searchland: Search quality for BeginnersSearchland: Search quality for Beginners
Searchland: Search quality for Beginners
 
CAPTURING USER BROWSING BEHAVIOUR INDICATORS
CAPTURING USER BROWSING BEHAVIOUR INDICATORSCAPTURING USER BROWSING BEHAVIOUR INDICATORS
CAPTURING USER BROWSING BEHAVIOUR INDICATORS
 
One Stop Recommendation
One Stop RecommendationOne Stop Recommendation
One Stop Recommendation
 
One Stop Recommendation
One Stop RecommendationOne Stop Recommendation
One Stop Recommendation
 
Data catalog
Data catalogData catalog
Data catalog
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
 

Recently uploaded

FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
dharasingh5698
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 

Recently uploaded (20)

data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 

Web search-metrics-tutorial-www2010-section-5of7-discovery

  • 1. 1 Web Search Engine Metrics for Measuring User Satisfaction [Section 5 of 7: Discovery] Ali Dasdan, eBay Kostas Tsioutsiouliklis, Yahoo! Emre Velipasaoglu, Yahoo! With contributions from Prasad Kantamneni, Yahoo! 27 Apr 2010 (Update in Aug 2015: The authors work in different companies now.)
  • 2. 2 Tutorial @ 19th International World Wide Web Conference http://www2010.org/ April 26-30, 2010
  • 3. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Disclaimers •  This talk presents the opinions of the authors. It does not necessarily reflect the views of our employers. •  This talk does not imply that these metrics are used by our employers, or should they be used, they may not be used in the way described in this talk. •  The examples are just that – examples. Please do not generalize them to the level of comparing search engines. 3
  • 4. 4 Discovery and Latency Metrics Section 5/7 of WWW’10 Tutorial on Web Search Engine Metrics by A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu
  • 5. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Example on discovery: Page was born ~30 minutes before 5
  • 6. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Example on discovery: URL of page was not found 6
  • 7. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Example on discovery: But content existed under different URLs 7
  • 8. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Example on discovery: URL was also found after ~1 hr 8
  • 9. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Life of a URL 9 AGE LATENCY BORN DISCOVERED NOW EXPIRED TIME
  • 10. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Lives of many URLs 10 AGE LATENCY BORN DISCOVERED NOW EXPIRED TIME LATENCY LATENCY LATENCY
  • 11. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. How to measure discovery and latency •  Consider a sample of new pages on the Web –  Feeds at regular intervals –  Each sample monitored for a period (e.g., 15 days) •  User view –  Discovery: Measure how many of these new pages are in the search results? •  using the coverage ratio formula –  Latency: Measure how long it took to get these new pages in the search results? •  variants as ‘Time-To-First-* (TTF*)’ metrics, e.g., Time-To- First-Click and Time-To-First-View •  System view –  Discovery: Measure how many of these new pages are in a catalog? –  Latency: Measure how long it took to get these new pages in a catalog? 11
  • 12. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Discovery profile of a search engine component: Overview 12 Time to reach a certain coverage percentage No expiration yet Content expired Convergence Over many URLs, per search engine component Otherbehaviors
  • 13. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Discovery profiles and monitoring: Examples 13 Profiles Monitoring of profile parameters
  • 14. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Latency profiles of a search engine component: Overview 14 Over many URLs, per search engine component Desired skewness directionClose to zero for crawlers
  • 15. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Latency profiles and monitoring: Examples 15 Profiles Monitoring of profile parameters
  • 16. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Further issues to consider •  How to discover samples to measure discovery and latency •  How to beat crawlers to acquire samples •  Discovery of top-level pages •  Discovery of deep links •  Discovery of hidden web content •  How to balance discovery against other objectives 16
  • 17. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Key problems •  Predict content changes on the Web •  Discover new content almost instantaneously •  Reduce latency per search engine component and overall 17
  • 18. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Reference review on discovery metrics •  Cho, Garcia-Molina, & Page (1998) –  discusses how to order URL accesses based on importance scores •  importance: PageRank (best), link count, similarity to query in anchortext or URL string, attributes of URL string. •  Dasgupta et al. (2007) –  formulates the problem of discoverability (discover new content from the fewest number of known pages) and proposes approximation algorithms •  Kim and Kang (2007) –  compares top three search engines for discovery (called “timeliness”), freshness, and latency 18
  • 19. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Reference review on discovery metrics •  Lewandowski (2008) –  compares top three search engines for freshness and latency •  Dasdan and Drome (2009) –  proposes discovery metrics along the lines discussed in this section •  Olston and Najork (2010) –  gives a detailed survey of web crawling, including how crawlers discover URLs –  discusses how to optimize for both coverage and freshness in a web crawler 19
  • 20. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. References •  J. Cho, H. Garcia-Molina, and L. Page (1998), Efficient Crawling Through URL Ordering, Computer Networks and ISDN Systems, 30(1-7):161-172. •  A. Dasdan and C. Drome (2009), Discovery coverage: Measuring how fast content is discovered by search engines, submitted. •  A. Dasgupta, A. Ghosh R. Kumar, C. Olston, S. Pandey, and A. Tomkins (2007), The discoverability of the Web, WWW’07. •  J. Dean (2009), Challenges in building large-scale information retrieval systems, WSDM’09. •  N. Eiron, K.S. McCurley, and J.A. Tomlin, Ranking the Web frontier, WWW’04. •  C. Grimes, D. Ford, and E. Tassone (2008), Keeping a search engine index fresh: Risk and optimality in estimating refresh rates for web pages, INTERFACE’08. •  Y.S. Kim and B.H. Kang (2007), Coverage and timeliness analysis of search engines with webpage monitoring results, WISE’07. •  D. Lewandowski (2008), A three-year study on the freshness of Web search engine databases, to appear in J. Info. Syst., 2008. •  C. Olston and M. Najork (2010), Web crawling, Chapter in Foundations and Trends in Information Retrieval, 4(3):175--246. 20