Measuring the quality of web search engines

565 views

Published on

Published in: Technology, Design
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
565
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Measuring the quality of web search engines

  1. 1. Measuring the quality of web search enginesProf. Dr. Dirk LewandowskiUniversity of Applied Sciences Hamburgdirk.lewandowski@haw-hamburg.deTartu University, 14 September 2009
  2. 2. Agenda Introduction A few words about user behaviour Standard retrieval effectiveness tests vs. “Universal Search” Selected results: Results descriptions, navigational queries Towards an integrated test framework Conclusions1 | Dirk Lewandowski
  3. 3. Agenda Introduction A few words about user behaviour Standard retrieval effectiveness tests vs. “Universal Search” Selected results: Results descriptions, navigational queries Towards an integrated test framework Conclusions2 | Dirk Lewandowski
  4. 4. Search engine market: Germany 20093 | Dirk Lewandowski (Webhits, 2009)
  5. 5. Search engine market: Estonia 20074 | Dirk Lewandowski (Global Search Report 2007)
  6. 6. Why measure the quality of web search engines? • Search engines are the main access point to web content. • One player is dominating the worldwide market. • Open questions – How good are search engines’ results? – Do we need alternatives to “big three” (“big two”? “big one”?) – How good are alternative search engines in delivering an alternative view on web content? – How good must a new search engine be to compete?5 | Dirk Lewandowski
  7. 7. A framework for measuring search engine quality • Index quality – Size of database, coverage of the web – Coverage of certain areas (countries, languages) – Index overlap – Index freshness • Quality of the results – Retrieval effectiveness – User satisfaction – Results overlap • Quality of the search features – Features offered – Operational reliability • Search engine usability and user guidance (Lewandowski & Höchstötter, 2007)6 | Dirk Lewandowski
  8. 8. A framework for measuring search engine quality • Index quality – Size of database, coverage of the web – Coverage of certain areas (countries, languages) – Index overlap – Index freshness • Quality of the results – Retrieval effectiveness – User satisfaction – Results overlap • Quality of the search features – Features offered – Operational reliability • Search engine usability and user guidance (Lewandowski & Höchstötter, 2007)7 | Dirk Lewandowski
  9. 9. Agenda Introduction A few words about user behaviour Standard retrieval effectiveness tests vs. “Universal Search” Selected results: Results descriptions, navigational queries Towards an integrated test framework Conclusions8 | Dirk Lewandowski
  10. 10. Users use relatively few cognitive resources in web searching. • Queries – Average length: 1.7 words (German language queries; English language queries slightly longer) – Approx. 50 percent of queries consist of just one word • Search engine results pages (SERPs) – 80 percent of users view no more than the first results page (10 results) – Users normally only view the first few results („above the fold“) – Users only view up to five results per session – Session length is less than 15 minutes • Users are usually satisfied with the results given.9 | Dirk Lewandowski
  11. 11. Results selection (top11 results) (Granka et al. 2004)10 | Dirk Lewandowski
  12. 12. Agenda Introduction A few words about user behaviour Standard retrieval effectiveness tests vs. “Universal Search” Selected results: Results descriptions, navigational queries Towards an integrated test framework Conclusions11 | Dirk Lewandowski
  13. 13. Standard design for retrieval effectiveness tests • Select (at least 50) queries (from log files, from user studies, etc.) • Select some (major) search engines • Consider top results (use cut-off) • Anonymise search engines, randomise results positions • Let users judge results • Calculate precision scores – the ratio of relevant results in proportion to all results retrieved at the corresponding position • Calculate/assume recall scores – the ratio of relevant results shown by a certain search engine in proportion to all relevant results within the database.12 | Dirk Lewandowski
  14. 14. Recall-Precision-Graph (top20 results)13 | Dirk Lewandowski (Lewandowski 2008)
  15. 15. Standard design for retrieval effectiveness tests • Problematic assumptions – Model of “dedicated searcher” (willing to select one result after the other and go through an extensive list of results) – User wants high precision and high recall, as well. • These studies do not consider – how many documents a user is willing to view / how many are sufficient for answering the query – how popular the queries used in the evaluation are – graded relevance judgements (relevance scales) – different relevance judgements by different jurors – different query types – results descriptions – users’ typical results selection behaviour – visibility of different elements in the results lists (through their presentation) – users’ preference for a certain search engine – diversity of the results set / the top results – ...14 | Dirk Lewandowski
  16. 16. • Results selection simple 15 | Dirk Lewandowski
  17. 17. Universal Search • x16 | Dirk Lewandowski
  18. 18. Universal Search News results ads • x organic results image results video results17 | Dirk Lewandowski organic results (contd.)
  19. 19. Agenda Introduction A few words about user behaviour Standard retrieval effectiveness tests vs. “Universal Search” Selected results: Results descriptions, navigational queries Towards an integrated test framework Conclusions18 | Dirk Lewandowski
  20. 20. Results descriptions META Description Yahoo Directory Open Directory19 | Dirk Lewandowski
  21. 21. Results decriptions: keywords in context (KWIC)20 | Dirk Lewandowski
  22. 22. • Results selection simple 21 | Dirk Lewandowski
  23. 23. • results selection with descriptions 22 | Dirk Lewandowski
  24. 24. Ratio of relevant results vs. relevant descriptions (top20 results)23 | Dirk Lewandowski
  25. 25. Recall-precision graph (top20 descriptions)24 | Dirk Lewandowski
  26. 26. Precision of descriptions vs. precision of results (Google)25 | Dirk Lewandowski
  27. 27. Recall-Precision-Graph (Top20, DRprec = relevant descriptions leading to relevant results)26 | Dirk Lewandowski
  28. 28. Search engines deal with different query types. Query types (Broder, 2002): • Informational – Looking for information on a certain topic – User wants to view a few relevant pages • Navigational – Looking for a (known) homepage – User wants to navigate to this homepage, only one relevant result • Transactional – Looking for a website to complete a transaction – One or more relevant results – Transaction can be purchasing a product, downloading a file, etc.27 | Dirk Lewandowski
  29. 29. Search engines deal with different query types. Query types (Broder, 2002): • Informational – Looking for information on a certain topic – User wants to view a few relevant pages • Navigational – Looking for a (known) homepage – User wants to navigate to this homepage, only one relevant result • Transactional – Looking for a website to complete a transaction – One or more relevant results – Transaction can be purchasing a product, downloading a file, etc.28 | Dirk Lewandowski
  30. 30. Percentage of unanswered queries (“navigational fail”)29 | Dirk Lewandowski (Lewandowski 2009)
  31. 31. Successful answered queries on results position n30 | Dirk Lewandowski (Lewandowski 2009)
  32. 32. Results for navigational vs. informational queries • Studies should consider informational, as well as navigational queries. • Queries should be weighted according to their frequency. • When >40% of queries are navigational, new search engines should put significant effort in answering these queries sufficiently.31 | Dirk Lewandowski
  33. 33. Agenda Introduction A few words about user behaviour Standard retrieval effectiveness tests vs. “Universal Search” Selected results: Results descriptions, navigational queries Towards an integrated test framework Conclusions32 | Dirk Lewandowski
  34. 34. Addressing major problems with retrieval effectiveness tests • We use navigational and informational queries, as well. – no suitable framework for transactional queries, though. • We use query frequency data from the T-Online database. – The database consists of approx. 400 million queries from 2007 onwards. – We can use time series analysis. • We classify queries according to query type and topic. – We did a study on query classification based on 50,000 queries from T-Online log files to gain a better understanding of user intents. Data collection was “crowdsourced” to Humangrid GmbH.33 | Dirk Lewandowski
  35. 35. Addressing major problems with retrieval effectiveness tests • We consider all elements on the first results page. – Organic results, ads, shortcuts – We will use clickthrough data from T-Online to measure “importance” of certain results. • Each result will be judged by several jurors. – Juror groups: Students, professors, retired persons, librarians, school children, other. – Additional judgements by the “general users” are collected in cooperation with Humangrid GmbH. • Results will be graded on a relevance scale. – Results and descriptions will be getting judged. • We will classify all organic results according to – document type (e.g., encyclopaedia, blog, forum, news) – date – degree of commercial intent34 | Dirk Lewandowski
  36. 36. Addressing major problems with retrieval effectiveness tests • We will count ads on results pages – Do search engines prefer pages carrying ads from the engine’s ad system? • We will ask users additional questions – Users will also judge the results set of each individual search engine as a whole. – Users will rank search engine based on the result sets. – Users will say where they would have stopped viewing more results. – Users will provide their own individual relevance-ranked list by card-sorting the complete results set from all search engines. • We will use printout screenshots of the results – Makes the study “mobile” – Especially important when considering certain user groups (e.g., elderly people).35 | Dirk Lewandowski
  37. 37. State of current work • First wave of data collection starting in October. • Proposal for additional project funding sent to DFG (German Research Foundation). • Project on user intents from search queries near completion. • Continuing collaboration with Deutsche Telekom, T-Online.36 | Dirk Lewandowski
  38. 38. Agenda Introduction A few words about user behaviour Standard retrieval effectiveness tests vs. “Universal Search” Selected results: Results descriptions, navigational queries Towards an integrated test framework Conclusions37 | Dirk Lewandowski
  39. 39. Conclusion • Measuring search engine quality is a complex task. • Retrieval effectiveness is a major aspect of SE quality evaluation. • Established evaluation frameworks are not sufficient for the web context.38 | Dirk Lewandowski
  40. 40. Thank you for your attention.Prof. Dr.Dirk LewandowskiHamburg University of Applied SciencesDepartment InformationBerliner Tor 5D - 20099 HamburgGermanywww.bui.haw-hamburg.de/lewandowski.htmlE-Mail: dirk.lewandowski@haw-hamburg.de

×