Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Measuring the quality of web search engines

626 views

Published on

Published in: Technology, Design
  • Be the first to comment

  • Be the first to like this

Measuring the quality of web search engines

  1. 1. Measuring the quality of web search enginesProf. Dr. Dirk LewandowskiUniversity of Applied Sciences Hamburgdirk.lewandowski@haw-hamburg.deTartu University, 14 September 2009
  2. 2. Agenda Introduction A few words about user behaviour Standard retrieval effectiveness tests vs. “Universal Search” Selected results: Results descriptions, navigational queries Towards an integrated test framework Conclusions1 | Dirk Lewandowski
  3. 3. Agenda Introduction A few words about user behaviour Standard retrieval effectiveness tests vs. “Universal Search” Selected results: Results descriptions, navigational queries Towards an integrated test framework Conclusions2 | Dirk Lewandowski
  4. 4. Search engine market: Germany 20093 | Dirk Lewandowski (Webhits, 2009)
  5. 5. Search engine market: Estonia 20074 | Dirk Lewandowski (Global Search Report 2007)
  6. 6. Why measure the quality of web search engines? • Search engines are the main access point to web content. • One player is dominating the worldwide market. • Open questions – How good are search engines’ results? – Do we need alternatives to “big three” (“big two”? “big one”?) – How good are alternative search engines in delivering an alternative view on web content? – How good must a new search engine be to compete?5 | Dirk Lewandowski
  7. 7. A framework for measuring search engine quality • Index quality – Size of database, coverage of the web – Coverage of certain areas (countries, languages) – Index overlap – Index freshness • Quality of the results – Retrieval effectiveness – User satisfaction – Results overlap • Quality of the search features – Features offered – Operational reliability • Search engine usability and user guidance (Lewandowski & Höchstötter, 2007)6 | Dirk Lewandowski
  8. 8. A framework for measuring search engine quality • Index quality – Size of database, coverage of the web – Coverage of certain areas (countries, languages) – Index overlap – Index freshness • Quality of the results – Retrieval effectiveness – User satisfaction – Results overlap • Quality of the search features – Features offered – Operational reliability • Search engine usability and user guidance (Lewandowski & Höchstötter, 2007)7 | Dirk Lewandowski
  9. 9. Agenda Introduction A few words about user behaviour Standard retrieval effectiveness tests vs. “Universal Search” Selected results: Results descriptions, navigational queries Towards an integrated test framework Conclusions8 | Dirk Lewandowski
  10. 10. Users use relatively few cognitive resources in web searching. • Queries – Average length: 1.7 words (German language queries; English language queries slightly longer) – Approx. 50 percent of queries consist of just one word • Search engine results pages (SERPs) – 80 percent of users view no more than the first results page (10 results) – Users normally only view the first few results („above the fold“) – Users only view up to five results per session – Session length is less than 15 minutes • Users are usually satisfied with the results given.9 | Dirk Lewandowski
  11. 11. Results selection (top11 results) (Granka et al. 2004)10 | Dirk Lewandowski
  12. 12. Agenda Introduction A few words about user behaviour Standard retrieval effectiveness tests vs. “Universal Search” Selected results: Results descriptions, navigational queries Towards an integrated test framework Conclusions11 | Dirk Lewandowski
  13. 13. Standard design for retrieval effectiveness tests • Select (at least 50) queries (from log files, from user studies, etc.) • Select some (major) search engines • Consider top results (use cut-off) • Anonymise search engines, randomise results positions • Let users judge results • Calculate precision scores – the ratio of relevant results in proportion to all results retrieved at the corresponding position • Calculate/assume recall scores – the ratio of relevant results shown by a certain search engine in proportion to all relevant results within the database.12 | Dirk Lewandowski
  14. 14. Recall-Precision-Graph (top20 results)13 | Dirk Lewandowski (Lewandowski 2008)
  15. 15. Standard design for retrieval effectiveness tests • Problematic assumptions – Model of “dedicated searcher” (willing to select one result after the other and go through an extensive list of results) – User wants high precision and high recall, as well. • These studies do not consider – how many documents a user is willing to view / how many are sufficient for answering the query – how popular the queries used in the evaluation are – graded relevance judgements (relevance scales) – different relevance judgements by different jurors – different query types – results descriptions – users’ typical results selection behaviour – visibility of different elements in the results lists (through their presentation) – users’ preference for a certain search engine – diversity of the results set / the top results – ...14 | Dirk Lewandowski
  16. 16. • Results selection simple 15 | Dirk Lewandowski
  17. 17. Universal Search • x16 | Dirk Lewandowski
  18. 18. Universal Search News results ads • x organic results image results video results17 | Dirk Lewandowski organic results (contd.)
  19. 19. Agenda Introduction A few words about user behaviour Standard retrieval effectiveness tests vs. “Universal Search” Selected results: Results descriptions, navigational queries Towards an integrated test framework Conclusions18 | Dirk Lewandowski
  20. 20. Results descriptions META Description Yahoo Directory Open Directory19 | Dirk Lewandowski
  21. 21. Results decriptions: keywords in context (KWIC)20 | Dirk Lewandowski
  22. 22. • Results selection simple 21 | Dirk Lewandowski
  23. 23. • results selection with descriptions 22 | Dirk Lewandowski
  24. 24. Ratio of relevant results vs. relevant descriptions (top20 results)23 | Dirk Lewandowski
  25. 25. Recall-precision graph (top20 descriptions)24 | Dirk Lewandowski
  26. 26. Precision of descriptions vs. precision of results (Google)25 | Dirk Lewandowski
  27. 27. Recall-Precision-Graph (Top20, DRprec = relevant descriptions leading to relevant results)26 | Dirk Lewandowski
  28. 28. Search engines deal with different query types. Query types (Broder, 2002): • Informational – Looking for information on a certain topic – User wants to view a few relevant pages • Navigational – Looking for a (known) homepage – User wants to navigate to this homepage, only one relevant result • Transactional – Looking for a website to complete a transaction – One or more relevant results – Transaction can be purchasing a product, downloading a file, etc.27 | Dirk Lewandowski
  29. 29. Search engines deal with different query types. Query types (Broder, 2002): • Informational – Looking for information on a certain topic – User wants to view a few relevant pages • Navigational – Looking for a (known) homepage – User wants to navigate to this homepage, only one relevant result • Transactional – Looking for a website to complete a transaction – One or more relevant results – Transaction can be purchasing a product, downloading a file, etc.28 | Dirk Lewandowski
  30. 30. Percentage of unanswered queries (“navigational fail”)29 | Dirk Lewandowski (Lewandowski 2009)
  31. 31. Successful answered queries on results position n30 | Dirk Lewandowski (Lewandowski 2009)
  32. 32. Results for navigational vs. informational queries • Studies should consider informational, as well as navigational queries. • Queries should be weighted according to their frequency. • When >40% of queries are navigational, new search engines should put significant effort in answering these queries sufficiently.31 | Dirk Lewandowski
  33. 33. Agenda Introduction A few words about user behaviour Standard retrieval effectiveness tests vs. “Universal Search” Selected results: Results descriptions, navigational queries Towards an integrated test framework Conclusions32 | Dirk Lewandowski
  34. 34. Addressing major problems with retrieval effectiveness tests • We use navigational and informational queries, as well. – no suitable framework for transactional queries, though. • We use query frequency data from the T-Online database. – The database consists of approx. 400 million queries from 2007 onwards. – We can use time series analysis. • We classify queries according to query type and topic. – We did a study on query classification based on 50,000 queries from T-Online log files to gain a better understanding of user intents. Data collection was “crowdsourced” to Humangrid GmbH.33 | Dirk Lewandowski
  35. 35. Addressing major problems with retrieval effectiveness tests • We consider all elements on the first results page. – Organic results, ads, shortcuts – We will use clickthrough data from T-Online to measure “importance” of certain results. • Each result will be judged by several jurors. – Juror groups: Students, professors, retired persons, librarians, school children, other. – Additional judgements by the “general users” are collected in cooperation with Humangrid GmbH. • Results will be graded on a relevance scale. – Results and descriptions will be getting judged. • We will classify all organic results according to – document type (e.g., encyclopaedia, blog, forum, news) – date – degree of commercial intent34 | Dirk Lewandowski
  36. 36. Addressing major problems with retrieval effectiveness tests • We will count ads on results pages – Do search engines prefer pages carrying ads from the engine’s ad system? • We will ask users additional questions – Users will also judge the results set of each individual search engine as a whole. – Users will rank search engine based on the result sets. – Users will say where they would have stopped viewing more results. – Users will provide their own individual relevance-ranked list by card-sorting the complete results set from all search engines. • We will use printout screenshots of the results – Makes the study “mobile” – Especially important when considering certain user groups (e.g., elderly people).35 | Dirk Lewandowski
  37. 37. State of current work • First wave of data collection starting in October. • Proposal for additional project funding sent to DFG (German Research Foundation). • Project on user intents from search queries near completion. • Continuing collaboration with Deutsche Telekom, T-Online.36 | Dirk Lewandowski
  38. 38. Agenda Introduction A few words about user behaviour Standard retrieval effectiveness tests vs. “Universal Search” Selected results: Results descriptions, navigational queries Towards an integrated test framework Conclusions37 | Dirk Lewandowski
  39. 39. Conclusion • Measuring search engine quality is a complex task. • Retrieval effectiveness is a major aspect of SE quality evaluation. • Established evaluation frameworks are not sufficient for the web context.38 | Dirk Lewandowski
  40. 40. Thank you for your attention.Prof. Dr.Dirk LewandowskiHamburg University of Applied SciencesDepartment InformationBerliner Tor 5D - 20099 HamburgGermanywww.bui.haw-hamburg.de/lewandowski.htmlE-Mail: dirk.lewandowski@haw-hamburg.de

×