evaluation in infomation retrival

3,152 views

Published on

the evaluation method ir field

Published in: Economy & Finance, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,152
On SlideShare
0
From Embeds
0
Number of Embeds
21
Actions
Shares
0
Downloads
120
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

evaluation in infomation retrival

  1. 1. Evaluation in Information Retrieval Ruihua Song Web Search and Mining Group Email: rsong@microsoft.com
  2. 2. Overview • Retrieval Effectiveness Evaluation • Evaluation Measures • Significance Test • One Selected SIGIR Paper
  3. 3. How to evaluate? • How well does system meet information need? ̵ System evaluation: how good are document rankings? ̵ User-based evaluation: how satisfied is user?
  4. 4. Ellen Voorhees, The TREC Conference: An Introduction
  5. 5. Ellen Voorhees, The TREC Conference: An Introduction
  6. 6. Ellen Voorhees, The TREC Conference: An Introduction
  7. 7. Ellen Voorhees, The TREC Conference: An Introduction
  8. 8. Ellen Voorhees, The TREC Conference: An Introduction
  9. 9. Ellen Voorhees, The TREC Conference: An Introduction
  10. 10. Ellen Voorhees, The TREC Conference: An Introduction
  11. 11. Ellen Voorhees, The TREC Conference: An Introduction
  12. 12. Ellen Voorhees, The TREC Conference: An Introduction
  13. 13. Ellen Voorhees, The TREC Conference: An Introduction
  14. 14. Evaluation Challenges On The Web • Collection is dynamic ̵ 10-20% urls change every month • Queries are time sensitive ̵ Topics are hot then they ae not • Spam methods evolve ̵ Algorithms evaluated against last month’s web may not work today • But we have a lot of users… you can use clicks as supervision SIGIR'05 Keynote given by Amit Singhal from Google
  15. 15. Overview • Retrieval Effectiveness Evaluation • Evaluation Measures • Significance Test • One Selected SIGIR Paper
  16. 16. Ellen Voorhees, The TREC Conference: An Introduction
  17. 17. P-R curve • Precision and recall • Precision-recall curve • Average precision-recall curve
  18. 18. P-R curve (cont.) • For a query there is a result list (answer set) R A (Relevant Docs) Ra (Answer Set)
  19. 19. P-R curve (cont.) • Recall is fraction of the relevant | Ra | document which has been retrieved recall = |R| | Ra | • Precision is fraction of the retrieved precision = | A| document which is relevant
  20. 20. P-R curve (cont.) • E.g. ̵ For some query, |Total Docs|=200,|R|=20 ̵ r: relevant ̵ n: non-relevant ̵ At rank 10,recall=6/20,precision=6/10 r n n r r n r n r r d , d , d , d , d , d , d , d , d , d ,... 123 84 5 87 80 59 90 8 89 55
  21. 21. Individual query P-R curve
  22. 22. P-R curve (cont.)
  23. 23. MAP • Mean Average Precision • Defined as mean of the precision obtained after each relevant document is retrieved, using zero as the precision for document that are not retrieved.
  24. 24. MAP (cont.) • E.g. ̵ |Total Docs|=200, |R|=20 ̵ The whole result list consist of 10 docs is as follow ̵ r-rel ̵ n-nonrelevant ̵ MAP = (1+2/4+3/5+4/7+5/9+6/10)/6 r n n r r n r n r r d ,d ,d ,d ,d ,d ,d ,d ,d ,d 123 84 5 87 80 59 90 8 89 55
  25. 25. Precision at 10 • P@10 is the number of relevant documents in the top 10 documents in the ranked list returned for a topic • E.g. ̵ there is 3 documents in the top 10 documents that is relevant ̵ P@10=0.3
  26. 26. Mean Reciprocal Rank • MRR is the reciprocal of the first relevant document’s rank in the ranked list returned for a topic • E.g. ̵ the first relevant document is ranked as No.4 ̵ MRR = ¼ = 0.25
  27. 27. bpref • Bpref stands for Binary Preference • Consider only judged docs in result list • The basic idea is to count number of time judged non-relevant docs retrieval before judged relevant docs
  28. 28. bpref (cont.)
  29. 29. bpref (cont.) • E.g. ̵ |Total Docs| =200, |R|=20 ̵ r: judged relevant ̵ n: judged non-relevant ̵ u: not judged, unknown whether relevant or not r n n u r n r u u r d , d , d , d , d , d , d , d , d , d ,... 123 84 5 87 80 59 90 8 89 55
  30. 30. References • Baeza-Yates, R. & Ribeiro-Neto, B. Modern Information Retrieval Addison Wesley, 1999 , 73-96 • Buckley, C. & Voorhees, E.M. Retrieval Evaluation with Incomplete Information Proceedings of SIGIR 2004
  31. 31. NDCG • Two assumptions about ranked result list ̵ Highly relevant document are more valuable ̵ The greater the ranked position of a relevant document , the less valuable it is for the user
  32. 32. NDCG (cont.) • Graded judgment -> gain vector • Cumulated Gain
  33. 33. NDCG (cont.) • Discounted CG • Discounting function
  34. 34. NDCG (cont.) • Ideal (D)CG vector
  35. 35. NDCG (cont.)
  36. 36. NDCG (cont.) • Normalized (D)CG
  37. 37. NDCG (cont.)
  38. 38. NDCG (cont.) • Pros. ̵ Graded, more precise than R-P ̵ Reflect more user behavior (e.g. user persistence) ̵ CG and DCG graphs are intuitive to interpret • Cons. ̵ Disagreements in rating ̵ How to set parameters
  39. 39. Reference • Jarvelin, K. & Kekalainen, J. Cumulated Gain-based Evaluation of IR Techniques ACM Transactions on Information Systems, 2002 , 20 , 422-446
  40. 40. Overview • Retrieval Effectiveness Evaluation • Evaluation Measures • Significance Test • One Selected SIGIR Paper
  41. 41. Significance Test • Significance Test ̵ Why is it necessary? ̵ T-Test is chosen in IR experiments • Paired • Two-tailed / One-tailed
  42. 42. Is the difference significant? • Two almost same systems p(.) Green < Yellow ? p(.) score The difference is significant or just caused by chance score
  43. 43. T-Test • 样本均值和总体均值的比较 ̵ 为了判断观察出的一组计量数据是否与其总体均值 接近,两者的相差是同一总体样本与总体之间的误 差,还是已超出抽样误差的允许范围而存在显著差 别? • 成对资料样本均值的比较 ̵ 有时我们并不知道总体均值,且数据成对关联。我 们可以先初步观察每对数据的差别情况,进一步算 出平均相差为样本均值,再与假设的总体均值比较 看相差是否显著 医学理论第七章 摘自 www.37c.com.cn
  44. 44. T-Test (cont.) 医学理论第七章 摘自 www.37c.com.cn
  45. 45. T-Test (cont.) 医学理论第七章 摘自 www.37c.com.cn
  46. 46. T-Test (cont.) 医学理论第七章 摘自 www.37c.com.cn
  47. 47. T-Test (cont.) 医学理论第七章 摘自 www.37c.com.cn
  48. 48. T-Test (cont.) 医学理论第七章 摘自 www.37c.com.cn
  49. 49. T-Test (cont.) 医学理论第七章 摘自 www.37c.com.cn
  50. 50. Overview • Retrieval Effectiveness Evaluation • Evaluation Measures • Significance Test • One Selected SIGIR Paper ̵ T. Joachims, L. Granka, B. Pang, H. Hembrooke, and G. Gay, Accurately Interpreting Clickthrough Data as Implicit Feedback, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), 2005.
  51. 51. First Author
  52. 52. Introduction • The user study is different in at least two respects from previous work ̵ The study provides detailed insight into the users’ decision-making process through the use of eyetracking ̵ Evaluate relative preference signals derived from user behavior • Clicking decisions are biased at least two ways, trust bias and quality bias • Clicks have to be interpreted relative to the order of presentation and relative to the other abstracts
  53. 53. User Study • Designed these studies to not only record and evaluate user actions, but also to give insight into the decision process that lead the user to the action • This is achieved by recording users’ eye movements by Eye tracking
  54. 54. Questions Used
  55. 55. Two Phases of the Study • Phase I ̵ 34 participants ̵ Start search with Google query, search for answers • Phase II ̵ Investigate how users react to manipulations of search results ̵ Same instructions as phase I ̵ Each subject assigned to one of three experimental conditions • Normal • Swapped • Reversed
  56. 56. Explicit Relevance Judgments • Collected explicit relevance judgments for all queries and results pages ̵ Phase I • Randomized the order of abstracts and asked jugdes to (weakly) order the abstracts ̵ Phase II • The set for judging includes more • Abstracts and Web pages • Inter-judge agreements ̵ Phase I: 89.5% ̵ Phase II: abstract 82.5%, page 86.4%
  57. 57. Eyetracking • Fixations ̵ 200-300 milliseconds ̵ Used in this paper • Saccades ̵ 40-50 milliseconds • Pupil dilation
  58. 58. Analysis of User Behavior • Which links do users view and click? • Do users scan links from top to bottom? • Which links do users evaluate before clicking?
  59. 59. Which links do users view and click? • Almost equal frequency of 1st and 2nd link, but more clicks on 1st link • Once the user has started scrolling, rank appears to become less of an influence
  60. 60. Do users scan links from top to bottom? • Big gap before viewing 3rd ranked abstract • Users scan viewable results thoroughly before scrolling
  61. 61. Which links do users evaluate before clicking? • Abstracts closer above the clicked link are more likely to be viewed • Abstract right below a link is viewed roughly 50% of the time
  62. 62. Analysis of Implicit Feedback • Does relevance influence user decisions? • Are clicks absolute relevance judgments? • Are clicks relative relevance judgments?
  63. 63. Does relevance influence user decisions? • Yes • Use the “reversed” condition ̵ Controllably decreases the quality of the retrieval function and relevance of highly ranked abstracts • Users react in two ways ̵ View lower ranked links more frequently, scan significantly more abstracts ̵ Subjects are much less likely to click on the first link, more likely to click on a lower ranked link
  64. 64. Are clicks absolute relevance judgments? • Interpretation is problematic • Trust Bias ̵ Abstract ranked first receives more clicks than the second • First link is more relevant (not influenced by order of presentation) or • Users prefer the first link due to some level of trust in the search engine (influenced by order of presentation)
  65. 65. Trust Bias • Hypothesis that users are not influenced by presentation order can be rejected • Users have substantial trust in search engine’s ability to estimate relevance
  66. 66. Quality Bias • Quality of the ranking influences the user’s clicking behavior ̵ If relevance of retrieved results decreases, users click on abstracts that are on average less relevant ̵ Confirmed by the “reversed” condition
  67. 67. Are clicks relative relevance judgments? • An accurate interpretation of clicks needs to take two biases into consideration, but they are they are difficult to measure explicitly ̵ User’s trust into quality of search engine ̵ Quality of retrieval function itself • How about interpreting clicks as pairwise preference statements? • An example
  68. 68. In the example, Comments: • Takes trust and quality bias into consideration • Substantially and significantly better than random • Close in accuracy to inter judge agreement
  69. 69. Experimental Results
  70. 70. In the example, Comments: • Slightly more accurate than Strategy 1 • Not a significant difference in Phase II
  71. 71. Experimental Results
  72. 72. In the example, Comments: • Accuracy worse than Strategy 1 • Ranking quality has an effect on the accuracy
  73. 73. Experimental Results
  74. 74. In the example, Rel(l5) > Rel(l4) Comments: • No significant differences compared to Strategy 1
  75. 75. Experimental Results
  76. 76. In the example, Rel(l1) > Rel(l2), Rel(l3) > Rel(l4), Rel(l5) > Rel(l6) Comments: • Highly accurate in the “normal” condition • Misleading ̵Aligned preferences probably less valuable for learning ̵ Better results even if user behaves randomly • Less accurate than Strategy 1 in the “reversed” condition
  77. 77. Experimental Results
  78. 78. Conclusion • Users’ clicking decisions influenced by search bias and quality bias, so it is difficult to interpret clicks as absolute feedback • Strategies for generating relative relevance feedback signals, which are shown to correspond well with explicit judgments • While implicit relevance signals are less consistent with explicit judgments than the explicit judgments among each other, but the difference is encouragingly small
  79. 79. Summary • Retrieval Effectiveness Evaluation • Evaluation Measures • Significance Test • One Selected SIGIR Paper ̵ T. Joachims, L. Granka, B. Pang, H. Hembrooke, and G. Gay, Accurately Interpreting Clickthrough Data as Implicit Feedback, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), 2005.
  80. 80. Thanks!

×