evaluation in infomation retrival

  • 2,608 views
Uploaded on

the evaluation method ir field

the evaluation method ir field

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,608
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
108
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Evaluation in Information Retrieval Ruihua Song Web Search and Mining Group Email: rsong@microsoft.com
  • 2. Overview • Retrieval Effectiveness Evaluation • Evaluation Measures • Significance Test • One Selected SIGIR Paper
  • 3. How to evaluate? • How well does system meet information need? ̵ System evaluation: how good are document rankings? ̵ User-based evaluation: how satisfied is user?
  • 4. Ellen Voorhees, The TREC Conference: An Introduction
  • 5. Ellen Voorhees, The TREC Conference: An Introduction
  • 6. Ellen Voorhees, The TREC Conference: An Introduction
  • 7. Ellen Voorhees, The TREC Conference: An Introduction
  • 8. Ellen Voorhees, The TREC Conference: An Introduction
  • 9. Ellen Voorhees, The TREC Conference: An Introduction
  • 10. Ellen Voorhees, The TREC Conference: An Introduction
  • 11. Ellen Voorhees, The TREC Conference: An Introduction
  • 12. Ellen Voorhees, The TREC Conference: An Introduction
  • 13. Ellen Voorhees, The TREC Conference: An Introduction
  • 14. Evaluation Challenges On The Web • Collection is dynamic ̵ 10-20% urls change every month • Queries are time sensitive ̵ Topics are hot then they ae not • Spam methods evolve ̵ Algorithms evaluated against last month’s web may not work today • But we have a lot of users… you can use clicks as supervision SIGIR'05 Keynote given by Amit Singhal from Google
  • 15. Overview • Retrieval Effectiveness Evaluation • Evaluation Measures • Significance Test • One Selected SIGIR Paper
  • 16. Ellen Voorhees, The TREC Conference: An Introduction
  • 17. P-R curve • Precision and recall • Precision-recall curve • Average precision-recall curve
  • 18. P-R curve (cont.) • For a query there is a result list (answer set) R A (Relevant Docs) Ra (Answer Set)
  • 19. P-R curve (cont.) • Recall is fraction of the relevant | Ra | document which has been retrieved recall = |R| | Ra | • Precision is fraction of the retrieved precision = | A| document which is relevant
  • 20. P-R curve (cont.) • E.g. ̵ For some query, |Total Docs|=200,|R|=20 ̵ r: relevant ̵ n: non-relevant ̵ At rank 10,recall=6/20,precision=6/10 r n n r r n r n r r d , d , d , d , d , d , d , d , d , d ,... 123 84 5 87 80 59 90 8 89 55
  • 21. Individual query P-R curve
  • 22. P-R curve (cont.)
  • 23. MAP • Mean Average Precision • Defined as mean of the precision obtained after each relevant document is retrieved, using zero as the precision for document that are not retrieved.
  • 24. MAP (cont.) • E.g. ̵ |Total Docs|=200, |R|=20 ̵ The whole result list consist of 10 docs is as follow ̵ r-rel ̵ n-nonrelevant ̵ MAP = (1+2/4+3/5+4/7+5/9+6/10)/6 r n n r r n r n r r d ,d ,d ,d ,d ,d ,d ,d ,d ,d 123 84 5 87 80 59 90 8 89 55
  • 25. Precision at 10 • P@10 is the number of relevant documents in the top 10 documents in the ranked list returned for a topic • E.g. ̵ there is 3 documents in the top 10 documents that is relevant ̵ P@10=0.3
  • 26. Mean Reciprocal Rank • MRR is the reciprocal of the first relevant document’s rank in the ranked list returned for a topic • E.g. ̵ the first relevant document is ranked as No.4 ̵ MRR = ¼ = 0.25
  • 27. bpref • Bpref stands for Binary Preference • Consider only judged docs in result list • The basic idea is to count number of time judged non-relevant docs retrieval before judged relevant docs
  • 28. bpref (cont.)
  • 29. bpref (cont.) • E.g. ̵ |Total Docs| =200, |R|=20 ̵ r: judged relevant ̵ n: judged non-relevant ̵ u: not judged, unknown whether relevant or not r n n u r n r u u r d , d , d , d , d , d , d , d , d , d ,... 123 84 5 87 80 59 90 8 89 55
  • 30. References • Baeza-Yates, R. & Ribeiro-Neto, B. Modern Information Retrieval Addison Wesley, 1999 , 73-96 • Buckley, C. & Voorhees, E.M. Retrieval Evaluation with Incomplete Information Proceedings of SIGIR 2004
  • 31. NDCG • Two assumptions about ranked result list ̵ Highly relevant document are more valuable ̵ The greater the ranked position of a relevant document , the less valuable it is for the user
  • 32. NDCG (cont.) • Graded judgment -> gain vector • Cumulated Gain
  • 33. NDCG (cont.) • Discounted CG • Discounting function
  • 34. NDCG (cont.) • Ideal (D)CG vector
  • 35. NDCG (cont.)
  • 36. NDCG (cont.) • Normalized (D)CG
  • 37. NDCG (cont.)
  • 38. NDCG (cont.) • Pros. ̵ Graded, more precise than R-P ̵ Reflect more user behavior (e.g. user persistence) ̵ CG and DCG graphs are intuitive to interpret • Cons. ̵ Disagreements in rating ̵ How to set parameters
  • 39. Reference • Jarvelin, K. & Kekalainen, J. Cumulated Gain-based Evaluation of IR Techniques ACM Transactions on Information Systems, 2002 , 20 , 422-446
  • 40. Overview • Retrieval Effectiveness Evaluation • Evaluation Measures • Significance Test • One Selected SIGIR Paper
  • 41. Significance Test • Significance Test ̵ Why is it necessary? ̵ T-Test is chosen in IR experiments • Paired • Two-tailed / One-tailed
  • 42. Is the difference significant? • Two almost same systems p(.) Green < Yellow ? p(.) score The difference is significant or just caused by chance score
  • 43. T-Test • 样本均值和总体均值的比较 ̵ 为了判断观察出的一组计量数据是否与其总体均值 接近,两者的相差是同一总体样本与总体之间的误 差,还是已超出抽样误差的允许范围而存在显著差 别? • 成对资料样本均值的比较 ̵ 有时我们并不知道总体均值,且数据成对关联。我 们可以先初步观察每对数据的差别情况,进一步算 出平均相差为样本均值,再与假设的总体均值比较 看相差是否显著 医学理论第七章 摘自 www.37c.com.cn
  • 44. T-Test (cont.) 医学理论第七章 摘自 www.37c.com.cn
  • 45. T-Test (cont.) 医学理论第七章 摘自 www.37c.com.cn
  • 46. T-Test (cont.) 医学理论第七章 摘自 www.37c.com.cn
  • 47. T-Test (cont.) 医学理论第七章 摘自 www.37c.com.cn
  • 48. T-Test (cont.) 医学理论第七章 摘自 www.37c.com.cn
  • 49. T-Test (cont.) 医学理论第七章 摘自 www.37c.com.cn
  • 50. Overview • Retrieval Effectiveness Evaluation • Evaluation Measures • Significance Test • One Selected SIGIR Paper ̵ T. Joachims, L. Granka, B. Pang, H. Hembrooke, and G. Gay, Accurately Interpreting Clickthrough Data as Implicit Feedback, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), 2005.
  • 51. First Author
  • 52. Introduction • The user study is different in at least two respects from previous work ̵ The study provides detailed insight into the users’ decision-making process through the use of eyetracking ̵ Evaluate relative preference signals derived from user behavior • Clicking decisions are biased at least two ways, trust bias and quality bias • Clicks have to be interpreted relative to the order of presentation and relative to the other abstracts
  • 53. User Study • Designed these studies to not only record and evaluate user actions, but also to give insight into the decision process that lead the user to the action • This is achieved by recording users’ eye movements by Eye tracking
  • 54. Questions Used
  • 55. Two Phases of the Study • Phase I ̵ 34 participants ̵ Start search with Google query, search for answers • Phase II ̵ Investigate how users react to manipulations of search results ̵ Same instructions as phase I ̵ Each subject assigned to one of three experimental conditions • Normal • Swapped • Reversed
  • 56. Explicit Relevance Judgments • Collected explicit relevance judgments for all queries and results pages ̵ Phase I • Randomized the order of abstracts and asked jugdes to (weakly) order the abstracts ̵ Phase II • The set for judging includes more • Abstracts and Web pages • Inter-judge agreements ̵ Phase I: 89.5% ̵ Phase II: abstract 82.5%, page 86.4%
  • 57. Eyetracking • Fixations ̵ 200-300 milliseconds ̵ Used in this paper • Saccades ̵ 40-50 milliseconds • Pupil dilation
  • 58. Analysis of User Behavior • Which links do users view and click? • Do users scan links from top to bottom? • Which links do users evaluate before clicking?
  • 59. Which links do users view and click? • Almost equal frequency of 1st and 2nd link, but more clicks on 1st link • Once the user has started scrolling, rank appears to become less of an influence
  • 60. Do users scan links from top to bottom? • Big gap before viewing 3rd ranked abstract • Users scan viewable results thoroughly before scrolling
  • 61. Which links do users evaluate before clicking? • Abstracts closer above the clicked link are more likely to be viewed • Abstract right below a link is viewed roughly 50% of the time
  • 62. Analysis of Implicit Feedback • Does relevance influence user decisions? • Are clicks absolute relevance judgments? • Are clicks relative relevance judgments?
  • 63. Does relevance influence user decisions? • Yes • Use the “reversed” condition ̵ Controllably decreases the quality of the retrieval function and relevance of highly ranked abstracts • Users react in two ways ̵ View lower ranked links more frequently, scan significantly more abstracts ̵ Subjects are much less likely to click on the first link, more likely to click on a lower ranked link
  • 64. Are clicks absolute relevance judgments? • Interpretation is problematic • Trust Bias ̵ Abstract ranked first receives more clicks than the second • First link is more relevant (not influenced by order of presentation) or • Users prefer the first link due to some level of trust in the search engine (influenced by order of presentation)
  • 65. Trust Bias • Hypothesis that users are not influenced by presentation order can be rejected • Users have substantial trust in search engine’s ability to estimate relevance
  • 66. Quality Bias • Quality of the ranking influences the user’s clicking behavior ̵ If relevance of retrieved results decreases, users click on abstracts that are on average less relevant ̵ Confirmed by the “reversed” condition
  • 67. Are clicks relative relevance judgments? • An accurate interpretation of clicks needs to take two biases into consideration, but they are they are difficult to measure explicitly ̵ User’s trust into quality of search engine ̵ Quality of retrieval function itself • How about interpreting clicks as pairwise preference statements? • An example
  • 68. In the example, Comments: • Takes trust and quality bias into consideration • Substantially and significantly better than random • Close in accuracy to inter judge agreement
  • 69. Experimental Results
  • 70. In the example, Comments: • Slightly more accurate than Strategy 1 • Not a significant difference in Phase II
  • 71. Experimental Results
  • 72. In the example, Comments: • Accuracy worse than Strategy 1 • Ranking quality has an effect on the accuracy
  • 73. Experimental Results
  • 74. In the example, Rel(l5) > Rel(l4) Comments: • No significant differences compared to Strategy 1
  • 75. Experimental Results
  • 76. In the example, Rel(l1) > Rel(l2), Rel(l3) > Rel(l4), Rel(l5) > Rel(l6) Comments: • Highly accurate in the “normal” condition • Misleading ̵Aligned preferences probably less valuable for learning ̵ Better results even if user behaves randomly • Less accurate than Strategy 1 in the “reversed” condition
  • 77. Experimental Results
  • 78. Conclusion • Users’ clicking decisions influenced by search bias and quality bias, so it is difficult to interpret clicks as absolute feedback • Strategies for generating relative relevance feedback signals, which are shown to correspond well with explicit judgments • While implicit relevance signals are less consistent with explicit judgments than the explicit judgments among each other, but the difference is encouragingly small
  • 79. Summary • Retrieval Effectiveness Evaluation • Evaluation Measures • Significance Test • One Selected SIGIR Paper ̵ T. Joachims, L. Granka, B. Pang, H. Hembrooke, and G. Gay, Accurately Interpreting Clickthrough Data as Implicit Feedback, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), 2005.
  • 80. Thanks!