Evaluation in Information Retrieval (Book chapter from C.D. Manning, P. Raghavan, and H. Schutze. Introduction to information retrieval) Dishant Ailawadi INF384H / CS395T: Concepts of Information Retrieval (and Web Search) Fall11
Outline● Why Evaluation?● Standard test collections.● Precision and Recall● Mean Average Precision● Kappa Statistic● RPrecision● Summary
Why Evaluation?● There are many retrieval models/ algorithms/ systems, which one is the best?● Measure effect of adding new features.● How far down the ranked list will a user need to look to find some/all relevant documents?● Difficulties : Relevance, it is not binary but continuous. How to say if a document is relevant?
Standard Test Collections A standard test collection consists of three things:1. A document collection.2. A set of queries on this collection3. A set of relevance judgments on those queries.If a document in test collection is given a binary classification. This decision is referred to as the gold standard or ground truth judgment of relevance.
Standard Test Collections ● Cranfield: 1950s in UK. Too small to be used nowadays. TREC (text retrieval conference) ● ● Early TREC had 50 Information needs, TREC 68 provide 150 information needs over more than 500 thousand articles. ● Recent work on 25 million pages of GOV2 is now available for research. NTCIR EastAsian Language and Cross Language IR Systems ● Cross Language Evaluation Forum (CLEF) ● Reuters21578 collection most used for text classification. ●
Evaluation Measures Retrieved True positives (tp) False positives (fp) Not Retrieved False negatives (fn) True negatives (tn) Relevant Non Relevant Number of relevant documents retrieved = tp/(tp + fn) recall = Total number of relevant documents Number of relevant documents retrieved precision = = tp/(tp + fp) Total number of documents retrieved (How many correct selections?) Accuracy = (tp + tn)/(tp + fp + fn + tn)
An Example n doc # relevant Let total # of relevant docs = 6 1 588 x Check each new recall point: 2 589 x 3 576 R=1/6=0.167; P=1/1=1 4 590 x 5 986 R=2/6=0.333; P=2/2=1 6 592 x 7 984 R=3/6=0.5; P=3/4=0.75 8 988 9 578 R=4/6=0.667; P=4/6=0.667 10 985 Missing one 11 103 relevant document. 12 591 Never reach 13 772 x R=5/6=0.833; p=5/13=0.38 100% recall 14 990 7
Combining Precision & RecallFMeasure: Weighted HM of precision and recall.Value of β controls tradeoff:●β = 1: Equally weight precision and recall.●β > 1: Weight recall more.● β < 1: Weight precision more. 2 PR 2 F= = 1 1 P + R R+P
Precision-Recall curveInterpolated Precision: To get smooth curve.
Single Figure MeasuresMean Average Precision (MAP): Average Precision over all queries.Example: Average Precision: (1 + 1 + 0.75 + 0.667 + 0.38 + 0)/6 = 0.633Normalized Distributed Cumulative Gain (NDCG): For nonbinary notions.
Assesing Relevance Pooling: To obtain a subset of collection related to query● – Use a set of search engines/algorithms – The topk results (k is between 20 to 50 in TREC) are merged into a pool, duplicates are removed – Present the documents in a random order to analysts for relevance judgments Kappa Statistic:● If we have multiple judges on one information need, how consistent are those judges? kappa = (P(A) – P(E)) / (1 – P(E)) – P(A) is the proportion of the times that the judges agreed – P(E) is the proportion of the times they would be expected to agree by chance
Example: Kappa Statistic Judge 2 Relevance Yes No TotalJudge 1 Yes 300 20 320Relevance No 10 70 80 Total 310 90 400Observed proportion of the times the judges agreed :Pooled marginals: Probability that two judges agreed by chance (Max Value=1, Min =0.5): Kappa statistic: Kappa Value between 0.67 and 0.8 is fair agreement but below 0.67 is seen as data providing a dubious basis for evaluation.
Evaluation n doc # relevantRPRECISION : 1 588 x R = # of relevant docs = 7 2 589 x 3 576 RPrecision = 4/7 = 0.571 4 590 x 5 986 6 592 x 7 984 8 988A/B Test : Precisely one change between 9 578 10 985 current and previous system. We evaluate the 11 103Affect of that change on system. 12 591 13 772 x 14 990
Summary● FMeasure: To combine Precision and recall. ● Recallprecision graph – conveying more information than a single number measure.● Mean average precision – single number value, popular measure.● Normalized Discounted Cumulative Gain (NDCG) – single number summary for each rank level emphasizing top ranked documents, relevance judgments only needed to a specific rank depth (e.g., 10)● Kappa Measure: Judgement reliability● RPrecision: Only need to examine top rel documents.