SlideShare a Scribd company logo
1 of 16
Download to read offline
Evaluation in Information
               Retrieval


      (Book chapter from C.D. Manning, P. Raghavan, and H. Schutze. 
                Introduction to information retrieval)



                            Dishant Ailawadi
    INF384H / CS395T: Concepts of Information Retrieval (and Web Search) Fall11




                                         
Outline

● Why Evaluation?
● Standard test collections.

● Precision and Recall

● Mean Average Precision

● Kappa Statistic

● R­Precision

● Summary




                           
Why Evaluation?


●
  There are many retrieval models/ algorithms/ systems, 
which one is the best?
●
  Measure effect of adding new features.
●
  How far down the ranked list will a user need to look to find 
some/all relevant documents?
●
  Difficulties : Relevance, it is not binary but continuous. How 
to say if a document is relevant?



                                  
Standard Test Collections
 A standard test collection consists of three things:
1. A document collection.
2. A set of queries on this collection
3. A set of relevance judgments on those queries.

If a document in test collection is given a binary classification.  
This decision is referred to as the gold standard or ground 
truth judgment of relevance.  




                                  
Standard Test Collections

    ●    Cranfield: 1950s in UK. Too small to be used nowadays.
     TREC (text retrieval conference)
    ●


           ●   Early TREC had 50 Information needs, TREC 6­8 provide 150 
                 information needs over more than 500 thousand articles.
           ●   Recent work on 25 million pages of GOV2 is now available for 
                 research.
     NTCIR East­Asian Language and Cross Language IR Systems
    ●



     Cross Language Evaluation Forum (CLEF)
    ●



     Reuters­21578 collection most used for text classification.
    ●



                                           
Evaluation Measures
         Retrieved    True positives (tp)    False positives (fp)

     Not Retrieved    False negatives (fn)   True negatives (tn)
                       Relevant               Non Relevant


               Number  of  relevant  documents retrieved            = tp/(tp + fn)
    recall  = 
                Total  number  of  relevant  documents


                 Number  of  relevant documents  retrieved
    precision =                                                       = tp/(tp + fp)
                  Total number of  documents  retrieved



 
    (How many correct selections?) Accuracy = (tp + tn)/(tp + fp + fn + tn)
                                     
An Example
    n doc # relevant
                       Let total # of relevant docs = 6
    1 588       x
                       Check each new recall point:
    2 589       x
    3 576
                       R=1/6=0.167;     P=1/1=1
    4 590       x
    5 986
                       R=2/6=0.333;     P=2/2=1
    6 592       x
    7 984              R=3/6=0.5;     P=3/4=0.75
    8 988
    9 578              R=4/6=0.667; P=4/6=0.667
    10 985
                                                    Missing one 
    11 103                                          relevant document.
    12 591                                          Never reach 
    13 772      x      R=5/6=0.833;     p=5/13=0.38 100% recall
    14 990
                                                              7

                                 
Combining Precision & Recall
F­Measure: Weighted HM of precision and recall.




Value of β controls trade­off:
●β = 1: Equally weight precision and recall.


●β > 1: Weight recall more.


●
 β < 1: Weight precision more.
                     2 PR    2
                  F=      = 1 1
                     P + R R+P

                                   
Precision-Recall curve




Interpolated Precision: To get smooth curve.

                                  
11-point Interpolated Average Precision

Recall   Interp.
          Precision
   0.0      1.00
   0.1      0.67
   0.2      0.63
   0.3      0.55
   0.4      0.45
   0.5      0.41
   0.6      0.36
   0.7      0.29
   0.8      0.13
   0.9      0.10
   1.0      0.08

                         
Single Figure Measures

Mean Average Precision (MAP): Average Precision over all 
queries.
Example: Average Precision: (1 + 1 + 0.75 + 0.667 + 0.38 + 
0)/6 = 0.633



Normalized Distributed Cumulative Gain (NDCG): For non­
binary notions. 



                              
Assesing Relevance
 Pooling: To obtain a subset of collection related to query
●

    – Use a set of search engines/algorithms
    – The top­k results (k is between 20 to 50 in TREC) are
      merged into a pool, duplicates are removed
    – Present the documents in a random order to analysts for
      relevance judgments


 Kappa Statistic:
●

     If we have multiple judges on one information need, how consistent are 
      those judges?
  kappa = (P(A) – P(E)) / (1 – P(E))
   – P(A) is the proportion of the times that the judges
     agreed
   – P(E) is the proportion of the times they would be
                                         
    expected to agree by chance
Example: Kappa Statistic
                           Judge 2 Relevance
                            Yes      No  Total
Judge 1      Yes     300     20    320
Relevance   No      10      70     80
                 Total   310     90    400
Observed proportion of the times the judges agreed :


Pooled marginals: 


Probability that two judges agreed by chance (Max Value=1, Min =0.5): 


Kappa statistic: 


Kappa Value between 0.67 and 0.8 is fair agreement but below 0.67 is 
                                       
seen as data providing a dubious basis for evaluation.
Evaluation
                                                  n doc # relevant
R­PRECISION :                                      1 588      x
                     R = # of relevant docs = 7    2 589      x
                                                   3 576
                      R­Precision = 4/7 = 0.571    4 590      x
                                                   5 986
                                                   6 592      x
                                                   7 984
                                                   8 988
A/B Test : Precisely one change between            9 578
                                                  10 985
 current and previous system. We evaluate the     11 103
Affect of that change on system.                  12 591
                                                  13 772      x
                                                  14 990




                               
Summary
● F­Measure: To combine Precision and recall. 
● Recall­precision graph – conveying more information than


 a single number measure.
● Mean average precision – single number value, popular 


measure.
● Normalized Discounted Cumulative Gain (NDCG) – single 


number summary for each rank level emphasizing top ranked 
documents, relevance judgments only needed to a specific rank 
depth (e.g., 10)
● Kappa Measure: Judgement reliability

● R­Precision: Only need to examine top rel documents. 




                                 
THANK YOU!




         

More Related Content

Similar to Presentation

ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"
nozyh
 
DECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.ppt
DECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.pptDECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.ppt
DECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.ppt
glorypreciousj
 
Summer 2015 Internship
Summer 2015 InternshipSummer 2015 Internship
Summer 2015 Internship
Taylor Martell
 
Lecture 7
Lecture 7Lecture 7
Lecture 7
butest
 
Lecture 7
Lecture 7Lecture 7
Lecture 7
butest
 
GC-S005-DataAnalysis
GC-S005-DataAnalysisGC-S005-DataAnalysis
GC-S005-DataAnalysis
henry kang
 
A05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat TestsA05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat Tests
Leanleaders.org
 
A05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat TestsA05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat Tests
Leanleaders.org
 
GTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerceGTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerce
GrubhubTech
 

Similar to Presentation (20)

Statistics chm 235
Statistics chm 235Statistics chm 235
Statistics chm 235
 
Statistics
StatisticsStatistics
Statistics
 
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017
 
Performance evaluation of IR models
Performance evaluation of IR modelsPerformance evaluation of IR models
Performance evaluation of IR models
 
ML MODULE 4.pdf
ML MODULE 4.pdfML MODULE 4.pdf
ML MODULE 4.pdf
 
ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"
 
evaluation and credibility-Part 2
evaluation and credibility-Part 2evaluation and credibility-Part 2
evaluation and credibility-Part 2
 
DECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.ppt
DECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.pptDECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.ppt
DECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.ppt
 
2 Machine Learning General.pdf
2 Machine Learning General.pdf2 Machine Learning General.pdf
2 Machine Learning General.pdf
 
S1 - Process product optimization using design experiments and response surfa...
S1 - Process product optimization using design experiments and response surfa...S1 - Process product optimization using design experiments and response surfa...
S1 - Process product optimization using design experiments and response surfa...
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
 
T test statistics
T test statisticsT test statistics
T test statistics
 
Estimating Space-Time Covariance from Finite Sample Sets
Estimating Space-Time Covariance from Finite Sample SetsEstimating Space-Time Covariance from Finite Sample Sets
Estimating Space-Time Covariance from Finite Sample Sets
 
Summer 2015 Internship
Summer 2015 InternshipSummer 2015 Internship
Summer 2015 Internship
 
Lecture 7
Lecture 7Lecture 7
Lecture 7
 
Lecture 7
Lecture 7Lecture 7
Lecture 7
 
GC-S005-DataAnalysis
GC-S005-DataAnalysisGC-S005-DataAnalysis
GC-S005-DataAnalysis
 
A05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat TestsA05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat Tests
 
A05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat TestsA05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat Tests
 
GTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerceGTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerce
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Presentation