Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Alistair Moffat and Justin Zobel, ―Rank-Biased Precision for
Measurement of Retrieval Effectiveness‖, TOIS vol.27 no. 1, 2...
Introduction to IR Evaluation



    Mean Average Precision



    Rank-Biased Precision



    Analysis of RBP

Task: given query q, output ranked list of

    documents
    ◦ Find probability that document d is relevant for q
Task: given query q, output ranked list of

    documents
    ◦ Find probability that document d is relevant for q
    Ev...
Precision / recall

                                      Precision: |alg   rel|/|alg|
                                  ...
Relevancy requires human judgment

    ◦ Exhaustive judging is not scalable
    ◦ TREC uses pooling
    ◦ Shown to miss s...
In real-world, what does recall measure?

    ◦ Recall important only with ―perfect‖ knowledge
    ◦ If I got one result,...
In real-world, what does recall measure?

    ◦ Recall important only with ―perfect‖ knowledge
    ◦ If I got one result,...
Calculated as

    ◦ Intuitively: sum all P@X where rel found, divide by
      total rel to normalize for summing across ...
Calculated as

    ◦ Intuitively: sum all P@X where rel found, divide by
      total rel to normalize for summing across ...
Methodological problem of instability

    ◦ Results may depend on judging extent
    ◦ More judging can be destabilizing...
Complex abstraction of user satisfaction

    ◦ ―Every time a relevant document is encountered, the user pauses, asks ―Ov...
Complex abstraction of user satisfaction

    ◦ ―Every time a relevant document is encountered, the user pauses, asks ―Ov...
Induced by a user model

Induced by a user model





    ◦ Each document is observed at probability pi-1
    ◦ Expected #docs seen:
    ◦ Total e...
Values of p reflect user behaviors

    ◦ P=0.95    persistent user      (60% chance for 2nd page)

    ◦ P=0.5    impati...
Values of p reflect user behaviors

    ◦ P=0.95      persistent user     (60% chance for 2nd page)

    ◦ P=0.5     impa...
Values of p reflect user behaviors

    ◦ P=0.95      persistent user     (60% chance for 2nd page)

    ◦ P=0.5     impa...
Uncertainty: how many relevant documents?

    (down the ranking, or even in current depth)
    RBP value is inherently l...
Uncertainty: how many relevant documents?

    (down the ranking, or even in current depth)
    RBP value is inherently l...
Similarity
  (correlation)
  between measures




Detected significance
in evaluated systems’
ranking
RBP has significant advantages:

    ◦ Based on a solid and supported user model
    ◦ Real-life, no unknown factors (R, ...
IR Evaluation using Rank-Biased Precision
IR Evaluation using Rank-Biased Precision
IR Evaluation using Rank-Biased Precision
Upcoming SlideShare
Loading in …5
×

IR Evaluation using Rank-Biased Precision

5,592 views

Published on

How IR systems (search engines) are evaluated, in particular under the TREC methodology. The common measure of Mean Average Precision is discussed and compared to the newly proposed (Moffat and Zobel 2008) Rank-Biased Precision.

For more discussion, see: http://alteregozi.com/2009/01/18/evaluating-search-engines-relevance/

Published in: Technology, Business
  • Can this measure be used by a individual designer or does it require a test group of subjects like NDCG?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

IR Evaluation using Rank-Biased Precision

  1. 1. Alistair Moffat and Justin Zobel, ―Rank-Biased Precision for Measurement of Retrieval Effectiveness‖, TOIS vol.27 no. 1, 2008. Ofer Egozi LARA group, Technion
  2. 2. Introduction to IR Evaluation  Mean Average Precision  Rank-Biased Precision  Analysis of RBP 
  3. 3. Task: given query q, output ranked list of  documents ◦ Find probability that document d is relevant for q
  4. 4. Task: given query q, output ranked list of  documents ◦ Find probability that document d is relevant for q Evaluation is difficult  ◦ No (per query) test data ◦ Queries vary tremendously ◦ Relevance is a vague (human) concept
  5. 5. Precision / recall  Precision: |alg rel|/|alg| Recall: |alg rel|/|rel| D alg(q,D) rel(q,D) ◦ Precision and recall usually conflict ◦ Single measures proposed (P@X, RR, AP…)
  6. 6. Relevancy requires human judgment  ◦ Exhaustive judging is not scalable ◦ TREC uses pooling ◦ Shown to miss significant relevant portion… ◦ … but shown to compare cross-system well ◦ Bias against novel approaches
  7. 7. In real-world, what does recall measure?  ◦ Recall important only with ―perfect‖ knowledge ◦ If I got one result, and there is another I don’t know of, am I half-satisfied?... ◦ …yes, for specific needs (legal, patent) session ◦ ―Boiling temperature of lead‖
  8. 8. In real-world, what does recall measure?  ◦ Recall important only with ―perfect‖ knowledge ◦ If I got one result, and there is another I don’t know of, am I half-satisfied?... ◦ …yes, for specific needs (legal, patent) session ◦ ―Boiling temperature of lead‖ Precision is more user-oriented  ◦ P@10 measures real user satisfaction ◦ Still, P@10=0.3 can mean first three or last three…
  9. 9. Calculated as  ◦ Intuitively: sum all P@X where rel found, divide by total rel to normalize for summing across queries Example: $$---$----$-----$--- 
  10. 10. Calculated as  ◦ Intuitively: sum all P@X where rel found, divide by total rel to normalize for summing across queries Example: $$---$----$-----$---  Consider: $$---$----$-----$$$$  ◦ AP is down to 0.5234, despite P@20 increasing ◦ Finding more rels can harm AP performance! ◦ Similar problems if some are initially unjudged
  11. 11. Methodological problem of instability  ◦ Results may depend on judging extent ◦ More judging can be destabilizing (meaning error margins don’t shrink with reducing uncertainty)
  12. 12. Complex abstraction of user satisfaction  ◦ ―Every time a relevant document is encountered, the user pauses, asks ―Over the documents I have seen so far, on average how satisfied am I?‖ and writes a number on a piece of paper. Finally, when the user has examined every document in the collection — because this is the only way to be sure that all of the relevant ones have been seen — the user computes the average of the values they have written.‖ How can R be truly calculated?  Think evaluating a Google query…
  13. 13. Complex abstraction of user satisfaction  ◦ ―Every time a relevant document is encountered, the user pauses, asks ―Over the documents I have seen so far, on average how satisfied am I?‖ and writes a number on a piece of paper. Finally, when the user has examined every document in the collection — because this is the only way to be sure that all of the relevant ones have been seen — the user computes the average of the values they have written.‖ How can R be truly calculated?  Think evaluating a Google query… Still, MAP is highly popular and useful:  ◦ Validated in numerous TREC researches ◦ Shown to be stable and robust across query sets (for deep enough pools)
  14. 14. Induced by a user model 
  15. 15. Induced by a user model  ◦ Each document is observed at probability pi-1 ◦ Expected #docs seen: ◦ Total expected utility (ri = known relevance function): ◦ RBP = expected utility rate = utility/effort
  16. 16. Values of p reflect user behaviors  ◦ P=0.95 persistent user (60% chance for 2nd page) ◦ P=0.5 impatient (0.1% chance for 2nd page)
  17. 17. Values of p reflect user behaviors  ◦ P=0.95 persistent user (60% chance for 2nd page) ◦ P=0.5 impatient (0.1% chance for 2nd page) ◦ P=0 I’m feeling lucky  (identical to P@1)
  18. 18. Values of p reflect user behaviors  ◦ P=0.95 persistent user (60% chance for 2nd page) ◦ P=0.5 impatient (0.1% chance for 2nd page) ◦ P=0 I’m feeling lucky  (identical to P@1) Values of p control contribution of each  relevant document ◦ But always positive!
  19. 19. Uncertainty: how many relevant documents?  (down the ranking, or even in current depth) RBP value is inherently lower bound 
  20. 20. Uncertainty: how many relevant documents?  (down the ranking, or even in current depth) RBP value is inherently lower bound  Residual uncertainty is easy to calculate –  assume relevant…
  21. 21. Similarity (correlation) between measures Detected significance in evaluated systems’ ranking
  22. 22. RBP has significant advantages:  ◦ Based on a solid and supported user model ◦ Real-life, no unknown factors (R, |D|) ◦ Error bounds for uncertainty ◦ Statistical significance as good as others But also:  ◦ Absolute values, not relative to query difficulty ◦ A choice for p must be made

×