Your SlideShare is downloading.
×

- 1. Alistair Moffat and Justin Zobel, ―Rank-Biased Precision for Measurement of Retrieval Effectiveness‖, TOIS vol.27 no. 1, 2008. Ofer Egozi LARA group, Technion
- 2. Introduction to IR Evaluation Mean Average Precision Rank-Biased Precision Analysis of RBP
- 3. Task: given query q, output ranked list of documents ◦ Find probability that document d is relevant for q
- 4. Task: given query q, output ranked list of documents ◦ Find probability that document d is relevant for q Evaluation is difficult ◦ No (per query) test data ◦ Queries vary tremendously ◦ Relevance is a vague (human) concept
- 5. Precision / recall Precision: |alg rel|/|alg| Recall: |alg rel|/|rel| D alg(q,D) rel(q,D) ◦ Precision and recall usually conflict ◦ Single measures proposed (P@X, RR, AP…)
- 6. Relevancy requires human judgment ◦ Exhaustive judging is not scalable ◦ TREC uses pooling ◦ Shown to miss significant relevant portion… ◦ … but shown to compare cross-system well ◦ Bias against novel approaches
- 7. In real-world, what does recall measure? ◦ Recall important only with ―perfect‖ knowledge ◦ If I got one result, and there is another I don’t know of, am I half-satisfied?... ◦ …yes, for specific needs (legal, patent) session ◦ ―Boiling temperature of lead‖
- 8. In real-world, what does recall measure? ◦ Recall important only with ―perfect‖ knowledge ◦ If I got one result, and there is another I don’t know of, am I half-satisfied?... ◦ …yes, for specific needs (legal, patent) session ◦ ―Boiling temperature of lead‖ Precision is more user-oriented ◦ P@10 measures real user satisfaction ◦ Still, P@10=0.3 can mean first three or last three…
- 9. Calculated as ◦ Intuitively: sum all P@X where rel found, divide by total rel to normalize for summing across queries Example: $$---$----$-----$---
- 10. Calculated as ◦ Intuitively: sum all P@X where rel found, divide by total rel to normalize for summing across queries Example: $$---$----$-----$--- Consider: $$---$----$-----$$$$ ◦ AP is down to 0.5234, despite P@20 increasing ◦ Finding more rels can harm AP performance! ◦ Similar problems if some are initially unjudged
- 11. Methodological problem of instability ◦ Results may depend on judging extent ◦ More judging can be destabilizing (meaning error margins don’t shrink with reducing uncertainty)
- 12. Complex abstraction of user satisfaction ◦ ―Every time a relevant document is encountered, the user pauses, asks ―Over the documents I have seen so far, on average how satisfied am I?‖ and writes a number on a piece of paper. Finally, when the user has examined every document in the collection — because this is the only way to be sure that all of the relevant ones have been seen — the user computes the average of the values they have written.‖ How can R be truly calculated? Think evaluating a Google query…
- 13. Complex abstraction of user satisfaction ◦ ―Every time a relevant document is encountered, the user pauses, asks ―Over the documents I have seen so far, on average how satisfied am I?‖ and writes a number on a piece of paper. Finally, when the user has examined every document in the collection — because this is the only way to be sure that all of the relevant ones have been seen — the user computes the average of the values they have written.‖ How can R be truly calculated? Think evaluating a Google query… Still, MAP is highly popular and useful: ◦ Validated in numerous TREC researches ◦ Shown to be stable and robust across query sets (for deep enough pools)
- 14. Induced by a user model
- 15. Induced by a user model ◦ Each document is observed at probability pi-1 ◦ Expected #docs seen: ◦ Total expected utility (ri = known relevance function): ◦ RBP = expected utility rate = utility/effort
- 16. Values of p reflect user behaviors ◦ P=0.95 persistent user (60% chance for 2nd page) ◦ P=0.5 impatient (0.1% chance for 2nd page)
- 17. Values of p reflect user behaviors ◦ P=0.95 persistent user (60% chance for 2nd page) ◦ P=0.5 impatient (0.1% chance for 2nd page) ◦ P=0 I’m feeling lucky (identical to P@1)
- 18. Values of p reflect user behaviors ◦ P=0.95 persistent user (60% chance for 2nd page) ◦ P=0.5 impatient (0.1% chance for 2nd page) ◦ P=0 I’m feeling lucky (identical to P@1) Values of p control contribution of each relevant document ◦ But always positive!
- 19. Uncertainty: how many relevant documents? (down the ranking, or even in current depth) RBP value is inherently lower bound
- 20. Uncertainty: how many relevant documents? (down the ranking, or even in current depth) RBP value is inherently lower bound Residual uncertainty is easy to calculate – assume relevant…
- 21. Similarity (correlation) between measures Detected significance in evaluated systems’ ranking
- 22. RBP has significant advantages: ◦ Based on a solid and supported user model ◦ Real-life, no unknown factors (R, |D|) ◦ Error bounds for uncertainty ◦ Statistical significance as good as others But also: ◦ Absolute values, not relative to query difficulty ◦ A choice for p must be made