Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Language testing - Contrastive anal... by King Saud University 2275 views
- Principles of language assessment (... by Alfi Suru 807 views
- Reliability by dermengles 3785 views
- 3 basic-principles_of_assessment by hakim azman 269 views
- State of the Word 2011 by photomatt 559546 views
- Slideshare ppt by Mandy Suzanne 751164 views

460 views

Published on

No Downloads

Total views

460

On SlideShare

0

From Embeds

0

Number of Embeds

7

Shares

0

Downloads

17

Comments

0

Likes

1

No embeds

No notes for slide

- 1. SIGIR 2013 Dublin, Ireland Β· July 30thPicture by Philip Milne On the Measurement of Test Collection Reliability @julian_urbano University Carlos III of Madrid MΓ³nica Marrero University Carlos III of Madrid Diego MartΓn Technical University of Madrid
- 2. Gratefully supported by Student Travel Grant
- 3. Is System A More Effective than System B? -1 1 Ξeffectiveness π0
- 4. Is System A More Effective than System B? Get a test collection and evaluate Measure the average difference π and conclude which one is better
- 5. Samples Test collections are samples from a larger, possibly infinite, population Documents, queries and assessors π is only an estimate How reliable is our conclusion?
- 6. Reliability vs. Cost Building reliable collections is easyβ¦ Just use more documents, more queries, more assessors β¦but it is prohibitively expensive Our best bet is to increase query set size
- 7. Data-based approach 1.Randomly split query set 2.Compute indicators of reliability based on those two subsets 3.Extrapolate to larger query sets ..with some variations Voorheesβ98, Zobelβ98, Buckley & Voorheesβ00, Voorhees & Buckleyβ02, Sanderson & Zobelβ05, Sakaiβ07, Voorheesβ09
- 8. Data-based Reliability Indicators based on results with two collections Kendall π correlation stability of the ranking of systems π π¨π· correlation add a top-heaviness components Absolute sensitivity minimum absolute π s.t. swaps <5% Relative sensitivity minimum relative π s.t. swaps <5%
- 9. Data-based Reliability Indicators based on results with two collections Power ratio statistically significant results Minor conflict ratio statistically non-significant swap Major conflict ratio statistically significant swap RMSE differences in π
- 10. Generalizability Theory Directly address variability of scores G-study Estimate variance components from previous, representative, data D-study Estimate reliability based on estimated variance components
- 11. G-study π π = π π π + π π π + π π:π π Estimated using Analysis of Variance From previous data, usually an existing test collection
- 12. G-study π π = π π π + π π π + π π:π π Estimated using Analysis of Variance From previous data, usually an existing test collection system differences, our goal!
- 13. G-study π π = π π π + π π π + π π:π π Estimated using Analysis of Variance From previous data, usually an existing test collection system differences, our goal! query difficulty
- 14. G-study π π = π π π + π π π + π π:π π Estimated using Analysis of Variance From previous data, usually an existing test collection system differences, our goal! query difficulty some systems better for some queries
- 15. D-study Relative stability π¬π π = π π π π π π + π π:π π π π β² Absolute stability π½ = π π π π π π + π π π + π π:π π π π β² Easy to estimate how many queries we need for a certain stability level
- 16. Generalizability Theory Proposed by Bodoffβ07 Kanoulas & Aslamβ09 derive optimal gain & discount in nDCG TREC Million Query Track β80 queries sufficient for stable rankings β130 queries for stable absolute scores
- 17. In this Paper / Talk How sensitive is the D-study to the initial data used in the G-study? How to interpret G-theory in practice, why π¬π π = π. ππ and π½ = π. ππ? From the above two, review the reliability of >40 TREC test collections
- 18. variability of G-theory indicators of reliability
- 19. Data 43 TREC collections from TREC-3 to TREC 2011 12 tasks across 10 tracks Ad Hoc, Web, Novelty, Genomics, Robust, Terabyte, Enterprise, Million Query, Medical and Microblog
- 20. Experiment Vary number of queries in G-study from π π = π to full set Use all runs available Run D-study Compute π¬π π , π½ Compute π π β² to reach 0.95 stability 200 random trials
- 21. Variability due to queries
- 22. Variability due to queries We may get πΈπ2 = 0.9 or πΈπ2 = 0.3, depending on what 10 queries we use
- 23. Experiment (II) The same, but vary number of systems from π π = π to full set Use all queries available 200 random trials
- 24. Variability due to systems
- 25. Variability due to systems We may get πΈπ2 = 0.9 or πΈπ2 = 0.5, depending on what 20 systems we use
- 26. Results G-Theory is very sensitive to initial data Need about 50 queries and 50 systems for differences in π¬π π and π½ below 0.1 Number of queries for π¬π π = π. ππ may change in orders of magnitude Microblog2011 (all 184 systems and 30 queries): need 63 to 133 queries Medical2011 (all 34 queries and 40 systems): need 109 to 566 queries
- 27. Use Confidence Intervals Bodoffβ08 Confidence intervals in G-study But what about the D-study? Feldtβ65 and Arteaga et al.β82 Work reasonably well even when assumptions are violated Brennanβ01
- 28. Example
- 29. Example
- 30. Example Account for variability in initial data
- 31. Example Required number of queries to reach the lower end of the interval
- 32. Summary in TREC that is, the 43 collections we study here π¬π π : mean=0.88 sd=0.1 95% conf. intervals are 0.1 long π½: mean=0.74 sd=0.2 95% conf. intervals are 0.19 long
- 33. interpretation of G-Theory indicators of reliability
- 34. Experiment Split query set in 2 subsets from π π = ππ to full set / 2 Use all runs available Run D-study Compute π¬π π and π½ and map onto π, sensitivity, power, conflicts, etc. 50 random trials >28,000 datapoints
- 35. Example: π¬π π β π *All mappings in the paper
- 36. Example: π¬π π β π πΈπ2 = 0.95 β π β 0.85 *All mappings in the paper
- 37. Example: π¬π π β π π = 0.9 β πΈπ2 β 0.97 *All mappings in the paper
- 38. Example: π¬π π β π Million Query 2007 Million Query 2008 *All mappings in the paper
- 39. Future Predictions Allows us to make more informed decisions within a collection What about a new collection? Fit a single model for each mapping with 90% and 95% prediction intervals Assess whether a larger collection is really worth the effort
- 40. Example: π¬π π β π *All mappings in the paper
- 41. Example: π¬π π β π current collection *All mappings in the paper
- 42. Example: π¬π π β π current collection target *All mappings in the paper
- 43. Example: π½ β πππ. ππππππππππ
- 44. Example: π½ β πππ. ππππππππππ
- 45. review of TREC collections
- 46. Outline Estimate π¬π π and π½, with 95% confidence intervals, and full query set Map onto π, sensitivity, power, conflicts, etc. Results within task offer historical perspective since 1994
- 47. Example: Ad Hoc 3-8 π¬π π β π. ππ, π. ππ β π β [π. ππ, π. ππ] πππππ πππππππππ β π. π, π. π % πππππ πππππππππ β π. ππ, π. ππ % Queries to get π¬π π = π. ππ: [ππ, πππ] Queries to get π½ = π. ππ: [πππ, πππ] 50 queries were used *All collections and mappings in the paper
- 48. Example: Web Ad Hoc TREC-8 to TREC-2001: WT2g and WT10g π¬π π β π. ππ, π. ππ β π β [π. ππ, π. ππ] Queries to get π¬π π = π. ππ: ππ, πππ TREC-2009 to TREC-2011: ClueWeb09 π¬π π β π. π, π. ππ β π β [π. ππ, π. ππ] Queries to get π¬π π = π. ππ: πππ, πππ 50 queries were used
- 49. Historical Trend Decreasing within and across tracks?
- 50. Historical Trend Systems getting better for specific problems?
- 51. Historical Trend Increasing task-specificity in queries?
- 52. summing up
- 53. Generalizability Theory Regarded as more appropriate, easy to use and powerful tool to assess test collection reliability Very sensitive to the initial data used to estimate variance components Almost impossible to interpret in practical terms
- 54. Sensitivity of G-Theory About 50 queries and 50 systems are needed for robust estimates Caution if building a new collection Can always use confidence intervals
- 55. Interpretation of G-Theory Empirical mapping onto traditional indicators of reliability like π correlation π = π. π β π¬π π β π. ππ π¬π π = π. ππ β π β π. ππ
- 56. Historical Reliability in TREC On average, π¬π π = π. ππ β π β π. π Some collections clearly unreliable Web Distillation 2003, Genomics 2005, Terabyte 2006, Enterprise 2008, Medical 2011 and Web Ad Hoc 2011 50 queries not enough for stable rankings, about 200 are needed
- 57. Implications Fixing a minimum number of queries across tracks is unrealistic Not even across editions of the same task Need to analyze on a case-by-case basis, while building the collections
- 58. to be continuedβ¦
- 59. Future Work Study assessor effect Study document-collection effect Better models to map G-Theory onto data-based indicators We fitted theoretically correct(-ish) models, but in practice theory does not hold Methods to reliably measure reliability while building the collection
- 60. Source Code Online Code for R stats software G-study and D-study Required number of queries Map onto data-based indicators Confidence intervals ..in two simple steps
- 61. G-Theory too sensitive to initial data Questionable with small collections Compute confidence intervals Need π¬π π β π. ππ for π = π. π 50 queries not enough for stable rankings Fixing a minimum number of queries across tasks is unrealistic Need to analyze on a case-by-case basis

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment