On the Measurement of Test Collection Reliability

460 views

Published on

The reliability of a test collection is proportional to the number of queries it contains. But building a collection with many queries is expensive, so researchers have to find a balance between reliability and cost. Previous work on the measurement of test collection reliability relied on data-based approaches that contemplated random what if scenarios, and provided indicators such as swap rates and Kendall tau correlations. Generalizability Theory was proposed as an alternative founded on analysis of variance that provides reliability indicators based on statistical theory. However, these reliability indicators are hard to interpret in practice, because they do not correspond to well known indicators like Kendall tau correlation. We empirically established these relationships based on data from over 40 TREC collections, thus filling the gap in the practical interpretation of Generalizability Theory. We also review the computation of these indicators, and show that they are extremely dependent on the sample of systems and queries used, so much that the required number of queries to achieve a certain level of reliability can vary in orders of magnitude. We discuss the computation of confidence intervals for these statistics, providing a much more reliable tool to measure test collection reliability. Reflecting upon all these results, we review a wealth of TREC test collections, arguing that they are possibly not as reliable as generally accepted and that the common choice of 50 queries is insufficient even for stable rankings.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
460
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
17
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

On the Measurement of Test Collection Reliability

  1. 1. SIGIR 2013 Dublin, Ireland Β· July 30thPicture by Philip Milne On the Measurement of Test Collection Reliability @julian_urbano University Carlos III of Madrid MΓ³nica Marrero University Carlos III of Madrid Diego MartΓ­n Technical University of Madrid
  2. 2. Gratefully supported by Student Travel Grant
  3. 3. Is System A More Effective than System B? -1 1 Ξ”effectiveness 𝑑0
  4. 4. Is System A More Effective than System B? Get a test collection and evaluate Measure the average difference 𝒅 and conclude which one is better
  5. 5. Samples Test collections are samples from a larger, possibly infinite, population Documents, queries and assessors 𝒅 is only an estimate How reliable is our conclusion?
  6. 6. Reliability vs. Cost Building reliable collections is easy… Just use more documents, more queries, more assessors …but it is prohibitively expensive Our best bet is to increase query set size
  7. 7. Data-based approach 1.Randomly split query set 2.Compute indicators of reliability based on those two subsets 3.Extrapolate to larger query sets ..with some variations Voorhees’98, Zobel’98, Buckley & Voorhees’00, Voorhees & Buckley’02, Sanderson & Zobel’05, Sakai’07, Voorhees’09
  8. 8. Data-based Reliability Indicators based on results with two collections Kendall 𝝉 correlation stability of the ranking of systems 𝝉 𝑨𝑷 correlation add a top-heaviness components Absolute sensitivity minimum absolute 𝒅 s.t. swaps <5% Relative sensitivity minimum relative 𝒅 s.t. swaps <5%
  9. 9. Data-based Reliability Indicators based on results with two collections Power ratio statistically significant results Minor conflict ratio statistically non-significant swap Major conflict ratio statistically significant swap RMSE differences in 𝒅
  10. 10. Generalizability Theory Directly address variability of scores G-study Estimate variance components from previous, representative, data D-study Estimate reliability based on estimated variance components
  11. 11. G-study 𝝈 𝟐 = 𝝈 𝒔 𝟐 + 𝝈 𝒒 𝟐 + 𝝈 𝒔:𝒒 𝟐 Estimated using Analysis of Variance From previous data, usually an existing test collection
  12. 12. G-study 𝝈 𝟐 = 𝝈 𝒔 𝟐 + 𝝈 𝒒 𝟐 + 𝝈 𝒔:𝒒 𝟐 Estimated using Analysis of Variance From previous data, usually an existing test collection system differences, our goal!
  13. 13. G-study 𝝈 𝟐 = 𝝈 𝒔 𝟐 + 𝝈 𝒒 𝟐 + 𝝈 𝒔:𝒒 𝟐 Estimated using Analysis of Variance From previous data, usually an existing test collection system differences, our goal! query difficulty
  14. 14. G-study 𝝈 𝟐 = 𝝈 𝒔 𝟐 + 𝝈 𝒒 𝟐 + 𝝈 𝒔:𝒒 𝟐 Estimated using Analysis of Variance From previous data, usually an existing test collection system differences, our goal! query difficulty some systems better for some queries
  15. 15. D-study Relative stability 𝑬𝝆 𝟐 = 𝝈 𝒔 𝟐 𝝈 𝒔 𝟐 + 𝝈 𝒔:𝒒 𝟐 𝒏 𝒒 β€² Absolute stability 𝚽 = 𝝈 𝒔 𝟐 𝝈 𝒔 𝟐 + 𝝈 𝒒 𝟐 + 𝝈 𝒔:𝒒 𝟐 𝒏 𝒒 β€² Easy to estimate how many queries we need for a certain stability level
  16. 16. Generalizability Theory Proposed by Bodoff’07 Kanoulas & Aslam’09 derive optimal gain & discount in nDCG TREC Million Query Track β‰ˆ80 queries sufficient for stable rankings β‰ˆ130 queries for stable absolute scores
  17. 17. In this Paper / Talk How sensitive is the D-study to the initial data used in the G-study? How to interpret G-theory in practice, why 𝑬𝝆 𝟐 = 𝟎. πŸ—πŸ“ and 𝚽 = 𝟎. πŸ—πŸ“? From the above two, review the reliability of >40 TREC test collections
  18. 18. variability of G-theory indicators of reliability
  19. 19. Data 43 TREC collections from TREC-3 to TREC 2011 12 tasks across 10 tracks Ad Hoc, Web, Novelty, Genomics, Robust, Terabyte, Enterprise, Million Query, Medical and Microblog
  20. 20. Experiment Vary number of queries in G-study from 𝒏 𝒒 = πŸ“ to full set Use all runs available Run D-study Compute 𝑬𝝆 𝟐 , 𝚽 Compute 𝒏 𝒒 β€² to reach 0.95 stability 200 random trials
  21. 21. Variability due to queries
  22. 22. Variability due to queries We may get 𝐸𝜌2 = 0.9 or 𝐸𝜌2 = 0.3, depending on what 10 queries we use
  23. 23. Experiment (II) The same, but vary number of systems from 𝒏 𝒔 = πŸ“ to full set Use all queries available 200 random trials
  24. 24. Variability due to systems
  25. 25. Variability due to systems We may get 𝐸𝜌2 = 0.9 or 𝐸𝜌2 = 0.5, depending on what 20 systems we use
  26. 26. Results G-Theory is very sensitive to initial data Need about 50 queries and 50 systems for differences in 𝑬𝝆 𝟐 and 𝚽 below 0.1 Number of queries for 𝑬𝝆 𝟐 = 𝟎. πŸ—πŸ“ may change in orders of magnitude Microblog2011 (all 184 systems and 30 queries): need 63 to 133 queries Medical2011 (all 34 queries and 40 systems): need 109 to 566 queries
  27. 27. Use Confidence Intervals Bodoff’08 Confidence intervals in G-study But what about the D-study? Feldt’65 and Arteaga et al.’82 Work reasonably well even when assumptions are violated Brennan’01
  28. 28. Example
  29. 29. Example
  30. 30. Example Account for variability in initial data
  31. 31. Example Required number of queries to reach the lower end of the interval
  32. 32. Summary in TREC that is, the 43 collections we study here 𝑬𝝆 𝟐 : mean=0.88 sd=0.1 95% conf. intervals are 0.1 long 𝚽: mean=0.74 sd=0.2 95% conf. intervals are 0.19 long
  33. 33. interpretation of G-Theory indicators of reliability
  34. 34. Experiment Split query set in 2 subsets from 𝒏 𝒒 = 𝟏𝟎 to full set / 2 Use all runs available Run D-study Compute 𝑬𝝆 𝟐 and 𝚽 and map onto 𝝉, sensitivity, power, conflicts, etc. 50 random trials >28,000 datapoints
  35. 35. Example: 𝑬𝝆 𝟐 β†’ 𝝉 *All mappings in the paper
  36. 36. Example: 𝑬𝝆 𝟐 β†’ 𝝉 𝐸𝜌2 = 0.95 β†’ 𝜏 β‰ˆ 0.85 *All mappings in the paper
  37. 37. Example: 𝑬𝝆 𝟐 β†’ 𝝉 𝜏 = 0.9 β†’ 𝐸𝜌2 β‰ˆ 0.97 *All mappings in the paper
  38. 38. Example: 𝑬𝝆 𝟐 β†’ 𝝉 Million Query 2007 Million Query 2008 *All mappings in the paper
  39. 39. Future Predictions Allows us to make more informed decisions within a collection What about a new collection? Fit a single model for each mapping with 90% and 95% prediction intervals Assess whether a larger collection is really worth the effort
  40. 40. Example: 𝑬𝝆 𝟐 β†’ 𝝉 *All mappings in the paper
  41. 41. Example: 𝑬𝝆 𝟐 β†’ 𝝉 current collection *All mappings in the paper
  42. 42. Example: 𝑬𝝆 𝟐 β†’ 𝝉 current collection target *All mappings in the paper
  43. 43. Example: 𝚽 β†’ 𝒓𝒆𝒍. π’”π’†π’π’”π’Šπ’•π’—π’Šπ’•π’š
  44. 44. Example: 𝚽 β†’ 𝒓𝒆𝒍. π’”π’†π’π’”π’Šπ’•π’—π’Šπ’•π’š
  45. 45. review of TREC collections
  46. 46. Outline Estimate 𝑬𝝆 𝟐 and 𝚽, with 95% confidence intervals, and full query set Map onto 𝝉, sensitivity, power, conflicts, etc. Results within task offer historical perspective since 1994
  47. 47. Example: Ad Hoc 3-8 𝑬𝝆 𝟐 ∈ 𝟎. πŸ–πŸ”, 𝟎. πŸ—πŸ‘ β†’ 𝝉 ∈ [𝟎. πŸ”πŸ“, 𝟎. πŸ–πŸ] π’Žπ’Šπ’π’π’“ π’„π’π’π’‡π’π’Šπ’„π’•π’” ∈ 𝟎. πŸ”, πŸ–. 𝟐 % π’Žπ’‚π’‹π’π’“ π’„π’π’π’‡π’π’Šπ’„π’•π’” ∈ 𝟎. 𝟎𝟐, 𝟏. πŸ‘πŸ– % Queries to get 𝑬𝝆 𝟐 = 𝟎. πŸ—πŸ“: [πŸ‘πŸ•, πŸπŸ‘πŸ‘] Queries to get 𝚽 = 𝟎. πŸ—πŸ“: [πŸπŸπŸ”, πŸ—πŸ—πŸ—] 50 queries were used *All collections and mappings in the paper
  48. 48. Example: Web Ad Hoc TREC-8 to TREC-2001: WT2g and WT10g 𝑬𝝆 𝟐 ∈ 𝟎. πŸ–πŸ”, 𝟎. πŸ—πŸ‘ β†’ 𝝉 ∈ [𝟎. πŸ”πŸ“, 𝟎. πŸ–πŸ] Queries to get 𝑬𝝆 𝟐 = 𝟎. πŸ—πŸ“: πŸ’πŸŽ, 𝟐𝟐𝟎 TREC-2009 to TREC-2011: ClueWeb09 𝑬𝝆 𝟐 ∈ 𝟎. πŸ–, 𝟎. πŸ–πŸ‘ β†’ 𝝉 ∈ [𝟎. πŸ“πŸ‘, 𝟎. πŸ“πŸ—] Queries to get 𝑬𝝆 𝟐 = 𝟎. πŸ—πŸ“: πŸπŸŽπŸ•, πŸ’πŸ‘πŸ– 50 queries were used
  49. 49. Historical Trend Decreasing within and across tracks?
  50. 50. Historical Trend Systems getting better for specific problems?
  51. 51. Historical Trend Increasing task-specificity in queries?
  52. 52. summing up
  53. 53. Generalizability Theory Regarded as more appropriate, easy to use and powerful tool to assess test collection reliability Very sensitive to the initial data used to estimate variance components Almost impossible to interpret in practical terms
  54. 54. Sensitivity of G-Theory About 50 queries and 50 systems are needed for robust estimates Caution if building a new collection Can always use confidence intervals
  55. 55. Interpretation of G-Theory Empirical mapping onto traditional indicators of reliability like 𝝉 correlation 𝝉 = 𝟎. πŸ— β†’ 𝑬𝝆 𝟐 β‰ˆ 𝟎. πŸ—πŸ• 𝑬𝝆 𝟐 = 𝟎. πŸ—πŸ“ β†’ 𝝉 β‰ˆ 𝟎. πŸ–πŸ“
  56. 56. Historical Reliability in TREC On average, 𝑬𝝆 𝟐 = 𝟎. πŸ–πŸ– β†’ 𝝉 β‰ˆ 𝟎. πŸ• Some collections clearly unreliable Web Distillation 2003, Genomics 2005, Terabyte 2006, Enterprise 2008, Medical 2011 and Web Ad Hoc 2011 50 queries not enough for stable rankings, about 200 are needed
  57. 57. Implications Fixing a minimum number of queries across tracks is unrealistic Not even across editions of the same task Need to analyze on a case-by-case basis, while building the collections
  58. 58. to be continued…
  59. 59. Future Work Study assessor effect Study document-collection effect Better models to map G-Theory onto data-based indicators We fitted theoretically correct(-ish) models, but in practice theory does not hold Methods to reliably measure reliability while building the collection
  60. 60. Source Code Online Code for R stats software G-study and D-study Required number of queries Map onto data-based indicators Confidence intervals ..in two simple steps
  61. 61. G-Theory too sensitive to initial data Questionable with small collections Compute confidence intervals Need 𝑬𝝆 𝟐 β‰ˆ 𝟎. πŸ—πŸ• for 𝝉 = 𝟎. πŸ— 50 queries not enough for stable rankings Fixing a minimum number of queries across tasks is unrealistic Need to analyze on a case-by-case basis

Γ—