The reliability of a test collection is proportional to the number of queries it contains. But building a collection with many queries is expensive, so researchers have to find a balance between reliability and cost. Previous work on the measurement of test collection reliability relied on data-based approaches that contemplated random what if scenarios, and provided indicators such as swap rates and Kendall tau correlations. Generalizability Theory was proposed as an alternative founded on analysis of variance that provides reliability indicators based on statistical theory. However, these reliability indicators are hard to interpret in practice, because they do not correspond to well known indicators like Kendall tau correlation. We empirically established these relationships based on data from over 40 TREC collections, thus filling the gap in the practical interpretation of Generalizability Theory. We also review the computation of these indicators, and show that they are extremely dependent on the sample of systems and queries used, so much that the required number of queries to achieve a certain level of reliability can vary in orders of magnitude. We discuss the computation of confidence intervals for these statistics, providing a much more reliable tool to measure test collection reliability. Reflecting upon all these results, we review a wealth of TREC test collections, arguing that they are possibly not as reliable as generally accepted and that the common choice of 50 queries is insufficient even for stable rankings.
TeamStation AI System Report LATAM IT Salaries 2024
On the Measurement of Test Collection Reliability
1. SIGIR 2013
Dublin, Ireland · July 30thPicture by Philip Milne
On the Measurement of
Test Collection Reliability
@julian_urbano University Carlos III of Madrid
Mónica Marrero University Carlos III of Madrid
Diego Martín Technical University of Madrid
3. Is System A More Effective
than System B?
-1 1
Δeffectiveness
𝑑0
4. Is System A More Effective
than System B?
Get a test collection and evaluate
Measure the average difference 𝒅
and conclude which one is better
5. Samples
Test collections are samples from a
larger, possibly infinite, population
Documents, queries and assessors
𝒅 is only an estimate
How reliable is our conclusion?
6. Reliability vs. Cost
Building reliable collections is easy…
Just use more documents, more queries,
more assessors
…but it is prohibitively expensive
Our best bet is to increase query set size
7. Data-based approach
1.Randomly split query set
2.Compute indicators of reliability
based on those two subsets
3.Extrapolate to larger query sets
..with some variations
Voorhees’98, Zobel’98, Buckley & Voorhees’00,
Voorhees & Buckley’02, Sanderson & Zobel’05,
Sakai’07, Voorhees’09
8. Data-based Reliability Indicators
based on results with two collections
Kendall 𝝉 correlation
stability of the ranking of systems
𝝉 𝑨𝑷 correlation
add a top-heaviness components
Absolute sensitivity
minimum absolute 𝒅 s.t. swaps <5%
Relative sensitivity
minimum relative 𝒅 s.t. swaps <5%
9. Data-based Reliability Indicators
based on results with two collections
Power ratio
statistically significant results
Minor conflict ratio
statistically non-significant swap
Major conflict ratio
statistically significant swap
RMSE
differences in 𝒅
10. Generalizability Theory
Directly address variability of scores
G-study
Estimate variance components
from previous, representative, data
D-study
Estimate reliability based on
estimated variance components
11. G-study
𝝈 𝟐
= 𝝈 𝒔
𝟐
+ 𝝈 𝒒
𝟐
+ 𝝈 𝒔:𝒒
𝟐
Estimated using Analysis of Variance
From previous data,
usually an existing test collection
12. G-study
𝝈 𝟐
= 𝝈 𝒔
𝟐
+ 𝝈 𝒒
𝟐
+ 𝝈 𝒔:𝒒
𝟐
Estimated using Analysis of Variance
From previous data,
usually an existing test collection
system
differences,
our goal!
13. G-study
𝝈 𝟐
= 𝝈 𝒔
𝟐
+ 𝝈 𝒒
𝟐
+ 𝝈 𝒔:𝒒
𝟐
Estimated using Analysis of Variance
From previous data,
usually an existing test collection
system
differences,
our goal! query
difficulty
14. G-study
𝝈 𝟐
= 𝝈 𝒔
𝟐
+ 𝝈 𝒒
𝟐
+ 𝝈 𝒔:𝒒
𝟐
Estimated using Analysis of Variance
From previous data,
usually an existing test collection
system
differences,
our goal! query
difficulty
some systems
better for
some queries
15. D-study
Relative stability
𝑬𝝆 𝟐
=
𝝈 𝒔
𝟐
𝝈 𝒔
𝟐
+
𝝈 𝒔:𝒒
𝟐
𝒏 𝒒
′
Absolute stability
𝚽 =
𝝈 𝒔
𝟐
𝝈 𝒔
𝟐
+
𝝈 𝒒
𝟐
+ 𝝈 𝒔:𝒒
𝟐
𝒏 𝒒
′
Easy to estimate how many queries we
need for a certain stability level
16. Generalizability Theory
Proposed by Bodoff’07
Kanoulas & Aslam’09
derive optimal gain & discount in nDCG
TREC Million Query Track
≈80 queries sufficient for stable rankings
≈130 queries for stable absolute scores
17. In this Paper / Talk
How sensitive is the D-study to the
initial data used in the G-study?
How to interpret G-theory in practice,
why 𝑬𝝆 𝟐
= 𝟎. 𝟗𝟓 and 𝚽 = 𝟎. 𝟗𝟓?
From the above two, review the
reliability of >40 TREC test collections
19. Data
43 TREC collections
from TREC-3 to TREC 2011
12 tasks across 10 tracks
Ad Hoc, Web, Novelty, Genomics,
Robust, Terabyte, Enterprise, Million
Query, Medical and Microblog
20. Experiment
Vary number of queries in G-study
from 𝒏 𝒒 = 𝟓 to full set
Use all runs available
Run D-study
Compute 𝑬𝝆 𝟐
, 𝚽
Compute 𝒏 𝒒
′
to reach 0.95 stability
200 random trials
25. Variability due to systems
We may get 𝐸𝜌2 = 0.9 or
𝐸𝜌2 = 0.5, depending on
what 20 systems we use
26. Results
G-Theory is very sensitive to initial data
Need about 50 queries and 50 systems for
differences in 𝑬𝝆 𝟐 and 𝚽 below 0.1
Number of queries for 𝑬𝝆 𝟐
= 𝟎. 𝟗𝟓
may change in orders of magnitude
Microblog2011 (all 184 systems and 30 queries):
need 63 to 133 queries
Medical2011 (all 34 queries and 40 systems):
need 109 to 566 queries
27. Use Confidence Intervals
Bodoff’08
Confidence intervals in G-study
But what about the D-study?
Feldt’65 and Arteaga et al.’82
Work reasonably well even when
assumptions are violated Brennan’01
32. Summary in TREC
that is, the 43 collections we study here
𝑬𝝆 𝟐
: mean=0.88 sd=0.1
95% conf. intervals are 0.1 long
𝚽: mean=0.74 sd=0.2
95% conf. intervals are 0.19 long
34. Experiment
Split query set in 2 subsets
from 𝒏 𝒒 = 𝟏𝟎 to full set / 2
Use all runs available
Run D-study
Compute 𝑬𝝆 𝟐
and 𝚽 and map onto 𝝉,
sensitivity, power, conflicts, etc.
50 random trials
>28,000 datapoints
36. Example: 𝑬𝝆 𝟐 → 𝝉
𝐸𝜌2 = 0.95 → 𝜏 ≈ 0.85
*All mappings in the paper
37. Example: 𝑬𝝆 𝟐 → 𝝉
𝜏 = 0.9 → 𝐸𝜌2 ≈ 0.97
*All mappings in the paper
38. Example: 𝑬𝝆 𝟐 → 𝝉
Million
Query
2007
Million Query 2008
*All mappings in the paper
39. Future Predictions
Allows us to make more informed
decisions within a collection
What about a new collection?
Fit a single model for each mapping
with 90% and 95% prediction intervals
Assess whether a larger collection
is really worth the effort
46. Outline
Estimate 𝑬𝝆 𝟐
and 𝚽, with 95%
confidence intervals, and full query set
Map onto 𝝉, sensitivity, power,
conflicts, etc.
Results within task offer historical
perspective since 1994
47. Example: Ad Hoc 3-8
𝑬𝝆 𝟐
∈ 𝟎. 𝟖𝟔, 𝟎. 𝟗𝟑 → 𝝉 ∈ [𝟎. 𝟔𝟓, 𝟎. 𝟖𝟏]
𝒎𝒊𝒏𝒐𝒓 𝒄𝒐𝒏𝒇𝒍𝒊𝒄𝒕𝒔 ∈ 𝟎. 𝟔, 𝟖. 𝟐 %
𝒎𝒂𝒋𝒐𝒓 𝒄𝒐𝒏𝒇𝒍𝒊𝒄𝒕𝒔 ∈ 𝟎. 𝟎𝟐, 𝟏. 𝟑𝟖 %
Queries to get 𝑬𝝆 𝟐
= 𝟎. 𝟗𝟓: [𝟑𝟕, 𝟐𝟑𝟑]
Queries to get 𝚽 = 𝟎. 𝟗𝟓: [𝟏𝟏𝟔, 𝟗𝟗𝟗]
50 queries were used
*All collections and mappings in the paper
48. Example: Web Ad Hoc
TREC-8 to TREC-2001: WT2g and WT10g
𝑬𝝆 𝟐
∈ 𝟎. 𝟖𝟔, 𝟎. 𝟗𝟑 → 𝝉 ∈ [𝟎. 𝟔𝟓, 𝟎. 𝟖𝟏]
Queries to get 𝑬𝝆 𝟐
= 𝟎. 𝟗𝟓: 𝟒𝟎, 𝟐𝟐𝟎
TREC-2009 to TREC-2011: ClueWeb09
𝑬𝝆 𝟐
∈ 𝟎. 𝟖, 𝟎. 𝟖𝟑 → 𝝉 ∈ [𝟎. 𝟓𝟑, 𝟎. 𝟓𝟗]
Queries to get 𝑬𝝆 𝟐
= 𝟎. 𝟗𝟓: 𝟏𝟎𝟕, 𝟒𝟑𝟖
50 queries were used
53. Generalizability Theory
Regarded as more appropriate,
easy to use and powerful tool
to assess test collection reliability
Very sensitive to the initial data
used to estimate variance components
Almost impossible to interpret
in practical terms
54. Sensitivity of G-Theory
About 50 queries and 50 systems
are needed for robust estimates
Caution if building a new collection
Can always use confidence intervals
55. Interpretation of G-Theory
Empirical mapping onto traditional
indicators of reliability like 𝝉 correlation
𝝉 = 𝟎. 𝟗 → 𝑬𝝆 𝟐
≈ 𝟎. 𝟗𝟕
𝑬𝝆 𝟐
= 𝟎. 𝟗𝟓 → 𝝉 ≈ 𝟎. 𝟖𝟓
56. Historical Reliability in TREC
On average, 𝑬𝝆 𝟐
= 𝟎. 𝟖𝟖 → 𝝉 ≈ 𝟎. 𝟕
Some collections clearly unreliable
Web Distillation 2003, Genomics 2005, Terabyte 2006,
Enterprise 2008, Medical 2011 and Web Ad Hoc 2011
50 queries not enough for stable
rankings, about 200 are needed
57. Implications
Fixing a minimum number of queries
across tracks is unrealistic
Not even across editions of the same task
Need to analyze on a case-by-case
basis, while building the collections
59. Future Work
Study assessor effect
Study document-collection effect
Better models to map G-Theory
onto data-based indicators
We fitted theoretically correct(-ish) models,
but in practice theory does not hold
Methods to reliably measure reliability
while building the collection
60. Source Code Online
Code for R stats software
G-study and D-study
Required number of queries
Map onto data-based indicators
Confidence intervals
..in two simple steps
61. G-Theory too sensitive to initial data
Questionable with small collections
Compute confidence intervals
Need 𝑬𝝆 𝟐 ≈ 𝟎. 𝟗𝟕 for 𝝉 = 𝟎. 𝟗
50 queries not enough for stable rankings
Fixing a minimum number of
queries across tasks is unrealistic
Need to analyze on a case-by-case basis