On the Measurement of Test Collection Reliability

SIGIR 2013
Dublin, Ireland · July 30thPicture by Philip Milne
On the Measurement of
Test Collection Reliability
@julian_urbano University Carlos III of Madrid
Mónica Marrero University Carlos III of Madrid
Diego Martín Technical University of Madrid

Gratefully supported
by Student Travel Grant

Is System A More Effective
than System B?
-1 1
Δeffectiveness
𝑑0

Is System A More Effective
than System B?
Get a test collection and evaluate
Measure the average difference 𝒅
and conclude which one is better

Samples
Test collections are samples from a
larger, possibly infinite, population
Documents, queries and assessors
𝒅 is only an estimate
How reliable is our conclusion?

Reliability vs. Cost
Building reliable collections is easy…
Just use more documents, more queries,
more assessors
…but it is prohibitively expensive
Our best bet is to increase query set size

Data-based approach
1.Randomly split query set
2.Compute indicators of reliability
based on those two subsets
3.Extrapolate to larger query sets
..with some variations
Voorhees’98, Zobel’98, Buckley & Voorhees’00,
Voorhees & Buckley’02, Sanderson & Zobel’05,
Sakai’07, Voorhees’09

Data-based Reliability Indicators
based on results with two collections
Kendall 𝝉 correlation
stability of the ranking of systems
𝝉 𝑨𝑷 correlation
add a top-heaviness components
Absolute sensitivity
minimum absolute 𝒅 s.t. swaps <5%
Relative sensitivity
minimum relative 𝒅 s.t. swaps <5%

Data-based Reliability Indicators
based on results with two collections
Power ratio
statistically significant results
Minor conflict ratio
statistically non-significant swap
Major conflict ratio
statistically significant swap
RMSE
differences in 𝒅

Generalizability Theory
Directly address variability of scores
G-study
Estimate variance components
from previous, representative, data
D-study
Estimate reliability based on
estimated variance components

G-study
𝝈 𝟐
= 𝝈 𝒔
𝟐
+ 𝝈 𝒒
𝟐
+ 𝝈 𝒔:𝒒
𝟐
Estimated using Analysis of Variance
From previous data,
usually an existing test collection

G-study
𝝈 𝟐
= 𝝈 𝒔
𝟐
+ 𝝈 𝒒
𝟐
+ 𝝈 𝒔:𝒒
𝟐
From previous data,
system
differences,
our goal!

G-study
𝝈 𝟐
= 𝝈 𝒔
𝟐
+ 𝝈 𝒒
𝟐
+ 𝝈 𝒔:𝒒
𝟐
From previous data,
system
differences,
our goal! query
difficulty

G-study
𝝈 𝟐
= 𝝈 𝒔
𝟐
+ 𝝈 𝒒
𝟐
+ 𝝈 𝒔:𝒒
𝟐
From previous data,
system
differences,
our goal! query
difficulty
some systems
better for
some queries

D-study
Relative stability
𝑬𝝆 𝟐
=
𝝈 𝒔
𝟐
𝝈 𝒔
𝟐
+
𝝈 𝒔:𝒒
𝟐
𝒏 𝒒
′
Absolute stability
𝚽 =
𝝈 𝒔
𝟐
𝝈 𝒔
𝟐
+
𝝈 𝒒
𝟐
+ 𝝈 𝒔:𝒒
𝟐
𝒏 𝒒
′
Easy to estimate how many queries we
need for a certain stability level

Proposed by Bodoff’07
Kanoulas & Aslam’09
derive optimal gain & discount in nDCG
TREC Million Query Track
≈80 queries sufficient for stable rankings
≈130 queries for stable absolute scores

In this Paper / Talk
How sensitive is the D-study to the
initial data used in the G-study?
How to interpret G-theory in practice,
why 𝑬𝝆 𝟐
= 𝟎. 𝟗𝟓 and 𝚽 = 𝟎. 𝟗𝟓?
From the above two, review the
reliability of >40 TREC test collections

variability of G-theory
indicators of reliability

Data
43 TREC collections
from TREC-3 to TREC 2011
12 tasks across 10 tracks
Ad Hoc, Web, Novelty, Genomics,
Robust, Terabyte, Enterprise, Million
Query, Medical and Microblog

Experiment
Vary number of queries in G-study
from 𝒏 𝒒 = 𝟓 to full set
Use all runs available
Run D-study
Compute 𝑬𝝆 𝟐
, 𝚽
Compute 𝒏 𝒒
′
to reach 0.95 stability
200 random trials

Variability due to queries
We may get 𝐸𝜌2 = 0.9 or
𝐸𝜌2 = 0.3, depending on
what 10 queries we use

Experiment (II)
The same, but vary number of systems
from 𝒏 𝒔 = 𝟓 to full set
Use all queries available
200 random trials

Variability due to systems
We may get 𝐸𝜌2 = 0.9 or
𝐸𝜌2 = 0.5, depending on
what 20 systems we use

Results
G-Theory is very sensitive to initial data
Need about 50 queries and 50 systems for
differences in 𝑬𝝆 𝟐 and 𝚽 below 0.1
Number of queries for 𝑬𝝆 𝟐
= 𝟎. 𝟗𝟓
may change in orders of magnitude
Microblog2011 (all 184 systems and 30 queries):
need 63 to 133 queries
Medical2011 (all 34 queries and 40 systems):
need 109 to 566 queries

Use Confidence Intervals
Bodoff’08
Confidence intervals in G-study
But what about the D-study?
Feldt’65 and Arteaga et al.’82
Work reasonably well even when
assumptions are violated Brennan’01

Example
Account for variability
in initial data

Example
Required number of
queries to reach the
lower end of the interval

Summary in TREC
that is, the 43 collections we study here
𝑬𝝆 𝟐
: mean=0.88 sd=0.1
95% conf. intervals are 0.1 long
𝚽: mean=0.74 sd=0.2
95% conf. intervals are 0.19 long

interpretation of G-Theory
indicators of reliability

Experiment
Split query set in 2 subsets
from 𝒏 𝒒 = 𝟏𝟎 to full set / 2
Use all runs available
Run D-study
Compute 𝑬𝝆 𝟐
and 𝚽 and map onto 𝝉,
sensitivity, power, conflicts, etc.
50 random trials
>28,000 datapoints

Example: 𝑬𝝆 𝟐 → 𝝉
*All mappings in the paper

𝐸𝜌2 = 0.95 → 𝜏 ≈ 0.85

𝜏 = 0.9 → 𝐸𝜌2 ≈ 0.97

Million
Query
2007
Million Query 2008

Future Predictions
Allows us to make more informed
decisions within a collection
What about a new collection?
Fit a single model for each mapping
with 90% and 95% prediction intervals
Assess whether a larger collection
is really worth the effort

current collection

current collection target

Example: 𝚽 → 𝒓𝒆𝒍. 𝒔𝒆𝒏𝒔𝒊𝒕𝒗𝒊𝒕𝒚

Outline
Estimate 𝑬𝝆 𝟐
and 𝚽, with 95%
confidence intervals, and full query set
Map onto 𝝉, sensitivity, power,
conflicts, etc.
Results within task offer historical
perspective since 1994

Example: Ad Hoc 3-8
𝑬𝝆 𝟐
∈ 𝟎. 𝟖𝟔, 𝟎. 𝟗𝟑 → 𝝉 ∈ [𝟎. 𝟔𝟓, 𝟎. 𝟖𝟏]
𝒎𝒊𝒏𝒐𝒓 𝒄𝒐𝒏𝒇𝒍𝒊𝒄𝒕𝒔 ∈ 𝟎. 𝟔, 𝟖. 𝟐 %
𝒎𝒂𝒋𝒐𝒓 𝒄𝒐𝒏𝒇𝒍𝒊𝒄𝒕𝒔 ∈ 𝟎. 𝟎𝟐, 𝟏. 𝟑𝟖 %
Queries to get 𝑬𝝆 𝟐
= 𝟎. 𝟗𝟓: [𝟑𝟕, 𝟐𝟑𝟑]
Queries to get 𝚽 = 𝟎. 𝟗𝟓: [𝟏𝟏𝟔, 𝟗𝟗𝟗]
50 queries were used
*All collections and mappings in the paper

Example: Web Ad Hoc
TREC-8 to TREC-2001: WT2g and WT10g
𝑬𝝆 𝟐
∈ 𝟎. 𝟖𝟔, 𝟎. 𝟗𝟑 → 𝝉 ∈ [𝟎. 𝟔𝟓, 𝟎. 𝟖𝟏]
= 𝟎. 𝟗𝟓: 𝟒𝟎, 𝟐𝟐𝟎
TREC-2009 to TREC-2011: ClueWeb09
𝑬𝝆 𝟐
∈ 𝟎. 𝟖, 𝟎. 𝟖𝟑 → 𝝉 ∈ [𝟎. 𝟓𝟑, 𝟎. 𝟓𝟗]
= 𝟎. 𝟗𝟓: 𝟏𝟎𝟕, 𝟒𝟑𝟖
50 queries were used

Historical Trend
Decreasing within and across tracks?

Historical Trend
Systems getting better for specific problems?

Historical Trend
Increasing task-specificity in queries?

Regarded as more appropriate,
easy to use and powerful tool
to assess test collection reliability
Very sensitive to the initial data
used to estimate variance components
Almost impossible to interpret
in practical terms

Sensitivity of G-Theory
About 50 queries and 50 systems
are needed for robust estimates
Caution if building a new collection
Can always use confidence intervals

Interpretation of G-Theory
Empirical mapping onto traditional
indicators of reliability like 𝝉 correlation
𝝉 = 𝟎. 𝟗 → 𝑬𝝆 𝟐
≈ 𝟎. 𝟗𝟕
𝑬𝝆 𝟐
= 𝟎. 𝟗𝟓 → 𝝉 ≈ 𝟎. 𝟖𝟓

Historical Reliability in TREC
On average, 𝑬𝝆 𝟐
= 𝟎. 𝟖𝟖 → 𝝉 ≈ 𝟎. 𝟕
Some collections clearly unreliable
Web Distillation 2003, Genomics 2005, Terabyte 2006,
Enterprise 2008, Medical 2011 and Web Ad Hoc 2011
50 queries not enough for stable
rankings, about 200 are needed

Implications
Fixing a minimum number of queries
across tracks is unrealistic
Not even across editions of the same task
Need to analyze on a case-by-case
basis, while building the collections

Future Work
Study assessor effect
Study document-collection effect
Better models to map G-Theory
onto data-based indicators
We fitted theoretically correct(-ish) models,
but in practice theory does not hold
Methods to reliably measure reliability
while building the collection

Source Code Online
Code for R stats software
G-study and D-study
Required number of queries
Map onto data-based indicators
Confidence intervals
..in two simple steps

G-Theory too sensitive to initial data
Questionable with small collections
Compute confidence intervals
Need 𝑬𝝆 𝟐 ≈ 𝟎. 𝟗𝟕 for 𝝉 = 𝟎. 𝟗
50 queries not enough for stable rankings
Fixing a minimum number of
queries across tasks is unrealistic
Need to analyze on a case-by-case basis

On the Measurement of Test Collection Reliability

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (6)

Similar to On the Measurement of Test Collection Reliability

Similar to On the Measurement of Test Collection Reliability (20)

More from Julián Urbano

More from Julián Urbano (20)

Recently uploaded

Recently uploaded (20)

On the Measurement of Test Collection Reliability