The Kendall tau and AP rank correlation coefficients have become mainstream in Information Retrieval research for comparing the rankings of systems produced by two different evaluation conditions, such as different effectiveness measures or pool depths. However, in this paper we focus on the expected rank correlation between the mean scores observed with a test collection and the true, unobservable means under the same conditions. In particular, we propose statistical estimators of tau and AP correlations following both parametric and non-parametric approaches, and with special emphasis on small topic sets. Through large scale simulation with TREC data, we study the error and bias of the estimators. In general, such estimates of expected correlation with the true ranking may accompany the results reported from an evaluation experiment, as an easy to understand figure of reliability. All the results in this paper are fully reproducible with data and code available online.
Formation of low mass protostars and their circumstellar disks
Toward Estimating the Rank Correlation between the Test Collection Results and the True System Performance
1. • Kendall's τ and AP correla�on are successful at comparing two given rankigns
• What about the correla�on between the observed and the true ranking?
• Useful as a single, well-understood, figure of the reliability of an experiment
• Contrary to sensi�vity or sta�s�cal significance, it gives an idea of global
similarity with the truth, not just about individual pairs of systems (eg. t-test)
or about a swap somewhere in the ranking (eg. ANOVA)
Toward Es�ma�ng the Rank Correla�on
between the Test Collec�on Results
and the True System Performance
Julián Urbano and Mónica Marrero
fully reproducible:
data and code
available online
SIGIR 2016
Pisa, July 19th
Evalua�on
0.0 0.2 0.4 0.6 0.8 1.0
0.01.02.0
Population of Topics
Effectiveness
Density
Sample of Topics
Effectiveness
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
04812
Test Collec�on Real World
S5 > S12 > S6 > S2 > S1 > S4...
Future Work
• Be�er es�mators of discordance
• Interval es�mators
• Fully Bayesian approach
• Consider other sources of
variability besides topics, such as
systems or documents
Results: Error of es�mators
0.020.040.060.080.10
tau − adhoc6
topic set size
Error
10
20
30
40
50
60
70
80
90
100
ML
MSQD
RES
KD
SH(w/o)
SH(w)
tau − adhoc7
topic set size
Error
10
20
30
40
50
60
70
80
90
100
0.020.040.060.080.10
tau − adhoc8
topic set size
Error
10
20
30
40
50
60
70
80
90
100
0.020.040.060.080.10
tauAP − adhoc8
topic set size
Error
10
20
30
40
50
60
70
80
90
100
0.020.040.060.080.10
tauAP − adhoc7
topic set size
Error
10
20
30
40
50
60
70
80
90
100
0.020.040.060.080.10
tauAP − adhoc6
topic set size
Error
10
20
30
40
50
60
70
80
90
100
0.020.040.060.080.10
• Split-half es�mators perform very poorly
• About 0.035 error with 50 topics
• All proposals near the same, but MSQD be�er with small samples
Results: Bias of es�mators
tau − adhoc6
topic set size
Bias
10
20
30
40
50
60
70
80
90
100
ML
MSQD
RES
KD
SH(w/o)
SH(w)
0.000.040.08
tau − adhoc7
topic set size
Bias
10
20
30
40
50
60
70
80
90
100
0.000.040.08
0.000.040.08
tau − adhoc8
topic set size
Bias
10
20
30
40
50
60
70
80
90
100
0.000.040.08
tauAP − adhoc6
topic set size
Bias
10
20
30
40
50
60
70
80
90
100
0.000.040.08
tauAP − adhoc7
topic set size
Bias
10
20
30
40
50
60
70
80
90
100
0.000.040.08
tauAP − adhoc8
topic set size
Bias
10
20
30
40
50
60
70
80
90
100
• Split-half es�mators are clearly biased
• Correla�ons generally overes�mated
• MSQD much be�er with small collec�ons, KD slightly be�er otherwise
• We need to know the true scores in order to evaluate the es�mators!
• Stochas�c simula�on from a previous collec�on Y: maintains distribu�ons
and correla�ons, and prefixes vector of true mean scores E[Xs]=μs :=Ys
• From TREC 6, 7 & 8, simulate 3x1000 collec�ons of n=10, 20,...,100 topics
• Split-half baselines w/ and w/o replacement: y=a·ebx
, 2000 replicates
S1 > S2 > S3 > S4 > S5 > S6...
Expected Correla�on with the True Ranking
bias correction
rank of Xi within
the sample