Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Toward Estimating the Rank Correlation between the Test Collection Results and the True System Performance

133 views

Published on

The Kendall tau and AP rank correlation coefficients have become mainstream in Information Retrieval research for comparing the rankings of systems produced by two different evaluation conditions, such as different effectiveness measures or pool depths. However, in this paper we focus on the expected rank correlation between the mean scores observed with a test collection and the true, unobservable means under the same conditions. In particular, we propose statistical estimators of tau and AP correlations following both parametric and non-parametric approaches, and with special emphasis on small topic sets. Through large scale simulation with TREC data, we study the error and bias of the estimators. In general, such estimates of expected correlation with the true ranking may accompany the results reported from an evaluation experiment, as an easy to understand figure of reliability. All the results in this paper are fully reproducible with data and code available online.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Toward Estimating the Rank Correlation between the Test Collection Results and the True System Performance

  1. 1. • Kendall's τ and AP correla�on are successful at comparing two given rankigns • What about the correla�on between the observed and the true ranking? • Useful as a single, well-understood, figure of the reliability of an experiment • Contrary to sensi�vity or sta�s�cal significance, it gives an idea of global similarity with the truth, not just about individual pairs of systems (eg. t-test) or about a swap somewhere in the ranking (eg. ANOVA) Toward Es�ma�ng the Rank Correla�on between the Test Collec�on Results and the True System Performance Julián Urbano and Mónica Marrero fully reproducible: data and code available online SIGIR 2016 Pisa, July 19th Evalua�on 0.0 0.2 0.4 0.6 0.8 1.0 0.01.02.0 Population of Topics Effectiveness Density Sample of Topics Effectiveness Frequency 0.0 0.2 0.4 0.6 0.8 1.0 04812 Test Collec�on Real World S5 > S12 > S6 > S2 > S1 > S4... Future Work • Be�er es�mators of discordance • Interval es�mators • Fully Bayesian approach • Consider other sources of variability besides topics, such as systems or documents Results: Error of es�mators 0.020.040.060.080.10 tau − adhoc6 topic set size Error 10 20 30 40 50 60 70 80 90 100 ML MSQD RES KD SH(w/o) SH(w) tau − adhoc7 topic set size Error 10 20 30 40 50 60 70 80 90 100 0.020.040.060.080.10 tau − adhoc8 topic set size Error 10 20 30 40 50 60 70 80 90 100 0.020.040.060.080.10 tauAP − adhoc8 topic set size Error 10 20 30 40 50 60 70 80 90 100 0.020.040.060.080.10 tauAP − adhoc7 topic set size Error 10 20 30 40 50 60 70 80 90 100 0.020.040.060.080.10 tauAP − adhoc6 topic set size Error 10 20 30 40 50 60 70 80 90 100 0.020.040.060.080.10 • Split-half es�mators perform very poorly • About 0.035 error with 50 topics • All proposals near the same, but MSQD be�er with small samples Results: Bias of es�mators tau − adhoc6 topic set size Bias 10 20 30 40 50 60 70 80 90 100 ML MSQD RES KD SH(w/o) SH(w) 0.000.040.08 tau − adhoc7 topic set size Bias 10 20 30 40 50 60 70 80 90 100 0.000.040.08 0.000.040.08 tau − adhoc8 topic set size Bias 10 20 30 40 50 60 70 80 90 100 0.000.040.08 tauAP − adhoc6 topic set size Bias 10 20 30 40 50 60 70 80 90 100 0.000.040.08 tauAP − adhoc7 topic set size Bias 10 20 30 40 50 60 70 80 90 100 0.000.040.08 tauAP − adhoc8 topic set size Bias 10 20 30 40 50 60 70 80 90 100 • Split-half es�mators are clearly biased • Correla�ons generally overes�mated • MSQD much be�er with small collec�ons, KD slightly be�er otherwise • We need to know the true scores in order to evaluate the es�mators! • Stochas�c simula�on from a previous collec�on Y: maintains distribu�ons and correla�ons, and prefixes vector of true mean scores E[Xs]=μs :=Ys • From TREC 6, 7 & 8, simulate 3x1000 collec�ons of n=10, 20,...,100 topics • Split-half baselines w/ and w/o replacement: y=a·ebx , 2000 replicates S1 > S2 > S3 > S4 > S5 > S6... Expected Correla�on with the True Ranking bias correction rank of Xi within the sample

×