Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
A Comparison of the Optimality 
of Statistical Significance Tests 
for Information Retrieval Evaluation 
Julián Urbano, Mó...
Upcoming SlideShare
Loading in …5
×
Upcoming SlideShare
HIGH PERFORMANCE LIQUID CHROMATOGRAPHY (HPLC)
Next
Download to read offline and view in fullscreen.

0

Share

Download to read offline

A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation

Download to read offline

Previous research has suggested the permutation test as the theoretically optimal statistical significance test for IR evaluation, and advocated for the discontinuation of the Wilcoxon and sign tests. We present a large-scale study comprising nearly 60 million system comparisons showing that in practice the bootstrap, t-test and Wilcoxon test out- perform the permutation test under different optimality criteria. We also show that actual error rates seem to be lower than the theoretically expected 5%, further confirming that we may actually be underestimating significance.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation

  1. 1. A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation Julián Urbano, Mónica Marrero and Diego Martín Department of Computer Science · University Carlos III of Madrid The problem: is system A more effective than system B? The drill: evaluate with a test collection and run a statistical significance test The dilemma: t-test, Wilcoxon, sign, bootstrap or permutation? The reason: test assumptions are violated, so which one is optimal in practice? Three criteria: power (maximize # of significants), safety (minimize # of errors), exact (keep errors at α) Data and Methods · TREC Robust 2004: 100 topics from Ad Hoc 7 and 8 o 110 runs, 5995 pairs of systems · Randomly split topics in T1 and T2, as if two collections o Evaluate all runs and compute p-values o Compare p-values from T1 with p-values from T2 o 1000 trials, 12M p-values per test, 60M in total · Interpret pairs of p-values for different α levels T2 A ≻B A ≺B A ≻≻B A ≺≺B T1 A ≻B Non-significance A≻≻B Lack of power Minor error Success Major error Non-significance rate t-test permutation bootstrap Wilcoxon sign .001 .005 .01 .05 .1 Significance level a Non-significants / Total 0.3 0.35 0.4 0.45 0.5 0.6 Previous Work Zobel’98, Sanderson & Zobel’05, Cormack & Lynam’06 · Wilcoxon more powerful than t-test, but more errors Smucker et al. ‘07, ‘09 · bootstrap test overly powerful, though similar to t-test and permutation · Wilcoxon and sign unreliable, should use permutation · Power: bootstrap test gives more significant results · Safety: t-test gives fewer errors · Exactness: Wilcoxon test best tracks the nominal level · The permutation test is not optimal in practice · Error rates seem lower than expected; focus on power Success rate .001 .005 .01 .05 .1 Significance level a Successes / Total significants 0.76 0.78 0.80 0.82 0.84 0.86 Take-Home Messages Lack of power rate .001 .005 .01 .05 .1 Significance level a Lacks of power / Total significants 0.12 0.14 0.16 0.18 0.20 Minor error rate t-test permutation bootstrap Wilcoxon sign y=x .001 .005 .01 .05 .1 Significance level a Minor errors / Total significants 0.001 0.002 0.005 0.010 0.020 Major error rate .001 .005 .01 .05 .1 Significance level a Major errors / Total significants 5e-07 5e-06 5e-05 5e-04 Global error rate .0001 .0005.001 .005 .01 .05 .1 .5 Significance level a Minor and Major errors / Total significants 5e-04 2e-03 5e-03 2e-02 5e-02 Dublin, Ireland · 30th July 2013 Supported by ACM SIGIR Student Travel Grant

Previous research has suggested the permutation test as the theoretically optimal statistical significance test for IR evaluation, and advocated for the discontinuation of the Wilcoxon and sign tests. We present a large-scale study comprising nearly 60 million system comparisons showing that in practice the bootstrap, t-test and Wilcoxon test out- perform the permutation test under different optimality criteria. We also show that actual error rates seem to be lower than the theoretically expected 5%, further confirming that we may actually be underestimating significance.

Views

Total views

367

On Slideshare

0

From embeds

0

Number of embeds

3

Actions

Downloads

3

Shares

0

Comments

0

Likes

0

×