SlideShare a Scribd company logo
A Comparison of the Optimality 
of Statistical Significance Tests 
for Information Retrieval Evaluation 
Julián Urbano, Mónica Marrero and Diego Martín 
Department of Computer Science · University Carlos III of Madrid 
The problem: is system A more effective than system B? 
The drill: evaluate with a test collection and run a statistical significance test 
The dilemma: t-test, Wilcoxon, sign, bootstrap or permutation? 
The reason: test assumptions are violated, so which one is optimal in practice? 
Three criteria: power (maximize # of significants), safety (minimize # of errors), exact (keep errors at α) 
Data and Methods 
· TREC Robust 2004: 100 topics from Ad Hoc 7 and 8 
o 110 runs, 5995 pairs of systems 
· Randomly split topics in T1 and T2, as if two collections 
o Evaluate all runs and compute p-values 
o Compare p-values from T1 with p-values from T2 
o 1000 trials, 12M p-values per test, 60M in total 
· Interpret pairs of p-values for different α levels 
T2 
A ≻B A ≺B A ≻≻B A ≺≺B 
T1 
A ≻B Non-significance 
A≻≻B 
Lack of 
power 
Minor 
error 
Success 
Major 
error 
Non-significance rate 
t-test 
permutation 
bootstrap 
Wilcoxon 
sign 
.001 .005 .01 .05 .1 
Significance level a 
Non-significants / Total 
0.3 0.35 0.4 0.45 0.5 0.6 
Previous Work 
Zobel’98, Sanderson & Zobel’05, Cormack & Lynam’06 
· Wilcoxon more powerful than t-test, but more errors 
Smucker et al. ‘07, ‘09 
· bootstrap test overly powerful, though similar to t-test 
and permutation 
· Wilcoxon and sign unreliable, should use permutation 
· Power: bootstrap test gives more significant results 
· Safety: t-test gives fewer errors 
· Exactness: Wilcoxon test best tracks the nominal level 
· The permutation test is not optimal in practice 
· Error rates seem lower than expected; focus on power 
Success rate 
.001 .005 .01 .05 .1 
Significance level a 
Successes / Total significants 
0.76 0.78 0.80 0.82 0.84 0.86 
Take-Home Messages 
Lack of power rate 
.001 .005 .01 .05 .1 
Significance level a 
Lacks of power / Total significants 
0.12 0.14 0.16 0.18 0.20 
Minor error rate 
t-test 
permutation 
bootstrap 
Wilcoxon 
sign 
y=x 
.001 .005 .01 .05 .1 
Significance level a 
Minor errors / Total significants 
0.001 0.002 0.005 0.010 0.020 
Major error rate 
.001 .005 .01 .05 .1 
Significance level a 
Major errors / Total significants 
5e-07 5e-06 5e-05 5e-04 
Global error rate 
.0001 .0005.001 .005 .01 .05 .1 .5 
Significance level a 
Minor and Major errors / Total significants 
5e-04 2e-03 5e-03 2e-02 5e-02 
Dublin, Ireland · 30th July 2013 Supported by ACM SIGIR Student Travel Grant

More Related Content

Similar to A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation

Cross validation
Cross validationCross validation
Cross validation
RidhaAfrawe
 
multi criteria decision making
multi criteria decision makingmulti criteria decision making
multi criteria decision making
Shankha Goswami
 
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Julián Urbano
 
Site Verification: Tools and Best Practices to Accurately Meter Complex, High...
Site Verification: Tools and Best Practices to Accurately Meter Complex, High...Site Verification: Tools and Best Practices to Accurately Meter Complex, High...
Site Verification: Tools and Best Practices to Accurately Meter Complex, High...
TESCO - The Eastern Specialty Company
 
Complete Site Testing
Complete Site TestingComplete Site Testing
ML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptxML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptx
belay41
 
Burden Testing, Theory, and Practice
Burden Testing, Theory, and PracticeBurden Testing, Theory, and Practice
Burden Testing, Theory, and Practice
TESCO - The Eastern Specialty Company
 
Process Capability: Steps 1 to 3
Process Capability: Steps 1 to 3Process Capability: Steps 1 to 3
Process Capability: Steps 1 to 3
Matt Hansen
 
Instrument Transformer Testing
Instrument Transformer TestingInstrument Transformer Testing
Instrument Transformer Testing
TESCO - The Eastern Specialty Company
 
Statistical process control ppt @ bec doms
Statistical process control ppt @ bec domsStatistical process control ppt @ bec doms
Statistical process control ppt @ bec doms
Babasab Patil
 
Hph7300week14winter2009narr
Hph7300week14winter2009narrHph7300week14winter2009narr
Hph7300week14winter2009narrSarah
 
Facility Location
Facility Location Facility Location
Facility Location
Joshua Miranda
 
blckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ yk
blckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ ykblckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ yk
blckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ yk
SMayankSharma
 
Complete Site Testing
Complete Site TestingComplete Site Testing
Chap 9 A Process Capability & Spc Hk
Chap 9 A Process Capability & Spc HkChap 9 A Process Capability & Spc Hk
Chap 9 A Process Capability & Spc Hkajithsrc
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Thomas Ploetz
 
Fault simulation – application and methods
Fault simulation – application and methodsFault simulation – application and methods
Fault simulation – application and methodsSubash John
 
SE 09 (test design techs).pptx
SE 09 (test design techs).pptxSE 09 (test design techs).pptx
SE 09 (test design techs).pptx
ZohairMughal1
 
Stochastic Process
Stochastic ProcessStochastic Process
Stochastic Processknksmart
 

Similar to A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation (20)

Cross validation
Cross validationCross validation
Cross validation
 
multi criteria decision making
multi criteria decision makingmulti criteria decision making
multi criteria decision making
 
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
 
Site Verification: Tools and Best Practices to Accurately Meter Complex, High...
Site Verification: Tools and Best Practices to Accurately Meter Complex, High...Site Verification: Tools and Best Practices to Accurately Meter Complex, High...
Site Verification: Tools and Best Practices to Accurately Meter Complex, High...
 
Complete Site Testing
Complete Site TestingComplete Site Testing
Complete Site Testing
 
ML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptxML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptx
 
Burden Testing, Theory, and Practice
Burden Testing, Theory, and PracticeBurden Testing, Theory, and Practice
Burden Testing, Theory, and Practice
 
Process Capability: Steps 1 to 3
Process Capability: Steps 1 to 3Process Capability: Steps 1 to 3
Process Capability: Steps 1 to 3
 
Instrument Transformer Testing
Instrument Transformer TestingInstrument Transformer Testing
Instrument Transformer Testing
 
SE%200-Testing%20(2).pptx
SE%200-Testing%20(2).pptxSE%200-Testing%20(2).pptx
SE%200-Testing%20(2).pptx
 
Statistical process control ppt @ bec doms
Statistical process control ppt @ bec domsStatistical process control ppt @ bec doms
Statistical process control ppt @ bec doms
 
Hph7300week14winter2009narr
Hph7300week14winter2009narrHph7300week14winter2009narr
Hph7300week14winter2009narr
 
Facility Location
Facility Location Facility Location
Facility Location
 
blckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ yk
blckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ ykblckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ yk
blckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ yk
 
Complete Site Testing
Complete Site TestingComplete Site Testing
Complete Site Testing
 
Chap 9 A Process Capability & Spc Hk
Chap 9 A Process Capability & Spc HkChap 9 A Process Capability & Spc Hk
Chap 9 A Process Capability & Spc Hk
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
 
Fault simulation – application and methods
Fault simulation – application and methodsFault simulation – application and methods
Fault simulation – application and methods
 
SE 09 (test design techs).pptx
SE 09 (test design techs).pptxSE 09 (test design techs).pptx
SE 09 (test design techs).pptx
 
Stochastic Process
Stochastic ProcessStochastic Process
Stochastic Process
 

More from Julián Urbano

Your PhD and You
Your PhD and YouYour PhD and You
Your PhD and You
Julián Urbano
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
Julián Urbano
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP Correlation
Julián Urbano
 
A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR Evaluation
Julián Urbano
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured Documents
Julián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
Julián Urbano
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
Julián Urbano
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Julián Urbano
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)
Julián Urbano
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music Similarity
Julián Urbano
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalJulián Urbano
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
Julián Urbano
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...Julián Urbano
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Julián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
Julián Urbano
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Julián Urbano
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Julián Urbano
 
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Julián Urbano
 
Improving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsImproving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered Lists
Julián Urbano
 
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksCrowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Julián Urbano
 

More from Julián Urbano (20)

Your PhD and You
Your PhD and YouYour PhD and You
Your PhD and You
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP Correlation
 
A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR Evaluation
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured Documents
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music Similarity
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and Stability
 
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
 
Improving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsImproving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered Lists
 
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksCrowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
 

Recently uploaded

In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
silvermistyshot
 
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
RASHMI M G
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
tonzsalvador2222
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Studia Poinsotiana
 

Recently uploaded (20)

In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
 
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
 

A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation

  • 1. A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation Julián Urbano, Mónica Marrero and Diego Martín Department of Computer Science · University Carlos III of Madrid The problem: is system A more effective than system B? The drill: evaluate with a test collection and run a statistical significance test The dilemma: t-test, Wilcoxon, sign, bootstrap or permutation? The reason: test assumptions are violated, so which one is optimal in practice? Three criteria: power (maximize # of significants), safety (minimize # of errors), exact (keep errors at α) Data and Methods · TREC Robust 2004: 100 topics from Ad Hoc 7 and 8 o 110 runs, 5995 pairs of systems · Randomly split topics in T1 and T2, as if two collections o Evaluate all runs and compute p-values o Compare p-values from T1 with p-values from T2 o 1000 trials, 12M p-values per test, 60M in total · Interpret pairs of p-values for different α levels T2 A ≻B A ≺B A ≻≻B A ≺≺B T1 A ≻B Non-significance A≻≻B Lack of power Minor error Success Major error Non-significance rate t-test permutation bootstrap Wilcoxon sign .001 .005 .01 .05 .1 Significance level a Non-significants / Total 0.3 0.35 0.4 0.45 0.5 0.6 Previous Work Zobel’98, Sanderson & Zobel’05, Cormack & Lynam’06 · Wilcoxon more powerful than t-test, but more errors Smucker et al. ‘07, ‘09 · bootstrap test overly powerful, though similar to t-test and permutation · Wilcoxon and sign unreliable, should use permutation · Power: bootstrap test gives more significant results · Safety: t-test gives fewer errors · Exactness: Wilcoxon test best tracks the nominal level · The permutation test is not optimal in practice · Error rates seem lower than expected; focus on power Success rate .001 .005 .01 .05 .1 Significance level a Successes / Total significants 0.76 0.78 0.80 0.82 0.84 0.86 Take-Home Messages Lack of power rate .001 .005 .01 .05 .1 Significance level a Lacks of power / Total significants 0.12 0.14 0.16 0.18 0.20 Minor error rate t-test permutation bootstrap Wilcoxon sign y=x .001 .005 .01 .05 .1 Significance level a Minor errors / Total significants 0.001 0.002 0.005 0.010 0.020 Major error rate .001 .005 .01 .05 .1 Significance level a Major errors / Total significants 5e-07 5e-06 5e-05 5e-04 Global error rate .0001 .0005.001 .005 .01 .05 .1 .5 Significance level a Minor and Major errors / Total significants 5e-04 2e-03 5e-03 2e-02 5e-02 Dublin, Ireland · 30th July 2013 Supported by ACM SIGIR Student Travel Grant