SlideShare a Scribd company logo
A Comparison of the Optimality 
of Statistical Significance Tests 
for Information Retrieval Evaluation 
Julián Urbano, Mónica Marrero and Diego Martín 
Department of Computer Science · University Carlos III of Madrid 
The problem: is system A more effective than system B? 
The drill: evaluate with a test collection and run a statistical significance test 
The dilemma: t-test, Wilcoxon, sign, bootstrap or permutation? 
The reason: test assumptions are violated, so which one is optimal in practice? 
Three criteria: power (maximize # of significants), safety (minimize # of errors), exact (keep errors at α) 
Data and Methods 
· TREC Robust 2004: 100 topics from Ad Hoc 7 and 8 
o 110 runs, 5995 pairs of systems 
· Randomly split topics in T1 and T2, as if two collections 
o Evaluate all runs and compute p-values 
o Compare p-values from T1 with p-values from T2 
o 1000 trials, 12M p-values per test, 60M in total 
· Interpret pairs of p-values for different α levels 
T2 
A ≻B A ≺B A ≻≻B A ≺≺B 
T1 
A ≻B Non-significance 
A≻≻B 
Lack of 
power 
Minor 
error 
Success 
Major 
error 
Non-significance rate 
t-test 
permutation 
bootstrap 
Wilcoxon 
sign 
.001 .005 .01 .05 .1 
Significance level a 
Non-significants / Total 
0.3 0.35 0.4 0.45 0.5 0.6 
Previous Work 
Zobel’98, Sanderson & Zobel’05, Cormack & Lynam’06 
· Wilcoxon more powerful than t-test, but more errors 
Smucker et al. ‘07, ‘09 
· bootstrap test overly powerful, though similar to t-test 
and permutation 
· Wilcoxon and sign unreliable, should use permutation 
· Power: bootstrap test gives more significant results 
· Safety: t-test gives fewer errors 
· Exactness: Wilcoxon test best tracks the nominal level 
· The permutation test is not optimal in practice 
· Error rates seem lower than expected; focus on power 
Success rate 
.001 .005 .01 .05 .1 
Significance level a 
Successes / Total significants 
0.76 0.78 0.80 0.82 0.84 0.86 
Take-Home Messages 
Lack of power rate 
.001 .005 .01 .05 .1 
Significance level a 
Lacks of power / Total significants 
0.12 0.14 0.16 0.18 0.20 
Minor error rate 
t-test 
permutation 
bootstrap 
Wilcoxon 
sign 
y=x 
.001 .005 .01 .05 .1 
Significance level a 
Minor errors / Total significants 
0.001 0.002 0.005 0.010 0.020 
Major error rate 
.001 .005 .01 .05 .1 
Significance level a 
Major errors / Total significants 
5e-07 5e-06 5e-05 5e-04 
Global error rate 
.0001 .0005.001 .005 .01 .05 .1 .5 
Significance level a 
Minor and Major errors / Total significants 
5e-04 2e-03 5e-03 2e-02 5e-02 
Dublin, Ireland · 30th July 2013 Supported by ACM SIGIR Student Travel Grant

More Related Content

Similar to A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation

Cross validation
Cross validationCross validation
Cross validation
RidhaAfrawe
 
multi criteria decision making
multi criteria decision makingmulti criteria decision making
multi criteria decision making
Shankha Goswami
 
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Julián Urbano
 
Site Verification: Tools and Best Practices to Accurately Meter Complex, High...
Site Verification: Tools and Best Practices to Accurately Meter Complex, High...Site Verification: Tools and Best Practices to Accurately Meter Complex, High...
Site Verification: Tools and Best Practices to Accurately Meter Complex, High...
TESCO - The Eastern Specialty Company
 
Complete Site Testing
Complete Site TestingComplete Site Testing
ML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptxML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptx
belay41
 
Burden Testing, Theory, and Practice
Burden Testing, Theory, and PracticeBurden Testing, Theory, and Practice
Burden Testing, Theory, and Practice
TESCO - The Eastern Specialty Company
 
Process Capability: Steps 1 to 3
Process Capability: Steps 1 to 3Process Capability: Steps 1 to 3
Process Capability: Steps 1 to 3
Matt Hansen
 
Instrument Transformer Testing
Instrument Transformer TestingInstrument Transformer Testing
Instrument Transformer Testing
TESCO - The Eastern Specialty Company
 
SE%200-Testing%20(2).pptx
SE%200-Testing%20(2).pptxSE%200-Testing%20(2).pptx
SE%200-Testing%20(2).pptx
200723KarthikeyanD
 
Statistical process control ppt @ bec doms
Statistical process control ppt @ bec domsStatistical process control ppt @ bec doms
Statistical process control ppt @ bec doms
Babasab Patil
 
Hph7300week14winter2009narr
Hph7300week14winter2009narrHph7300week14winter2009narr
Hph7300week14winter2009narr
Sarah
 
Facility Location
Facility Location Facility Location
Facility Location
Joshua Miranda
 
blckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ yk
blckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ ykblckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ yk
blckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ yk
SMayankSharma
 
Complete Site Testing
Complete Site TestingComplete Site Testing
Chap 9 A Process Capability & Spc Hk
Chap 9 A Process Capability & Spc HkChap 9 A Process Capability & Spc Hk
Chap 9 A Process Capability & Spc Hk
ajithsrc
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Thomas Ploetz
 
Fault simulation – application and methods
Fault simulation – application and methodsFault simulation – application and methods
Fault simulation – application and methods
Subash John
 
SE 09 (test design techs).pptx
SE 09 (test design techs).pptxSE 09 (test design techs).pptx
SE 09 (test design techs).pptx
ZohairMughal1
 
Stochastic Process
Stochastic ProcessStochastic Process
Stochastic Process
knksmart
 

Similar to A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation (20)

Cross validation
Cross validationCross validation
Cross validation
 
multi criteria decision making
multi criteria decision makingmulti criteria decision making
multi criteria decision making
 
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
 
Site Verification: Tools and Best Practices to Accurately Meter Complex, High...
Site Verification: Tools and Best Practices to Accurately Meter Complex, High...Site Verification: Tools and Best Practices to Accurately Meter Complex, High...
Site Verification: Tools and Best Practices to Accurately Meter Complex, High...
 
Complete Site Testing
Complete Site TestingComplete Site Testing
Complete Site Testing
 
ML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptxML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptx
 
Burden Testing, Theory, and Practice
Burden Testing, Theory, and PracticeBurden Testing, Theory, and Practice
Burden Testing, Theory, and Practice
 
Process Capability: Steps 1 to 3
Process Capability: Steps 1 to 3Process Capability: Steps 1 to 3
Process Capability: Steps 1 to 3
 
Instrument Transformer Testing
Instrument Transformer TestingInstrument Transformer Testing
Instrument Transformer Testing
 
SE%200-Testing%20(2).pptx
SE%200-Testing%20(2).pptxSE%200-Testing%20(2).pptx
SE%200-Testing%20(2).pptx
 
Statistical process control ppt @ bec doms
Statistical process control ppt @ bec domsStatistical process control ppt @ bec doms
Statistical process control ppt @ bec doms
 
Hph7300week14winter2009narr
Hph7300week14winter2009narrHph7300week14winter2009narr
Hph7300week14winter2009narr
 
Facility Location
Facility Location Facility Location
Facility Location
 
blckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ yk
blckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ ykblckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ yk
blckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ yk
 
Complete Site Testing
Complete Site TestingComplete Site Testing
Complete Site Testing
 
Chap 9 A Process Capability & Spc Hk
Chap 9 A Process Capability & Spc HkChap 9 A Process Capability & Spc Hk
Chap 9 A Process Capability & Spc Hk
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
 
Fault simulation – application and methods
Fault simulation – application and methodsFault simulation – application and methods
Fault simulation – application and methods
 
SE 09 (test design techs).pptx
SE 09 (test design techs).pptxSE 09 (test design techs).pptx
SE 09 (test design techs).pptx
 
Stochastic Process
Stochastic ProcessStochastic Process
Stochastic Process
 

More from Julián Urbano

Your PhD and You
Your PhD and YouYour PhD and You
Your PhD and You
Julián Urbano
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
Julián Urbano
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP Correlation
Julián Urbano
 
A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR Evaluation
Julián Urbano
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured Documents
Julián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
Julián Urbano
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
Julián Urbano
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Julián Urbano
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)
Julián Urbano
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music Similarity
Julián Urbano
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Julián Urbano
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
Julián Urbano
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
Julián Urbano
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Julián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
Julián Urbano
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Julián Urbano
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Julián Urbano
 
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Julián Urbano
 
Improving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsImproving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered Lists
Julián Urbano
 
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksCrowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Julián Urbano
 

More from Julián Urbano (20)

Your PhD and You
Your PhD and YouYour PhD and You
Your PhD and You
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP Correlation
 
A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR Evaluation
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured Documents
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music Similarity
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and Stability
 
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
 
Improving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsImproving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered Lists
 
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksCrowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
 

Recently uploaded

Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
HongcNguyn6
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
tonzsalvador2222
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
muralinath2
 
Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
SSR02
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
fafyfskhan251kmf
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 

Recently uploaded (20)

Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
 
Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 

A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation

  • 1. A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation Julián Urbano, Mónica Marrero and Diego Martín Department of Computer Science · University Carlos III of Madrid The problem: is system A more effective than system B? The drill: evaluate with a test collection and run a statistical significance test The dilemma: t-test, Wilcoxon, sign, bootstrap or permutation? The reason: test assumptions are violated, so which one is optimal in practice? Three criteria: power (maximize # of significants), safety (minimize # of errors), exact (keep errors at α) Data and Methods · TREC Robust 2004: 100 topics from Ad Hoc 7 and 8 o 110 runs, 5995 pairs of systems · Randomly split topics in T1 and T2, as if two collections o Evaluate all runs and compute p-values o Compare p-values from T1 with p-values from T2 o 1000 trials, 12M p-values per test, 60M in total · Interpret pairs of p-values for different α levels T2 A ≻B A ≺B A ≻≻B A ≺≺B T1 A ≻B Non-significance A≻≻B Lack of power Minor error Success Major error Non-significance rate t-test permutation bootstrap Wilcoxon sign .001 .005 .01 .05 .1 Significance level a Non-significants / Total 0.3 0.35 0.4 0.45 0.5 0.6 Previous Work Zobel’98, Sanderson & Zobel’05, Cormack & Lynam’06 · Wilcoxon more powerful than t-test, but more errors Smucker et al. ‘07, ‘09 · bootstrap test overly powerful, though similar to t-test and permutation · Wilcoxon and sign unreliable, should use permutation · Power: bootstrap test gives more significant results · Safety: t-test gives fewer errors · Exactness: Wilcoxon test best tracks the nominal level · The permutation test is not optimal in practice · Error rates seem lower than expected; focus on power Success rate .001 .005 .01 .05 .1 Significance level a Successes / Total significants 0.76 0.78 0.80 0.82 0.84 0.86 Take-Home Messages Lack of power rate .001 .005 .01 .05 .1 Significance level a Lacks of power / Total significants 0.12 0.14 0.16 0.18 0.20 Minor error rate t-test permutation bootstrap Wilcoxon sign y=x .001 .005 .01 .05 .1 Significance level a Minor errors / Total significants 0.001 0.002 0.005 0.010 0.020 Major error rate .001 .005 .01 .05 .1 Significance level a Major errors / Total significants 5e-07 5e-06 5e-05 5e-04 Global error rate .0001 .0005.001 .005 .01 .05 .1 .5 Significance level a Minor and Major errors / Total significants 5e-04 2e-03 5e-03 2e-02 5e-02 Dublin, Ireland · 30th July 2013 Supported by ACM SIGIR Student Travel Grant