SlideShare a Scribd company logo
On Comparing Classifiers:
Pitfalls to Avoid and a
Recommended Approach
(cited by 581)
Author: Steven L.Salzberg
Presented by: Mehmet Ali Abbasoğlu &
Mustafa İlker Saraç
10.04.2014
Contents
1. Motivation
2. Comparing Algorithms
3. Definitions
4. Problems
5. Recommended Approach
6. Conclusion
Motivation
● Be careful about comparative studies of classification
and other algorithms.
○ It is easy to result in statistically invalid conclusions.
● How to chose which algorithm to use for a new
problem?
● Using brute force one can easily find a phenomenon or
pattern that looks impressive.
○ REALLY?
Motivation
● You have lots of data
○ Choose one from UCI repository
● You have many classification methods to compare
But,
● Any differences in classification accuracy that reach
statistical significance should be reported as important?
○ Think again!
Comparing Algorithms
● Many new algorithms has problems according to a
survey conducted by Prechelt.
○ 29% not evaluated on a real problem
○ 8% compared to more than one alternative on real
data
● A survey by Flexer on experimental neural network
papers in leading journals
○ Only 3 out of 43 used a seperate data set for tuning
parameters.
Comparing Algorithms
● Drawbacks of reporting results on a well studied data
set, e.g. a data set from UCI repository
○ It is hard to improve results
○ Prone to statistical accidents
○ They are fine to see initial results for your new
algorithm
● It seems easy to change known algorithms a little then
use comparisons to report improved results.
○ High risk of statistical invalidity
○ Better apply new algorithms
Definitions
● Statistical significance
○ In statistics, a result is considered significant not because
it is important or meaningful, but because it has been
predicted as unlikely to have occurred by chance alone.
● t-test
○ Used to determine whether two sets of data are
significantly different from each other
● p-value
○ Probability of getting the same results when comparing 2
hypothesis.
● null hypothesis
○ The default position, initial state of the data
Problem 1 :
Small repository of datasets
● It is difficult to produce major new results using well-
studied and widely shared data.
● Suppose 100 people are studying the effect of
algorithms A and B
● At least 5 will get results statistically significant at p <=
0.05
● Clearly results are due to chance.
○ The ones who get significant results will publish
○ While others will simply move on to other experiments.
Problem 2 :
Statistical validity
● Statistics offer many tests that are desined to measure
the significance of any difference
● These tests are not designed with computational
experiments in mind.
● For example
○ 14 different variations of classifier algorithms
○ 11 different datasets
○ 154 variations, 154 changes to be significant
○ Actual p-value used is 154*0.05 = 7.7
○ multiplicy effect
Problem 2 :
Statistical validity
● Let the significance for each level be α
● Chance for making right conclusion for one experiment
is (1 - α )
● Assuming experiments are independent of one another,
chance for getting n experiments correct is (1 - α )n
● Chances of not making correct conclusion is 1- ( 1 - α )n
● Substituting α = 0.05
● Chances for making incorrect conclusion is 0.9996
● To obtain results significant at 0.05 level with 154 tests
1 - ( 1 - α )n
< 0.05
α < 0.003
● This adjustment is known as Bonferroni Adjustment.
Problem 3 :
Experiments are not independent
● The t-test assumes that the test sets for
each algorithm are independent.
● Generally two algorithms are compared on
the same data set
○ Obviously the test sets are not independent.
Problem 4 :
Only considers overall accuracy
● Comparison must consider 4 number when a common
test set is used for comparing two algorithms
○ A got right and B got wrong ( A > B )
○ B got right and A got wrong ( B > A )
○ Both algorithms got right
○ Both algorithms got wrong
● If only two algorithms compared
○ Throw out ties
○ Compare A > B vs B > A
● If more than two algorithms compared
○ Use “Analysis of Variance” (ANOVA)
○ Bonferroni adjustment for multiple test
Problem 5 :
Repeated tuning
● Researchers tune their algorithms repeatedly to perform
optimally on a data set.
● Whenever tuning takes place, every adjustment should
really be considered as a separate experiment.
○ For example if 10 tuning experiments were
attempted, then p-value should be 0.005 instead of
0.05.
● When one uses an algorithm that has been used before,
the algorithm may already have been tuned on public
databases.
Problem 5 :
Repeated tuning
● Recommended approach:
○ Reserve a portion of the training set as a tuning set
○ Repeatedly test the algorithm and adjust parameters on tuning
set.
○ Measure accuracy on the test data.
Problem 5 :
Generalizing results
● Common methodological approach
○ pick several datasets from UCI repository
○ perform series of experiments
■ measuring classification accuracy
■ learning rates
● It is not valid to make general statements about other
datasets.
○ The repository is not an unbiased sample of classification
problems.
● Someone can write an algorithm that works very well on
some of the known datasets
○ Anyone familiar with the data may be biased.
A Recommended Approach
1. Choose other algorithms to include in the comparison.
2. Chose a benchmark data set.
3. Divide the data set into k subsets for cross validation
○ Typically k = 10
○ For small data sets, chose larger k.
A Recommended Approach
4. Run cross-validation
○ For each of the k subsets of the data set D, create a training
set T = D - k
○ Divide T into two subsets: T1
(training) and T2
(tuning)
○ Once parameters are optimized, re-run training on T
○ Measure accuracy on k
○ Overall accuracy is averaged across all k partitions.
5. Compare algorithms
● In case of multiple data sets, Bonferroni adjustment
should be applied.
Conclusion
● Authors do not mean to discourage emprical
comparisons
● They try to provide suggestions to avoid pitfalls
● They suggest that
○ Statistical tools should be used carefully.
○ Every details of the experiment should be reported.
Thank you!

More Related Content

What's hot

Psyc 355Education Specialist / snaptutorial.com
Psyc 355Education Specialist / snaptutorial.comPsyc 355Education Specialist / snaptutorial.com
Psyc 355Education Specialist / snaptutorial.com
McdonaldRyan117
 
Psyc 355 Effective Communication - tutorialrank.com
Psyc 355 Effective Communication - tutorialrank.comPsyc 355 Effective Communication - tutorialrank.com
Psyc 355 Effective Communication - tutorialrank.com
Bartholomew88
 
Data analysis
Data analysisData analysis
Data analysis
SANTHANAM V
 
Analysis of Variance
Analysis of VarianceAnalysis of Variance
Analysis of Variance
Shailesh Dewangan
 
PSYC 355 Inspiring Innovation/tutorialrank.com
 PSYC 355 Inspiring Innovation/tutorialrank.com PSYC 355 Inspiring Innovation/tutorialrank.com
PSYC 355 Inspiring Innovation/tutorialrank.com
jonhson158
 
Comparison statisticalsignificancetestir
Comparison statisticalsignificancetestirComparison statisticalsignificancetestir
Comparison statisticalsignificancetestir
Claudia Ribeiro
 
Why we run cronbach’s alpha
Why we run cronbach’s alphaWhy we run cronbach’s alpha
Why we run cronbach’s alphaAiden Yeh
 
Basic Concepts of Non-Parametric Methods ( Statistics )
Basic Concepts of Non-Parametric Methods ( Statistics )Basic Concepts of Non-Parametric Methods ( Statistics )
Basic Concepts of Non-Parametric Methods ( Statistics )
Hasnat Israq
 
STATISTICAL TOOLS USED IN ANALYTICAL CHEMISTRY
STATISTICAL TOOLS USED IN ANALYTICAL CHEMISTRYSTATISTICAL TOOLS USED IN ANALYTICAL CHEMISTRY
STATISTICAL TOOLS USED IN ANALYTICAL CHEMISTRY
keerthana151
 
Imputation of missing data in clinical trials
Imputation of missing data in clinical trialsImputation of missing data in clinical trials
Imputation of missing data in clinical trials
Seema Ahirwar
 
Psyc 355 Effective Communication / snaptutorial.com
Psyc 355  Effective Communication / snaptutorial.comPsyc 355  Effective Communication / snaptutorial.com
Psyc 355 Effective Communication / snaptutorial.com
HarrisGeorg39
 
Psyc 355 Enhance teaching-snaptutorial.com
Psyc 355 Enhance teaching-snaptutorial.comPsyc 355 Enhance teaching-snaptutorial.com
Psyc 355 Enhance teaching-snaptutorial.com
robertleew40
 
Psyc 355 Exceptional Education / snaptutorial.com
Psyc 355 Exceptional Education / snaptutorial.comPsyc 355 Exceptional Education / snaptutorial.com
Psyc 355 Exceptional Education / snaptutorial.com
Baileya73
 
Error analytical
Error analyticalError analytical
Error analytical
Lovnish Thakur
 
Non parametrics
Non parametricsNon parametrics
Non parametricsRyan Sain
 
Research Methology -Factor Analyses
Research Methology -Factor AnalysesResearch Methology -Factor Analyses
Research Methology -Factor AnalysesNeerav Shivhare
 
Imputation techniques for missing data in clinical trials
Imputation techniques for missing data in clinical trialsImputation techniques for missing data in clinical trials
Imputation techniques for missing data in clinical trials
Nitin George
 
Khurram
KhurramKhurram
Khurram
JJkedst
 
Mann Whitney U test
Mann Whitney U testMann Whitney U test
Mann Whitney U test
Dr. Ankit Gaur
 

What's hot (20)

Psyc 355Education Specialist / snaptutorial.com
Psyc 355Education Specialist / snaptutorial.comPsyc 355Education Specialist / snaptutorial.com
Psyc 355Education Specialist / snaptutorial.com
 
Psyc 355 Effective Communication - tutorialrank.com
Psyc 355 Effective Communication - tutorialrank.comPsyc 355 Effective Communication - tutorialrank.com
Psyc 355 Effective Communication - tutorialrank.com
 
Data analysis
Data analysisData analysis
Data analysis
 
Analysis of Variance
Analysis of VarianceAnalysis of Variance
Analysis of Variance
 
PSYC 355 Inspiring Innovation/tutorialrank.com
 PSYC 355 Inspiring Innovation/tutorialrank.com PSYC 355 Inspiring Innovation/tutorialrank.com
PSYC 355 Inspiring Innovation/tutorialrank.com
 
Comparison statisticalsignificancetestir
Comparison statisticalsignificancetestirComparison statisticalsignificancetestir
Comparison statisticalsignificancetestir
 
Why we run cronbach’s alpha
Why we run cronbach’s alphaWhy we run cronbach’s alpha
Why we run cronbach’s alpha
 
Basic Concepts of Non-Parametric Methods ( Statistics )
Basic Concepts of Non-Parametric Methods ( Statistics )Basic Concepts of Non-Parametric Methods ( Statistics )
Basic Concepts of Non-Parametric Methods ( Statistics )
 
STATISTICAL TOOLS USED IN ANALYTICAL CHEMISTRY
STATISTICAL TOOLS USED IN ANALYTICAL CHEMISTRYSTATISTICAL TOOLS USED IN ANALYTICAL CHEMISTRY
STATISTICAL TOOLS USED IN ANALYTICAL CHEMISTRY
 
Imputation of missing data in clinical trials
Imputation of missing data in clinical trialsImputation of missing data in clinical trials
Imputation of missing data in clinical trials
 
Psyc 355 Effective Communication / snaptutorial.com
Psyc 355  Effective Communication / snaptutorial.comPsyc 355  Effective Communication / snaptutorial.com
Psyc 355 Effective Communication / snaptutorial.com
 
Psyc 355 Enhance teaching-snaptutorial.com
Psyc 355 Enhance teaching-snaptutorial.comPsyc 355 Enhance teaching-snaptutorial.com
Psyc 355 Enhance teaching-snaptutorial.com
 
Psyc 355 Exceptional Education / snaptutorial.com
Psyc 355 Exceptional Education / snaptutorial.comPsyc 355 Exceptional Education / snaptutorial.com
Psyc 355 Exceptional Education / snaptutorial.com
 
Error analytical
Error analyticalError analytical
Error analytical
 
Non parametrics
Non parametricsNon parametrics
Non parametrics
 
The Chi Square Test
The Chi Square TestThe Chi Square Test
The Chi Square Test
 
Research Methology -Factor Analyses
Research Methology -Factor AnalysesResearch Methology -Factor Analyses
Research Methology -Factor Analyses
 
Imputation techniques for missing data in clinical trials
Imputation techniques for missing data in clinical trialsImputation techniques for missing data in clinical trials
Imputation techniques for missing data in clinical trials
 
Khurram
KhurramKhurram
Khurram
 
Mann Whitney U test
Mann Whitney U testMann Whitney U test
Mann Whitney U test
 

Similar to CS550 Presentation - On comparing classifiers by Slazberg

Chemometrics-ANALYTICAL DATA SIGNIFICANCE TESTS.pptx
Chemometrics-ANALYTICAL DATA SIGNIFICANCE TESTS.pptxChemometrics-ANALYTICAL DATA SIGNIFICANCE TESTS.pptx
Chemometrics-ANALYTICAL DATA SIGNIFICANCE TESTS.pptx
HakimuNsubuga2
 
CHAPTER 4- Lesson A
CHAPTER 4- Lesson ACHAPTER 4- Lesson A
CHAPTER 4- Lesson A
MLG College of Learning, Inc
 
Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)
Bioinformatics and Computational Biosciences Branch
 
Day 12 t test for dependent samples and single samples pdf
Day 12 t test for dependent samples and single samples pdfDay 12 t test for dependent samples and single samples pdf
Day 12 t test for dependent samples and single samples pdf
Elih Sutisna Yanto
 
Artificial Intelligence Chapter 9 Negnevitsky
Artificial Intelligence Chapter 9 NegnevitskyArtificial Intelligence Chapter 9 Negnevitsky
Artificial Intelligence Chapter 9 Negnevitskylopanath
 
chapter12.ppt
chapter12.pptchapter12.ppt
chapter12.ppt
EndrisHEbrahim
 
Experimental designs and data analysis in the field of Agronomy science by ma...
Experimental designs and data analysis in the field of Agronomy science by ma...Experimental designs and data analysis in the field of Agronomy science by ma...
Experimental designs and data analysis in the field of Agronomy science by ma...
Manoj Sharma
 
T test
T testT test
Planning of experiment in industrial research
Planning of experiment in industrial researchPlanning of experiment in industrial research
Planning of experiment in industrial researchpbbharate
 
hypothesis teesting
 hypothesis teesting hypothesis teesting
hypothesis teesting
kpgandhi
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
Muhammadasif909
 
Data Ananlysis lecture 7 Simon Fraser University
Data Ananlysis lecture 7 Simon Fraser UniversityData Ananlysis lecture 7 Simon Fraser University
Data Ananlysis lecture 7 Simon Fraser University
soniyamarghani
 
A05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat TestsA05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat TestsLeanleaders.org
 
A05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat TestsA05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat TestsLeanleaders.org
 
Quantitative methodology part one.compressed
Quantitative methodology part one.compressedQuantitative methodology part one.compressed
Quantitative methodology part one.compressed
Maria Sanchez
 
Worked examples of sampling uncertainty evaluation
Worked examples of sampling uncertainty evaluationWorked examples of sampling uncertainty evaluation
Worked examples of sampling uncertainty evaluation
GH Yeoh
 
Week 11 Model Evalaution Model Evaluation
Week 11 Model Evalaution Model EvaluationWeek 11 Model Evalaution Model Evaluation
Week 11 Model Evalaution Model Evaluation
khairulhuda242
 
Machine Learning with Spark and Cassandra - Model Selection Tests
Machine Learning with Spark and Cassandra - Model Selection TestsMachine Learning with Spark and Cassandra - Model Selection Tests
Machine Learning with Spark and Cassandra - Model Selection Tests
Anant Corporation
 
TEST #1Perform the following two-tailed hypothesis test, using a.docx
TEST #1Perform the following two-tailed hypothesis test, using a.docxTEST #1Perform the following two-tailed hypothesis test, using a.docx
TEST #1Perform the following two-tailed hypothesis test, using a.docx
mattinsonjanel
 
UNIT 5.pptx
UNIT 5.pptxUNIT 5.pptx
UNIT 5.pptx
ShifnaRahman
 

Similar to CS550 Presentation - On comparing classifiers by Slazberg (20)

Chemometrics-ANALYTICAL DATA SIGNIFICANCE TESTS.pptx
Chemometrics-ANALYTICAL DATA SIGNIFICANCE TESTS.pptxChemometrics-ANALYTICAL DATA SIGNIFICANCE TESTS.pptx
Chemometrics-ANALYTICAL DATA SIGNIFICANCE TESTS.pptx
 
CHAPTER 4- Lesson A
CHAPTER 4- Lesson ACHAPTER 4- Lesson A
CHAPTER 4- Lesson A
 
Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)
 
Day 12 t test for dependent samples and single samples pdf
Day 12 t test for dependent samples and single samples pdfDay 12 t test for dependent samples and single samples pdf
Day 12 t test for dependent samples and single samples pdf
 
Artificial Intelligence Chapter 9 Negnevitsky
Artificial Intelligence Chapter 9 NegnevitskyArtificial Intelligence Chapter 9 Negnevitsky
Artificial Intelligence Chapter 9 Negnevitsky
 
chapter12.ppt
chapter12.pptchapter12.ppt
chapter12.ppt
 
Experimental designs and data analysis in the field of Agronomy science by ma...
Experimental designs and data analysis in the field of Agronomy science by ma...Experimental designs and data analysis in the field of Agronomy science by ma...
Experimental designs and data analysis in the field of Agronomy science by ma...
 
T test
T testT test
T test
 
Planning of experiment in industrial research
Planning of experiment in industrial researchPlanning of experiment in industrial research
Planning of experiment in industrial research
 
hypothesis teesting
 hypothesis teesting hypothesis teesting
hypothesis teesting
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Data Ananlysis lecture 7 Simon Fraser University
Data Ananlysis lecture 7 Simon Fraser UniversityData Ananlysis lecture 7 Simon Fraser University
Data Ananlysis lecture 7 Simon Fraser University
 
A05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat TestsA05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat Tests
 
A05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat TestsA05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat Tests
 
Quantitative methodology part one.compressed
Quantitative methodology part one.compressedQuantitative methodology part one.compressed
Quantitative methodology part one.compressed
 
Worked examples of sampling uncertainty evaluation
Worked examples of sampling uncertainty evaluationWorked examples of sampling uncertainty evaluation
Worked examples of sampling uncertainty evaluation
 
Week 11 Model Evalaution Model Evaluation
Week 11 Model Evalaution Model EvaluationWeek 11 Model Evalaution Model Evaluation
Week 11 Model Evalaution Model Evaluation
 
Machine Learning with Spark and Cassandra - Model Selection Tests
Machine Learning with Spark and Cassandra - Model Selection TestsMachine Learning with Spark and Cassandra - Model Selection Tests
Machine Learning with Spark and Cassandra - Model Selection Tests
 
TEST #1Perform the following two-tailed hypothesis test, using a.docx
TEST #1Perform the following two-tailed hypothesis test, using a.docxTEST #1Perform the following two-tailed hypothesis test, using a.docx
TEST #1Perform the following two-tailed hypothesis test, using a.docx
 
UNIT 5.pptx
UNIT 5.pptxUNIT 5.pptx
UNIT 5.pptx
 

More from mustafa sarac

Uluslararasilasma son
Uluslararasilasma sonUluslararasilasma son
Uluslararasilasma son
mustafa sarac
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3
mustafa sarac
 
Latka december digital
Latka december digitalLatka december digital
Latka december digital
mustafa sarac
 
Axial RC SCX10 AE2 ESC user manual
Axial RC SCX10 AE2 ESC user manualAxial RC SCX10 AE2 ESC user manual
Axial RC SCX10 AE2 ESC user manual
mustafa sarac
 
Array programming with Numpy
Array programming with NumpyArray programming with Numpy
Array programming with Numpy
mustafa sarac
 
Math for programmers
Math for programmersMath for programmers
Math for programmers
mustafa sarac
 
The book of Why
The book of WhyThe book of Why
The book of Why
mustafa sarac
 
BM sgk meslek kodu
BM sgk meslek koduBM sgk meslek kodu
BM sgk meslek kodu
mustafa sarac
 
TEGV 2020 Bireysel bagiscilarimiz
TEGV 2020 Bireysel bagiscilarimizTEGV 2020 Bireysel bagiscilarimiz
TEGV 2020 Bireysel bagiscilarimiz
mustafa sarac
 
How to make and manage a bee hotel?
How to make and manage a bee hotel?How to make and manage a bee hotel?
How to make and manage a bee hotel?
mustafa sarac
 
Cahit arf makineler dusunebilir mi
Cahit arf makineler dusunebilir miCahit arf makineler dusunebilir mi
Cahit arf makineler dusunebilir mi
mustafa sarac
 
How did Software Got So Reliable Without Proof?
How did Software Got So Reliable Without Proof?How did Software Got So Reliable Without Proof?
How did Software Got So Reliable Without Proof?
mustafa sarac
 
Staff Report on Algorithmic Trading in US Capital Markets
Staff Report on Algorithmic Trading in US Capital MarketsStaff Report on Algorithmic Trading in US Capital Markets
Staff Report on Algorithmic Trading in US Capital Markets
mustafa sarac
 
Yetiskinler icin okuma yazma egitimi
Yetiskinler icin okuma yazma egitimiYetiskinler icin okuma yazma egitimi
Yetiskinler icin okuma yazma egitimi
mustafa sarac
 
Consumer centric api design v0.4.0
Consumer centric api design v0.4.0Consumer centric api design v0.4.0
Consumer centric api design v0.4.0
mustafa sarac
 
State of microservices 2020 by tsh
State of microservices 2020 by tshState of microservices 2020 by tsh
State of microservices 2020 by tsh
mustafa sarac
 
Uber pitch deck 2008
Uber pitch deck 2008Uber pitch deck 2008
Uber pitch deck 2008
mustafa sarac
 
Wireless solar keyboard k760 quickstart guide
Wireless solar keyboard k760 quickstart guideWireless solar keyboard k760 quickstart guide
Wireless solar keyboard k760 quickstart guide
mustafa sarac
 
State of Serverless Report 2020
State of Serverless Report 2020State of Serverless Report 2020
State of Serverless Report 2020
mustafa sarac
 
Dont just roll the dice
Dont just roll the diceDont just roll the dice
Dont just roll the dice
mustafa sarac
 

More from mustafa sarac (20)

Uluslararasilasma son
Uluslararasilasma sonUluslararasilasma son
Uluslararasilasma son
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3
 
Latka december digital
Latka december digitalLatka december digital
Latka december digital
 
Axial RC SCX10 AE2 ESC user manual
Axial RC SCX10 AE2 ESC user manualAxial RC SCX10 AE2 ESC user manual
Axial RC SCX10 AE2 ESC user manual
 
Array programming with Numpy
Array programming with NumpyArray programming with Numpy
Array programming with Numpy
 
Math for programmers
Math for programmersMath for programmers
Math for programmers
 
The book of Why
The book of WhyThe book of Why
The book of Why
 
BM sgk meslek kodu
BM sgk meslek koduBM sgk meslek kodu
BM sgk meslek kodu
 
TEGV 2020 Bireysel bagiscilarimiz
TEGV 2020 Bireysel bagiscilarimizTEGV 2020 Bireysel bagiscilarimiz
TEGV 2020 Bireysel bagiscilarimiz
 
How to make and manage a bee hotel?
How to make and manage a bee hotel?How to make and manage a bee hotel?
How to make and manage a bee hotel?
 
Cahit arf makineler dusunebilir mi
Cahit arf makineler dusunebilir miCahit arf makineler dusunebilir mi
Cahit arf makineler dusunebilir mi
 
How did Software Got So Reliable Without Proof?
How did Software Got So Reliable Without Proof?How did Software Got So Reliable Without Proof?
How did Software Got So Reliable Without Proof?
 
Staff Report on Algorithmic Trading in US Capital Markets
Staff Report on Algorithmic Trading in US Capital MarketsStaff Report on Algorithmic Trading in US Capital Markets
Staff Report on Algorithmic Trading in US Capital Markets
 
Yetiskinler icin okuma yazma egitimi
Yetiskinler icin okuma yazma egitimiYetiskinler icin okuma yazma egitimi
Yetiskinler icin okuma yazma egitimi
 
Consumer centric api design v0.4.0
Consumer centric api design v0.4.0Consumer centric api design v0.4.0
Consumer centric api design v0.4.0
 
State of microservices 2020 by tsh
State of microservices 2020 by tshState of microservices 2020 by tsh
State of microservices 2020 by tsh
 
Uber pitch deck 2008
Uber pitch deck 2008Uber pitch deck 2008
Uber pitch deck 2008
 
Wireless solar keyboard k760 quickstart guide
Wireless solar keyboard k760 quickstart guideWireless solar keyboard k760 quickstart guide
Wireless solar keyboard k760 quickstart guide
 
State of Serverless Report 2020
State of Serverless Report 2020State of Serverless Report 2020
State of Serverless Report 2020
 
Dont just roll the dice
Dont just roll the diceDont just roll the dice
Dont just roll the dice
 

Recently uploaded

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 

Recently uploaded (20)

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 

CS550 Presentation - On comparing classifiers by Slazberg

  • 1. On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach (cited by 581) Author: Steven L.Salzberg Presented by: Mehmet Ali Abbasoğlu & Mustafa İlker Saraç 10.04.2014
  • 2. Contents 1. Motivation 2. Comparing Algorithms 3. Definitions 4. Problems 5. Recommended Approach 6. Conclusion
  • 3. Motivation ● Be careful about comparative studies of classification and other algorithms. ○ It is easy to result in statistically invalid conclusions. ● How to chose which algorithm to use for a new problem? ● Using brute force one can easily find a phenomenon or pattern that looks impressive. ○ REALLY?
  • 4. Motivation ● You have lots of data ○ Choose one from UCI repository ● You have many classification methods to compare But, ● Any differences in classification accuracy that reach statistical significance should be reported as important? ○ Think again!
  • 5. Comparing Algorithms ● Many new algorithms has problems according to a survey conducted by Prechelt. ○ 29% not evaluated on a real problem ○ 8% compared to more than one alternative on real data ● A survey by Flexer on experimental neural network papers in leading journals ○ Only 3 out of 43 used a seperate data set for tuning parameters.
  • 6. Comparing Algorithms ● Drawbacks of reporting results on a well studied data set, e.g. a data set from UCI repository ○ It is hard to improve results ○ Prone to statistical accidents ○ They are fine to see initial results for your new algorithm ● It seems easy to change known algorithms a little then use comparisons to report improved results. ○ High risk of statistical invalidity ○ Better apply new algorithms
  • 7. Definitions ● Statistical significance ○ In statistics, a result is considered significant not because it is important or meaningful, but because it has been predicted as unlikely to have occurred by chance alone. ● t-test ○ Used to determine whether two sets of data are significantly different from each other ● p-value ○ Probability of getting the same results when comparing 2 hypothesis. ● null hypothesis ○ The default position, initial state of the data
  • 8. Problem 1 : Small repository of datasets ● It is difficult to produce major new results using well- studied and widely shared data. ● Suppose 100 people are studying the effect of algorithms A and B ● At least 5 will get results statistically significant at p <= 0.05 ● Clearly results are due to chance. ○ The ones who get significant results will publish ○ While others will simply move on to other experiments.
  • 9. Problem 2 : Statistical validity ● Statistics offer many tests that are desined to measure the significance of any difference ● These tests are not designed with computational experiments in mind. ● For example ○ 14 different variations of classifier algorithms ○ 11 different datasets ○ 154 variations, 154 changes to be significant ○ Actual p-value used is 154*0.05 = 7.7 ○ multiplicy effect
  • 10. Problem 2 : Statistical validity ● Let the significance for each level be α ● Chance for making right conclusion for one experiment is (1 - α ) ● Assuming experiments are independent of one another, chance for getting n experiments correct is (1 - α )n ● Chances of not making correct conclusion is 1- ( 1 - α )n ● Substituting α = 0.05 ● Chances for making incorrect conclusion is 0.9996 ● To obtain results significant at 0.05 level with 154 tests 1 - ( 1 - α )n < 0.05 α < 0.003 ● This adjustment is known as Bonferroni Adjustment.
  • 11. Problem 3 : Experiments are not independent ● The t-test assumes that the test sets for each algorithm are independent. ● Generally two algorithms are compared on the same data set ○ Obviously the test sets are not independent.
  • 12. Problem 4 : Only considers overall accuracy ● Comparison must consider 4 number when a common test set is used for comparing two algorithms ○ A got right and B got wrong ( A > B ) ○ B got right and A got wrong ( B > A ) ○ Both algorithms got right ○ Both algorithms got wrong ● If only two algorithms compared ○ Throw out ties ○ Compare A > B vs B > A ● If more than two algorithms compared ○ Use “Analysis of Variance” (ANOVA) ○ Bonferroni adjustment for multiple test
  • 13. Problem 5 : Repeated tuning ● Researchers tune their algorithms repeatedly to perform optimally on a data set. ● Whenever tuning takes place, every adjustment should really be considered as a separate experiment. ○ For example if 10 tuning experiments were attempted, then p-value should be 0.005 instead of 0.05. ● When one uses an algorithm that has been used before, the algorithm may already have been tuned on public databases.
  • 14. Problem 5 : Repeated tuning ● Recommended approach: ○ Reserve a portion of the training set as a tuning set ○ Repeatedly test the algorithm and adjust parameters on tuning set. ○ Measure accuracy on the test data.
  • 15. Problem 5 : Generalizing results ● Common methodological approach ○ pick several datasets from UCI repository ○ perform series of experiments ■ measuring classification accuracy ■ learning rates ● It is not valid to make general statements about other datasets. ○ The repository is not an unbiased sample of classification problems. ● Someone can write an algorithm that works very well on some of the known datasets ○ Anyone familiar with the data may be biased.
  • 16. A Recommended Approach 1. Choose other algorithms to include in the comparison. 2. Chose a benchmark data set. 3. Divide the data set into k subsets for cross validation ○ Typically k = 10 ○ For small data sets, chose larger k.
  • 17. A Recommended Approach 4. Run cross-validation ○ For each of the k subsets of the data set D, create a training set T = D - k ○ Divide T into two subsets: T1 (training) and T2 (tuning) ○ Once parameters are optimized, re-run training on T ○ Measure accuracy on k ○ Overall accuracy is averaged across all k partitions. 5. Compare algorithms ● In case of multiple data sets, Bonferroni adjustment should be applied.
  • 18. Conclusion ● Authors do not mean to discourage emprical comparisons ● They try to provide suggestions to avoid pitfalls ● They suggest that ○ Statistical tools should be used carefully. ○ Every details of the experiment should be reported.