Your SlideShare is downloading. ×
CS550 Presentation - On comparing classifiers by Slazberg
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

CS550 Presentation - On comparing classifiers by Slazberg


Published on

A presentation for paper written by Steven L. Salzberg on comparing classifiers.

A presentation for paper written by Steven L. Salzberg on comparing classifiers.

Published in: Data & Analytics

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach (cited by 581) Author: Steven L.Salzberg Presented by: Mehmet Ali Abbasoğlu & Mustafa İlker Saraç 10.04.2014
  • 2. Contents 1. Motivation 2. Comparing Algorithms 3. Definitions 4. Problems 5. Recommended Approach 6. Conclusion
  • 3. Motivation ● Be careful about comparative studies of classification and other algorithms. ○ It is easy to result in statistically invalid conclusions. ● How to chose which algorithm to use for a new problem? ● Using brute force one can easily find a phenomenon or pattern that looks impressive. ○ REALLY?
  • 4. Motivation ● You have lots of data ○ Choose one from UCI repository ● You have many classification methods to compare But, ● Any differences in classification accuracy that reach statistical significance should be reported as important? ○ Think again!
  • 5. Comparing Algorithms ● Many new algorithms has problems according to a survey conducted by Prechelt. ○ 29% not evaluated on a real problem ○ 8% compared to more than one alternative on real data ● A survey by Flexer on experimental neural network papers in leading journals ○ Only 3 out of 43 used a seperate data set for tuning parameters.
  • 6. Comparing Algorithms ● Drawbacks of reporting results on a well studied data set, e.g. a data set from UCI repository ○ It is hard to improve results ○ Prone to statistical accidents ○ They are fine to see initial results for your new algorithm ● It seems easy to change known algorithms a little then use comparisons to report improved results. ○ High risk of statistical invalidity ○ Better apply new algorithms
  • 7. Definitions ● Statistical significance ○ In statistics, a result is considered significant not because it is important or meaningful, but because it has been predicted as unlikely to have occurred by chance alone. ● t-test ○ Used to determine whether two sets of data are significantly different from each other ● p-value ○ Probability of getting the same results when comparing 2 hypothesis. ● null hypothesis ○ The default position, initial state of the data
  • 8. Problem 1 : Small repository of datasets ● It is difficult to produce major new results using well- studied and widely shared data. ● Suppose 100 people are studying the effect of algorithms A and B ● At least 5 will get results statistically significant at p <= 0.05 ● Clearly results are due to chance. ○ The ones who get significant results will publish ○ While others will simply move on to other experiments.
  • 9. Problem 2 : Statistical validity ● Statistics offer many tests that are desined to measure the significance of any difference ● These tests are not designed with computational experiments in mind. ● For example ○ 14 different variations of classifier algorithms ○ 11 different datasets ○ 154 variations, 154 changes to be significant ○ Actual p-value used is 154*0.05 = 7.7 ○ multiplicy effect
  • 10. Problem 2 : Statistical validity ● Let the significance for each level be α ● Chance for making right conclusion for one experiment is (1 - α ) ● Assuming experiments are independent of one another, chance for getting n experiments correct is (1 - α )n ● Chances of not making correct conclusion is 1- ( 1 - α )n ● Substituting α = 0.05 ● Chances for making incorrect conclusion is 0.9996 ● To obtain results significant at 0.05 level with 154 tests 1 - ( 1 - α )n < 0.05 α < 0.003 ● This adjustment is known as Bonferroni Adjustment.
  • 11. Problem 3 : Experiments are not independent ● The t-test assumes that the test sets for each algorithm are independent. ● Generally two algorithms are compared on the same data set ○ Obviously the test sets are not independent.
  • 12. Problem 4 : Only considers overall accuracy ● Comparison must consider 4 number when a common test set is used for comparing two algorithms ○ A got right and B got wrong ( A > B ) ○ B got right and A got wrong ( B > A ) ○ Both algorithms got right ○ Both algorithms got wrong ● If only two algorithms compared ○ Throw out ties ○ Compare A > B vs B > A ● If more than two algorithms compared ○ Use “Analysis of Variance” (ANOVA) ○ Bonferroni adjustment for multiple test
  • 13. Problem 5 : Repeated tuning ● Researchers tune their algorithms repeatedly to perform optimally on a data set. ● Whenever tuning takes place, every adjustment should really be considered as a separate experiment. ○ For example if 10 tuning experiments were attempted, then p-value should be 0.005 instead of 0.05. ● When one uses an algorithm that has been used before, the algorithm may already have been tuned on public databases.
  • 14. Problem 5 : Repeated tuning ● Recommended approach: ○ Reserve a portion of the training set as a tuning set ○ Repeatedly test the algorithm and adjust parameters on tuning set. ○ Measure accuracy on the test data.
  • 15. Problem 5 : Generalizing results ● Common methodological approach ○ pick several datasets from UCI repository ○ perform series of experiments ■ measuring classification accuracy ■ learning rates ● It is not valid to make general statements about other datasets. ○ The repository is not an unbiased sample of classification problems. ● Someone can write an algorithm that works very well on some of the known datasets ○ Anyone familiar with the data may be biased.
  • 16. A Recommended Approach 1. Choose other algorithms to include in the comparison. 2. Chose a benchmark data set. 3. Divide the data set into k subsets for cross validation ○ Typically k = 10 ○ For small data sets, chose larger k.
  • 17. A Recommended Approach 4. Run cross-validation ○ For each of the k subsets of the data set D, create a training set T = D - k ○ Divide T into two subsets: T1 (training) and T2 (tuning) ○ Once parameters are optimized, re-run training on T ○ Measure accuracy on k ○ Overall accuracy is averaged across all k partitions. 5. Compare algorithms ● In case of multiple data sets, Bonferroni adjustment should be applied.
  • 18. Conclusion ● Authors do not mean to discourage emprical comparisons ● They try to provide suggestions to avoid pitfalls ● They suggest that ○ Statistical tools should be used carefully. ○ Every details of the experiment should be reported.
  • 19. Thank you!