1.
On Comparing Classifiers:
Pitfalls to Avoid and a
Recommended Approach
(cited by 581)
Author: Steven L.Salzberg
Presented by: Mehmet Ali Abbasoğlu &
Mustafa İlker Saraç
10.04.2014
3.
Motivation
● Be careful about comparative studies of classification
and other algorithms.
○ It is easy to result in statistically invalid conclusions.
● How to chose which algorithm to use for a new
problem?
● Using brute force one can easily find a phenomenon or
pattern that looks impressive.
○ REALLY?
4.
Motivation
● You have lots of data
○ Choose one from UCI repository
● You have many classification methods to compare
But,
● Any differences in classification accuracy that reach
statistical significance should be reported as important?
○ Think again!
5.
Comparing Algorithms
● Many new algorithms has problems according to a
survey conducted by Prechelt.
○ 29% not evaluated on a real problem
○ 8% compared to more than one alternative on real
data
● A survey by Flexer on experimental neural network
papers in leading journals
○ Only 3 out of 43 used a seperate data set for tuning
parameters.
6.
Comparing Algorithms
● Drawbacks of reporting results on a well studied data
set, e.g. a data set from UCI repository
○ It is hard to improve results
○ Prone to statistical accidents
○ They are fine to see initial results for your new
algorithm
● It seems easy to change known algorithms a little then
use comparisons to report improved results.
○ High risk of statistical invalidity
○ Better apply new algorithms
7.
Definitions
● Statistical significance
○ In statistics, a result is considered significant not because
it is important or meaningful, but because it has been
predicted as unlikely to have occurred by chance alone.
● t-test
○ Used to determine whether two sets of data are
significantly different from each other
● p-value
○ Probability of getting the same results when comparing 2
hypothesis.
● null hypothesis
○ The default position, initial state of the data
8.
Problem 1 :
Small repository of datasets
● It is difficult to produce major new results using well-
studied and widely shared data.
● Suppose 100 people are studying the effect of
algorithms A and B
● At least 5 will get results statistically significant at p <=
0.05
● Clearly results are due to chance.
○ The ones who get significant results will publish
○ While others will simply move on to other experiments.
9.
Problem 2 :
Statistical validity
● Statistics offer many tests that are desined to measure
the significance of any difference
● These tests are not designed with computational
experiments in mind.
● For example
○ 14 different variations of classifier algorithms
○ 11 different datasets
○ 154 variations, 154 changes to be significant
○ Actual p-value used is 154*0.05 = 7.7
○ multiplicy effect
10.
Problem 2 :
Statistical validity
● Let the significance for each level be α
● Chance for making right conclusion for one experiment
is (1 - α )
● Assuming experiments are independent of one another,
chance for getting n experiments correct is (1 - α )n
● Chances of not making correct conclusion is 1- ( 1 - α )n
● Substituting α = 0.05
● Chances for making incorrect conclusion is 0.9996
● To obtain results significant at 0.05 level with 154 tests
1 - ( 1 - α )n
< 0.05
α < 0.003
● This adjustment is known as Bonferroni Adjustment.
11.
Problem 3 :
Experiments are not independent
● The t-test assumes that the test sets for
each algorithm are independent.
● Generally two algorithms are compared on
the same data set
○ Obviously the test sets are not independent.
12.
Problem 4 :
Only considers overall accuracy
● Comparison must consider 4 number when a common
test set is used for comparing two algorithms
○ A got right and B got wrong ( A > B )
○ B got right and A got wrong ( B > A )
○ Both algorithms got right
○ Both algorithms got wrong
● If only two algorithms compared
○ Throw out ties
○ Compare A > B vs B > A
● If more than two algorithms compared
○ Use “Analysis of Variance” (ANOVA)
○ Bonferroni adjustment for multiple test
13.
Problem 5 :
Repeated tuning
● Researchers tune their algorithms repeatedly to perform
optimally on a data set.
● Whenever tuning takes place, every adjustment should
really be considered as a separate experiment.
○ For example if 10 tuning experiments were
attempted, then p-value should be 0.005 instead of
0.05.
● When one uses an algorithm that has been used before,
the algorithm may already have been tuned on public
databases.
14.
Problem 5 :
Repeated tuning
● Recommended approach:
○ Reserve a portion of the training set as a tuning set
○ Repeatedly test the algorithm and adjust parameters on tuning
set.
○ Measure accuracy on the test data.
15.
Problem 5 :
Generalizing results
● Common methodological approach
○ pick several datasets from UCI repository
○ perform series of experiments
■ measuring classification accuracy
■ learning rates
● It is not valid to make general statements about other
datasets.
○ The repository is not an unbiased sample of classification
problems.
● Someone can write an algorithm that works very well on
some of the known datasets
○ Anyone familiar with the data may be biased.
16.
A Recommended Approach
1. Choose other algorithms to include in the comparison.
2. Chose a benchmark data set.
3. Divide the data set into k subsets for cross validation
○ Typically k = 10
○ For small data sets, chose larger k.
17.
A Recommended Approach
4. Run cross-validation
○ For each of the k subsets of the data set D, create a training
set T = D - k
○ Divide T into two subsets: T1
(training) and T2
(tuning)
○ Once parameters are optimized, re-run training on T
○ Measure accuracy on k
○ Overall accuracy is averaged across all k partitions.
5. Compare algorithms
● In case of multiple data sets, Bonferroni adjustment
should be applied.
18.
Conclusion
● Authors do not mean to discourage emprical
comparisons
● They try to provide suggestions to avoid pitfalls
● They suggest that
○ Statistical tools should be used carefully.
○ Every details of the experiment should be reported.
Be the first to comment