Predicting best classifier using properties of data sets

Predicting the Best Classiﬁer using Properties of
Datasets
Abhishek Vijayvargia
Supervised by: Prof. Harish Karnick
Department of Computer Science & Engineering
IIT Kanpur
June 24, 2015

Introduction and Background Data Properties Regression and Significance Testing Results Conclusion and Future Work
Outline
1 Introduction and Background
2 Data Properties
3 Regression and Significance Testing
4 Results
5 Conclusion and Future Work
Abhishek Vijayvargia Predicting the Best Classifier using Properties of Datasets 2/ 47

Outline
2 Data Properties
4 Results

Introduction
Classification techniques have application in different domains
Datasets contains mixture of nominal,integer,real and text attributes
Datasets have different properties
Classification algorithms performs differently on these datasets

Introduction
No single best algorithm exists(NFL)

Introduction
Cross validation is used to ﬁnd good algorithm

Introduction
Cross validation is used to ﬁnd good algorithm (time consuming)

Introduction
Cross validation is used to ﬁnd good algorithm (time consuming)
Takes more time with large datasets and algorithms

Introduction
Meta-Learning
Knowledge of datasets and performances of algorithms are stored

Introduction
Meta-Learning
Predict performance of algorithms

Introduction
Meta-Learning
Generate ranking

Introduction
Meta-Learning
Generate ranking
Top-k algorithms can be chosen

Introduction
Meta-Learning
Generate ranking
Top-k algorithms can be chosen
Problem Statement
Predict an optimal learning algorithm or nearly optimal learning
algorithms using a ranking paradigm in terms of performance for a
new data set by using the properties of the data set.

Related Work
Characterization of Classiﬁcation Algorithms [2]
Dataset characteristics
Simple Measures
Statistical Measures
Information Theoretic Measures

Related Work
Simple Measures
Used four types of models

Related Work
Simple Measures
Used four types of models
Partial Learning Curve [4]
Full learning curve from partial learning curve
Fraction of instances used (10%)
Predict best algorithm from two algorithms

Related Work
Meta-Analysis [3]
Meta Features
Simple, Statistical and Information Theoretic Measures
Model Based Measures
Landmarks
Classiﬁcation for algorithm selection
Synthetic dataset used

Related Work
Meta-Analysis [3]
Meta Features
Simple, Statistical and Information Theoretic Measures
Model Based Measures
Landmarks
Classiﬁcation for algorithm selection
Synthetic dataset used
Automatic Classiﬁer Selection for Non-Experts [5]
Meta Features
Predicted Accuracy by regression

Motivation
Empirical Comparison of Supervised Learning Algorithm [1]
Best methods performs poorly
Poor methods performs exceptionally well on some problems

Motivation
Empirical Comparison of Supervised Learning Algorithm [1]
Best methods performs poorly
Poor methods performs exceptionally well on some problems
Motivation to generate ranking of algorithms

Outline
2 Data Properties
4 Results

Histogram of Standard Deviation
Creating Histogram
K standard deviation values (numerical attributes)
H histogram bins
2 histogram per dataset (binary class)
Bins from range [0,0.5] are taken (Data range [0,1])

Histogram of Standard Deviation
Table 1: Standard Deviation Data
Class 0 0.228 0.215 0.2 0.187 0.135 0.15 0.116 0.154
Class 1 0.366 0.223 0.204 0.171 0.179 0.162 0.164 0.178
Table 2: Histogram
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 0 2 3 3 0 0 0 0 0
Class 1 Histogram 0 0 0 5 2 0 0 1 0 0

Comparing Histograms
1-Norm Distance Based Comparison
Two datasets histograms compared on the basis of 1-Norm distance
Two pairwise comparison (one for each class) between datasets
Minimum distance score of two pairwise comparison is taken
Order datasets in increasing distance

Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 0 2 3 3 0 0 0 0 0
Class 1 Histogram 0 0 0 5 2 0 0 1 0 0
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 1 0 5 2 0 0 0 0 0
Class 1 Histogram 0 1 1 1 2 3 0 0 0 0

Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 0 2 3 3 0 0 0 0 0
Class 1 Histogram 0 0 0 5 2 0 0 1 0 0
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 1 0 5 2 0 0 0 0 0
Class 1 Histogram 0 1 1 1 2 3 0 0 0 0
Score 1 : Comparing Class 0 of Dataset-1 with Class 0 of Dataset-2 and Class 1 of Dataset-1 with Class 1 of
Dataset-2.

Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 0 2 3 3 0 0 0 0 0
Class 1 Histogram 0 0 0 5 2 0 0 1 0 0
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 1 0 5 2 0 0 0 0 0
Class 1 Histogram 0 1 1 1 2 3 0 0 0 0
Dataset-2.
Score − 1 = (|0 − 0| + |0 − 1| + |2 − 0| + |3 − 5| + |3 − 2| + |0 − 0| + |0 − 0| + |0 − 0| + |0 − 0| + |0 − 0|)
+ (|0 − 0| + |0 − 1| + |0 − 1| + |5 − 1| + |2 − 2| + |0 − 3| + |0 − 0| + |1 − 0| + |0 − 0| + |0 − 0|)
= 6 + 10 = 16

Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 0 2 3 3 0 0 0 0 0
Class 1 Histogram 0 0 0 5 2 0 0 1 0 0
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 1 0 5 2 0 0 0 0 0
Class 1 Histogram 0 1 1 1 2 3 0 0 0 0
Dataset-2.
Score − 1 = (|0 − 0| + |0 − 1| + |2 − 0| + |3 − 5| + |3 − 2| + |0 − 0| + |0 − 0| + |0 − 0| + |0 − 0| + |0 − 0|)
+ (|0 − 0| + |0 − 1| + |0 − 1| + |5 − 1| + |2 − 2| + |0 − 3| + |0 − 0| + |1 − 0| + |0 − 0| + |0 − 0|)
= 6 + 10 = 16
Dataset-2.

Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 0 2 3 3 0 0 0 0 0
Class 1 Histogram 0 0 0 5 2 0 0 1 0 0
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 1 0 5 2 0 0 0 0 0
Class 1 Histogram 0 1 1 1 2 3 0 0 0 0
Dataset-2.
Score − 1 = (|0 − 0| + |0 − 1| + |2 − 0| + |3 − 5| + |3 − 2| + |0 − 0| + |0 − 0| + |0 − 0| + |0 − 0| + |0 − 0|)
+ (|0 − 0| + |0 − 1| + |0 − 1| + |5 − 1| + |2 − 2| + |0 − 3| + |0 − 0| + |1 − 0| + |0 − 0| + |0 − 0|)
= 6 + 10 = 16
Dataset-2.
Score − 2 = (|0 − 0| + |0 − 1| + |2 − 1| + |3 − 1| + |3 − 2| + |0 − 3| + |0 − 0| + |0 − 0| + |0 − 0| + |0 − 0|)
+ (|0 − 0| + |0 − 1| + |0 − 0| + |5 − 5| + |2 − 2| + |0 − 0| + |0 − 0| + |1 − 0| + |0 − 0| + |0 − 0|)
= 8 + 2 = 10

Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 0 2 3 3 0 0 0 0 0
Class 1 Histogram 0 0 0 5 2 0 0 1 0 0
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 1 0 5 2 0 0 0 0 0
Class 1 Histogram 0 1 1 1 2 3 0 0 0 0
Dataset-2.
Score − 1 = (|0 − 0| + |0 − 1| + |2 − 0| + |3 − 5| + |3 − 2| + |0 − 0| + |0 − 0| + |0 − 0| + |0 − 0| + |0 − 0|)
+ (|0 − 0| + |0 − 1| + |0 − 1| + |5 − 1| + |2 − 2| + |0 − 3| + |0 − 0| + |1 − 0| + |0 − 0| + |0 − 0|)
= 6 + 10 = 16
Dataset-2.
Score − 2 = (|0 − 0| + |0 − 1| + |2 − 1| + |3 − 1| + |3 − 2| + |0 − 3| + |0 − 0| + |0 − 0| + |0 − 0| + |0 − 0|)
+ (|0 − 0| + |0 − 1| + |0 − 0| + |5 − 5| + |2 − 2| + |0 − 0| + |0 − 0| + |1 − 0| + |0 − 0| + |0 − 0|)
= 8 + 2 = 10
Distance Score = min(Score-1,Score-2) = min(16,10) = 10

Kolmogorov Smirnov Test Based Comparison
Take value of histogram as a sample

Calculate proportion of each value in two samples

Calculate cumulative proportion of each sample

Calculate D statistics

Two pairwise comparison

Two pairwise comparison
Minimum distance score

Table 3: Komogorov Smirnov test
Bin Rage Histogram-1 Histogram-2 Proportion-1 Proportion-2 Cum.Pro.-1 Cum. Pro.-2 Diﬀerence
.00-.05 0 0 0 0 0 0 0
.05-.10 1 0 0.125 0 0.125 0 0.125
.10-.15 0 2 0 0.25 0.125 0.25 0.125
.15-.20 5 3 0.625 0.375 0.75 0.625 0.125
.20-.25 2 3 0.25 0.375 1 1 0
.25-.30 0 0 0 0 1 1 0
.30-.35 0 0 0 0 1 1 0
.35-.40 0 0 0 0 1 1 0
.40-.45 0 0 0 0 1 1 0
.45-.50 0 0 0 0 1 1 0

.00-.05 0 0 0 0 0 0 0
.05-.10 1 0 0.125 0 0.125 0 0.125
.10-.15 0 2 0 0.25 0.125 0.25 0.125
.15-.20 5 3 0.625 0.375 0.75 0.625 0.125
.20-.25 2 3 0.25 0.375 1 1 0
.25-.30 0 0 0 0 1 1 0
.30-.35 0 0 0 0 1 1 0
.35-.40 0 0 0 0 1 1 0
.40-.45 0 0 0 0 1 1 0
.45-.50 0 0 0 0 1 1 0
Two datasets can be compared in two ways

.00-.05 0 0 0 0 0 0 0
.05-.10 1 0 0.125 0 0.125 0 0.125
.10-.15 0 2 0 0.25 0.125 0.25 0.125
.15-.20 5 3 0.625 0.375 0.75 0.625 0.125
.20-.25 2 3 0.25 0.375 1 1 0
.25-.30 0 0 0 0 1 1 0
.30-.35 0 0 0 0 1 1 0
.35-.40 0 0 0 0 1 1 0
.40-.45 0 0 0 0 1 1 0
.45-.50 0 0 0 0 1 1 0
Total 2 D values for each comparison

.00-.05 0 0 0 0 0 0 0
.05-.10 1 0 0.125 0 0.125 0 0.125
.10-.15 0 2 0 0.25 0.125 0.25 0.125
.15-.20 5 3 0.625 0.375 0.75 0.625 0.125
.20-.25 2 3 0.25 0.375 1 1 0
.25-.30 0 0 0 0 1 1 0
.30-.35 0 0 0 0 1 1 0
.35-.40 0 0 0 0 1 1 0
.40-.45 0 0 0 0 1 1 0
.45-.50 0 0 0 0 1 1 0
Total 2 D values for each comparison
Sum both values and consider minimum D score mapping

Datasets Properties
CAV
Separate dataset based on it’s class

Datasets Properties
CAV
Apply KMeans Clustering

Datasets Properties
CAV
Cluster Properties

Datasets Properties
CAV
Cluster Properties
Number of instances in each cluster
Cluster Value = Ck = ¯x∈cluster dist(¯x, centroid)
Cluster Centroid

Datasets Properties
CAV
Cluster Properties
Number of instances in each cluster
Cluster Value = Ck = ¯x∈cluster dist(¯x, centroid)
Cluster Centroid
Moments of Data
Variance
Skewness
Kurtosis

Datasets Properties
Mixture Of Gaussians
Dataset may have overlapping clusters(non-circular shape)
For each attributes k diﬀerent Gaussian
Model ﬁt by Maximum likelihood of observed data
mean and variance of each Gaussian are stored

Datasets Properties
Mixture Of Gaussians
Dataset may have overlapping clusters(non-circular shape)
For each attributes k diﬀerent Gaussian
Model ﬁt by Maximum likelihood of observed data
mean and variance of each Gaussian are stored
Multivariate Gaussian Model
N(x; µ, ) = 1
(2π)d/2 | |
−1/2
e(−1
2 (x−µ) −1
(x−µ)T
)
µ is d-length row vector
σ is d × d matrix
Singular value decomposition of covariance matrix
Values from diagonal matrix and mean vector are stored

Outline
2 Data Properties
4 Results

Regression to Predict Performance Measures
Regression Analysis
Property vector populated by meta properties
Entire vector or sub-vector can be used to predict performance
measures
Regression model is given as Y = f (X, α)
Y is dependent variable (performance measure)
X is independent variables (property vector)
α is a vector of unknown parameters

Figure 1: Training

Figure 2: Testing

Statistical Signiﬁcance Testing
Comparison of predicted sequence with random sequence
Actual sequence is taken as baseline
Probability of at least one algorithm of top-k present in actual top k:
Probability = 1 − (none of algorithm present in topk)
= 1 −
n − k
n
×
n − k − 1
n − 1
× . . . ×
n − 2k + 1
n − k + 1
Expected Number of algorithm from random sequence present in
actual top k:
Expected Value = 1 ×
k
1
n−k
k−1
n
k
+ 2 ×
k
2
n−k
k−2
n
k
+ . . . +
+ . . . k ×
k
1−k
n−k
k−k
n
k

Statistical Significance Testing
Test 1
At least one algorithm of given top-k sequence is present in actual
top-k sequence
Statistical significance test between
Predicted rank-actual rank matches
Random rank-actual rank matches
Exclude prediction methods where the difference is not statistically
significant
Test 2
Number of algorithms of given top-k sequence is present in actual
top-k sequence
Statistical significance test between
Predicted rank-actual rank matches
Random rank-actual rank matches

Autoencoder
Learns to construct it’s own input
Increasing dimension of data by autoencoder
Decreasing dimension of data by autoencoder
Stacked autoencoder

Outline
2 Data Properties
4 Results

Data and Algorithm Set
Real world datasets
44 binary datasets from UCI, tunedIT, keel, delve
Synthetic datasets
484 datasets are generated using univariate and multivariate
distribution
DA1 : 13 classification algorithms and 44 real datasets
DA2 : 70 classification algorithms and 44 real datasets
DA2 : 70 classification algorithms and 484 synthetic datasets

Data Cleaning
Steps for Data Cleaning
Nominal to binary 0/1 attribute
PCA with maximum attributes as 8
Normalization has been done on these reduced sets of attributes
using (value−min)
(max−min) .
Class attributes are renamed to 0 and 1

Predicting Ranking from Histogram Bin
Table 4: Predicting Ranking from histogram bins using 1-Norm distance on DA1
Count Random Probability 1-Norm distance Conﬁdence for alternative hypothesis
1 0.076923077 0.090909091 0.5583927
2 0.294871795 0.295454545 0.44665546
3 0.58041958 0.659090909 0.8166996
4 0.823776224 0.795454545 0.2378282
5 0.956487956 1 0.8587793
6 0.995920746 1 0.164608
7 1 1 0
Table 5: Predicting Ranking from histogram bins using Kolmogorov Smirnov on
DA1
Count Random Probability Kolmogorov smirnov test Conﬁdence for alternative hypothesis
1 0.076923077 0.090909091 0.7518056
2 0.294871795 0.340909091 0.6987076
3 0.58041958 0.636363636 0.7234472
4 0.823776224 0.863636364 0.6779081
5 0.956487956 0.977272727 0.5761086
6 0.995920746 1 0.164608
7 1 1 0

Table 6: Predicting Ranking from histogram bins using DA2
Count Random Probability 1-Norm distance Conﬁdence for alter. hypo. Kolmogorov smirnov test Conﬁdence for alter. hypo.
1 0.014285714 0 0 0 0
2 0.056728778 0.022727273 0.07656138 0.022727273 0.07656138
3 0.124862989 0.068181818 0.07501977 0.090909091 0.1837721
4 0.213955796 0.159090909 0.1402761 0.159090909 0.1402761
5 0.317534624 0.340909091 0.5753865 0.272727273 0.213988
6 0.428182857 0.454545455 0.5822902 0.363636364 0.1543545
7 0.538469854 0.522727273 0.3581275 0.454545455 0.1025831
8 0.641846095 0.659090909 0.5264127 0.681818182 0.6489032
9 0.733341187 0.818181818 0.8664682 0.795454545 0.77316
10 0.809949161 0.863636364 0.756462 0.863636364 0.756462
11 0.870659846 0.909090909 0.688802 0.886363636 0.5115497
12 0.916175987 0.954545455 0.7251068 0.977272727 0.8932743
13 0.948419826 0.977272727 0.6699302 0.977272727 0.6699302
14 0.969963161 1 0.7386451 1 0.7386451
15 0.983506557 1 0.5189398 1 0.5189398

Table 7: Predicting Ranking using from histogram bins DA3
Count Random Probability 1-Norm distance Conﬁdence for alter. hypo. Kolmogorov smirnov test Conﬁdence for alter. hypo.
1 0.014285714 0 0 0 0
2 0.056728778 0 0 0 0
3 0.124862989 0.008264463 0 0.010330579 0
4 0.213955796 0.068181818 0 0.07231405 0
5 0.317534624 0.150826446 0 0.150826446 0
6 0.428182857 0.268595041 5.64E-14 0.27892562 3.98E-12
7 0.538469854 0.400826446 2.69E-10 0.404958678 1.49E-09
8 0.641846095 0.541322314 1.44E-06 0.530991736 2.25E-07
9 0.733341187 0.683884298 0.0066866 0.681818182 0.00504363
10 0.809949161 0.780991736 0.04829307 0.783057851 0.04829307
11 0.870659846 0.863636364 0.2945404 0.863636364 0.2945404
12 0.916175987 0.919421488 0.5609581 0.919421488 0.5609581
13 0.948419826 0.960743802 0.8715195 0.962809917 0.8715195
14 0.969963161 0.975206612 0.6958514 0.977272727 0.6958514
15 0.983506557 0.981404959 0.2801928 0.983471074 0.2801928

Predicting Ranking by Data Property Vector
Steps
These properties are considered:
Histogram of standard deviation of each class.
Cluster Analysis Vector(CVV).
Moments of data.
Mixture of Gaussian on each attribute.
Mean and vector of diagonal entries from Sigma matrix obtained by
singular value decomposition (SVD) of the covariance matrix of the
multivariate Gaussian model of the dataset.

Steps
Moments of data.
Property vector as independent variable and accuracy as dependent
variable in regression

Steps
Moments of data.
One regression model for each classiﬁer

Steps
Moments of data.
One regression model for each classiﬁer
Result compared with random sequence

Table 8: Test-1 using Data Characteristics forDA1
Value of k
Random
Probability
Gaussian
Processes
p-value for
null hypo.
IBk
p-value for
null hypo.
1 0.076923077 0.136363636 0.2481944 0.25 0.000392335
2 0.294871795 0.522727273 0.001277486 0.454545455 0.01802491
3 0.58041958 0.795454545 0.006170564 0.704545455 0.06290122
Table 9: Test-2 using Data Characteristics for DA1
Count
Random
Probability
Gaussian
Processes
p-value for
null hypo.
IBk
p-value for
null hypo.
1 0.076923077 0.136363636 0.2481944 0.25 0.000392335
2 0.153846154 0.284090909 0.002996455 0.25 0.0128041
3 0.230769231 0.386363636 0.000394481 0.363636364 0.000394481
4 0.307692308 0.443181818 0.000343617 0.403409091 0.0159287
5 0.384615385 0.518181818 0.00032955 0.440909091 0.08601391
6 0.461538462 0.549242424 0.003806131 0.53030303 0.02680756

Count
Random
Probability
Gaussian
Processes
p-value for
null hypo.
IBk
p-value for
null hypo.
1 0.014285714 0.045454545 0.469059 0.068181818 0.130488
2 0.056728778 0.227272727 0.000702291 0.295454545 4.23E-06
3 0.124862989 0.318181818 0.002140653 0.295454545 0.0063842
4 0.213955796 0.431818182 0.000962345 0.409090909 0.006957731
5 0.317534624 0.590909091 0.000168674 0.477272727 0.03942881
6 0.428182857 0.636363636 0.01014022 0.613636364 0.01014022
Count
Random
Probability
Gaussian
Processes
p-value for
null hypo.
IBk
p-value for
null hypo.
5 0.071428571 0.195454545 4.80E-08 0.25 1.92E-16
10 0.142857143 0.281818182 1.03E-12 0.327272727 8.45E-21
15 0.214285714 0.363636364 1.32E-18 0.366666667 1.32E-18
20 0.285714286 0.454545455 1.86E-26 0.422727273 3.23E-15
25 0.357142857 0.512727273 2.45E-22 0.470909091 1.99E-11
30 0.428571429 0.575 1.61E-24 0.543939394 3.85E-12
35 0.5 0.646753247 3.16E-27 0.607792208 5.06E-13

Count
Random
Probability
Gaussian
Processes
p-value for
null hypo.
IBk
p-value for
null hypo.
1 0.014285714 0.148760331 5.27E-49 0.237603306 2.57E-101
2 0.056728778 0.384297521 1.92E-101 0.48553719 1.66E-154
3 0.124862989 0.520661157 7.63E-97 0.657024793 8.85E-163
4 0.213955796 0.597107438 1.49E-73 0.766528926 4.31E-148
5 0.317534624 0.665289256 1.05E-54 0.830578512 3.11E-119
6 0.428182857 0.710743802 3.06E-36 0.873966942 5.45E-92
Count
Random
Probability
Gaussian
Processes
p-value for
null hypo.
IBk
p-value for
null hypo.
5 0.071428571 0.321487603 1.43E-284 0.395041322 0
10 0.142857143 0.40661157 0 0.480991736 0
15 0.214285714 0.478236915 0 0.534710744 0
20 0.285714286 0.517768595 0 0.577582645 0
25 0.357142857 0.573636364 0 0.622479339 0
30 0.428571429 0.631060606 0 0.666804408 0
35 0.5 0.687839433 0 0.717532468 0

Figure 3: Test-1 on DA1

Figure 7: Test-1 on DA3 Figure 8: Test-2 on DA3

Increasing Dimension of Data (Autoencoder)

Decreasing Dimension of Data (Autoencoder)

Stacked Autoencoder

Comparison with Previous Techniques
Test-1 : Difference between Accuracy
True accuracy and predicted accuracy of a algorithm on each test
dataset is compared
Average of absolute difference of all datasets and for all algorithms
is taken
Result of best regression technique is reported
Table 14: Difference Between Accuracy
Simple Statistical
Info
Theoretic
Model
Based
Landmark DCT ALL MDDF
Difference 0.0890 0.0915 0.0648 0.0859 0.0422 0.0747 0.0525 0.0426

Comparison with Previous Techniques
Test-2 : Rank Correlation
Spearman’s rank correlation coefficient between these two rankings
is calculated
Value of this coefficient is averaged over all data sets
Higher the value of this coefficient, the higher the correlation
between the actual rank and predicted rank
Table 15: Rank Correlation
Simple Statistical
Info
Theoretic
Model
Based
Landmark DCT ALL MDDF
Average
Value
0.464 0.444 0.488 0.431 0.495 0.459 0.488 0.520

Outline
2 Data Properties
4 Results

Conclusion
Methods to generate ranking of binary classiﬁcation algorithms
without running them
Intrinsic properties of data
Ranking of classiﬁers predicted via regression
Autoencoder used for more predictive analysis
Our approach give better result than previous techniques

Future Work
Multi-class classiﬁcation
Datasets can be grouped together based on domain knowledge
Other performance measures like precision, recall and f-measure can
be used

Thank you!

References I
[1] R. Caruana and A. Niculescu-Mizil.
An empirical comparison of supervised learning algorithms.
In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, pages
161–168, New York, NY, USA, 2006. ACM.
[2] J. Gama and P. Brazdil.
Characterization of classification algorithms.
In Proceedings of the 7th Portuguese Conference on Artificial Intelligence: Progress in
Artificial Intelligence, EPIA ’95, pages 189–200, London, UK, UK, 1995. Springer-Verlag.
[3] C. Kpf, C. Taylor, and J. Keller.
Meta-analysis: From data characterisation for meta-learning to meta-regression.
In Proceedings of the PKDD-00 Workshop on Data Mining, Decision Support,Meta-Learning
and ILP.
[4] R. Leite and P. Brazdil.
An iterative process for building learning curves and predicting relative performance of
classifiers.
In J. Neves, M. Santos, and J. Machado, editors, Progress in Artificial Intelligence, volume
4874 of Lecture Notes in Computer Science, pages 87–98. Springer Berlin Heidelberg, 2007.
[5] M. Reif, F. Shafait, M. Goldstein, T. Breuel, and A. Dengel.
Automatic classifier selection for non-experts.
Pattern Analysis and Applications, 17(1):83–96, 2014.

Predicting best classifier using properties of data sets

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Predicting best classifier using properties of data sets

Similar to Predicting best classifier using properties of data sets (20)

Recently uploaded

Recently uploaded (20)

Predicting best classifier using properties of data sets