SlideShare a Scribd company logo
Predicting the Best Classiļ¬er using Properties of
Datasets
Abhishek Vijayvargia
Supervised by: Prof. Harish Karnick
Department of Computer Science & Engineering
IIT Kanpur
June 24, 2015
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Outline
1 Introduction and Background
2 Data Properties
3 Regression and Signiļ¬cance Testing
4 Results
5 Conclusion and Future Work
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 2/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Outline
1 Introduction and Background
2 Data Properties
3 Regression and Signiļ¬cance Testing
4 Results
5 Conclusion and Future Work
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 3/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Introduction
Classiļ¬cation techniques have application in diļ¬€erent domains
Datasets contains mixture of nominal,integer,real and text attributes
Datasets have diļ¬€erent properties
Classiļ¬cation algorithms performs diļ¬€erently on these datasets
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 4/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Introduction
Classiļ¬cation techniques have application in diļ¬€erent domains
Datasets contains mixture of nominal,integer,real and text attributes
Datasets have diļ¬€erent properties
Classiļ¬cation algorithms performs diļ¬€erently on these datasets
No single best algorithm exists(NFL)
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 4/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Introduction
Classiļ¬cation techniques have application in diļ¬€erent domains
Datasets contains mixture of nominal,integer,real and text attributes
Datasets have diļ¬€erent properties
Classiļ¬cation algorithms performs diļ¬€erently on these datasets
No single best algorithm exists(NFL)
Cross validation is used to ļ¬nd good algorithm
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 4/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Introduction
Classiļ¬cation techniques have application in diļ¬€erent domains
Datasets contains mixture of nominal,integer,real and text attributes
Datasets have diļ¬€erent properties
Classiļ¬cation algorithms performs diļ¬€erently on these datasets
No single best algorithm exists(NFL)
Cross validation is used to ļ¬nd good algorithm (time consuming)
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 4/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Introduction
Classiļ¬cation techniques have application in diļ¬€erent domains
Datasets contains mixture of nominal,integer,real and text attributes
Datasets have diļ¬€erent properties
Classiļ¬cation algorithms performs diļ¬€erently on these datasets
No single best algorithm exists(NFL)
Cross validation is used to ļ¬nd good algorithm (time consuming)
Takes more time with large datasets and algorithms
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 4/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Introduction
Meta-Learning
Knowledge of datasets and performances of algorithms are stored
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 5/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Introduction
Meta-Learning
Knowledge of datasets and performances of algorithms are stored
Predict performance of algorithms
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 5/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Introduction
Meta-Learning
Knowledge of datasets and performances of algorithms are stored
Predict performance of algorithms
Generate ranking
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 5/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Introduction
Meta-Learning
Knowledge of datasets and performances of algorithms are stored
Predict performance of algorithms
Generate ranking
Top-k algorithms can be chosen
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 5/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Introduction
Meta-Learning
Knowledge of datasets and performances of algorithms are stored
Predict performance of algorithms
Generate ranking
Top-k algorithms can be chosen
Problem Statement
Predict an optimal learning algorithm or nearly optimal learning
algorithms using a ranking paradigm in terms of performance for a
new data set by using the properties of the data set.
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 5/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Related Work
Characterization of Classiļ¬cation Algorithms [2]
Dataset characteristics
Simple Measures
Statistical Measures
Information Theoretic Measures
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 6/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Related Work
Characterization of Classiļ¬cation Algorithms [2]
Dataset characteristics
Simple Measures
Statistical Measures
Information Theoretic Measures
Used four types of models
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 6/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Related Work
Characterization of Classiļ¬cation Algorithms [2]
Dataset characteristics
Simple Measures
Statistical Measures
Information Theoretic Measures
Used four types of models
Partial Learning Curve [4]
Full learning curve from partial learning curve
Fraction of instances used (10%)
Predict best algorithm from two algorithms
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 6/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Related Work
Meta-Analysis [3]
Meta Features
Simple, Statistical and Information Theoretic Measures
Model Based Measures
Landmarks
Classiļ¬cation for algorithm selection
Synthetic dataset used
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 7/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Related Work
Meta-Analysis [3]
Meta Features
Simple, Statistical and Information Theoretic Measures
Model Based Measures
Landmarks
Classiļ¬cation for algorithm selection
Synthetic dataset used
Automatic Classiļ¬er Selection for Non-Experts [5]
Meta Features
Predicted Accuracy by regression
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 7/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Motivation
Empirical Comparison of Supervised Learning Algorithm [1]
Best methods performs poorly
Poor methods performs exceptionally well on some problems
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 8/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Motivation
Empirical Comparison of Supervised Learning Algorithm [1]
Best methods performs poorly
Poor methods performs exceptionally well on some problems
Motivation to generate ranking of algorithms
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 8/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Outline
1 Introduction and Background
2 Data Properties
3 Regression and Signiļ¬cance Testing
4 Results
5 Conclusion and Future Work
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 9/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Histogram of Standard Deviation
Creating Histogram
K standard deviation values (numerical attributes)
H histogram bins
2 histogram per dataset (binary class)
Bins from range [0,0.5] are taken (Data range [0,1])
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 10/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Histogram of Standard Deviation
Table 1: Standard Deviation Data
Class 0 0.228 0.215 0.2 0.187 0.135 0.15 0.116 0.154
Class 1 0.366 0.223 0.204 0.171 0.179 0.162 0.164 0.178
Table 2: Histogram
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 0 2 3 3 0 0 0 0 0
Class 1 Histogram 0 0 0 5 2 0 0 1 0 0
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 11/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
1-Norm Distance Based Comparison
Two datasets histograms compared on the basis of 1-Norm distance
Two pairwise comparison (one for each class) between datasets
Minimum distance score of two pairwise comparison is taken
Order datasets in increasing distance
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 12/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 0 2 3 3 0 0 0 0 0
Class 1 Histogram 0 0 0 5 2 0 0 1 0 0
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 1 0 5 2 0 0 0 0 0
Class 1 Histogram 0 1 1 1 2 3 0 0 0 0
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 13/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 0 2 3 3 0 0 0 0 0
Class 1 Histogram 0 0 0 5 2 0 0 1 0 0
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 1 0 5 2 0 0 0 0 0
Class 1 Histogram 0 1 1 1 2 3 0 0 0 0
Score 1 : Comparing Class 0 of Dataset-1 with Class 0 of Dataset-2 and Class 1 of Dataset-1 with Class 1 of
Dataset-2.
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 13/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 0 2 3 3 0 0 0 0 0
Class 1 Histogram 0 0 0 5 2 0 0 1 0 0
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 1 0 5 2 0 0 0 0 0
Class 1 Histogram 0 1 1 1 2 3 0 0 0 0
Score 1 : Comparing Class 0 of Dataset-1 with Class 0 of Dataset-2 and Class 1 of Dataset-1 with Class 1 of
Dataset-2.
Score āˆ’ 1 = (|0 āˆ’ 0| + |0 āˆ’ 1| + |2 āˆ’ 0| + |3 āˆ’ 5| + |3 āˆ’ 2| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0|)
+ (|0 āˆ’ 0| + |0 āˆ’ 1| + |0 āˆ’ 1| + |5 āˆ’ 1| + |2 āˆ’ 2| + |0 āˆ’ 3| + |0 āˆ’ 0| + |1 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0|)
= 6 + 10 = 16
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 13/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 0 2 3 3 0 0 0 0 0
Class 1 Histogram 0 0 0 5 2 0 0 1 0 0
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 1 0 5 2 0 0 0 0 0
Class 1 Histogram 0 1 1 1 2 3 0 0 0 0
Score 1 : Comparing Class 0 of Dataset-1 with Class 0 of Dataset-2 and Class 1 of Dataset-1 with Class 1 of
Dataset-2.
Score āˆ’ 1 = (|0 āˆ’ 0| + |0 āˆ’ 1| + |2 āˆ’ 0| + |3 āˆ’ 5| + |3 āˆ’ 2| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0|)
+ (|0 āˆ’ 0| + |0 āˆ’ 1| + |0 āˆ’ 1| + |5 āˆ’ 1| + |2 āˆ’ 2| + |0 āˆ’ 3| + |0 āˆ’ 0| + |1 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0|)
= 6 + 10 = 16
Score 2 : Comparing Class 0 of Dataset-1 with Class 1 of Dataset-2 and Class 1 of Dataset-1 with Class 0 of
Dataset-2.
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 13/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 0 2 3 3 0 0 0 0 0
Class 1 Histogram 0 0 0 5 2 0 0 1 0 0
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 1 0 5 2 0 0 0 0 0
Class 1 Histogram 0 1 1 1 2 3 0 0 0 0
Score 1 : Comparing Class 0 of Dataset-1 with Class 0 of Dataset-2 and Class 1 of Dataset-1 with Class 1 of
Dataset-2.
Score āˆ’ 1 = (|0 āˆ’ 0| + |0 āˆ’ 1| + |2 āˆ’ 0| + |3 āˆ’ 5| + |3 āˆ’ 2| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0|)
+ (|0 āˆ’ 0| + |0 āˆ’ 1| + |0 āˆ’ 1| + |5 āˆ’ 1| + |2 āˆ’ 2| + |0 āˆ’ 3| + |0 āˆ’ 0| + |1 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0|)
= 6 + 10 = 16
Score 2 : Comparing Class 0 of Dataset-1 with Class 1 of Dataset-2 and Class 1 of Dataset-1 with Class 0 of
Dataset-2.
Score āˆ’ 2 = (|0 āˆ’ 0| + |0 āˆ’ 1| + |2 āˆ’ 1| + |3 āˆ’ 1| + |3 āˆ’ 2| + |0 āˆ’ 3| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0|)
+ (|0 āˆ’ 0| + |0 āˆ’ 1| + |0 āˆ’ 0| + |5 āˆ’ 5| + |2 āˆ’ 2| + |0 āˆ’ 0| + |0 āˆ’ 0| + |1 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0|)
= 8 + 2 = 10
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 13/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 0 2 3 3 0 0 0 0 0
Class 1 Histogram 0 0 0 5 2 0 0 1 0 0
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 1 0 5 2 0 0 0 0 0
Class 1 Histogram 0 1 1 1 2 3 0 0 0 0
Score 1 : Comparing Class 0 of Dataset-1 with Class 0 of Dataset-2 and Class 1 of Dataset-1 with Class 1 of
Dataset-2.
Score āˆ’ 1 = (|0 āˆ’ 0| + |0 āˆ’ 1| + |2 āˆ’ 0| + |3 āˆ’ 5| + |3 āˆ’ 2| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0|)
+ (|0 āˆ’ 0| + |0 āˆ’ 1| + |0 āˆ’ 1| + |5 āˆ’ 1| + |2 āˆ’ 2| + |0 āˆ’ 3| + |0 āˆ’ 0| + |1 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0|)
= 6 + 10 = 16
Score 2 : Comparing Class 0 of Dataset-1 with Class 1 of Dataset-2 and Class 1 of Dataset-1 with Class 0 of
Dataset-2.
Score āˆ’ 2 = (|0 āˆ’ 0| + |0 āˆ’ 1| + |2 āˆ’ 1| + |3 āˆ’ 1| + |3 āˆ’ 2| + |0 āˆ’ 3| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0|)
+ (|0 āˆ’ 0| + |0 āˆ’ 1| + |0 āˆ’ 0| + |5 āˆ’ 5| + |2 āˆ’ 2| + |0 āˆ’ 0| + |0 āˆ’ 0| + |1 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0|)
= 8 + 2 = 10
Distance Score = min(Score-1,Score-2) = min(16,10) = 10
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 13/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Kolmogorov Smirnov Test Based Comparison
Take value of histogram as a sample
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 14/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Kolmogorov Smirnov Test Based Comparison
Take value of histogram as a sample
Calculate proportion of each value in two samples
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 14/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Kolmogorov Smirnov Test Based Comparison
Take value of histogram as a sample
Calculate proportion of each value in two samples
Calculate cumulative proportion of each sample
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 14/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Kolmogorov Smirnov Test Based Comparison
Take value of histogram as a sample
Calculate proportion of each value in two samples
Calculate cumulative proportion of each sample
Calculate D statistics
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 14/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Kolmogorov Smirnov Test Based Comparison
Take value of histogram as a sample
Calculate proportion of each value in two samples
Calculate cumulative proportion of each sample
Calculate D statistics
Two pairwise comparison
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 14/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Kolmogorov Smirnov Test Based Comparison
Take value of histogram as a sample
Calculate proportion of each value in two samples
Calculate cumulative proportion of each sample
Calculate D statistics
Two pairwise comparison
Minimum distance score
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 14/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Table 3: Komogorov Smirnov test
Bin Rage Histogram-1 Histogram-2 Proportion-1 Proportion-2 Cum.Pro.-1 Cum. Pro.-2 Diļ¬€erence
.00-.05 0 0 0 0 0 0 0
.05-.10 1 0 0.125 0 0.125 0 0.125
.10-.15 0 2 0 0.25 0.125 0.25 0.125
.15-.20 5 3 0.625 0.375 0.75 0.625 0.125
.20-.25 2 3 0.25 0.375 1 1 0
.25-.30 0 0 0 0 1 1 0
.30-.35 0 0 0 0 1 1 0
.35-.40 0 0 0 0 1 1 0
.40-.45 0 0 0 0 1 1 0
.45-.50 0 0 0 0 1 1 0
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 15/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Table 3: Komogorov Smirnov test
Bin Rage Histogram-1 Histogram-2 Proportion-1 Proportion-2 Cum.Pro.-1 Cum. Pro.-2 Diļ¬€erence
.00-.05 0 0 0 0 0 0 0
.05-.10 1 0 0.125 0 0.125 0 0.125
.10-.15 0 2 0 0.25 0.125 0.25 0.125
.15-.20 5 3 0.625 0.375 0.75 0.625 0.125
.20-.25 2 3 0.25 0.375 1 1 0
.25-.30 0 0 0 0 1 1 0
.30-.35 0 0 0 0 1 1 0
.35-.40 0 0 0 0 1 1 0
.40-.45 0 0 0 0 1 1 0
.45-.50 0 0 0 0 1 1 0
Two datasets can be compared in two ways
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 15/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Table 3: Komogorov Smirnov test
Bin Rage Histogram-1 Histogram-2 Proportion-1 Proportion-2 Cum.Pro.-1 Cum. Pro.-2 Diļ¬€erence
.00-.05 0 0 0 0 0 0 0
.05-.10 1 0 0.125 0 0.125 0 0.125
.10-.15 0 2 0 0.25 0.125 0.25 0.125
.15-.20 5 3 0.625 0.375 0.75 0.625 0.125
.20-.25 2 3 0.25 0.375 1 1 0
.25-.30 0 0 0 0 1 1 0
.30-.35 0 0 0 0 1 1 0
.35-.40 0 0 0 0 1 1 0
.40-.45 0 0 0 0 1 1 0
.45-.50 0 0 0 0 1 1 0
Two datasets can be compared in two ways
Total 2 D values for each comparison
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 15/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Table 3: Komogorov Smirnov test
Bin Rage Histogram-1 Histogram-2 Proportion-1 Proportion-2 Cum.Pro.-1 Cum. Pro.-2 Diļ¬€erence
.00-.05 0 0 0 0 0 0 0
.05-.10 1 0 0.125 0 0.125 0 0.125
.10-.15 0 2 0 0.25 0.125 0.25 0.125
.15-.20 5 3 0.625 0.375 0.75 0.625 0.125
.20-.25 2 3 0.25 0.375 1 1 0
.25-.30 0 0 0 0 1 1 0
.30-.35 0 0 0 0 1 1 0
.35-.40 0 0 0 0 1 1 0
.40-.45 0 0 0 0 1 1 0
.45-.50 0 0 0 0 1 1 0
Two datasets can be compared in two ways
Total 2 D values for each comparison
Sum both values and consider minimum D score mapping
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 15/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Datasets Properties
CAV
Separate dataset based on itā€™s class
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 16/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Datasets Properties
CAV
Separate dataset based on itā€™s class
Apply KMeans Clustering
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 16/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Datasets Properties
CAV
Separate dataset based on itā€™s class
Apply KMeans Clustering
Cluster Properties
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 16/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Datasets Properties
CAV
Separate dataset based on itā€™s class
Apply KMeans Clustering
Cluster Properties
Number of instances in each cluster
Cluster Value = Ck = ĀÆxāˆˆcluster dist(ĀÆx, centroid)
Cluster Centroid
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 16/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Datasets Properties
CAV
Separate dataset based on itā€™s class
Apply KMeans Clustering
Cluster Properties
Number of instances in each cluster
Cluster Value = Ck = ĀÆxāˆˆcluster dist(ĀÆx, centroid)
Cluster Centroid
Moments of Data
Variance
Skewness
Kurtosis
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 16/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Datasets Properties
Mixture Of Gaussians
Dataset may have overlapping clusters(non-circular shape)
For each attributes k diļ¬€erent Gaussian
Model ļ¬t by Maximum likelihood of observed data
mean and variance of each Gaussian are stored
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 17/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Datasets Properties
Mixture Of Gaussians
Dataset may have overlapping clusters(non-circular shape)
For each attributes k diļ¬€erent Gaussian
Model ļ¬t by Maximum likelihood of observed data
mean and variance of each Gaussian are stored
Multivariate Gaussian Model
N(x; Āµ, ) = 1
(2Ļ€)d/2 | |
āˆ’1/2
e(āˆ’1
2 (xāˆ’Āµ) āˆ’1
(xāˆ’Āµ)T
)
Āµ is d-length row vector
Ļƒ is d Ɨ d matrix
Singular value decomposition of covariance matrix
Values from diagonal matrix and mean vector are stored
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 17/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Outline
1 Introduction and Background
2 Data Properties
3 Regression and Signiļ¬cance Testing
4 Results
5 Conclusion and Future Work
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 18/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Regression to Predict Performance Measures
Regression Analysis
Property vector populated by meta properties
Entire vector or sub-vector can be used to predict performance
measures
Regression model is given as Y = f (X, Ī±)
Y is dependent variable (performance measure)
X is independent variables (property vector)
Ī± is a vector of unknown parameters
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 19/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Regression to Predict Performance Measures
Figure 1: Training
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 20/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Regression to Predict Performance Measures
Figure 2: Testing
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 21/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Statistical Signiļ¬cance Testing
Comparison of predicted sequence with random sequence
Actual sequence is taken as baseline
Probability of at least one algorithm of top-k present in actual top k:
Probability = 1 āˆ’ (none of algorithm present in topk)
= 1 āˆ’
n āˆ’ k
n
Ɨ
n āˆ’ k āˆ’ 1
n āˆ’ 1
Ɨ . . . Ɨ
n āˆ’ 2k + 1
n āˆ’ k + 1
Expected Number of algorithm from random sequence present in
actual top k:
Expected Value = 1 Ɨ
k
1
nāˆ’k
kāˆ’1
n
k
+ 2 Ɨ
k
2
nāˆ’k
kāˆ’2
n
k
+ . . . +
+ . . . k Ɨ
k
1āˆ’k
nāˆ’k
kāˆ’k
n
k
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 22/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Statistical Signiļ¬cance Testing
Test 1
At least one algorithm of given top-k sequence is present in actual
top-k sequence
Statistical signiļ¬cance test between
Predicted rank-actual rank matches
Random rank-actual rank matches
Exclude prediction methods where the diļ¬€erence is not statistically
signiļ¬cant
Test 2
Number of algorithms of given top-k sequence is present in actual
top-k sequence
Statistical signiļ¬cance test between
Predicted rank-actual rank matches
Random rank-actual rank matches
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 23/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Autoencoder
Learns to construct itā€™s own input
Increasing dimension of data by autoencoder
Decreasing dimension of data by autoencoder
Stacked autoencoder
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 24/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Outline
1 Introduction and Background
2 Data Properties
3 Regression and Signiļ¬cance Testing
4 Results
5 Conclusion and Future Work
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 25/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Data and Algorithm Set
Real world datasets
44 binary datasets from UCI, tunedIT, keel, delve
Synthetic datasets
484 datasets are generated using univariate and multivariate
distribution
DA1 : 13 classiļ¬cation algorithms and 44 real datasets
DA2 : 70 classiļ¬cation algorithms and 44 real datasets
DA2 : 70 classiļ¬cation algorithms and 484 synthetic datasets
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 26/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Data Cleaning
Steps for Data Cleaning
Nominal to binary 0/1 attribute
PCA with maximum attributes as 8
Normalization has been done on these reduced sets of attributes
using (valueāˆ’min)
(maxāˆ’min) .
Class attributes are renamed to 0 and 1
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 27/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Predicting Ranking from Histogram Bin
Table 4: Predicting Ranking from histogram bins using 1-Norm distance on DA1
Count Random Probability 1-Norm distance Conļ¬dence for alternative hypothesis
1 0.076923077 0.090909091 0.5583927
2 0.294871795 0.295454545 0.44665546
3 0.58041958 0.659090909 0.8166996
4 0.823776224 0.795454545 0.2378282
5 0.956487956 1 0.8587793
6 0.995920746 1 0.164608
7 1 1 0
Table 5: Predicting Ranking from histogram bins using Kolmogorov Smirnov on
DA1
Count Random Probability Kolmogorov smirnov test Conļ¬dence for alternative hypothesis
1 0.076923077 0.090909091 0.7518056
2 0.294871795 0.340909091 0.6987076
3 0.58041958 0.636363636 0.7234472
4 0.823776224 0.863636364 0.6779081
5 0.956487956 0.977272727 0.5761086
6 0.995920746 1 0.164608
7 1 1 0
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 28/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Predicting Ranking from Histogram Bin
Table 6: Predicting Ranking from histogram bins using DA2
Count Random Probability 1-Norm distance Conļ¬dence for alter. hypo. Kolmogorov smirnov test Conļ¬dence for alter. hypo.
1 0.014285714 0 0 0 0
2 0.056728778 0.022727273 0.07656138 0.022727273 0.07656138
3 0.124862989 0.068181818 0.07501977 0.090909091 0.1837721
4 0.213955796 0.159090909 0.1402761 0.159090909 0.1402761
5 0.317534624 0.340909091 0.5753865 0.272727273 0.213988
6 0.428182857 0.454545455 0.5822902 0.363636364 0.1543545
7 0.538469854 0.522727273 0.3581275 0.454545455 0.1025831
8 0.641846095 0.659090909 0.5264127 0.681818182 0.6489032
9 0.733341187 0.818181818 0.8664682 0.795454545 0.77316
10 0.809949161 0.863636364 0.756462 0.863636364 0.756462
11 0.870659846 0.909090909 0.688802 0.886363636 0.5115497
12 0.916175987 0.954545455 0.7251068 0.977272727 0.8932743
13 0.948419826 0.977272727 0.6699302 0.977272727 0.6699302
14 0.969963161 1 0.7386451 1 0.7386451
15 0.983506557 1 0.5189398 1 0.5189398
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 29/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Predicting Ranking from Histogram Bin
Table 7: Predicting Ranking using from histogram bins DA3
Count Random Probability 1-Norm distance Conļ¬dence for alter. hypo. Kolmogorov smirnov test Conļ¬dence for alter. hypo.
1 0.014285714 0 0 0 0
2 0.056728778 0 0 0 0
3 0.124862989 0.008264463 0 0.010330579 0
4 0.213955796 0.068181818 0 0.07231405 0
5 0.317534624 0.150826446 0 0.150826446 0
6 0.428182857 0.268595041 5.64E-14 0.27892562 3.98E-12
7 0.538469854 0.400826446 2.69E-10 0.404958678 1.49E-09
8 0.641846095 0.541322314 1.44E-06 0.530991736 2.25E-07
9 0.733341187 0.683884298 0.0066866 0.681818182 0.00504363
10 0.809949161 0.780991736 0.04829307 0.783057851 0.04829307
11 0.870659846 0.863636364 0.2945404 0.863636364 0.2945404
12 0.916175987 0.919421488 0.5609581 0.919421488 0.5609581
13 0.948419826 0.960743802 0.8715195 0.962809917 0.8715195
14 0.969963161 0.975206612 0.6958514 0.977272727 0.6958514
15 0.983506557 0.981404959 0.2801928 0.983471074 0.2801928
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 30/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Predicting Ranking by Data Property Vector
Steps
These properties are considered:
Histogram of standard deviation of each class.
Cluster Analysis Vector(CVV).
Moments of data.
Mixture of Gaussian on each attribute.
Mean and vector of diagonal entries from Sigma matrix obtained by
singular value decomposition (SVD) of the covariance matrix of the
multivariate Gaussian model of the dataset.
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 31/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Predicting Ranking by Data Property Vector
Steps
These properties are considered:
Histogram of standard deviation of each class.
Cluster Analysis Vector(CVV).
Moments of data.
Mixture of Gaussian on each attribute.
Mean and vector of diagonal entries from Sigma matrix obtained by
singular value decomposition (SVD) of the covariance matrix of the
multivariate Gaussian model of the dataset.
Property vector as independent variable and accuracy as dependent
variable in regression
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 31/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Predicting Ranking by Data Property Vector
Steps
These properties are considered:
Histogram of standard deviation of each class.
Cluster Analysis Vector(CVV).
Moments of data.
Mixture of Gaussian on each attribute.
Mean and vector of diagonal entries from Sigma matrix obtained by
singular value decomposition (SVD) of the covariance matrix of the
multivariate Gaussian model of the dataset.
Property vector as independent variable and accuracy as dependent
variable in regression
One regression model for each classiļ¬er
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 31/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Predicting Ranking by Data Property Vector
Steps
These properties are considered:
Histogram of standard deviation of each class.
Cluster Analysis Vector(CVV).
Moments of data.
Mixture of Gaussian on each attribute.
Mean and vector of diagonal entries from Sigma matrix obtained by
singular value decomposition (SVD) of the covariance matrix of the
multivariate Gaussian model of the dataset.
Property vector as independent variable and accuracy as dependent
variable in regression
One regression model for each classiļ¬er
Result compared with random sequence
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 31/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Predicting Ranking by Data Property Vector
Table 8: Test-1 using Data Characteristics forDA1
Value of k
Random
Probability
Gaussian
Processes
p-value for
null hypo.
IBk
p-value for
null hypo.
1 0.076923077 0.136363636 0.2481944 0.25 0.000392335
2 0.294871795 0.522727273 0.001277486 0.454545455 0.01802491
3 0.58041958 0.795454545 0.006170564 0.704545455 0.06290122
Table 9: Test-2 using Data Characteristics for DA1
Count
Random
Probability
Gaussian
Processes
p-value for
null hypo.
IBk
p-value for
null hypo.
1 0.076923077 0.136363636 0.2481944 0.25 0.000392335
2 0.153846154 0.284090909 0.002996455 0.25 0.0128041
3 0.230769231 0.386363636 0.000394481 0.363636364 0.000394481
4 0.307692308 0.443181818 0.000343617 0.403409091 0.0159287
5 0.384615385 0.518181818 0.00032955 0.440909091 0.08601391
6 0.461538462 0.549242424 0.003806131 0.53030303 0.02680756
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 32/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Predicting Ranking by Data Property Vector
Table 10: Test-1 using Data Characteristics for DA2
Count
Random
Probability
Gaussian
Processes
p-value for
null hypo.
IBk
p-value for
null hypo.
1 0.014285714 0.045454545 0.469059 0.068181818 0.130488
2 0.056728778 0.227272727 0.000702291 0.295454545 4.23E-06
3 0.124862989 0.318181818 0.002140653 0.295454545 0.0063842
4 0.213955796 0.431818182 0.000962345 0.409090909 0.006957731
5 0.317534624 0.590909091 0.000168674 0.477272727 0.03942881
6 0.428182857 0.636363636 0.01014022 0.613636364 0.01014022
Table 11: Test-2 using Data Characteristics for DA2
Count
Random
Probability
Gaussian
Processes
p-value for
null hypo.
IBk
p-value for
null hypo.
5 0.071428571 0.195454545 4.80E-08 0.25 1.92E-16
10 0.142857143 0.281818182 1.03E-12 0.327272727 8.45E-21
15 0.214285714 0.363636364 1.32E-18 0.366666667 1.32E-18
20 0.285714286 0.454545455 1.86E-26 0.422727273 3.23E-15
25 0.357142857 0.512727273 2.45E-22 0.470909091 1.99E-11
30 0.428571429 0.575 1.61E-24 0.543939394 3.85E-12
35 0.5 0.646753247 3.16E-27 0.607792208 5.06E-13
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 33/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Predicting Ranking by Data Property Vector
Table 12: Test-1 using Data Characteristics for DA3
Count
Random
Probability
Gaussian
Processes
p-value for
null hypo.
IBk
p-value for
null hypo.
1 0.014285714 0.148760331 5.27E-49 0.237603306 2.57E-101
2 0.056728778 0.384297521 1.92E-101 0.48553719 1.66E-154
3 0.124862989 0.520661157 7.63E-97 0.657024793 8.85E-163
4 0.213955796 0.597107438 1.49E-73 0.766528926 4.31E-148
5 0.317534624 0.665289256 1.05E-54 0.830578512 3.11E-119
6 0.428182857 0.710743802 3.06E-36 0.873966942 5.45E-92
Table 13: Test-2 using Data Characteristics for DA3
Count
Random
Probability
Gaussian
Processes
p-value for
null hypo.
IBk
p-value for
null hypo.
5 0.071428571 0.321487603 1.43E-284 0.395041322 0
10 0.142857143 0.40661157 0 0.480991736 0
15 0.214285714 0.478236915 0 0.534710744 0
20 0.285714286 0.517768595 0 0.577582645 0
25 0.357142857 0.573636364 0 0.622479339 0
30 0.428571429 0.631060606 0 0.666804408 0
35 0.5 0.687839433 0 0.717532468 0
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 34/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Figure 3: Test-1 on DA1
Figure 4: Test-1 on DA2
Figure 5: Test-2 on DA1
Figure 6: Test-2 on DA2
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 35/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Figure 7: Test-1 on DA3 Figure 8: Test-2 on DA3
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 36/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Increasing Dimension of Data (Autoencoder)
Figure 9: Test-1 on DA2 Figure 10: Test-2 on DA2
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 37/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Decreasing Dimension of Data (Autoencoder)
Figure 11: Test-1 on DA2 Figure 12: Test-2 on DA2
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 38/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Stacked Autoencoder
Figure 13: Test-1 on DA2 Figure 14: Test-2 on DA2
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 39/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparison with Previous Techniques
Test-1 : Diļ¬€erence between Accuracy
True accuracy and predicted accuracy of a algorithm on each test
dataset is compared
Average of absolute diļ¬€erence of all datasets and for all algorithms
is taken
Result of best regression technique is reported
Table 14: Diļ¬€erence Between Accuracy
Simple Statistical
Info
Theoretic
Model
Based
Landmark DCT ALL MDDF
Diļ¬€erence 0.0890 0.0915 0.0648 0.0859 0.0422 0.0747 0.0525 0.0426
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 40/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparison with Previous Techniques
Test-2 : Rank Correlation
Spearmanā€™s rank correlation coeļ¬ƒcient between these two rankings
is calculated
Value of this coeļ¬ƒcient is averaged over all data sets
Higher the value of this coeļ¬ƒcient, the higher the correlation
between the actual rank and predicted rank
Table 15: Rank Correlation
Simple Statistical
Info
Theoretic
Model
Based
Landmark DCT ALL MDDF
Average
Value
0.464 0.444 0.488 0.431 0.495 0.459 0.488 0.520
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 41/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Outline
1 Introduction and Background
2 Data Properties
3 Regression and Signiļ¬cance Testing
4 Results
5 Conclusion and Future Work
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 42/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Conclusion
Methods to generate ranking of binary classiļ¬cation algorithms
without running them
Intrinsic properties of data
Ranking of classiļ¬ers predicted via regression
Autoencoder used for more predictive analysis
Our approach give better result than previous techniques
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 43/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Future Work
Multi-class classiļ¬cation
Datasets can be grouped together based on domain knowledge
Other performance measures like precision, recall and f-measure can
be used
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 44/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 45/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Thank you!
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 46/ 47
Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
References I
[1] R. Caruana and A. Niculescu-Mizil.
An empirical comparison of supervised learning algorithms.
In Proceedings of the 23rd International Conference on Machine Learning, ICML ā€™06, pages
161ā€“168, New York, NY, USA, 2006. ACM.
[2] J. Gama and P. Brazdil.
Characterization of classiļ¬cation algorithms.
In Proceedings of the 7th Portuguese Conference on Artiļ¬cial Intelligence: Progress in
Artiļ¬cial Intelligence, EPIA ā€™95, pages 189ā€“200, London, UK, UK, 1995. Springer-Verlag.
[3] C. Kpf, C. Taylor, and J. Keller.
Meta-analysis: From data characterisation for meta-learning to meta-regression.
In Proceedings of the PKDD-00 Workshop on Data Mining, Decision Support,Meta-Learning
and ILP.
[4] R. Leite and P. Brazdil.
An iterative process for building learning curves and predicting relative performance of
classiļ¬ers.
In J. Neves, M. Santos, and J. Machado, editors, Progress in Artiļ¬cial Intelligence, volume
4874 of Lecture Notes in Computer Science, pages 87ā€“98. Springer Berlin Heidelberg, 2007.
[5] M. Reif, F. Shafait, M. Goldstein, T. Breuel, and A. Dengel.
Automatic classiļ¬er selection for non-experts.
Pattern Analysis and Applications, 17(1):83ā€“96, 2014.
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 47/ 47

More Related Content

What's hot

Classification and prediction
Classification and predictionClassification and prediction
Classification and prediction
Acad
Ā 
My PhD thesis presentation slides
My PhD thesis presentation slidesMy PhD thesis presentation slides
My PhD thesis presentation slides
Mattia Bosio
Ā 
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
Alejandro Bellogin
Ā 
Analysis of Textual Data Classification with a Reddit Comments Dataset
Analysis of Textual Data Classification with a Reddit Comments DatasetAnalysis of Textual Data Classification with a Reddit Comments Dataset
Analysis of Textual Data Classification with a Reddit Comments Dataset
AdamBab
Ā 
Data Mining
Data MiningData Mining
Data Mining
IIIT ALLAHABAD
Ā 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
error007
Ā 
Alleviating cold-user start problem with users' social network data in recomm...
Alleviating cold-user start problem with users' social network data in recomm...Alleviating cold-user start problem with users' social network data in recomm...
Alleviating cold-user start problem with users' social network data in recomm...
Eduardo Castillejo Gil
Ā 
Evolutionary Search Techniques with Strong Heuristics for Multi-Objective Fea...
Evolutionary Search Techniques with Strong Heuristics for Multi-Objective Fea...Evolutionary Search Techniques with Strong Heuristics for Multi-Objective Fea...
Evolutionary Search Techniques with Strong Heuristics for Multi-Objective Fea...
Abdel Salam Sayyad
Ā 
A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems
A Top-N Recommender System Evaluation Protocol Inspired by Deployed SystemsA Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems
A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems
Alan Said
Ā 
Data.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and predictionData.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and prediction
Margaret Wang
Ā 
Multi-method Evaluation in Scientific Paper Recommender Systems
Multi-method Evaluation in Scientific Paper Recommender SystemsMulti-method Evaluation in Scientific Paper Recommender Systems
Multi-method Evaluation in Scientific Paper Recommender Systems
Aravind Sesagiri Raamkumar
Ā 
Fuzzy logic applications for data acquisition systems of practical measurement
Fuzzy logic applications for data acquisition systems  of practical measurement Fuzzy logic applications for data acquisition systems  of practical measurement
Fuzzy logic applications for data acquisition systems of practical measurement
IJECEIAES
Ā 
Pareto-Optimal Search-Based Software Engineering (POSBSE): A Literature Survey
Pareto-Optimal Search-Based Software Engineering (POSBSE): A Literature SurveyPareto-Optimal Search-Based Software Engineering (POSBSE): A Literature Survey
Pareto-Optimal Search-Based Software Engineering (POSBSE): A Literature Survey
Abdel Salam Sayyad
Ā 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
Tonmoy Bhagawati
Ā 
IRJET- Online Course Recommendation System
IRJET- Online Course Recommendation SystemIRJET- Online Course Recommendation System
IRJET- Online Course Recommendation System
IRJET Journal
Ā 
yelp data challenge
yelp data challengeyelp data challenge
yelp data challenge
AMR koura
Ā 
Ch06
Ch06Ch06
Collaborative Filtering 2: Item-based CF
Collaborative Filtering 2: Item-based CFCollaborative Filtering 2: Item-based CF
Collaborative Filtering 2: Item-based CF
Yusuke Yamamoto
Ā 
Building a Classifier Employing Prism Algorithm with Fuzzy Logic
Building a Classifier Employing Prism Algorithm with Fuzzy LogicBuilding a Classifier Employing Prism Algorithm with Fuzzy Logic
Building a Classifier Employing Prism Algorithm with Fuzzy Logic
IJDKP
Ā 
Improving neural question generation using answer separation
Improving neural question generation using answer separationImproving neural question generation using answer separation
Improving neural question generation using answer separation
NAVER Engineering
Ā 

What's hot (20)

Classification and prediction
Classification and predictionClassification and prediction
Classification and prediction
Ā 
My PhD thesis presentation slides
My PhD thesis presentation slidesMy PhD thesis presentation slides
My PhD thesis presentation slides
Ā 
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
Ā 
Analysis of Textual Data Classification with a Reddit Comments Dataset
Analysis of Textual Data Classification with a Reddit Comments DatasetAnalysis of Textual Data Classification with a Reddit Comments Dataset
Analysis of Textual Data Classification with a Reddit Comments Dataset
Ā 
Data Mining
Data MiningData Mining
Data Mining
Ā 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Ā 
Alleviating cold-user start problem with users' social network data in recomm...
Alleviating cold-user start problem with users' social network data in recomm...Alleviating cold-user start problem with users' social network data in recomm...
Alleviating cold-user start problem with users' social network data in recomm...
Ā 
Evolutionary Search Techniques with Strong Heuristics for Multi-Objective Fea...
Evolutionary Search Techniques with Strong Heuristics for Multi-Objective Fea...Evolutionary Search Techniques with Strong Heuristics for Multi-Objective Fea...
Evolutionary Search Techniques with Strong Heuristics for Multi-Objective Fea...
Ā 
A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems
A Top-N Recommender System Evaluation Protocol Inspired by Deployed SystemsA Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems
A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems
Ā 
Data.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and predictionData.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and prediction
Ā 
Multi-method Evaluation in Scientific Paper Recommender Systems
Multi-method Evaluation in Scientific Paper Recommender SystemsMulti-method Evaluation in Scientific Paper Recommender Systems
Multi-method Evaluation in Scientific Paper Recommender Systems
Ā 
Fuzzy logic applications for data acquisition systems of practical measurement
Fuzzy logic applications for data acquisition systems  of practical measurement Fuzzy logic applications for data acquisition systems  of practical measurement
Fuzzy logic applications for data acquisition systems of practical measurement
Ā 
Pareto-Optimal Search-Based Software Engineering (POSBSE): A Literature Survey
Pareto-Optimal Search-Based Software Engineering (POSBSE): A Literature SurveyPareto-Optimal Search-Based Software Engineering (POSBSE): A Literature Survey
Pareto-Optimal Search-Based Software Engineering (POSBSE): A Literature Survey
Ā 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
Ā 
IRJET- Online Course Recommendation System
IRJET- Online Course Recommendation SystemIRJET- Online Course Recommendation System
IRJET- Online Course Recommendation System
Ā 
yelp data challenge
yelp data challengeyelp data challenge
yelp data challenge
Ā 
Ch06
Ch06Ch06
Ch06
Ā 
Collaborative Filtering 2: Item-based CF
Collaborative Filtering 2: Item-based CFCollaborative Filtering 2: Item-based CF
Collaborative Filtering 2: Item-based CF
Ā 
Building a Classifier Employing Prism Algorithm with Fuzzy Logic
Building a Classifier Employing Prism Algorithm with Fuzzy LogicBuilding a Classifier Employing Prism Algorithm with Fuzzy Logic
Building a Classifier Employing Prism Algorithm with Fuzzy Logic
Ā 
Improving neural question generation using answer separation
Improving neural question generation using answer separationImproving neural question generation using answer separation
Improving neural question generation using answer separation
Ā 

Viewers also liked

Supervised algorithms
Supervised algorithmsSupervised algorithms
Supervised algorithms
Yassine Akhiat
Ā 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
Anil Yadav
Ā 
25 Machine Learning Unsupervised Learaning K-means K-centers
25 Machine Learning Unsupervised Learaning K-means K-centers25 Machine Learning Unsupervised Learaning K-means K-centers
25 Machine Learning Unsupervised Learaning K-means K-centers
Andres Mendez-Vazquez
Ā 
Unsupervised Feature Learning
Unsupervised Feature LearningUnsupervised Feature Learning
Unsupervised Feature Learning
Amgad Muhammad
Ā 
Inverse Reinforcement Learning Algorithms
Inverse Reinforcement Learning AlgorithmsInverse Reinforcement Learning Algorithms
Inverse Reinforcement Learning Algorithms
Sungjoon Choi
Ā 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
Chandra Meena
Ā 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Girish Khanzode
Ā 

Viewers also liked (7)

Supervised algorithms
Supervised algorithmsSupervised algorithms
Supervised algorithms
Ā 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
Ā 
25 Machine Learning Unsupervised Learaning K-means K-centers
25 Machine Learning Unsupervised Learaning K-means K-centers25 Machine Learning Unsupervised Learaning K-means K-centers
25 Machine Learning Unsupervised Learaning K-means K-centers
Ā 
Unsupervised Feature Learning
Unsupervised Feature LearningUnsupervised Feature Learning
Unsupervised Feature Learning
Ā 
Inverse Reinforcement Learning Algorithms
Inverse Reinforcement Learning AlgorithmsInverse Reinforcement Learning Algorithms
Inverse Reinforcement Learning Algorithms
Ā 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
Ā 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Ā 

Similar to Predicting best classifier using properties of data sets

Optimization Technique for Feature Selection and Classification Using Support...
Optimization Technique for Feature Selection and Classification Using Support...Optimization Technique for Feature Selection and Classification Using Support...
Optimization Technique for Feature Selection and Classification Using Support...
IJTET Journal
Ā 
Query-porcessing-& Query optimization
Query-porcessing-& Query optimizationQuery-porcessing-& Query optimization
Query-porcessing-& Query optimization
Saranya Natarajan
Ā 
FINAL REVIEW
FINAL REVIEWFINAL REVIEW
FINAL REVIEW
samuelrajueda
Ā 
Network Based Intrusion Detection System using Filter Based Feature Selection...
Network Based Intrusion Detection System using Filter Based Feature Selection...Network Based Intrusion Detection System using Filter Based Feature Selection...
Network Based Intrusion Detection System using Filter Based Feature Selection...
IRJET Journal
Ā 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
Dmitry Grapov
Ā 
IRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware PerformanceIRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware Performance
IRJET Journal
Ā 
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
csandit
Ā 
Barga Data Science lecture 8
Barga Data Science lecture 8Barga Data Science lecture 8
Barga Data Science lecture 8
Roger Barga
Ā 
Kaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning ChallengeKaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning Challenge
Bernard Ong
Ā 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
theijes
Ā 
Weka : A machine learning algorithms for data mining
Weka : A machine learning algorithms for data miningWeka : A machine learning algorithms for data mining
Weka : A machine learning algorithms for data mining
Keshab Kumar Gaurav
Ā 
Study on Relavance Feature Selection Methods
Study on Relavance Feature Selection MethodsStudy on Relavance Feature Selection Methods
Study on Relavance Feature Selection Methods
IRJET Journal
Ā 
Research proposal
Research proposalResearch proposal
Research proposal
Sadia Sharmin
Ā 
SQL Optimizer vs Hive
SQL Optimizer vs Hive SQL Optimizer vs Hive
SQL Optimizer vs Hive
Vishaka Balasubramanian Sekar
Ā 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
Marco Meoni
Ā 
G046024851
G046024851G046024851
G046024851
IJERA Editor
Ā 
Volume 2-issue-6-2165-2172
Volume 2-issue-6-2165-2172Volume 2-issue-6-2165-2172
Volume 2-issue-6-2165-2172
Editor IJARCET
Ā 
Volume 2-issue-6-2165-2172
Volume 2-issue-6-2165-2172Volume 2-issue-6-2165-2172
Volume 2-issue-6-2165-2172
Editor IJARCET
Ā 
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
ijaia
Ā 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for Everyone
Aly Abdelkareem
Ā 

Similar to Predicting best classifier using properties of data sets (20)

Optimization Technique for Feature Selection and Classification Using Support...
Optimization Technique for Feature Selection and Classification Using Support...Optimization Technique for Feature Selection and Classification Using Support...
Optimization Technique for Feature Selection and Classification Using Support...
Ā 
Query-porcessing-& Query optimization
Query-porcessing-& Query optimizationQuery-porcessing-& Query optimization
Query-porcessing-& Query optimization
Ā 
FINAL REVIEW
FINAL REVIEWFINAL REVIEW
FINAL REVIEW
Ā 
Network Based Intrusion Detection System using Filter Based Feature Selection...
Network Based Intrusion Detection System using Filter Based Feature Selection...Network Based Intrusion Detection System using Filter Based Feature Selection...
Network Based Intrusion Detection System using Filter Based Feature Selection...
Ā 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
Ā 
IRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware PerformanceIRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware Performance
Ā 
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
Ā 
Barga Data Science lecture 8
Barga Data Science lecture 8Barga Data Science lecture 8
Barga Data Science lecture 8
Ā 
Kaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning ChallengeKaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning Challenge
Ā 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
Ā 
Weka : A machine learning algorithms for data mining
Weka : A machine learning algorithms for data miningWeka : A machine learning algorithms for data mining
Weka : A machine learning algorithms for data mining
Ā 
Study on Relavance Feature Selection Methods
Study on Relavance Feature Selection MethodsStudy on Relavance Feature Selection Methods
Study on Relavance Feature Selection Methods
Ā 
Research proposal
Research proposalResearch proposal
Research proposal
Ā 
SQL Optimizer vs Hive
SQL Optimizer vs Hive SQL Optimizer vs Hive
SQL Optimizer vs Hive
Ā 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
Ā 
G046024851
G046024851G046024851
G046024851
Ā 
Volume 2-issue-6-2165-2172
Volume 2-issue-6-2165-2172Volume 2-issue-6-2165-2172
Volume 2-issue-6-2165-2172
Ā 
Volume 2-issue-6-2165-2172
Volume 2-issue-6-2165-2172Volume 2-issue-6-2165-2172
Volume 2-issue-6-2165-2172
Ā 
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
Ā 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for Everyone
Ā 

Recently uploaded

äø€ęƔäø€åŽŸē‰ˆ(UMNę–‡å‡­čƁ书)ę˜Žå°¼č‹č¾¾å¤§å­¦ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(UMNę–‡å‡­čƁ书)ę˜Žå°¼č‹č¾¾å¤§å­¦ęƕäøščƁ如何办ē†äø€ęƔäø€åŽŸē‰ˆ(UMNę–‡å‡­čƁ书)ę˜Žå°¼č‹č¾¾å¤§å­¦ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(UMNę–‡å‡­čƁ书)ę˜Žå°¼č‹č¾¾å¤§å­¦ęƕäøščƁ如何办ē†
nyfuhyz
Ā 
äø€ęƔäø€åŽŸē‰ˆ(CUęƕäøščƁ)協尔é”æ大学ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(CUęƕäøščƁ)協尔é”æ大学ęƕäøščƁ如何办ē†äø€ęƔäø€åŽŸē‰ˆ(CUęƕäøščƁ)協尔é”æ大学ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(CUęƕäøščƁ)協尔é”æ大学ęƕäøščƁ如何办ē†
bmucuha
Ā 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
Ā 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
Ā 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
Ā 
äø€ęƔäø€åŽŸē‰ˆ(harvardęƕäøščƁ书ļ¼‰å“ˆä½›å¤§å­¦ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(harvardęƕäøščƁ书ļ¼‰å“ˆä½›å¤§å­¦ęƕäøščƁ如何办ē†äø€ęƔäø€åŽŸē‰ˆ(harvardęƕäøščƁ书ļ¼‰å“ˆä½›å¤§å­¦ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(harvardęƕäøščƁ书ļ¼‰å“ˆä½›å¤§å­¦ęƕäøščƁ如何办ē†
taqyea
Ā 
äø€ęƔäø€åŽŸē‰ˆ(UOęƕäøščƁ)ęø„å¤Ŗ华大学ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(UOęƕäøščƁ)ęø„å¤Ŗ华大学ęƕäøščƁ如何办ē†äø€ęƔäø€åŽŸē‰ˆ(UOęƕäøščƁ)ęø„å¤Ŗ华大学ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(UOęƕäøščƁ)ęø„å¤Ŗ华大学ęƕäøščƁ如何办ē†
bmucuha
Ā 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
facilitymanager11
Ā 
原ē‰ˆäø€ęƔäø€åˆ©å…¹č“å…‹ē‰¹å¤§å­¦ęƕäøščƁ(LeedsBeckettęƕäøščƁ书)如何办ē†
原ē‰ˆäø€ęƔäø€åˆ©å…¹č“å…‹ē‰¹å¤§å­¦ęƕäøščƁ(LeedsBeckettęƕäøščƁ书)如何办ē†åŽŸē‰ˆäø€ęƔäø€åˆ©å…¹č“å…‹ē‰¹å¤§å­¦ęƕäøščƁ(LeedsBeckettęƕäøščƁ书)如何办ē†
原ē‰ˆäø€ęƔäø€åˆ©å…¹č“å…‹ē‰¹å¤§å­¦ęƕäøščƁ(LeedsBeckettęƕäøščƁ书)如何办ē†
wyddcwye1
Ā 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
Ā 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
Ā 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
Ā 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
Ā 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
Ā 
äø€ęƔäø€åŽŸē‰ˆ(UCSFę–‡å‡­čƁ书)ę—§é‡‘å±±åˆ†ę ”ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(UCSFę–‡å‡­čƁ书)ę—§é‡‘å±±åˆ†ę ”ęƕäøščƁ如何办ē†äø€ęƔäø€åŽŸē‰ˆ(UCSFę–‡å‡­čƁ书)ę—§é‡‘å±±åˆ†ę ”ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(UCSFę–‡å‡­čƁ书)ę—§é‡‘å±±åˆ†ę ”ęƕäøščƁ如何办ē†
nuttdpt
Ā 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
MƔrton Kodok
Ā 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
Ā 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
Ā 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
yuvarajkumar334
Ā 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
Ā 

Recently uploaded (20)

äø€ęƔäø€åŽŸē‰ˆ(UMNę–‡å‡­čƁ书)ę˜Žå°¼č‹č¾¾å¤§å­¦ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(UMNę–‡å‡­čƁ书)ę˜Žå°¼č‹č¾¾å¤§å­¦ęƕäøščƁ如何办ē†äø€ęƔäø€åŽŸē‰ˆ(UMNę–‡å‡­čƁ书)ę˜Žå°¼č‹č¾¾å¤§å­¦ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(UMNę–‡å‡­čƁ书)ę˜Žå°¼č‹č¾¾å¤§å­¦ęƕäøščƁ如何办ē†
Ā 
äø€ęƔäø€åŽŸē‰ˆ(CUęƕäøščƁ)協尔é”æ大学ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(CUęƕäøščƁ)協尔é”æ大学ęƕäøščƁ如何办ē†äø€ęƔäø€åŽŸē‰ˆ(CUęƕäøščƁ)協尔é”æ大学ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(CUęƕäøščƁ)協尔é”æ大学ęƕäøščƁ如何办ē†
Ā 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Ā 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
Ā 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Ā 
äø€ęƔäø€åŽŸē‰ˆ(harvardęƕäøščƁ书ļ¼‰å“ˆä½›å¤§å­¦ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(harvardęƕäøščƁ书ļ¼‰å“ˆä½›å¤§å­¦ęƕäøščƁ如何办ē†äø€ęƔäø€åŽŸē‰ˆ(harvardęƕäøščƁ书ļ¼‰å“ˆä½›å¤§å­¦ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(harvardęƕäøščƁ书ļ¼‰å“ˆä½›å¤§å­¦ęƕäøščƁ如何办ē†
Ā 
äø€ęƔäø€åŽŸē‰ˆ(UOęƕäøščƁ)ęø„å¤Ŗ华大学ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(UOęƕäøščƁ)ęø„å¤Ŗ华大学ęƕäøščƁ如何办ē†äø€ęƔäø€åŽŸē‰ˆ(UOęƕäøščƁ)ęø„å¤Ŗ华大学ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(UOęƕäøščƁ)ęø„å¤Ŗ华大学ęƕäøščƁ如何办ē†
Ā 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
Ā 
原ē‰ˆäø€ęƔäø€åˆ©å…¹č“å…‹ē‰¹å¤§å­¦ęƕäøščƁ(LeedsBeckettęƕäøščƁ书)如何办ē†
原ē‰ˆäø€ęƔäø€åˆ©å…¹č“å…‹ē‰¹å¤§å­¦ęƕäøščƁ(LeedsBeckettęƕäøščƁ书)如何办ē†åŽŸē‰ˆäø€ęƔäø€åˆ©å…¹č“å…‹ē‰¹å¤§å­¦ęƕäøščƁ(LeedsBeckettęƕäøščƁ书)如何办ē†
原ē‰ˆäø€ęƔäø€åˆ©å…¹č“å…‹ē‰¹å¤§å­¦ęƕäøščƁ(LeedsBeckettęƕäøščƁ书)如何办ē†
Ā 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Ā 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Ā 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Ā 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
Ā 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Ā 
äø€ęƔäø€åŽŸē‰ˆ(UCSFę–‡å‡­čƁ书)ę—§é‡‘å±±åˆ†ę ”ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(UCSFę–‡å‡­čƁ书)ę—§é‡‘å±±åˆ†ę ”ęƕäøščƁ如何办ē†äø€ęƔäø€åŽŸē‰ˆ(UCSFę–‡å‡­čƁ书)ę—§é‡‘å±±åˆ†ę ”ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(UCSFę–‡å‡­čƁ书)ę—§é‡‘å±±åˆ†ę ”ęƕäøščƁ如何办ē†
Ā 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Ā 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Ā 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
Ā 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Ā 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Ā 

Predicting best classifier using properties of data sets

  • 1. Predicting the Best Classiļ¬er using Properties of Datasets Abhishek Vijayvargia Supervised by: Prof. Harish Karnick Department of Computer Science & Engineering IIT Kanpur June 24, 2015
  • 2. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Outline 1 Introduction and Background 2 Data Properties 3 Regression and Signiļ¬cance Testing 4 Results 5 Conclusion and Future Work Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 2/ 47
  • 3. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Outline 1 Introduction and Background 2 Data Properties 3 Regression and Signiļ¬cance Testing 4 Results 5 Conclusion and Future Work Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 3/ 47
  • 4. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Introduction Classiļ¬cation techniques have application in diļ¬€erent domains Datasets contains mixture of nominal,integer,real and text attributes Datasets have diļ¬€erent properties Classiļ¬cation algorithms performs diļ¬€erently on these datasets Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 4/ 47
  • 5. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Introduction Classiļ¬cation techniques have application in diļ¬€erent domains Datasets contains mixture of nominal,integer,real and text attributes Datasets have diļ¬€erent properties Classiļ¬cation algorithms performs diļ¬€erently on these datasets No single best algorithm exists(NFL) Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 4/ 47
  • 6. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Introduction Classiļ¬cation techniques have application in diļ¬€erent domains Datasets contains mixture of nominal,integer,real and text attributes Datasets have diļ¬€erent properties Classiļ¬cation algorithms performs diļ¬€erently on these datasets No single best algorithm exists(NFL) Cross validation is used to ļ¬nd good algorithm Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 4/ 47
  • 7. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Introduction Classiļ¬cation techniques have application in diļ¬€erent domains Datasets contains mixture of nominal,integer,real and text attributes Datasets have diļ¬€erent properties Classiļ¬cation algorithms performs diļ¬€erently on these datasets No single best algorithm exists(NFL) Cross validation is used to ļ¬nd good algorithm (time consuming) Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 4/ 47
  • 8. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Introduction Classiļ¬cation techniques have application in diļ¬€erent domains Datasets contains mixture of nominal,integer,real and text attributes Datasets have diļ¬€erent properties Classiļ¬cation algorithms performs diļ¬€erently on these datasets No single best algorithm exists(NFL) Cross validation is used to ļ¬nd good algorithm (time consuming) Takes more time with large datasets and algorithms Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 4/ 47
  • 9. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Introduction Meta-Learning Knowledge of datasets and performances of algorithms are stored Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 5/ 47
  • 10. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Introduction Meta-Learning Knowledge of datasets and performances of algorithms are stored Predict performance of algorithms Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 5/ 47
  • 11. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Introduction Meta-Learning Knowledge of datasets and performances of algorithms are stored Predict performance of algorithms Generate ranking Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 5/ 47
  • 12. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Introduction Meta-Learning Knowledge of datasets and performances of algorithms are stored Predict performance of algorithms Generate ranking Top-k algorithms can be chosen Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 5/ 47
  • 13. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Introduction Meta-Learning Knowledge of datasets and performances of algorithms are stored Predict performance of algorithms Generate ranking Top-k algorithms can be chosen Problem Statement Predict an optimal learning algorithm or nearly optimal learning algorithms using a ranking paradigm in terms of performance for a new data set by using the properties of the data set. Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 5/ 47
  • 14. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Related Work Characterization of Classiļ¬cation Algorithms [2] Dataset characteristics Simple Measures Statistical Measures Information Theoretic Measures Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 6/ 47
  • 15. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Related Work Characterization of Classiļ¬cation Algorithms [2] Dataset characteristics Simple Measures Statistical Measures Information Theoretic Measures Used four types of models Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 6/ 47
  • 16. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Related Work Characterization of Classiļ¬cation Algorithms [2] Dataset characteristics Simple Measures Statistical Measures Information Theoretic Measures Used four types of models Partial Learning Curve [4] Full learning curve from partial learning curve Fraction of instances used (10%) Predict best algorithm from two algorithms Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 6/ 47
  • 17. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Related Work Meta-Analysis [3] Meta Features Simple, Statistical and Information Theoretic Measures Model Based Measures Landmarks Classiļ¬cation for algorithm selection Synthetic dataset used Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 7/ 47
  • 18. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Related Work Meta-Analysis [3] Meta Features Simple, Statistical and Information Theoretic Measures Model Based Measures Landmarks Classiļ¬cation for algorithm selection Synthetic dataset used Automatic Classiļ¬er Selection for Non-Experts [5] Meta Features Predicted Accuracy by regression Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 7/ 47
  • 19. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Motivation Empirical Comparison of Supervised Learning Algorithm [1] Best methods performs poorly Poor methods performs exceptionally well on some problems Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 8/ 47
  • 20. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Motivation Empirical Comparison of Supervised Learning Algorithm [1] Best methods performs poorly Poor methods performs exceptionally well on some problems Motivation to generate ranking of algorithms Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 8/ 47
  • 21. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Outline 1 Introduction and Background 2 Data Properties 3 Regression and Signiļ¬cance Testing 4 Results 5 Conclusion and Future Work Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 9/ 47
  • 22. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Histogram of Standard Deviation Creating Histogram K standard deviation values (numerical attributes) H histogram bins 2 histogram per dataset (binary class) Bins from range [0,0.5] are taken (Data range [0,1]) Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 10/ 47
  • 23. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Histogram of Standard Deviation Table 1: Standard Deviation Data Class 0 0.228 0.215 0.2 0.187 0.135 0.15 0.116 0.154 Class 1 0.366 0.223 0.204 0.171 0.179 0.162 0.164 0.178 Table 2: Histogram Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50 Class 0 Histogram 0 0 2 3 3 0 0 0 0 0 Class 1 Histogram 0 0 0 5 2 0 0 1 0 0 Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 11/ 47
  • 24. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Comparing Histograms 1-Norm Distance Based Comparison Two datasets histograms compared on the basis of 1-Norm distance Two pairwise comparison (one for each class) between datasets Minimum distance score of two pairwise comparison is taken Order datasets in increasing distance Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 12/ 47
  • 25. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Comparing Histograms Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50 Class 0 Histogram 0 0 2 3 3 0 0 0 0 0 Class 1 Histogram 0 0 0 5 2 0 0 1 0 0 Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50 Class 0 Histogram 0 1 0 5 2 0 0 0 0 0 Class 1 Histogram 0 1 1 1 2 3 0 0 0 0 Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 13/ 47
  • 26. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Comparing Histograms Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50 Class 0 Histogram 0 0 2 3 3 0 0 0 0 0 Class 1 Histogram 0 0 0 5 2 0 0 1 0 0 Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50 Class 0 Histogram 0 1 0 5 2 0 0 0 0 0 Class 1 Histogram 0 1 1 1 2 3 0 0 0 0 Score 1 : Comparing Class 0 of Dataset-1 with Class 0 of Dataset-2 and Class 1 of Dataset-1 with Class 1 of Dataset-2. Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 13/ 47
  • 27. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Comparing Histograms Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50 Class 0 Histogram 0 0 2 3 3 0 0 0 0 0 Class 1 Histogram 0 0 0 5 2 0 0 1 0 0 Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50 Class 0 Histogram 0 1 0 5 2 0 0 0 0 0 Class 1 Histogram 0 1 1 1 2 3 0 0 0 0 Score 1 : Comparing Class 0 of Dataset-1 with Class 0 of Dataset-2 and Class 1 of Dataset-1 with Class 1 of Dataset-2. Score āˆ’ 1 = (|0 āˆ’ 0| + |0 āˆ’ 1| + |2 āˆ’ 0| + |3 āˆ’ 5| + |3 āˆ’ 2| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0|) + (|0 āˆ’ 0| + |0 āˆ’ 1| + |0 āˆ’ 1| + |5 āˆ’ 1| + |2 āˆ’ 2| + |0 āˆ’ 3| + |0 āˆ’ 0| + |1 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0|) = 6 + 10 = 16 Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 13/ 47
  • 28. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Comparing Histograms Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50 Class 0 Histogram 0 0 2 3 3 0 0 0 0 0 Class 1 Histogram 0 0 0 5 2 0 0 1 0 0 Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50 Class 0 Histogram 0 1 0 5 2 0 0 0 0 0 Class 1 Histogram 0 1 1 1 2 3 0 0 0 0 Score 1 : Comparing Class 0 of Dataset-1 with Class 0 of Dataset-2 and Class 1 of Dataset-1 with Class 1 of Dataset-2. Score āˆ’ 1 = (|0 āˆ’ 0| + |0 āˆ’ 1| + |2 āˆ’ 0| + |3 āˆ’ 5| + |3 āˆ’ 2| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0|) + (|0 āˆ’ 0| + |0 āˆ’ 1| + |0 āˆ’ 1| + |5 āˆ’ 1| + |2 āˆ’ 2| + |0 āˆ’ 3| + |0 āˆ’ 0| + |1 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0|) = 6 + 10 = 16 Score 2 : Comparing Class 0 of Dataset-1 with Class 1 of Dataset-2 and Class 1 of Dataset-1 with Class 0 of Dataset-2. Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 13/ 47
  • 29. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Comparing Histograms Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50 Class 0 Histogram 0 0 2 3 3 0 0 0 0 0 Class 1 Histogram 0 0 0 5 2 0 0 1 0 0 Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50 Class 0 Histogram 0 1 0 5 2 0 0 0 0 0 Class 1 Histogram 0 1 1 1 2 3 0 0 0 0 Score 1 : Comparing Class 0 of Dataset-1 with Class 0 of Dataset-2 and Class 1 of Dataset-1 with Class 1 of Dataset-2. Score āˆ’ 1 = (|0 āˆ’ 0| + |0 āˆ’ 1| + |2 āˆ’ 0| + |3 āˆ’ 5| + |3 āˆ’ 2| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0|) + (|0 āˆ’ 0| + |0 āˆ’ 1| + |0 āˆ’ 1| + |5 āˆ’ 1| + |2 āˆ’ 2| + |0 āˆ’ 3| + |0 āˆ’ 0| + |1 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0|) = 6 + 10 = 16 Score 2 : Comparing Class 0 of Dataset-1 with Class 1 of Dataset-2 and Class 1 of Dataset-1 with Class 0 of Dataset-2. Score āˆ’ 2 = (|0 āˆ’ 0| + |0 āˆ’ 1| + |2 āˆ’ 1| + |3 āˆ’ 1| + |3 āˆ’ 2| + |0 āˆ’ 3| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0|) + (|0 āˆ’ 0| + |0 āˆ’ 1| + |0 āˆ’ 0| + |5 āˆ’ 5| + |2 āˆ’ 2| + |0 āˆ’ 0| + |0 āˆ’ 0| + |1 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0|) = 8 + 2 = 10 Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 13/ 47
  • 30. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Comparing Histograms Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50 Class 0 Histogram 0 0 2 3 3 0 0 0 0 0 Class 1 Histogram 0 0 0 5 2 0 0 1 0 0 Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50 Class 0 Histogram 0 1 0 5 2 0 0 0 0 0 Class 1 Histogram 0 1 1 1 2 3 0 0 0 0 Score 1 : Comparing Class 0 of Dataset-1 with Class 0 of Dataset-2 and Class 1 of Dataset-1 with Class 1 of Dataset-2. Score āˆ’ 1 = (|0 āˆ’ 0| + |0 āˆ’ 1| + |2 āˆ’ 0| + |3 āˆ’ 5| + |3 āˆ’ 2| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0|) + (|0 āˆ’ 0| + |0 āˆ’ 1| + |0 āˆ’ 1| + |5 āˆ’ 1| + |2 āˆ’ 2| + |0 āˆ’ 3| + |0 āˆ’ 0| + |1 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0|) = 6 + 10 = 16 Score 2 : Comparing Class 0 of Dataset-1 with Class 1 of Dataset-2 and Class 1 of Dataset-1 with Class 0 of Dataset-2. Score āˆ’ 2 = (|0 āˆ’ 0| + |0 āˆ’ 1| + |2 āˆ’ 1| + |3 āˆ’ 1| + |3 āˆ’ 2| + |0 āˆ’ 3| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0|) + (|0 āˆ’ 0| + |0 āˆ’ 1| + |0 āˆ’ 0| + |5 āˆ’ 5| + |2 āˆ’ 2| + |0 āˆ’ 0| + |0 āˆ’ 0| + |1 āˆ’ 0| + |0 āˆ’ 0| + |0 āˆ’ 0|) = 8 + 2 = 10 Distance Score = min(Score-1,Score-2) = min(16,10) = 10 Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 13/ 47
  • 31. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Comparing Histograms Kolmogorov Smirnov Test Based Comparison Take value of histogram as a sample Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 14/ 47
  • 32. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Comparing Histograms Kolmogorov Smirnov Test Based Comparison Take value of histogram as a sample Calculate proportion of each value in two samples Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 14/ 47
  • 33. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Comparing Histograms Kolmogorov Smirnov Test Based Comparison Take value of histogram as a sample Calculate proportion of each value in two samples Calculate cumulative proportion of each sample Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 14/ 47
  • 34. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Comparing Histograms Kolmogorov Smirnov Test Based Comparison Take value of histogram as a sample Calculate proportion of each value in two samples Calculate cumulative proportion of each sample Calculate D statistics Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 14/ 47
  • 35. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Comparing Histograms Kolmogorov Smirnov Test Based Comparison Take value of histogram as a sample Calculate proportion of each value in two samples Calculate cumulative proportion of each sample Calculate D statistics Two pairwise comparison Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 14/ 47
  • 36. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Comparing Histograms Kolmogorov Smirnov Test Based Comparison Take value of histogram as a sample Calculate proportion of each value in two samples Calculate cumulative proportion of each sample Calculate D statistics Two pairwise comparison Minimum distance score Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 14/ 47
  • 37. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Comparing Histograms Table 3: Komogorov Smirnov test Bin Rage Histogram-1 Histogram-2 Proportion-1 Proportion-2 Cum.Pro.-1 Cum. Pro.-2 Diļ¬€erence .00-.05 0 0 0 0 0 0 0 .05-.10 1 0 0.125 0 0.125 0 0.125 .10-.15 0 2 0 0.25 0.125 0.25 0.125 .15-.20 5 3 0.625 0.375 0.75 0.625 0.125 .20-.25 2 3 0.25 0.375 1 1 0 .25-.30 0 0 0 0 1 1 0 .30-.35 0 0 0 0 1 1 0 .35-.40 0 0 0 0 1 1 0 .40-.45 0 0 0 0 1 1 0 .45-.50 0 0 0 0 1 1 0 Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 15/ 47
  • 38. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Comparing Histograms Table 3: Komogorov Smirnov test Bin Rage Histogram-1 Histogram-2 Proportion-1 Proportion-2 Cum.Pro.-1 Cum. Pro.-2 Diļ¬€erence .00-.05 0 0 0 0 0 0 0 .05-.10 1 0 0.125 0 0.125 0 0.125 .10-.15 0 2 0 0.25 0.125 0.25 0.125 .15-.20 5 3 0.625 0.375 0.75 0.625 0.125 .20-.25 2 3 0.25 0.375 1 1 0 .25-.30 0 0 0 0 1 1 0 .30-.35 0 0 0 0 1 1 0 .35-.40 0 0 0 0 1 1 0 .40-.45 0 0 0 0 1 1 0 .45-.50 0 0 0 0 1 1 0 Two datasets can be compared in two ways Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 15/ 47
  • 39. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Comparing Histograms Table 3: Komogorov Smirnov test Bin Rage Histogram-1 Histogram-2 Proportion-1 Proportion-2 Cum.Pro.-1 Cum. Pro.-2 Diļ¬€erence .00-.05 0 0 0 0 0 0 0 .05-.10 1 0 0.125 0 0.125 0 0.125 .10-.15 0 2 0 0.25 0.125 0.25 0.125 .15-.20 5 3 0.625 0.375 0.75 0.625 0.125 .20-.25 2 3 0.25 0.375 1 1 0 .25-.30 0 0 0 0 1 1 0 .30-.35 0 0 0 0 1 1 0 .35-.40 0 0 0 0 1 1 0 .40-.45 0 0 0 0 1 1 0 .45-.50 0 0 0 0 1 1 0 Two datasets can be compared in two ways Total 2 D values for each comparison Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 15/ 47
  • 40. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Comparing Histograms Table 3: Komogorov Smirnov test Bin Rage Histogram-1 Histogram-2 Proportion-1 Proportion-2 Cum.Pro.-1 Cum. Pro.-2 Diļ¬€erence .00-.05 0 0 0 0 0 0 0 .05-.10 1 0 0.125 0 0.125 0 0.125 .10-.15 0 2 0 0.25 0.125 0.25 0.125 .15-.20 5 3 0.625 0.375 0.75 0.625 0.125 .20-.25 2 3 0.25 0.375 1 1 0 .25-.30 0 0 0 0 1 1 0 .30-.35 0 0 0 0 1 1 0 .35-.40 0 0 0 0 1 1 0 .40-.45 0 0 0 0 1 1 0 .45-.50 0 0 0 0 1 1 0 Two datasets can be compared in two ways Total 2 D values for each comparison Sum both values and consider minimum D score mapping Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 15/ 47
  • 41. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Datasets Properties CAV Separate dataset based on itā€™s class Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 16/ 47
  • 42. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Datasets Properties CAV Separate dataset based on itā€™s class Apply KMeans Clustering Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 16/ 47
  • 43. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Datasets Properties CAV Separate dataset based on itā€™s class Apply KMeans Clustering Cluster Properties Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 16/ 47
  • 44. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Datasets Properties CAV Separate dataset based on itā€™s class Apply KMeans Clustering Cluster Properties Number of instances in each cluster Cluster Value = Ck = ĀÆxāˆˆcluster dist(ĀÆx, centroid) Cluster Centroid Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 16/ 47
  • 45. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Datasets Properties CAV Separate dataset based on itā€™s class Apply KMeans Clustering Cluster Properties Number of instances in each cluster Cluster Value = Ck = ĀÆxāˆˆcluster dist(ĀÆx, centroid) Cluster Centroid Moments of Data Variance Skewness Kurtosis Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 16/ 47
  • 46. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Datasets Properties Mixture Of Gaussians Dataset may have overlapping clusters(non-circular shape) For each attributes k diļ¬€erent Gaussian Model ļ¬t by Maximum likelihood of observed data mean and variance of each Gaussian are stored Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 17/ 47
  • 47. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Datasets Properties Mixture Of Gaussians Dataset may have overlapping clusters(non-circular shape) For each attributes k diļ¬€erent Gaussian Model ļ¬t by Maximum likelihood of observed data mean and variance of each Gaussian are stored Multivariate Gaussian Model N(x; Āµ, ) = 1 (2Ļ€)d/2 | | āˆ’1/2 e(āˆ’1 2 (xāˆ’Āµ) āˆ’1 (xāˆ’Āµ)T ) Āµ is d-length row vector Ļƒ is d Ɨ d matrix Singular value decomposition of covariance matrix Values from diagonal matrix and mean vector are stored Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 17/ 47
  • 48. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Outline 1 Introduction and Background 2 Data Properties 3 Regression and Signiļ¬cance Testing 4 Results 5 Conclusion and Future Work Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 18/ 47
  • 49. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Regression to Predict Performance Measures Regression Analysis Property vector populated by meta properties Entire vector or sub-vector can be used to predict performance measures Regression model is given as Y = f (X, Ī±) Y is dependent variable (performance measure) X is independent variables (property vector) Ī± is a vector of unknown parameters Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 19/ 47
  • 50. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Regression to Predict Performance Measures Figure 1: Training Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 20/ 47
  • 51. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Regression to Predict Performance Measures Figure 2: Testing Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 21/ 47
  • 52. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Statistical Signiļ¬cance Testing Comparison of predicted sequence with random sequence Actual sequence is taken as baseline Probability of at least one algorithm of top-k present in actual top k: Probability = 1 āˆ’ (none of algorithm present in topk) = 1 āˆ’ n āˆ’ k n Ɨ n āˆ’ k āˆ’ 1 n āˆ’ 1 Ɨ . . . Ɨ n āˆ’ 2k + 1 n āˆ’ k + 1 Expected Number of algorithm from random sequence present in actual top k: Expected Value = 1 Ɨ k 1 nāˆ’k kāˆ’1 n k + 2 Ɨ k 2 nāˆ’k kāˆ’2 n k + . . . + + . . . k Ɨ k 1āˆ’k nāˆ’k kāˆ’k n k Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 22/ 47
  • 53. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Statistical Signiļ¬cance Testing Test 1 At least one algorithm of given top-k sequence is present in actual top-k sequence Statistical signiļ¬cance test between Predicted rank-actual rank matches Random rank-actual rank matches Exclude prediction methods where the diļ¬€erence is not statistically signiļ¬cant Test 2 Number of algorithms of given top-k sequence is present in actual top-k sequence Statistical signiļ¬cance test between Predicted rank-actual rank matches Random rank-actual rank matches Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 23/ 47
  • 54. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Autoencoder Learns to construct itā€™s own input Increasing dimension of data by autoencoder Decreasing dimension of data by autoencoder Stacked autoencoder Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 24/ 47
  • 55. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Outline 1 Introduction and Background 2 Data Properties 3 Regression and Signiļ¬cance Testing 4 Results 5 Conclusion and Future Work Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 25/ 47
  • 56. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Data and Algorithm Set Real world datasets 44 binary datasets from UCI, tunedIT, keel, delve Synthetic datasets 484 datasets are generated using univariate and multivariate distribution DA1 : 13 classiļ¬cation algorithms and 44 real datasets DA2 : 70 classiļ¬cation algorithms and 44 real datasets DA2 : 70 classiļ¬cation algorithms and 484 synthetic datasets Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 26/ 47
  • 57. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Data Cleaning Steps for Data Cleaning Nominal to binary 0/1 attribute PCA with maximum attributes as 8 Normalization has been done on these reduced sets of attributes using (valueāˆ’min) (maxāˆ’min) . Class attributes are renamed to 0 and 1 Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 27/ 47
  • 58. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Predicting Ranking from Histogram Bin Table 4: Predicting Ranking from histogram bins using 1-Norm distance on DA1 Count Random Probability 1-Norm distance Conļ¬dence for alternative hypothesis 1 0.076923077 0.090909091 0.5583927 2 0.294871795 0.295454545 0.44665546 3 0.58041958 0.659090909 0.8166996 4 0.823776224 0.795454545 0.2378282 5 0.956487956 1 0.8587793 6 0.995920746 1 0.164608 7 1 1 0 Table 5: Predicting Ranking from histogram bins using Kolmogorov Smirnov on DA1 Count Random Probability Kolmogorov smirnov test Conļ¬dence for alternative hypothesis 1 0.076923077 0.090909091 0.7518056 2 0.294871795 0.340909091 0.6987076 3 0.58041958 0.636363636 0.7234472 4 0.823776224 0.863636364 0.6779081 5 0.956487956 0.977272727 0.5761086 6 0.995920746 1 0.164608 7 1 1 0 Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 28/ 47
  • 59. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Predicting Ranking from Histogram Bin Table 6: Predicting Ranking from histogram bins using DA2 Count Random Probability 1-Norm distance Conļ¬dence for alter. hypo. Kolmogorov smirnov test Conļ¬dence for alter. hypo. 1 0.014285714 0 0 0 0 2 0.056728778 0.022727273 0.07656138 0.022727273 0.07656138 3 0.124862989 0.068181818 0.07501977 0.090909091 0.1837721 4 0.213955796 0.159090909 0.1402761 0.159090909 0.1402761 5 0.317534624 0.340909091 0.5753865 0.272727273 0.213988 6 0.428182857 0.454545455 0.5822902 0.363636364 0.1543545 7 0.538469854 0.522727273 0.3581275 0.454545455 0.1025831 8 0.641846095 0.659090909 0.5264127 0.681818182 0.6489032 9 0.733341187 0.818181818 0.8664682 0.795454545 0.77316 10 0.809949161 0.863636364 0.756462 0.863636364 0.756462 11 0.870659846 0.909090909 0.688802 0.886363636 0.5115497 12 0.916175987 0.954545455 0.7251068 0.977272727 0.8932743 13 0.948419826 0.977272727 0.6699302 0.977272727 0.6699302 14 0.969963161 1 0.7386451 1 0.7386451 15 0.983506557 1 0.5189398 1 0.5189398 Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 29/ 47
  • 60. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Predicting Ranking from Histogram Bin Table 7: Predicting Ranking using from histogram bins DA3 Count Random Probability 1-Norm distance Conļ¬dence for alter. hypo. Kolmogorov smirnov test Conļ¬dence for alter. hypo. 1 0.014285714 0 0 0 0 2 0.056728778 0 0 0 0 3 0.124862989 0.008264463 0 0.010330579 0 4 0.213955796 0.068181818 0 0.07231405 0 5 0.317534624 0.150826446 0 0.150826446 0 6 0.428182857 0.268595041 5.64E-14 0.27892562 3.98E-12 7 0.538469854 0.400826446 2.69E-10 0.404958678 1.49E-09 8 0.641846095 0.541322314 1.44E-06 0.530991736 2.25E-07 9 0.733341187 0.683884298 0.0066866 0.681818182 0.00504363 10 0.809949161 0.780991736 0.04829307 0.783057851 0.04829307 11 0.870659846 0.863636364 0.2945404 0.863636364 0.2945404 12 0.916175987 0.919421488 0.5609581 0.919421488 0.5609581 13 0.948419826 0.960743802 0.8715195 0.962809917 0.8715195 14 0.969963161 0.975206612 0.6958514 0.977272727 0.6958514 15 0.983506557 0.981404959 0.2801928 0.983471074 0.2801928 Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 30/ 47
  • 61. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Predicting Ranking by Data Property Vector Steps These properties are considered: Histogram of standard deviation of each class. Cluster Analysis Vector(CVV). Moments of data. Mixture of Gaussian on each attribute. Mean and vector of diagonal entries from Sigma matrix obtained by singular value decomposition (SVD) of the covariance matrix of the multivariate Gaussian model of the dataset. Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 31/ 47
  • 62. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Predicting Ranking by Data Property Vector Steps These properties are considered: Histogram of standard deviation of each class. Cluster Analysis Vector(CVV). Moments of data. Mixture of Gaussian on each attribute. Mean and vector of diagonal entries from Sigma matrix obtained by singular value decomposition (SVD) of the covariance matrix of the multivariate Gaussian model of the dataset. Property vector as independent variable and accuracy as dependent variable in regression Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 31/ 47
  • 63. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Predicting Ranking by Data Property Vector Steps These properties are considered: Histogram of standard deviation of each class. Cluster Analysis Vector(CVV). Moments of data. Mixture of Gaussian on each attribute. Mean and vector of diagonal entries from Sigma matrix obtained by singular value decomposition (SVD) of the covariance matrix of the multivariate Gaussian model of the dataset. Property vector as independent variable and accuracy as dependent variable in regression One regression model for each classiļ¬er Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 31/ 47
  • 64. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Predicting Ranking by Data Property Vector Steps These properties are considered: Histogram of standard deviation of each class. Cluster Analysis Vector(CVV). Moments of data. Mixture of Gaussian on each attribute. Mean and vector of diagonal entries from Sigma matrix obtained by singular value decomposition (SVD) of the covariance matrix of the multivariate Gaussian model of the dataset. Property vector as independent variable and accuracy as dependent variable in regression One regression model for each classiļ¬er Result compared with random sequence Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 31/ 47
  • 65. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Predicting Ranking by Data Property Vector Table 8: Test-1 using Data Characteristics forDA1 Value of k Random Probability Gaussian Processes p-value for null hypo. IBk p-value for null hypo. 1 0.076923077 0.136363636 0.2481944 0.25 0.000392335 2 0.294871795 0.522727273 0.001277486 0.454545455 0.01802491 3 0.58041958 0.795454545 0.006170564 0.704545455 0.06290122 Table 9: Test-2 using Data Characteristics for DA1 Count Random Probability Gaussian Processes p-value for null hypo. IBk p-value for null hypo. 1 0.076923077 0.136363636 0.2481944 0.25 0.000392335 2 0.153846154 0.284090909 0.002996455 0.25 0.0128041 3 0.230769231 0.386363636 0.000394481 0.363636364 0.000394481 4 0.307692308 0.443181818 0.000343617 0.403409091 0.0159287 5 0.384615385 0.518181818 0.00032955 0.440909091 0.08601391 6 0.461538462 0.549242424 0.003806131 0.53030303 0.02680756 Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 32/ 47
  • 66. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Predicting Ranking by Data Property Vector Table 10: Test-1 using Data Characteristics for DA2 Count Random Probability Gaussian Processes p-value for null hypo. IBk p-value for null hypo. 1 0.014285714 0.045454545 0.469059 0.068181818 0.130488 2 0.056728778 0.227272727 0.000702291 0.295454545 4.23E-06 3 0.124862989 0.318181818 0.002140653 0.295454545 0.0063842 4 0.213955796 0.431818182 0.000962345 0.409090909 0.006957731 5 0.317534624 0.590909091 0.000168674 0.477272727 0.03942881 6 0.428182857 0.636363636 0.01014022 0.613636364 0.01014022 Table 11: Test-2 using Data Characteristics for DA2 Count Random Probability Gaussian Processes p-value for null hypo. IBk p-value for null hypo. 5 0.071428571 0.195454545 4.80E-08 0.25 1.92E-16 10 0.142857143 0.281818182 1.03E-12 0.327272727 8.45E-21 15 0.214285714 0.363636364 1.32E-18 0.366666667 1.32E-18 20 0.285714286 0.454545455 1.86E-26 0.422727273 3.23E-15 25 0.357142857 0.512727273 2.45E-22 0.470909091 1.99E-11 30 0.428571429 0.575 1.61E-24 0.543939394 3.85E-12 35 0.5 0.646753247 3.16E-27 0.607792208 5.06E-13 Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 33/ 47
  • 67. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Predicting Ranking by Data Property Vector Table 12: Test-1 using Data Characteristics for DA3 Count Random Probability Gaussian Processes p-value for null hypo. IBk p-value for null hypo. 1 0.014285714 0.148760331 5.27E-49 0.237603306 2.57E-101 2 0.056728778 0.384297521 1.92E-101 0.48553719 1.66E-154 3 0.124862989 0.520661157 7.63E-97 0.657024793 8.85E-163 4 0.213955796 0.597107438 1.49E-73 0.766528926 4.31E-148 5 0.317534624 0.665289256 1.05E-54 0.830578512 3.11E-119 6 0.428182857 0.710743802 3.06E-36 0.873966942 5.45E-92 Table 13: Test-2 using Data Characteristics for DA3 Count Random Probability Gaussian Processes p-value for null hypo. IBk p-value for null hypo. 5 0.071428571 0.321487603 1.43E-284 0.395041322 0 10 0.142857143 0.40661157 0 0.480991736 0 15 0.214285714 0.478236915 0 0.534710744 0 20 0.285714286 0.517768595 0 0.577582645 0 25 0.357142857 0.573636364 0 0.622479339 0 30 0.428571429 0.631060606 0 0.666804408 0 35 0.5 0.687839433 0 0.717532468 0 Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 34/ 47
  • 68. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Figure 3: Test-1 on DA1 Figure 4: Test-1 on DA2 Figure 5: Test-2 on DA1 Figure 6: Test-2 on DA2 Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 35/ 47
  • 69. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Figure 7: Test-1 on DA3 Figure 8: Test-2 on DA3 Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 36/ 47
  • 70. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Increasing Dimension of Data (Autoencoder) Figure 9: Test-1 on DA2 Figure 10: Test-2 on DA2 Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 37/ 47
  • 71. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Decreasing Dimension of Data (Autoencoder) Figure 11: Test-1 on DA2 Figure 12: Test-2 on DA2 Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 38/ 47
  • 72. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Stacked Autoencoder Figure 13: Test-1 on DA2 Figure 14: Test-2 on DA2 Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 39/ 47
  • 73. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Comparison with Previous Techniques Test-1 : Diļ¬€erence between Accuracy True accuracy and predicted accuracy of a algorithm on each test dataset is compared Average of absolute diļ¬€erence of all datasets and for all algorithms is taken Result of best regression technique is reported Table 14: Diļ¬€erence Between Accuracy Simple Statistical Info Theoretic Model Based Landmark DCT ALL MDDF Diļ¬€erence 0.0890 0.0915 0.0648 0.0859 0.0422 0.0747 0.0525 0.0426 Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 40/ 47
  • 74. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Comparison with Previous Techniques Test-2 : Rank Correlation Spearmanā€™s rank correlation coeļ¬ƒcient between these two rankings is calculated Value of this coeļ¬ƒcient is averaged over all data sets Higher the value of this coeļ¬ƒcient, the higher the correlation between the actual rank and predicted rank Table 15: Rank Correlation Simple Statistical Info Theoretic Model Based Landmark DCT ALL MDDF Average Value 0.464 0.444 0.488 0.431 0.495 0.459 0.488 0.520 Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 41/ 47
  • 75. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Outline 1 Introduction and Background 2 Data Properties 3 Regression and Signiļ¬cance Testing 4 Results 5 Conclusion and Future Work Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 42/ 47
  • 76. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Conclusion Methods to generate ranking of binary classiļ¬cation algorithms without running them Intrinsic properties of data Ranking of classiļ¬ers predicted via regression Autoencoder used for more predictive analysis Our approach give better result than previous techniques Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 43/ 47
  • 77. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Future Work Multi-class classiļ¬cation Datasets can be grouped together based on domain knowledge Other performance measures like precision, recall and f-measure can be used Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 44/ 47
  • 78. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 45/ 47
  • 79. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work Thank you! Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 46/ 47
  • 80. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work References I [1] R. Caruana and A. Niculescu-Mizil. An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd International Conference on Machine Learning, ICML ā€™06, pages 161ā€“168, New York, NY, USA, 2006. ACM. [2] J. Gama and P. Brazdil. Characterization of classiļ¬cation algorithms. In Proceedings of the 7th Portuguese Conference on Artiļ¬cial Intelligence: Progress in Artiļ¬cial Intelligence, EPIA ā€™95, pages 189ā€“200, London, UK, UK, 1995. Springer-Verlag. [3] C. Kpf, C. Taylor, and J. Keller. Meta-analysis: From data characterisation for meta-learning to meta-regression. In Proceedings of the PKDD-00 Workshop on Data Mining, Decision Support,Meta-Learning and ILP. [4] R. Leite and P. Brazdil. An iterative process for building learning curves and predicting relative performance of classiļ¬ers. In J. Neves, M. Santos, and J. Machado, editors, Progress in Artiļ¬cial Intelligence, volume 4874 of Lecture Notes in Computer Science, pages 87ā€“98. Springer Berlin Heidelberg, 2007. [5] M. Reif, F. Shafait, M. Goldstein, T. Breuel, and A. Dengel. Automatic classiļ¬er selection for non-experts. Pattern Analysis and Applications, 17(1):83ā€“96, 2014. Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 47/ 47