SlideShare a Scribd company logo
1 of 22
Download to read offline
INFO-F-422
STATISTICAL FOUNDATION
OF MACHINE LEARNING
OTTO GROUP PRODUCT
CLASSIFICATION CHALLENGE
Fiscarelli Antonio Maria
INTRODUCTION
The aim of this project is to implement and assess some feature selection methods
and supervised learning algorithms. The most accurate will be selected and used for
the Otto Group Classification Challenge.
The objective is to build a predictive model that is able to distinguish between 9
different product categories.
DATA
The Otto Group provides:
- a training set in the CSV (comma separated values) format that includes 61788
labeled samples and 93 features, with their relative outcome.
- a test set in the CSV (comma separated values) format that includes 144368 labeled
samples and 93 features.
All variables assume non-negative integer values.
Training set is quite sparse(many variables assume 0 value).
Following there are some plots that describes the distribution of the training data.
As can be noticed, almost 50% of training samples belong to class 2 and class 6.
Instead classes 1,4,5 and 7 are the least populated. This means that training samples
are not equally distributed among all the 9 classes.
TECHNIQUES
The techniques used are the following:
• Feature selection methods:
◦ Stepwise selection
◦ Principal Components Analysis (PCA)
◦ Bivariate analysis (BA)
• Supervised learning algorithms:
◦ Discriminant analysis
◦ Classification trees and Random Forest
◦ Neural networks and Random Neuron
These techniques will be assessed to find the best subset of variables for a given
model and the best combination of models.
FEATURE SELECTION METHODS
The aim of feature selection is to select, among all variables, those ones that are most
relevant and better discriminates the different classes.
The best feature selection method will be selected according to:
• Accuracy of the model using the subset of variables
• Computation time of the model using the subset of variables
The feature selection methods proposed are:
• Stepwise selection
• Principal Component Analysis (PCA)
• Bivariate Analysis (BA)
STEPWISE SELECTION
Stepwise selection belongs to the wrapping methods. It is basically a search on the
space of variables, where a variable can be either selected or not to be included in the
set of features to use.
It combines forward and backward selection, where a solution is initialized as empty
set/set of all variables and variables, and are progressively added/substracted
according to a cost function (for example, ME).
The application of stepwise selection on the set of variables provided leads to this
subset fo features:
feat_1 + feat_2 + feat_4 + feat_5 + feat_6 +
feat_7 + feat_8 + feat_9 + feat_10 + feat_11 + feat_12 +
feat_13 + feat_14 + feat_15 + feat_16 + feat_17 + feat_19 +
feat_20 + feat_22 + feat_23 + feat_24 + feat_25 + feat_26 +
feat_28 + feat_29 + feat_30 + feat_33 + feat_35 + feat_36 +
feat_37 + feat_38 + feat_39 + feat_40 + feat_41 + feat_42 +
feat_43 + feat_44 + feat_46 + feat_47 + feat_48 + feat_50 +
feat_51 + feat_53 + feat_54 + feat_55 + feat_56 + feat_57 +
feat_58 + feat_59 + feat_60 + feat_62 + feat_63 + feat_64 +
feat_65 + feat_66 + feat_67 + feat_68 + feat_69 + feat_70 +
feat_71 + feat_72 + feat_73 + feat_74 + feat_75 + feat_76 +
feat_77 + feat_78 + feat_79 + feat_80 + feat_81 + feat_83 +
feat_84 + feat_85 + feat_86 + feat_87 + feat_88 + feat_90 +
feat_91 + feat_92
In total 73 features over 93 are selected.
For some reasons stepwise selection will not be selected as feature selection method:
• Regression is computationally more expensive than PCA and BA
• The subset of variables is larger than the one found with PCA and BA. This
leads to computationally more expensive learning algorithms.
PRINCIPAL COMPONENT ANALYSIS
PCA is a statistical procedure that converts a set of often correlated variables into a
set of linearly uncorrelated new variables called principal components.
Principal components are computed as the eigenvectors of the covariance matrix,
while their eigenvalues represent the variance associated to them. Hence principal
components are sorted according to decreasing variance (variables with highest
variance first).
Since principal components are uncorrelated, even a few number of them are enough
to describe data.
Notice that, using PCA , variables are projected on a different space. This means that
the 93 variables provided are not used any more. Hence they can' t be listed. Without
assessing the model, one can take the number of components whose cumulative
standard deviation is, for examle, 80% or 70% of total.
Following results will be about PCA applied to improve accuracy of the learning
algorithms used.
What can be noticed from this analysis is that ME and BER decreases as the number
of components increase, while LOGLOSS has a minimum.
The number of components associated to the minimum value (23) will be used during
the assessed process for Discriminant Analysis.
What can be noticed from this analysis is that ME, BER and LOGLOSS decrease as
the number of components increase. This means that the greater number of
components we use the better.
Since computation time of Classification Trees is still reasonable with an high
number of components, all components provided by PCA will be used during the
assessment process for Classification Trees.
What can be noticed from this analysis is that ME and BER decreases as the number
of variables increase, while LOGLOSS has a minimum.
The number of components associated to the minimum value (42) will be considered
as number of variables to use.
BIVARIATE ANALYSIS
Bivariate analysis is a quantitative statistical procedure that involves the analysis of
two variables in order to see if they are related to each other.
In this case a correlation metric has been used. Variables with high correlation are
strongly related to each other.
Correlation between each variable and the outcome has been computed and variables
have been sorted according to decreasing correlation (Most related variables to the
outcome first).
This is the list of variables ranked by importance using bivariate analysis.
14 40 25 15 88 24 36 20 69 8 72 75 41 18 22 38 67 76 90 62 13 2 68 11 66 54 33
9 58 79 55 83 59 57 60 4 80 3 91 92 35 7 26 47 64 44 49 28 71 37 17 19 42 29
43 82 53 46 86 89 61 78 31 23 27 85 87 10 39 50 81 12 48 52 73 1 63 65 74 45 51
77 93 6 5 70 56 34 30 32 21 16 84
Following results will be about BA applied to improve accuracy of the learning
algorithms used.
What can be noticed from this analysis is that ME, BER and LOGLOSS decrease as
the number of variables increase. This means that the greater number of variables we
use the better.
Since computation time of Discriminant Analysis is still reasonable with an high
number of variables, all variables provided will be considered.
What can be noticed from this analysis is that ME, BER and LOGLOSS decrease as
the number of variables increase. This means that the greater number of variables we
use the better.
Since computation time of Classification Trees is still reasonable with an high
number of components, all variables will be used considered.
What can be noticed from this analysis is that ME and BER decreases as the number
of variables increase, while LOGLOSS has a minimum.
The number of components associated to the minimum value (25) will be considered
as number of variables to use.
Riassuming:
• PCA performed better than BA for Discriminant Analysys. The optimum
number of variables (23) will be used to asses the model.
• PCA performed better than BA for Classification trees. Since the more
components are used the better, and computation time of Classification Trees
is still reasonable with an high numer of components, all components will be
used to assess the model.
• PCA performed better than BA for Neural Networks. The optimum number of
variables (42) will be used to asses the model. What's more, Neural Networks
are computationally expensive, hence using all the components would be
unfeasible for the time constraints had.
LEARNING ALGORITHMS
The aim of learning algorithms is to learn from training examples and make
predictions on new data. A model is built from training data input in order to make
data-drive predictions on test data.
The learning algorithms proposed are:
• Discriminant Analysis
• Classification Trees and Random Forest
• Neural networks and Random Neurons
For experiments on these learning algorithms will be used the feature selection
method and the number of variables that lead to the best result on the “Feature
Selection Methods” section.
To assess the accuracy of each classifier the following cost functions will be used:
• Misclassification error (ME)
• Balanced Error Rate (BER(C))
• LogLoss
DISCRIMINANT ANALYSIS
Discriminant analysis is a statistical technique that uses a discriminant rule to make a
separation of the sample space into several sets (one for each class) such that if a new
sample has to be classified, it will be allocated to one of these sets and hence
belonging to the associated class.
- For each experiment a 10-CrossValidation has been used on training data.
- Will be used PCA as feature selection method (see related section) as it performed
better than BA. The optimal number of variables (23) found will be used.
N.var ← 23
Apply PCA and select N.var variables
Split training set S in 10 disjoint subsets Si
For each subset Si
- train the model using on SSi
- test the model on Si.
Average error
For discriminant analysis there is no assessment process to find the best structure
since no parameters are available. Hence results for this learning algorithms are the
ones presented in the “Feature Selection Methods” section.
CLASSIFICATION TREES AND RANDOM FOREST
Decision trees (Classification trees in case the outcome is categorical) partitions the
input space into mutually exclusive regions to each of which a specific model is
assigned.
Each terminal node contains a label that indicates the class for the specific region.
Random Forest is an ensemble technique that combines several non pruned-trees
having low bias and high variance in order to reduce the variance and improve the
model accuracy. Since classification trees are deterministic models (if trained on the
same training set, will produce the same result) a bagging technique is needed to
generate different training subsets for each tree.
Structural Identification for Random Forest consists of finding the best number of
trees to use. Hence a search on the space of the number of trees is performed.
- For each experiment a 10-CrossValidation has been used on training data.
- A bagging technique has been used to train the different trees of the random forest
- Will be used PCA as feature selection method (see related section) as it performed
better than BA. All variables will be used.
N <- number of training samples
N.var <- 93
Apply PCA and select N.var variables
For (n.tree in 1:250 by 10)
Split training set S in 10 disjoint subsets Si
For each subset Si:
For (t in 1:n.tree)
- generate sample St from SSi ofsize N/(10 *n.tree) (no repetition)
- train the tree on St
- test tree on Si
- Average predicted probabilities of all trees
Average error
Choose number of trees that lead to minimum error
Following results will be about Random Forest with a variable number of trees.
What can be noticed from this analysis is that ME and BER decreases as the number
of trees increase, while LOGLOSS has a minimum.
The number of trees associated to the minimum(140) provides the best accuracy.
NEURAL NETWORKS AND RANDOM NEURONS
Feed Forward Artificial Neural Networks can estimate any function that depends on
certain inputs variables. They are systems of interconnected neurons that given a set
of weighted inputs can compute an output value.
Parametric Identification for neural networks consists of finding the optimal set of
weights. This can be done, for example, using Backpropagation: a gradient based
algorithm that minimizes a cost function. This technique is already implemented by R
in the function called to train the neural network.
Structural Identification for neural networks consists of finding the best number of
hidden layers and the best number of hidden neurons for layer. In this case only one
hidden layer is used and a search on the space of the number of hidden neurons is
performed (number of neurons will depend of the number of variables used. A naïve
choice is number of hidden neurons equal to the mean of number of input and output
neurons) .
- For each experiment a 10-CrossValidation has been used on training data.
- Will be used PCA as feature selection method (see related section) as it performed
better than BA. The optimal number of variables (42) found will be used.
N <- number of training samples
N.var <- 42
Apply PCA and select N.var variables
for(n.h in (N.var * 1/3):(N.var * 2/3))
- Split training set S in 10 disjoint subsets Si
For each subset Si:
- train the NN with n.h hidden layers on SSi
- test the NN on Si
Average error
Choose number of hidden neurons that leads to minimum error
What can be noticed is that, once applied PCA and using the optimal number of
variables, number of hidden neurons within this range doesn't affect significatively a
single neural network.
Using a very large number of hidden neurons (for example bigger than the number of
variables) leads to overfitting, hence it's not worth using unless exploited by
anensemble technique.
Random Neurons is the equivalent of Random Forest. It is an ensemble technique
that combines several neural networks with low bias and high variance (high number
of hidden neurons) in order to reduce the variance and improve the model accuracy.
Neural Networks are a probabilistic models if weight are initialized randomly (they
will produce differents results even if trained on the same training set), hence they
can be trained on the same training set.
- For each experiment a 10-CrossValidation has been used on training data.
- For this experiment a number of hidden nodes equal to the number of variables will
be used (it is a quite large number in order to produce low bias and high variance
models to exploit the ensemble properties)
- Will be used PCA as feature selection method (see related section) as it performed
better than BA. The optimal number of variables found will be used.
N <- number of training samples
N.var <- 42
N.h <- N.var
Apply PCA and select N.var variables
for(n.NN in 1:50 by 10)
- Split training set S in 10 disjoint subsets Si
For each subset Si:
For(n in 1:n.NN)
- train a neural networks with n.h hidden neurons on SSi
- test all the neural networks on Si
- Average predicted probabilities of all neural networks
Average error
Choose number of neural networks that leads to minimum error
What has been found is that Random Neurons increases the model accuracy, hence
performed better than a single neural network.
CONCLUSIONS
The best combination of models found composed of:
• PCA
◦ 42 components
• Random Neurons:
◦ 50 neural networks:
▪ 42 hidden neurons each
This model leads to a Logloss error of 0.47537 on training data (using cross
validation) and of 0.58521 on the 70% of test set samples (computed by Kaggle).
Since there is not a big differenc between the two values, I can say that the model
doesn't overfit data.

More Related Content

What's hot

Machine Learning Approaches to Brewing Beer
Machine Learning Approaches to Brewing BeerMachine Learning Approaches to Brewing Beer
Machine Learning Approaches to Brewing BeerGregg Barrett
 
Normalization of Large-Scale Metabolomic Studies 2014
Normalization of Large-Scale Metabolomic Studies 2014Normalization of Large-Scale Metabolomic Studies 2014
Normalization of Large-Scale Metabolomic Studies 2014Dmitry Grapov
 
Factor analysis
Factor analysisFactor analysis
Factor analysis緯鈞 沈
 
Metabolomic Data Analysis Case Studies
Metabolomic Data Analysis Case StudiesMetabolomic Data Analysis Case Studies
Metabolomic Data Analysis Case StudiesDmitry Grapov
 
Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validationStéphane Canu
 
Cannonical Correlation
Cannonical CorrelationCannonical Correlation
Cannonical Correlationdomsr
 
Factor analysis using spss 2005
Factor analysis using spss 2005Factor analysis using spss 2005
Factor analysis using spss 2005jamescupello
 
Factor analysis
Factor analysis Factor analysis
Factor analysis Nima
 
Factor Analysis in Research
Factor Analysis in ResearchFactor Analysis in Research
Factor Analysis in ResearchQasim Raza
 
Statistical analysis
Statistical analysisStatistical analysis
Statistical analysisXiuxia Du
 
IRJET- Online Course Recommendation System
IRJET- Online Course Recommendation SystemIRJET- Online Course Recommendation System
IRJET- Online Course Recommendation SystemIRJET Journal
 
Data Normalization Approaches for Large-scale Biological Studies
Data Normalization Approaches for Large-scale Biological StudiesData Normalization Approaches for Large-scale Biological Studies
Data Normalization Approaches for Large-scale Biological StudiesDmitry Grapov
 
Cross-validation aggregation for forecasting
Cross-validation aggregation for forecastingCross-validation aggregation for forecasting
Cross-validation aggregation for forecastingDevon Barrow
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selectionMarco Meoni
 
AMS_Aviation_2014_Ali
AMS_Aviation_2014_AliAMS_Aviation_2014_Ali
AMS_Aviation_2014_AliMDO_Lab
 
Factor Analysis for Exploratory Studies
Factor Analysis for Exploratory StudiesFactor Analysis for Exploratory Studies
Factor Analysis for Exploratory StudiesManohar Pahan
 

What's hot (20)

Factor analysis
Factor analysisFactor analysis
Factor analysis
 
Exploratory factor analysis
Exploratory factor analysisExploratory factor analysis
Exploratory factor analysis
 
Machine Learning Approaches to Brewing Beer
Machine Learning Approaches to Brewing BeerMachine Learning Approaches to Brewing Beer
Machine Learning Approaches to Brewing Beer
 
Normalization of Large-Scale Metabolomic Studies 2014
Normalization of Large-Scale Metabolomic Studies 2014Normalization of Large-Scale Metabolomic Studies 2014
Normalization of Large-Scale Metabolomic Studies 2014
 
Factor analysis
Factor analysisFactor analysis
Factor analysis
 
Metabolomic Data Analysis Case Studies
Metabolomic Data Analysis Case StudiesMetabolomic Data Analysis Case Studies
Metabolomic Data Analysis Case Studies
 
Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validation
 
Cannonical Correlation
Cannonical CorrelationCannonical Correlation
Cannonical Correlation
 
Factor analysis using spss 2005
Factor analysis using spss 2005Factor analysis using spss 2005
Factor analysis using spss 2005
 
Factor analysis
Factor analysis Factor analysis
Factor analysis
 
Factor Analysis in Research
Factor Analysis in ResearchFactor Analysis in Research
Factor Analysis in Research
 
Statistical analysis
Statistical analysisStatistical analysis
Statistical analysis
 
IRJET- Online Course Recommendation System
IRJET- Online Course Recommendation SystemIRJET- Online Course Recommendation System
IRJET- Online Course Recommendation System
 
Data Normalization Approaches for Large-scale Biological Studies
Data Normalization Approaches for Large-scale Biological StudiesData Normalization Approaches for Large-scale Biological Studies
Data Normalization Approaches for Large-scale Biological Studies
 
Priya
PriyaPriya
Priya
 
Cross-validation aggregation for forecasting
Cross-validation aggregation for forecastingCross-validation aggregation for forecasting
Cross-validation aggregation for forecasting
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
 
Data Analyst - Interview Guide
Data Analyst - Interview GuideData Analyst - Interview Guide
Data Analyst - Interview Guide
 
AMS_Aviation_2014_Ali
AMS_Aviation_2014_AliAMS_Aviation_2014_Ali
AMS_Aviation_2014_Ali
 
Factor Analysis for Exploratory Studies
Factor Analysis for Exploratory StudiesFactor Analysis for Exploratory Studies
Factor Analysis for Exploratory Studies
 

Viewers also liked

Secretaria de salud y de los servicios de salud de puebla
Secretaria de salud y de los servicios de salud de pueblaSecretaria de salud y de los servicios de salud de puebla
Secretaria de salud y de los servicios de salud de pueblagigi187
 
JCI_RecommendationLetter
JCI_RecommendationLetterJCI_RecommendationLetter
JCI_RecommendationLetterAneta Tzvetkova
 
Presentacion innovacion tecnologica V semestre admon
Presentacion innovacion tecnologica V semestre admonPresentacion innovacion tecnologica V semestre admon
Presentacion innovacion tecnologica V semestre admonPaula Diaz
 
Pantone_Logo1_Flat
Pantone_Logo1_FlatPantone_Logo1_Flat
Pantone_Logo1_FlatCaleb Metz
 
Shawn C Williams_11-20-14
Shawn C Williams_11-20-14Shawn C Williams_11-20-14
Shawn C Williams_11-20-14Shawn Williams
 
Sistemas de gestion de aprendisaje
Sistemas de gestion de aprendisajeSistemas de gestion de aprendisaje
Sistemas de gestion de aprendisajeJairo Cantoral
 
2016 SDMX Experts meeting, National Accounts business case (validation, data ...
2016 SDMX Experts meeting, National Accounts business case (validation, data ...2016 SDMX Experts meeting, National Accounts business case (validation, data ...
2016 SDMX Experts meeting, National Accounts business case (validation, data ...StatsCommunications
 
Aula responsabilidade social
Aula responsabilidade socialAula responsabilidade social
Aula responsabilidade socialFilipe Mello
 
2015 Annual report_En_05Mar2016
2015 Annual report_En_05Mar20162015 Annual report_En_05Mar2016
2015 Annual report_En_05Mar2016Linda Ryan
 
Servicios del relleno sanitario 0001
Servicios del relleno sanitario 0001Servicios del relleno sanitario 0001
Servicios del relleno sanitario 0001gigi187
 
Nortel NTRX42BA
Nortel NTRX42BANortel NTRX42BA
Nortel NTRX42BAsavomir
 
Modificatorio del contrato de comodato
Modificatorio del contrato de comodatoModificatorio del contrato de comodato
Modificatorio del contrato de comodatogigi187
 
Pago que celebra por una parte 0003
Pago que celebra por una parte 0003Pago que celebra por una parte 0003
Pago que celebra por una parte 0003gigi187
 

Viewers also liked (20)

vinod alien
vinod alienvinod alien
vinod alien
 
Secretaria de salud y de los servicios de salud de puebla
Secretaria de salud y de los servicios de salud de pueblaSecretaria de salud y de los servicios de salud de puebla
Secretaria de salud y de los servicios de salud de puebla
 
Ley elecnay13
Ley elecnay13Ley elecnay13
Ley elecnay13
 
201508181845
201508181845201508181845
201508181845
 
JCI_RecommendationLetter
JCI_RecommendationLetterJCI_RecommendationLetter
JCI_RecommendationLetter
 
Presentacion innovacion tecnologica V semestre admon
Presentacion innovacion tecnologica V semestre admonPresentacion innovacion tecnologica V semestre admon
Presentacion innovacion tecnologica V semestre admon
 
Pantone_Logo1_Flat
Pantone_Logo1_FlatPantone_Logo1_Flat
Pantone_Logo1_Flat
 
Shawn C Williams_11-20-14
Shawn C Williams_11-20-14Shawn C Williams_11-20-14
Shawn C Williams_11-20-14
 
Tendencias pedagógicas
Tendencias pedagógicas Tendencias pedagógicas
Tendencias pedagógicas
 
Sistemas de gestion de aprendisaje
Sistemas de gestion de aprendisajeSistemas de gestion de aprendisaje
Sistemas de gestion de aprendisaje
 
2016 SDMX Experts meeting, National Accounts business case (validation, data ...
2016 SDMX Experts meeting, National Accounts business case (validation, data ...2016 SDMX Experts meeting, National Accounts business case (validation, data ...
2016 SDMX Experts meeting, National Accounts business case (validation, data ...
 
05 Ras Tanura Approval
05 Ras Tanura Approval05 Ras Tanura Approval
05 Ras Tanura Approval
 
Aula responsabilidade social
Aula responsabilidade socialAula responsabilidade social
Aula responsabilidade social
 
2015 Annual report_En_05Mar2016
2015 Annual report_En_05Mar20162015 Annual report_En_05Mar2016
2015 Annual report_En_05Mar2016
 
Servicios del relleno sanitario 0001
Servicios del relleno sanitario 0001Servicios del relleno sanitario 0001
Servicios del relleno sanitario 0001
 
master diploma
master diplomamaster diploma
master diploma
 
Nortel NTRX42BA
Nortel NTRX42BANortel NTRX42BA
Nortel NTRX42BA
 
Modificatorio del contrato de comodato
Modificatorio del contrato de comodatoModificatorio del contrato de comodato
Modificatorio del contrato de comodato
 
Pago que celebra por una parte 0003
Pago que celebra por una parte 0003Pago que celebra por una parte 0003
Pago que celebra por una parte 0003
 
Rda in a_nutshell_november2016
Rda in a_nutshell_november2016Rda in a_nutshell_november2016
Rda in a_nutshell_november2016
 

Similar to OTTO-Report

A Guide to SPSS Statistics
A Guide to SPSS Statistics A Guide to SPSS Statistics
A Guide to SPSS Statistics Luke Farrell
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind MapAshish Patel
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdfBeyaNasr1
 
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...csandit
 
Dimensionality Reduction.pptx
Dimensionality Reduction.pptxDimensionality Reduction.pptx
Dimensionality Reduction.pptxPriyadharshiniG41
 
A parsimonious SVM model selection criterion for classification of real-world ...
A parsimonious SVM model selection criterion for classification of real-world ...A parsimonious SVM model selection criterion for classification of real-world ...
A parsimonious SVM model selection criterion for classification of real-world ...o_almasi
 
Recommender system
Recommender systemRecommender system
Recommender systemBhumi Patel
 
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Maninda Edirisooriya
 
Slides sem on pls-complete
Slides sem on pls-completeSlides sem on pls-complete
Slides sem on pls-completeDr Hemant Sharma
 
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...Amir Ziai
 
Data Reduction
Data ReductionData Reduction
Data ReductionRajan Shah
 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.pptDeadpool120050
 
Initializing & Optimizing Machine Learning Models
Initializing & Optimizing Machine Learning ModelsInitializing & Optimizing Machine Learning Models
Initializing & Optimizing Machine Learning ModelsEng Teong Cheah
 
Differential Evolution Algorithm (DEA)
Differential Evolution Algorithm (DEA) Differential Evolution Algorithm (DEA)
Differential Evolution Algorithm (DEA) A. Bilal Özcan
 
Working with the data for Machine Learning
Working with the data for Machine LearningWorking with the data for Machine Learning
Working with the data for Machine LearningMehwish690898
 
Feature Reduction Techniques
Feature Reduction TechniquesFeature Reduction Techniques
Feature Reduction TechniquesVishal Patel
 
Ensemble hybrid learning technique
Ensemble hybrid learning techniqueEnsemble hybrid learning technique
Ensemble hybrid learning techniqueDishaSinha9
 

Similar to OTTO-Report (20)

A Guide to SPSS Statistics
A Guide to SPSS Statistics A Guide to SPSS Statistics
A Guide to SPSS Statistics
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind Map
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
 
Dimensionality Reduction.pptx
Dimensionality Reduction.pptxDimensionality Reduction.pptx
Dimensionality Reduction.pptx
 
A parsimonious SVM model selection criterion for classification of real-world ...
A parsimonious SVM model selection criterion for classification of real-world ...A parsimonious SVM model selection criterion for classification of real-world ...
A parsimonious SVM model selection criterion for classification of real-world ...
 
Recommender system
Recommender systemRecommender system
Recommender system
 
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
 
Data discretization
Data discretizationData discretization
Data discretization
 
Slides sem on pls-complete
Slides sem on pls-completeSlides sem on pls-complete
Slides sem on pls-complete
 
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
SEM
SEMSEM
SEM
 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.ppt
 
Initializing & Optimizing Machine Learning Models
Initializing & Optimizing Machine Learning ModelsInitializing & Optimizing Machine Learning Models
Initializing & Optimizing Machine Learning Models
 
Differential Evolution Algorithm (DEA)
Differential Evolution Algorithm (DEA) Differential Evolution Algorithm (DEA)
Differential Evolution Algorithm (DEA)
 
Working with the data for Machine Learning
Working with the data for Machine LearningWorking with the data for Machine Learning
Working with the data for Machine Learning
 
Feature Reduction Techniques
Feature Reduction TechniquesFeature Reduction Techniques
Feature Reduction Techniques
 
Ensemble hybrid learning technique
Ensemble hybrid learning techniqueEnsemble hybrid learning technique
Ensemble hybrid learning technique
 
Model selection
Model selectionModel selection
Model selection
 

OTTO-Report

  • 1. INFO-F-422 STATISTICAL FOUNDATION OF MACHINE LEARNING OTTO GROUP PRODUCT CLASSIFICATION CHALLENGE Fiscarelli Antonio Maria
  • 2. INTRODUCTION The aim of this project is to implement and assess some feature selection methods and supervised learning algorithms. The most accurate will be selected and used for the Otto Group Classification Challenge. The objective is to build a predictive model that is able to distinguish between 9 different product categories.
  • 3. DATA The Otto Group provides: - a training set in the CSV (comma separated values) format that includes 61788 labeled samples and 93 features, with their relative outcome. - a test set in the CSV (comma separated values) format that includes 144368 labeled samples and 93 features. All variables assume non-negative integer values. Training set is quite sparse(many variables assume 0 value). Following there are some plots that describes the distribution of the training data.
  • 4. As can be noticed, almost 50% of training samples belong to class 2 and class 6. Instead classes 1,4,5 and 7 are the least populated. This means that training samples are not equally distributed among all the 9 classes.
  • 5. TECHNIQUES The techniques used are the following: • Feature selection methods: ◦ Stepwise selection ◦ Principal Components Analysis (PCA) ◦ Bivariate analysis (BA) • Supervised learning algorithms: ◦ Discriminant analysis ◦ Classification trees and Random Forest ◦ Neural networks and Random Neuron These techniques will be assessed to find the best subset of variables for a given model and the best combination of models.
  • 6. FEATURE SELECTION METHODS The aim of feature selection is to select, among all variables, those ones that are most relevant and better discriminates the different classes. The best feature selection method will be selected according to: • Accuracy of the model using the subset of variables • Computation time of the model using the subset of variables The feature selection methods proposed are: • Stepwise selection • Principal Component Analysis (PCA) • Bivariate Analysis (BA) STEPWISE SELECTION Stepwise selection belongs to the wrapping methods. It is basically a search on the space of variables, where a variable can be either selected or not to be included in the set of features to use. It combines forward and backward selection, where a solution is initialized as empty set/set of all variables and variables, and are progressively added/substracted according to a cost function (for example, ME). The application of stepwise selection on the set of variables provided leads to this subset fo features: feat_1 + feat_2 + feat_4 + feat_5 + feat_6 + feat_7 + feat_8 + feat_9 + feat_10 + feat_11 + feat_12 + feat_13 + feat_14 + feat_15 + feat_16 + feat_17 + feat_19 + feat_20 + feat_22 + feat_23 + feat_24 + feat_25 + feat_26 + feat_28 + feat_29 + feat_30 + feat_33 + feat_35 + feat_36 + feat_37 + feat_38 + feat_39 + feat_40 + feat_41 + feat_42 + feat_43 + feat_44 + feat_46 + feat_47 + feat_48 + feat_50 + feat_51 + feat_53 + feat_54 + feat_55 + feat_56 + feat_57 + feat_58 + feat_59 + feat_60 + feat_62 + feat_63 + feat_64 + feat_65 + feat_66 + feat_67 + feat_68 + feat_69 + feat_70 + feat_71 + feat_72 + feat_73 + feat_74 + feat_75 + feat_76 + feat_77 + feat_78 + feat_79 + feat_80 + feat_81 + feat_83 + feat_84 + feat_85 + feat_86 + feat_87 + feat_88 + feat_90 + feat_91 + feat_92 In total 73 features over 93 are selected. For some reasons stepwise selection will not be selected as feature selection method: • Regression is computationally more expensive than PCA and BA • The subset of variables is larger than the one found with PCA and BA. This leads to computationally more expensive learning algorithms.
  • 7. PRINCIPAL COMPONENT ANALYSIS PCA is a statistical procedure that converts a set of often correlated variables into a set of linearly uncorrelated new variables called principal components. Principal components are computed as the eigenvectors of the covariance matrix, while their eigenvalues represent the variance associated to them. Hence principal components are sorted according to decreasing variance (variables with highest variance first). Since principal components are uncorrelated, even a few number of them are enough to describe data. Notice that, using PCA , variables are projected on a different space. This means that the 93 variables provided are not used any more. Hence they can' t be listed. Without assessing the model, one can take the number of components whose cumulative standard deviation is, for examle, 80% or 70% of total.
  • 8. Following results will be about PCA applied to improve accuracy of the learning algorithms used.
  • 9. What can be noticed from this analysis is that ME and BER decreases as the number of components increase, while LOGLOSS has a minimum. The number of components associated to the minimum value (23) will be used during the assessed process for Discriminant Analysis.
  • 10. What can be noticed from this analysis is that ME, BER and LOGLOSS decrease as the number of components increase. This means that the greater number of components we use the better. Since computation time of Classification Trees is still reasonable with an high number of components, all components provided by PCA will be used during the assessment process for Classification Trees.
  • 11. What can be noticed from this analysis is that ME and BER decreases as the number of variables increase, while LOGLOSS has a minimum. The number of components associated to the minimum value (42) will be considered as number of variables to use. BIVARIATE ANALYSIS Bivariate analysis is a quantitative statistical procedure that involves the analysis of two variables in order to see if they are related to each other. In this case a correlation metric has been used. Variables with high correlation are strongly related to each other. Correlation between each variable and the outcome has been computed and variables have been sorted according to decreasing correlation (Most related variables to the outcome first). This is the list of variables ranked by importance using bivariate analysis. 14 40 25 15 88 24 36 20 69 8 72 75 41 18 22 38 67 76 90 62 13 2 68 11 66 54 33 9 58 79 55 83 59 57 60 4 80 3 91 92 35 7 26 47 64 44 49 28 71 37 17 19 42 29 43 82 53 46 86 89 61 78 31 23 27 85 87 10 39 50 81 12 48 52 73 1 63 65 74 45 51 77 93 6 5 70 56 34 30 32 21 16 84 Following results will be about BA applied to improve accuracy of the learning algorithms used.
  • 12. What can be noticed from this analysis is that ME, BER and LOGLOSS decrease as the number of variables increase. This means that the greater number of variables we use the better. Since computation time of Discriminant Analysis is still reasonable with an high number of variables, all variables provided will be considered.
  • 13. What can be noticed from this analysis is that ME, BER and LOGLOSS decrease as the number of variables increase. This means that the greater number of variables we use the better. Since computation time of Classification Trees is still reasonable with an high number of components, all variables will be used considered.
  • 14. What can be noticed from this analysis is that ME and BER decreases as the number of variables increase, while LOGLOSS has a minimum. The number of components associated to the minimum value (25) will be considered as number of variables to use.
  • 15. Riassuming: • PCA performed better than BA for Discriminant Analysys. The optimum number of variables (23) will be used to asses the model. • PCA performed better than BA for Classification trees. Since the more components are used the better, and computation time of Classification Trees is still reasonable with an high numer of components, all components will be used to assess the model. • PCA performed better than BA for Neural Networks. The optimum number of variables (42) will be used to asses the model. What's more, Neural Networks are computationally expensive, hence using all the components would be unfeasible for the time constraints had.
  • 16. LEARNING ALGORITHMS The aim of learning algorithms is to learn from training examples and make predictions on new data. A model is built from training data input in order to make data-drive predictions on test data. The learning algorithms proposed are: • Discriminant Analysis • Classification Trees and Random Forest • Neural networks and Random Neurons For experiments on these learning algorithms will be used the feature selection method and the number of variables that lead to the best result on the “Feature Selection Methods” section. To assess the accuracy of each classifier the following cost functions will be used: • Misclassification error (ME) • Balanced Error Rate (BER(C)) • LogLoss
  • 17. DISCRIMINANT ANALYSIS Discriminant analysis is a statistical technique that uses a discriminant rule to make a separation of the sample space into several sets (one for each class) such that if a new sample has to be classified, it will be allocated to one of these sets and hence belonging to the associated class. - For each experiment a 10-CrossValidation has been used on training data. - Will be used PCA as feature selection method (see related section) as it performed better than BA. The optimal number of variables (23) found will be used. N.var ← 23 Apply PCA and select N.var variables Split training set S in 10 disjoint subsets Si For each subset Si - train the model using on SSi - test the model on Si. Average error For discriminant analysis there is no assessment process to find the best structure since no parameters are available. Hence results for this learning algorithms are the ones presented in the “Feature Selection Methods” section. CLASSIFICATION TREES AND RANDOM FOREST Decision trees (Classification trees in case the outcome is categorical) partitions the input space into mutually exclusive regions to each of which a specific model is assigned. Each terminal node contains a label that indicates the class for the specific region. Random Forest is an ensemble technique that combines several non pruned-trees having low bias and high variance in order to reduce the variance and improve the model accuracy. Since classification trees are deterministic models (if trained on the same training set, will produce the same result) a bagging technique is needed to generate different training subsets for each tree. Structural Identification for Random Forest consists of finding the best number of trees to use. Hence a search on the space of the number of trees is performed.
  • 18. - For each experiment a 10-CrossValidation has been used on training data. - A bagging technique has been used to train the different trees of the random forest - Will be used PCA as feature selection method (see related section) as it performed better than BA. All variables will be used. N <- number of training samples N.var <- 93 Apply PCA and select N.var variables For (n.tree in 1:250 by 10) Split training set S in 10 disjoint subsets Si For each subset Si: For (t in 1:n.tree) - generate sample St from SSi ofsize N/(10 *n.tree) (no repetition) - train the tree on St - test tree on Si - Average predicted probabilities of all trees Average error Choose number of trees that lead to minimum error Following results will be about Random Forest with a variable number of trees.
  • 19. What can be noticed from this analysis is that ME and BER decreases as the number of trees increase, while LOGLOSS has a minimum. The number of trees associated to the minimum(140) provides the best accuracy. NEURAL NETWORKS AND RANDOM NEURONS Feed Forward Artificial Neural Networks can estimate any function that depends on certain inputs variables. They are systems of interconnected neurons that given a set of weighted inputs can compute an output value. Parametric Identification for neural networks consists of finding the optimal set of weights. This can be done, for example, using Backpropagation: a gradient based algorithm that minimizes a cost function. This technique is already implemented by R in the function called to train the neural network. Structural Identification for neural networks consists of finding the best number of hidden layers and the best number of hidden neurons for layer. In this case only one hidden layer is used and a search on the space of the number of hidden neurons is performed (number of neurons will depend of the number of variables used. A naïve choice is number of hidden neurons equal to the mean of number of input and output neurons) .
  • 20. - For each experiment a 10-CrossValidation has been used on training data. - Will be used PCA as feature selection method (see related section) as it performed better than BA. The optimal number of variables (42) found will be used. N <- number of training samples N.var <- 42 Apply PCA and select N.var variables for(n.h in (N.var * 1/3):(N.var * 2/3)) - Split training set S in 10 disjoint subsets Si For each subset Si: - train the NN with n.h hidden layers on SSi - test the NN on Si Average error Choose number of hidden neurons that leads to minimum error What can be noticed is that, once applied PCA and using the optimal number of variables, number of hidden neurons within this range doesn't affect significatively a single neural network. Using a very large number of hidden neurons (for example bigger than the number of variables) leads to overfitting, hence it's not worth using unless exploited by anensemble technique. Random Neurons is the equivalent of Random Forest. It is an ensemble technique that combines several neural networks with low bias and high variance (high number of hidden neurons) in order to reduce the variance and improve the model accuracy. Neural Networks are a probabilistic models if weight are initialized randomly (they
  • 21. will produce differents results even if trained on the same training set), hence they can be trained on the same training set. - For each experiment a 10-CrossValidation has been used on training data. - For this experiment a number of hidden nodes equal to the number of variables will be used (it is a quite large number in order to produce low bias and high variance models to exploit the ensemble properties) - Will be used PCA as feature selection method (see related section) as it performed better than BA. The optimal number of variables found will be used. N <- number of training samples N.var <- 42 N.h <- N.var Apply PCA and select N.var variables for(n.NN in 1:50 by 10) - Split training set S in 10 disjoint subsets Si For each subset Si: For(n in 1:n.NN) - train a neural networks with n.h hidden neurons on SSi - test all the neural networks on Si - Average predicted probabilities of all neural networks Average error Choose number of neural networks that leads to minimum error What has been found is that Random Neurons increases the model accuracy, hence performed better than a single neural network.
  • 22. CONCLUSIONS The best combination of models found composed of: • PCA ◦ 42 components • Random Neurons: ◦ 50 neural networks: ▪ 42 hidden neurons each This model leads to a Logloss error of 0.47537 on training data (using cross validation) and of 0.58521 on the 70% of test set samples (computed by Kaggle). Since there is not a big differenc between the two values, I can say that the model doesn't overfit data.