OTTO-Report

INFO-F-422
STATISTICAL FOUNDATION
OF MACHINE LEARNING
OTTO GROUP PRODUCT
CLASSIFICATION CHALLENGE
Fiscarelli Antonio Maria

INTRODUCTION
The aim of this project is to implement and assess some feature selection methods
and supervised learning algorithms. The most accurate will be selected and used for
the Otto Group Classification Challenge.
The objective is to build a predictive model that is able to distinguish between 9
different product categories.

DATA
The Otto Group provides:
- a training set in the CSV (comma separated values) format that includes 61788
labeled samples and 93 features, with their relative outcome.
- a test set in the CSV (comma separated values) format that includes 144368 labeled
samples and 93 features.
All variables assume non-negative integer values.
Training set is quite sparse(many variables assume 0 value).
Following there are some plots that describes the distribution of the training data.

As can be noticed, almost 50% of training samples belong to class 2 and class 6.
Instead classes 1,4,5 and 7 are the least populated. This means that training samples
are not equally distributed among all the 9 classes.

TECHNIQUES
The techniques used are the following:
• Feature selection methods:
◦ Stepwise selection
◦ Principal Components Analysis (PCA)
◦ Bivariate analysis (BA)
• Supervised learning algorithms:
◦ Discriminant analysis
◦ Classification trees and Random Forest
◦ Neural networks and Random Neuron
These techniques will be assessed to find the best subset of variables for a given
model and the best combination of models.

FEATURE SELECTION METHODS
The aim of feature selection is to select, among all variables, those ones that are most
relevant and better discriminates the different classes.
The best feature selection method will be selected according to:
• Accuracy of the model using the subset of variables
• Computation time of the model using the subset of variables
The feature selection methods proposed are:
• Stepwise selection
• Principal Component Analysis (PCA)
• Bivariate Analysis (BA)
STEPWISE SELECTION
Stepwise selection belongs to the wrapping methods. It is basically a search on the
space of variables, where a variable can be either selected or not to be included in the
set of features to use.
It combines forward and backward selection, where a solution is initialized as empty
set/set of all variables and variables, and are progressively added/substracted
according to a cost function (for example, ME).
The application of stepwise selection on the set of variables provided leads to this
subset fo features:
feat_1 + feat_2 + feat_4 + feat_5 + feat_6 +
feat_7 + feat_8 + feat_9 + feat_10 + feat_11 + feat_12 +
feat_91 + feat_92
In total 73 features over 93 are selected.
For some reasons stepwise selection will not be selected as feature selection method:
• Regression is computationally more expensive than PCA and BA
• The subset of variables is larger than the one found with PCA and BA. This
leads to computationally more expensive learning algorithms.

PRINCIPAL COMPONENT ANALYSIS
PCA is a statistical procedure that converts a set of often correlated variables into a
set of linearly uncorrelated new variables called principal components.
Principal components are computed as the eigenvectors of the covariance matrix,
while their eigenvalues represent the variance associated to them. Hence principal
components are sorted according to decreasing variance (variables with highest
variance first).
Since principal components are uncorrelated, even a few number of them are enough
to describe data.
Notice that, using PCA , variables are projected on a different space. This means that
the 93 variables provided are not used any more. Hence they can' t be listed. Without
assessing the model, one can take the number of components whose cumulative
standard deviation is, for examle, 80% or 70% of total.

Following results will be about PCA applied to improve accuracy of the learning
algorithms used.

What can be noticed from this analysis is that ME and BER decreases as the number
of components increase, while LOGLOSS has a minimum.
The number of components associated to the minimum value (23) will be used during
the assessed process for Discriminant Analysis.

What can be noticed from this analysis is that ME, BER and LOGLOSS decrease as
the number of components increase. This means that the greater number of
components we use the better.
Since computation time of Classification Trees is still reasonable with an high
number of components, all components provided by PCA will be used during the
assessment process for Classification Trees.

of variables increase, while LOGLOSS has a minimum.
The number of components associated to the minimum value (42) will be considered
as number of variables to use.
BIVARIATE ANALYSIS
Bivariate analysis is a quantitative statistical procedure that involves the analysis of
two variables in order to see if they are related to each other.
In this case a correlation metric has been used. Variables with high correlation are
strongly related to each other.
Correlation between each variable and the outcome has been computed and variables
have been sorted according to decreasing correlation (Most related variables to the
outcome first).
This is the list of variables ranked by importance using bivariate analysis.
14 40 25 15 88 24 36 20 69 8 72 75 41 18 22 38 67 76 90 62 13 2 68 11 66 54 33
9 58 79 55 83 59 57 60 4 80 3 91 92 35 7 26 47 64 44 49 28 71 37 17 19 42 29
43 82 53 46 86 89 61 78 31 23 27 85 87 10 39 50 81 12 48 52 73 1 63 65 74 45 51
77 93 6 5 70 56 34 30 32 21 16 84
Following results will be about BA applied to improve accuracy of the learning
algorithms used.

the number of variables increase. This means that the greater number of variables we
use the better.
Since computation time of Discriminant Analysis is still reasonable with an high
number of variables, all variables provided will be considered.

the number of variables increase. This means that the greater number of variables we
use the better.
Since computation time of Classification Trees is still reasonable with an high
number of components, all variables will be used considered.

of variables increase, while LOGLOSS has a minimum.
The number of components associated to the minimum value (25) will be considered
as number of variables to use.

Riassuming:
• PCA performed better than BA for Discriminant Analysys. The optimum
number of variables (23) will be used to asses the model.
• PCA performed better than BA for Classification trees. Since the more
components are used the better, and computation time of Classification Trees
is still reasonable with an high numer of components, all components will be
used to assess the model.
• PCA performed better than BA for Neural Networks. The optimum number of
variables (42) will be used to asses the model. What's more, Neural Networks
are computationally expensive, hence using all the components would be
unfeasible for the time constraints had.

LEARNING ALGORITHMS
The aim of learning algorithms is to learn from training examples and make
predictions on new data. A model is built from training data input in order to make
data-drive predictions on test data.
The learning algorithms proposed are:
• Discriminant Analysis
• Classification Trees and Random Forest
• Neural networks and Random Neurons
For experiments on these learning algorithms will be used the feature selection
method and the number of variables that lead to the best result on the “Feature
Selection Methods” section.
To assess the accuracy of each classifier the following cost functions will be used:
• Misclassification error (ME)
• Balanced Error Rate (BER(C))
• LogLoss

DISCRIMINANT ANALYSIS
Discriminant analysis is a statistical technique that uses a discriminant rule to make a
separation of the sample space into several sets (one for each class) such that if a new
sample has to be classified, it will be allocated to one of these sets and hence
belonging to the associated class.
- For each experiment a 10-CrossValidation has been used on training data.
- Will be used PCA as feature selection method (see related section) as it performed
better than BA. The optimal number of variables (23) found will be used.
N.var ← 23
Apply PCA and select N.var variables
Split training set S in 10 disjoint subsets Si
For each subset Si
- train the model using on SSi
- test the model on Si.
Average error
For discriminant analysis there is no assessment process to find the best structure
since no parameters are available. Hence results for this learning algorithms are the
ones presented in the “Feature Selection Methods” section.
CLASSIFICATION TREES AND RANDOM FOREST
Decision trees (Classification trees in case the outcome is categorical) partitions the
input space into mutually exclusive regions to each of which a specific model is
assigned.
Each terminal node contains a label that indicates the class for the specific region.
Random Forest is an ensemble technique that combines several non pruned-trees
having low bias and high variance in order to reduce the variance and improve the
model accuracy. Since classification trees are deterministic models (if trained on the
same training set, will produce the same result) a bagging technique is needed to
generate different training subsets for each tree.
Structural Identification for Random Forest consists of finding the best number of
trees to use. Hence a search on the space of the number of trees is performed.

- A bagging technique has been used to train the different trees of the random forest
better than BA. All variables will be used.
N <- number of training samples
N.var <- 93
For (n.tree in 1:250 by 10)
Split training set S in 10 disjoint subsets Si
For each subset Si:
For (t in 1:n.tree)
- generate sample St from SSi ofsize N/(10 *n.tree) (no repetition)
- train the tree on St
- test tree on Si
- Average predicted probabilities of all trees
Average error
Choose number of trees that lead to minimum error
Following results will be about Random Forest with a variable number of trees.

of trees increase, while LOGLOSS has a minimum.
The number of trees associated to the minimum(140) provides the best accuracy.
NEURAL NETWORKS AND RANDOM NEURONS
Feed Forward Artificial Neural Networks can estimate any function that depends on
certain inputs variables. They are systems of interconnected neurons that given a set
of weighted inputs can compute an output value.
Parametric Identification for neural networks consists of finding the optimal set of
weights. This can be done, for example, using Backpropagation: a gradient based
algorithm that minimizes a cost function. This technique is already implemented by R
in the function called to train the neural network.
Structural Identification for neural networks consists of finding the best number of
hidden layers and the best number of hidden neurons for layer. In this case only one
hidden layer is used and a search on the space of the number of hidden neurons is
performed (number of neurons will depend of the number of variables used. A naïve
choice is number of hidden neurons equal to the mean of number of input and output
neurons) .

better than BA. The optimal number of variables (42) found will be used.
N.var <- 42
for(n.h in (N.var * 1/3):(N.var * 2/3))
- Split training set S in 10 disjoint subsets Si
For each subset Si:
- train the NN with n.h hidden layers on SSi
- test the NN on Si
Average error
Choose number of hidden neurons that leads to minimum error
What can be noticed is that, once applied PCA and using the optimal number of
variables, number of hidden neurons within this range doesn't affect significatively a
single neural network.
Using a very large number of hidden neurons (for example bigger than the number of
variables) leads to overfitting, hence it's not worth using unless exploited by
anensemble technique.
Random Neurons is the equivalent of Random Forest. It is an ensemble technique
that combines several neural networks with low bias and high variance (high number
of hidden neurons) in order to reduce the variance and improve the model accuracy.
Neural Networks are a probabilistic models if weight are initialized randomly (they

will produce differents results even if trained on the same training set), hence they
can be trained on the same training set.
- For this experiment a number of hidden nodes equal to the number of variables will
be used (it is a quite large number in order to produce low bias and high variance
models to exploit the ensemble properties)
better than BA. The optimal number of variables found will be used.
N.var <- 42
N.h <- N.var
for(n.NN in 1:50 by 10)
- Split training set S in 10 disjoint subsets Si
For each subset Si:
For(n in 1:n.NN)
- train a neural networks with n.h hidden neurons on SSi
- test all the neural networks on Si
- Average predicted probabilities of all neural networks
Average error
Choose number of neural networks that leads to minimum error
What has been found is that Random Neurons increases the model accuracy, hence
performed better than a single neural network.

CONCLUSIONS
The best combination of models found composed of:
• PCA
◦ 42 components
• Random Neurons:
◦ 50 neural networks:
▪ 42 hidden neurons each
This model leads to a Logloss error of 0.47537 on training data (using cross
validation) and of 0.58521 on the 70% of test set samples (computed by Kaggle).
Since there is not a big differenc between the two values, I can say that the model
doesn't overfit data.

OTTO-Report

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to OTTO-Report

Similar to OTTO-Report (20)

OTTO-Report