1. 1
A STUDY ON
THE USE OF DIFFERENT CLASSIFIERS
FOR MACHINE LEARNING
PROJECT REPORT
Submitted by
AMAN SONI
B.TECH PART-II
in
ELECTRONICS ENGINEERING
Indian Institute of Technology (B.H.U.)
Varanasi – 221005
JUNE 2016
2. 2
ACKNOWLEDGEMENT
The internship opportunity I had in Indian School of Mines (I.S.M.)
Dhanbad, was a great chance for learning and professional
development. Therefore, I consider myself as a very lucky individual as I
was provided with an opportunity to be a part of it. I am also grateful for
having a chance to meet so many wonderful people and professionals
who led me though this internship period.
I am using this opportunity to express my deepest gratitude and special
thanks to Dr. Haider Banka, who in spite of being busy with his duties,
took time out to hear, guide and keep me on the correct path.
I express my deepest thanks to Prof. Debjani Mitra, for taking part in
useful decision and giving necessary advices and guidance and arranged
all facilities to make life easier. I choose this moment to acknowledge her
contribution gratefully.
I perceive this opportunity as a big milestone in my career development.
I will strive to use gained skills and knowledge in the best possible way,
and will continue to work on their improvement, in order to attain
desired career objectives.
Sincerely,
AMAN SONI
I.S.M. DHANBAD
21/06/2016
3. 3
CERTIFICATE
This is to certify that Mr. Aman Soni, S/O Mr. Rakesh Soni, Electronics
Engineering Undergraduate from Indian Institute of Technology (B.H.U.)
Varanasi, successfully completed four week (From 16th May, 2016 to 10th
June, 2016) internship at Indian School of Mines (I.S.M.) Dhanbad. During
the period of his internship programme with us, he was found punctual,
hardworking and inquisitive.
We wish him every success in life.
_____________________________________
Prof. Debjani Mitra
H.O.D. Electronics Engineering
I.S.M. Dhanbad
5. 5
INTRODUCTION
Machine learning has become an indispensible tool today. Be it the face
recognition in Facebook, or movie recommendations of Netflix, machine
learning is now-a-days used as a tool everywhere. Since large number of
websites are using this tool for various purposes, its accuracy becomes of
prime importance. For instance, machine learning comes in handy, while
recommending products to a user, based on his previous purchases.
Now, for the company to increase its sale, it is very important to
recommend relevant products to a particular customer, so that neither
he misses on great products, nor does he feel continuously annoyed by
irrelevant products being recommended to him. Similarly, machine
learning finds its application in speech-to-text conversion in various
personal assistants such as Siri offered by Apple, and Cortana offered by
Windows. Now the liking of a user towards a particular brand, depends
on how good or accurate is the speech-to-text conversion. So in this mini-
project, I have tried to study various classifiers, and train various data
samples using this classifier to obtain maximum accuracy. The data
consists of both classification and regression. In classification problems,
the data is classified into 2 or more classes, whereas in regression, the
data consists of numerical outcomes, such as house-rate prediction. In all
these training data, I have tried to train different classifiers to fit the
training data, and then obtain the accuracy on a different test dataset.
Then I tried to vary the different parameters of the classifier, and noted
its influence on the accuracy. Finally, I tried to contrast between the
different classifiers to know which one gives the maximum accuracy on
the given dataset.
6. 6
Short Description Of Various Classifiers Used:
1. Linear Regression
In Linear regression, the outcome is a number which can take on
different values. Given a training data such as housing data, where
given the various parameters of a house such as the size in sq feet, our
task is to predict the price of the house, by learning from the given
data. The various features of the house will be denoted by variable x,
the number of training examples will be denoted by m, and the
corresponding unknown parameter to be learned will be denoted by
variable theta (θ). The same notation will be followed for other
classifiers as well. Now consider the case of house-rate prediction
where only one feature (say size) is given. Then we have the following
scenario:
After minimizing the squared error cost function over the parameters
theta, we will obtain the model that best fits the training data. Then the
new data can be fit into the hypothesis to predict the house price.
7. 7
2. Logistic Regression
Logistic regression is used to classify samples into two different
classes. The hypothesis for logistic regression is defined as:
Where function g is the sigmoid function. The sigmoid function is
defined as:
The cost function in logistic regression is given as:
Again the goal is to minimize the cost function over the unknown
parameters theta. The parameters hence obtained best fit the model,
and produces least possible error. Then the new data is fit into the
hypothesis function which returns a value between 0 to 1, which
determines the probability of the sample to belong to a particular
class. For multiclass classification, we use the method of one-vs-all,
where each class is trained with respect to every other classes, and a
model is fit to it. Thus when a new sample is to be tested, it is fit to the
hypothesis of each of these classes, and the one that returns the
maximum value corresponds to the desired class.
8. 8
3. Neural Network Using Backpropagation
Neural networks are used to train complex non-linear hypothesis to
fit training data more accurately. A typical neural network can be
represented as:
The input layer corresponds to the input features. These are then
converted into more complex features, which are represented by the
hidden layers. Finally, the output layer represents the hypothesis. In
order to determine the theta parameters corresponding to the hidden
layers, the Backpropagation algorithm is used, where the errors is
propagated in the backward direction, just like the features for hidden
layers were calculated by forward propagation, but in reverse
direction. For multiclass classification, the output layer consists of
neurons, equal in number to the classes. The output layer, thus
represents a vector consisting of zeros and ones, where the index
containing one, represents the class of the sample. Thus unlike logistic
regression, there is no need for one-vs-all classification when using
neural network as the classifier.
9. 9
4. Support Vector Machine
Support Vector Machine works in a similar manner as logistic
regression. For instance, we saw that for logistic regression our goal
was:
Where λ is the regularization parameter and n is the number of
features. Similarly for Support Vector Machine, our goal will be:
Where C is a parameter that works similar to inverse of λ, and cost1
and cost0 are functions that are defined almost similar to that used for
logistic regression, as shown below:
Also while using SVM, we can use various kernels, which are functions
that modify the input features, before using them for training the
classifier. For example there is linear kernel, Gaussian kernel,
polynomial kernel, and many more. That effect of using these kernels
on training accuracy, has been noted during the project.
10. 10
Using Different Classifiers
1. Classification Data
For classification problem, the data consisted of 15 classes of hand
gestures, with more than 20 gestures available for each class. Each
image belonging to a particular class represents a training example.
For extracting the features from an image, the image was first resized
to 25 X 25. Now each pixel of this 25 X 25 matrix can be either 1 or 0
corresponding to white or black pixel. This matrix was converted into
a row vector of dimension 1 X 625. So for each image, we have 625
pixel values, so the number of features corresponding to a training
example is 625. Now for training the model, we split the data such
that we have 225 training data (15 from each of the 15 classes), and
75 cross validation data (5 from each of the 15 class).
A. Using Logistic Regression :
Using Logistic regression, the training accuracy obtained was
87.89 %, using one-vs-all method, with 150 iteration each, for
gradient descent. As the number of features were greater
than the number of training examples, so increasing the
number of training examples, slightly increased the accuracy,
by reducing the problem of overfitting. Also the accuracy
improved by using regularization, with the accuracy reaching
89.6% using λ=1, again as a consequence of less overfitting.
To speed up the algorithm, the learning rate was initially set
to 0.001 and then slowly increased by 5% as the cost function
starts converging.
11. 11
B. Using Support Vector Machine :
Using Support Vector Machine, the accuracy obtained with
linear kernel is 90% and with gaussian kernel is 82.06%.
Again the results follow, as the number of features here is
quite large as compared to the training example, and so using
more complex features from the ones given lead to
overfitting. However by reducing the parameter C of the SVM,
the results with gaussian kernel matched with the linear
kernel, since by reducing the value of C, we are increasing the
regularization parameter, and so avoiding overfitting. Also
reducing the parameter C with linear kernel after a certain
value, started showing a decrease in accuracy due to
underfitting. In this case, using more training samples didn't
help, as the model was underfitting. However with the value
of C set to 1, using more training examples slightly increased
the accuracy. Also all these cases showed a decline in
accuracy when the number of features were increased from
625 to 2500 (i.e. the image was resized to 50 X 50). Again the
results follow, as now the number of features is very large in
comparison to training examples and the model is very much
prone to overfitting.
We find that since in this case we are using the linear kernel
in SVM, the results of SVM are almost similar to that of
regularized logistic regression. However when the number of
features was reduced to 225 (i.e. by resizing the image to 15 X
15), then using SVM with gaussian kernel proved to be a
better alternative then using logistic regression, as using
SVM, we got to learn more complex features through kernel
functions.
12. 12
C. Using Backpropagation:
Using Backpropagation, the accuracy obtained was 84.74 %.
This was the case when 1 hidden layer was used and the
number of neurons in the hidden layer was 100. When the
number of neurons in the hidden layer was reduced to 10, the
accuracy decreased to less than 50%. This was the result of
underfitting, as the model was unable to represent the
complex relationship between the input and the output using
just 10 neurons in the hidden layer. On the other hand, when
the number of neurons in the hidden layer was increased to
115, the accuracy became 82.63 % which shows that the
model has started overfitting the data. This can be reduced by
introducing the regularization parameter.
Overall, we find that SVM gives best results among these
three classifiers and also is computationally less expensive
than backpropagation. On the other hand, with
backpropagation, we have to play with the number of hidden
layers, and the number of neurons in each hidden layer, to
obtain best possible results. Also increasing the number of
hidden layers and neurons, makes the computation quite
slow, and needs more memory.
Thus using these facts about the various classifiers, one can
move in the right direction to increase the accuracy and
performance of the classifier, without unnecessarily wasting
time in gathering more training data, or computing more
features, even when they will not contribute towards
increasing the accuracy.
13. 13
2. Regression Data:
The regression data is for house rate prediction, with 13 features.
The number of available samples is 506. The
accuracy/performance of a particular classifier for a regression
problem is obtained by calculating the squared error on the test
data set. Following classifiers were used to fit the training data:
A. Linear Regression:
For training the data using linear regression, the data was
first split into training and test dataset, with 350 examples in
the training data, and 156 samples for test dataset. The
features varied a lot in values, for example some of the
features are binary, having only 0 or 1 as its value, where
others had relatively large values such as size(in sq feet) of
the house. So, feature scaling was performed before training
the data, by subtracting each of the feature, by the mean value
of the feature, and then dividing it by the range of the feature
(i.e. the difference between the maximum and minimum
value of that feature). After this, the data obtained were
randomly shuffled, and then gradient descent applied to
minimize the squared error cost function. Setting the
learning rate alpha to 0.001, the squared error obtained on
the test set after 3000 iterations was equal to 20.2396.
Now the number of training samples were increased from
350 to 450. This lead to the square error on test set equal to
24.7069. The result follows, as the number of training
samples are far more than the number of features, and the
data is suffering from underfitting.
14. 14
B. Using Support Vector Machine:
Using SVM, with linear kernel and regularization parameter C
equal to 1 gave a relatively large square error on test dataset
of 94.7264. This follows, as both the use of linear kernel and
setting high regularization parameter lead to high bias.
However using gaussian kernel with regularization
parameter C set to 60000 gave a very high accuracy, with
squared error on test set equal to 12.0428 only. This shows
that by minimizing the regularization and using gaussian
kernel, we were able to create complex features from the 13
given features, and thus we were able to fit a more
appropriate model to the training data. Also, it is intuitive
that after a certain value, the increase in value of C will lead to
overfitting, as was evident here. When the value of C was set
to 6000000, the squared error increased slightly to 15.2026.
So, unlike the case for the previous classification problem
where the number of features were far greater than the
number of training samples available, here the number of
features is very small. So we need to add polynomial feature
terms to fit the training data more completely. As we found
above, one way was to reduce the regularization parameter,
and the other was to use a kernel function. Apart from this,
we could manually gather more features.
15. 15
C. Using Backpropagation:
As it has become evident from the above classifiers, we need
to gather more and more features. So using a single hidden
layer with 13 neurons gave a training error of 23.0929 and a
test error of 57.4974 whereas using a single hidden layer
with 100 neurons gave a training error of 1.5606 and a test
error of 43.7274.
16. 16
CONCLUSION
From the above analysis of different classifiers, it is clear
that different training samples require different treatment
for optimizing the accuracy of the classifier. For example for
a training data that suffers from high bias, it would be quite
useless to obtain more training samples, which would also
lead to wastage of time in gathering more data or manually
creating more. On the other a proper utilization of time
would be to analyze the current features to create more and
more features derived from existing features or to add
polynomial features, to fit complex non-linear functions to
the hypothesis. Conversely, from a data suffering from high
variance, obtaining more training samples would be a wise
choice to overcome the problem of overfitting. These
important decisions help in determining the right way to
proceed to increase the accuracy of our classifier.
17. 17
REFERENCES
1. UCI Machine learning repository
https://archive.ics.uci.edu/ml/datasets/Housing
2. Machine Learning Course offered by Stanford
University on Coursera Platform
https://www.coursera.org/learn/machine-learning
3. Data Mining, Practical Machine Learning Tools and
Techniques
By Ian H.Witten, Eibe Frank, Mark A. Hall
4. Neural Networks, Algorithms, Applications, and
Programming Techniques
By James A. Freeman, David M. Skapura
5. LIBSVM -- A Library for Support Vector Machines
Chih-Chung Chang and Chih-Jen Lin
https://www.csie.ntu.edu.tw/~cjlin/libsvm