The document is a final project report on facial expression recognition using machine learning models. It explores a facial expression dataset, performs dimensionality reduction using PCA and LDA, and builds classification models including SVM and a neural network. The models aim to classify images as happy or sad, with the neural network achieving 65.8% accuracy. Further improvements could involve tuning model parameters and using a convolutional neural network.
1. Data Science with Python
Facial Expression Recognition
Final Project Report - OPIM 5894 - Data Science with Python
Team Brogrammers: Santanu Paul, Sree Inturi, Saurav Gupta, Vibhuti Upadhyay, Sunender Pothula
Nov 30 2017
2. Table of Contents
Facial Expression Recognition .............................................................................................. 1
1. Introduction................................................................................................................................ 1
1.1 Background – What is Representation Learning? .......................................................................1
1.2 Research Objectives..................................................................................................................2
2. Data Description and Exploration.............................................................................................. 2
2.1 About the Dataset.................................................................................................................... 2
2.2 Data Exploration.......................................................................................................................3
2.3 Data Preprocessing...................................................................................................................4
3. Dimensionality Reduction.......................................................................................................... 4
3.1 Curse of Dimensionality............................................................................................................4
3.2 Principal Component Analysis...................................................................................................5
Takeaway from the plot: ........................................................................................................6
Visualizing the Eigen Value:..................................................................................................6
Interactive visualizations of PCArepresentation...............................................................7
Improvements:........................................................................................................................8
3.3 Linear Discriminant Analysis......................................................................................................9
4. Modeling................................................................................................................................... 11
4.1 Support Vector Machine .........................................................................................................11
4.2 Neural Networks.....................................................................................................................14
4.3 Conclusion..............................................................................................................................16
5. Scope for Improvement ........................................................................................................... 18
5.1 CNN (Convolutional Neural Network) And Parameter Tuning...................................................18
Attachments – Python Notebooks and Code.............................................................................. 19
3. 1
1. Introduction
1.1 Background– What is Representation Learning?
Talking about the older machine learning algorithms, they rely on the input being a
feature and then learn a classifier, regressor, etc. on top of that. Most of these features are
hand crafted, i.e. designed by humans. Classical examples of features in computer vision
include SIFT, LBP, etc. The problem with these is that they are designed by humans based
on heuristics. Images can be represented using these features and ML algorithms can be
applied on top of that. However, they may not be the most optimal in terms of the objective
function, i.e., it may be possible to design better features that can lead to lower objective
function values. Instead of hand crafting these image representations, we can learn them.
That is known as representation learning. We can have a neural network which takes the
image as an input and outputs a vector, which is the feature representation of the image. This
is the representation learner. This be followed by another neural network that acts as the
classifier, regressor, etc.
For example: A wheel has a geometric shape, but its image may be complicated by
shadows falling on the wheel, the sun glaring off the metal parts of the wheel, the fender of
the car or an object in the foreground obscuring part of the wheel, and so on. We can try to
manually describe how a wheel should look like and how it can be represented. Say, it should
be circular, be black in color, have treads, etc. But these are all hand-crafted features and
may not generalize to all situations. For example, if you look at the wheel from a different
angle, it might be oval in shape. Or the lighting may cause it to have lighter and darker
patches. These kinds of variations are hard to account for manually. Instead, we can let the
representation learning neural network learn them from data by giving it several positive and
negative examples of a wheel and training it end to end.
4. 2
1.2 ResearchObjectives
The major objective of this project is to classify an image using its facial expression.
(1) Image Classification
We are presenting a method for the classification of facial expression from the analysis of
facial deformations. The classification process is based on Convolutional Neural Networks which
classifies an image as “Happy” or “Sad”. Our Neural Network model extracts an expression
skeleton of facial features. We also demonstrate the efficiency of our classifier. Our classifier was
compared with PCA and LDA classifiers working on the same data.
2. Data Description and Exploration
The data set used in this project is Challenges in Representation Learning: Facial
Expression Recognition Challenge, which contains 48x48 pixel grayscale images of faces. The
faces have been automatically registered so that the face is more or less centered and occupies
about the same amount of space in each image. The task is to categorize each face based on
the emotion shown in the facial expression in to one of two categories (3=Happy, 4=Sad)
2.1 About the Dataset
The training set consists of 15,066 examples (Happy:8989, Sad:6077) and two columns,
"emotion" and "pixels”. The "emotion" column contains a numeric code i.e. 3 & 4, inclusive, for
the emotion that is present in the image. The "pixels" column contains a string surrounded in
quotes for each image. The contents of this string a space-separated pixel values in row major
order.
Similarly test set used for the leaderboard consists of 3,589 examples and contains only
the "pixels" column and our task is to predict the emotion (Happy or Sad)
There were no missing values in our data set, it was a clean dataset.
Value Counts of our data points: Our data set is quite balanced.
5. 3
Screenshotofdata
2.2 Data Exploration
As we had pixel information in the pixel column, our first goal is to split the pixel column into multiple fields,
so that we get a rough idea how a final 48*48 picture looks like.
6. 4
Let’s see how the emotions looks like
Happy Face Sad Face
2.3 Data Preprocessing
Standardization: Standardization is a good practice for many machine learning algorithms.
Although our data is on the same scale i.e. values (1 to 255) we still preferred to do Standardize
our data.
3. Dimensionality Reduction
3.1 Curse of Dimensionality
This term has often been thrown about, especially when PCA, LDA is thrown into the mix.
This phrase refers to how our perfectly good and reliable Machine Learning methods may
suddenly perform badly when we are dealing in a very high-dimensional space. But what exactly
do all these two acronyms do? They are essentially transformation methods used for
dimensionality reduction. Therefore, if we are able to project our data from a higher-dimensional
space to a lower one while keeping most of the relevant information, that would make life a lot
easier for our learning methods.
In our data, there are 48 X 48 pixel images of data contributing to 2307 columns.Modeling
in such high dimensional space our model could perform badly so it’s perfect time to introduce
Dimensionality Reduction methods.
7. 5
3.2 PrincipalComponentAnalysis
In a nutshell, PCA is a linear transformation algorithm that seeks to project the original
features of our data onto a smaller set of features (or subspace) while still retaining most of the
information. To do this the algorithm tries to find the most appropriate directions/angles (which
are the principal components) that maximize the variance in the new subspace.
We know that principal components are orthogonal to each other. As such when
generating the covariance matrix in our new subspace, the off-diagonal values of the covariance
matrix will be zero and only the diagonals (or eigenvalues) will be non-zero. It is these diagonal
values that represent the variances of the principal components i.e. the information about the
variability of our features.
This is how our final preprocessed data looks like:
The method follows:
1. Standardize the data (already done)
2. Calculating Eigen Vectors and Eigen Values of Covariance matrix
3. Create a list of (Eigen Value, Eigen Vector) tuples
8. 6
4. Sort the Eigen Value, Eigen Vector pair from high to low
5. Calculate the explained variance from Eigen Values
Takeaway from the plot:
There are two plots above, a smaller one embedded within the larger plot. The smaller
plot (Green and Red) shows the distribution of the Individual and Explained variances across all
features while the larger plot (Golden and black) portrays a zoomed section of the explained
variances only.
As we can see, out of our 2304 features or columns approximately 90% of the Explained
Variance can be described by using just over 107 features. So, if we wanted to implement a PCA
on this, extracting the top 107 features would be a very logical choice as they already account for
the majority of the data
Visualizing the Eigen Value:
As alluded to above, since the PCA method seeks to obtain the optimal directions (or
eigenvectors) that captures the most variance (spreads out the data points the most). Therefore,
9. 7
it may be informative to visualize these directions and their associated eigenvalues. For the
purposes of this notebook and for speed, I will invoke PCA to only extract the top 28. Of interest
is when one compares the first component "Eigenvalue 1" to the 28th component "Eigenvalue
28", it is obvious that more complicated directions or components are being generated in the
search to maximize variance in the new feature subspace.
Interactive visualizations of PCA representation
When it comes to these dimensionality reduction methods, scatter plots are most
commonly implemented because they allow for great and convenient visualizations of clustering
(if any existed) and this will be exactly what we will be doing as we plot the first 2 principal
components as follows. We observed that there are no observable clusters for first two Principal
Components.
10. 8
Improvements:
Looking at the reconstruction of the original image vs the image generated after PCA, it appears
that reconstructed images are not very similar to the original ones so as to discern them
categorically. Facial expressions can be subtle and lot more information will be needed to detect
them.
Sometimes, even naked eyes fail to understand the reconstructed images' emotions. Hence,
90% is not enough information. Let's move to 95% variance (259 components)
11. 9
But as we know PCA is meant to be an unsupervised method and therefore not optimized for
separating different class labels. Classifying more accurately is what we try to accomplish by the
very next method i.e. LDA.
3.3 Linear DiscriminantAnalysis
LDA, much like PCA is also a linear transformation method commonly used in
dimensionality reduction tasks. However, unlike the latter which is an unsupervised learning
algorithm, LDA falls into the class of supervised learning methods. As such the goal of LDA is that
with available information about class labels, LDA will seek to maximize the separation between
the different classes by computing the component axes (linear discriminants) which does this.
LDA Implementation from Scratch
The objective of LDA is to preserve the class separation information whilst still reducing
the dimensions of the dataset. As such implementing the method from scratch can roughly be
split into 4 distinct stages as below.
A. Projected Means
Since this method was designed to take into account class labels we therefore first need to
establish a suitable metric with which to measure the 'distance' or separation between different
12. 10
classes. Let's assume that we have a set of data points x that belong to one particular class w.
Therefore, in LDA the first step is to the project these points onto a new line, Y that contains the
class-specific information via the transformation
$$Y = omega^intercal x $$
With this the idea is to find some method that maximizes the separation of these new projected
variables. To do so, we first calculate the projected mean.
B. Scatter Matrices and their solutions: Having introduced our projected means, we now need
to find a function that can represent the difference between the means and then maximize it. Like
in linear regression, where the most basic case is to find the line of best fit we need to find the
equivalent of the variance in this context. And hence this is where we introduce scatter matrices
where the scatter is the equivalent of the variance.
$$ tilde{S}^{2} = (y - tilde{mu})^{2}$$
C. Selecting Optimal Projection Matrices
D. Transforming features onto new subspace
LDA Implementation via Sklearn: We used Sklearn inbuilt LDA function and hence we invoke
an LDA model as follows:
The syntax for the LDA implementation is very much like PCA whereby one calls the fit and
transform methods which fits the LDA model with the data and then does a transformation by
applying the LDA dimensionality reduction to it. However, since LDA is a supervised learning
algorithm, there is a second argument to the method that the user must provide and this would
be the class labels, which in this case is the target labels of the digits.
13. 11
Interactive visualizations of LDA representation:
From the scatter plot above, we can see that the data points are more clearly clustered when
using LDA with as compared to implementing PCA with class labels. This is an inherent advantage
in having class labels to supervise the method with.
4. Modeling
4.1 SupportVectorMachine
SVM can be considered as an extension of the perceptron. Using the perceptron algorithm, we
can minimize misclassification errors. However, in SVMs, our optimization objective is
to maximize the margin between the classes. The margin is defined as the distance between the
separating hyperplane (decision boundary) and the training samples (support vectors) that are
closest to this hyperplane.
14. 12
Input X: Components from PCA i.e. 107
Running a SVM classifier with default parameters on it we get the accuracy of 62%
Input X: Components from PCA i.e. 259 components.
Running a SVM classifier with default parameters on it we get the accuracy of 65%
16. 14
Input X: Output from LDA, i.e. LD 1
Running a SVM classifier with default parameters we get accuracy of 66.4%
Misclassification rate is 33.6%, we will try to fit a neural network model so that our model classifies
with more accuracy.
4.2 NeuralNetworks
A computational model that works in a similar way to the neurons in the human brain.
Each neuron takes an input, performs some operations then passes the output to the following
neuron. As we are done pre-processing and splitting our dataset we can start implementing our
neural network
We have designed a simple neural network with one hidden layer i.e. Vanilla NN with 50
nodes and the Hyperbolic Tangent Activation Function
17. 15
We have used a simple neural network with one hidden layers having 50 nodes. The learning
rate used is also quite low in order to find the optimum solution. A mix of gradient descent and
momentum method is used. Tangent hyperbolic function is applied in the hidden layer, and a
cross entropy loss function is used from the softmax output. An accuracy of 65.8% was
achieved.
18. 16
The maximum accuracy is achieved rather quickly in this method using gradient descent and
momentum.
4.3 Conclusion
As our model is misclassifying 33 times out of 100. We tried to look at the initial image,
what features it is not able to predict right. Pictures like following is what our model is not able to
predict right. Maybe because of the hair or the eyes or maybe because of the lightning. As the
image set is very discrete there may be some error there. Because of the time constraint we
were not able to run CNN (Convolutional Neural Network) on the dataset. But that would be our
next step.
19. 17
Many pictures in our data had watermarks just like this one, which were misclassified. Majority
of our training data doesn’t have watermarks, that is also the reason it is not able to classify to
the maximum capacity.
20. 18
5. Scope for Improvement
5.1 CNN (ConvolutionalNeural Network)And ParameterTuning
We were not able to tune the parameters of our neural network model because of time
crunch and it took a lot of time in training this huge dataset. So, going forward not for the grades
but for our self-learning we will be focusing on Tenserflow and CNN.
Traditional neural networks that are very good at doing image classification have many
more parameters and take a lot of time if trained on CPU. They are faster and are applied heavily
in image and video recognition, recommender systems and natural language processing. CNNs
share weights in convolutional layers, which means that the same filter weights bank is used for
each receptive field in the layer; this reduces memory footprint and improves performance.
21. 19
Attachments – Python Notebooks and Code
1.Python Project_Image Classification.ipynb
Initial Data exploration and Preprocessing. Dimensionality Reduction by PCA, LDA
2. Python Project_Image Classification2.ipynb
SVM Implementation on top of PCA and LDA (Comparison)
3. Vanilla Neural Network.ipynb
Neural Network Implementation