SlideShare a Scribd company logo
1 of 12
Download to read offline
See	discussions,	stats,	and	author	profiles	for	this	publication	at:	https://www.researchgate.net/publication/317062429
Facial	Expression	Recognition	Using	Enhanced
Deep	3D	Convolutional	Neural	Networks
Article	·	May	2017
CITATIONS
0
READS
97
2	authors:
Some	of	the	authors	of	this	publication	are	also	working	on	these	related	projects:
closed-loop	deep	brain	stimulation,	human	behavior	recognition,	brain	signals	View	project
Wearable	Health	Monitoring	View	project
Behzad	Hasani
University	of	Denver
9	PUBLICATIONS			8	CITATIONS			
SEE	PROFILE
Mohammad	H	Mahoor
University	of	Denver
116	PUBLICATIONS			1,397	CITATIONS			
SEE	PROFILE
All	content	following	this	page	was	uploaded	by	Behzad	Hasani	on	25	May	2017.
The	user	has	requested	enhancement	of	the	downloaded	file.
Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural
Networks
Behzad Hasani and Mohammad H. Mahoor
Department of Electrical and Computer Engineering
University of Denver, Denver, CO
behzad.hasani@du.edu and mmahoor@du.edu
Abstract
Deep Neural Networks (DNNs) have shown to outper-
form traditional methods in various visual recognition tasks
including Facial Expression Recognition (FER). In spite of
efforts made to improve the accuracy of FER systems using
DNN, existing methods still are not generalizable enough
in practical applications. This paper proposes a 3D Con-
volutional Neural Network method for FER in videos. This
new network architecture consists of 3D Inception-ResNet
layers followed by an LSTM unit that together extracts the
spatial relations within facial images as well as the tem-
poral relations between different frames in the video. Fa-
cial landmark points are also used as inputs to our network
which emphasize on the importance of facial components
rather than the facial regions that may not contribute sig-
nificantly to generating facial expressions. Our proposed
method is evaluated using four publicly available databases
in subject-independent and cross-database tasks and out-
performs state-of-the-art methods.
1. Introduction
Facial expressions are one of the most important non-
verbal channels for expressing internal emotions and in-
tentions. Ekman et al. [13] defined six expressions (viz.
anger, disgust, fear, happiness, sadness, and surprise) as ba-
sic emotional expressions which are universal among hu-
man beings. Automated Facial Expression Recognition
(FER) has been a topic of study for decades. Although
there have been many breakthroughs in developing auto-
matic FER systems, majority of the existing methods either
show undesirable performance in practical applications or
lack generalization due to the controlled condition in which
they are developed [47].
The FER problem becomes even more difficult when we
recognize expressions in videos. Facial expressions have a
dynamic pattern that can be divided into three phases: on-
Figure 1. Proposed method
set, peak and offset, where the onset describes the beginning
of the expression, the peak (aka apex) describes the maxi-
mum intensity of the expression and the offset describes the
moment when the expression vanishes. Most of the times,
the entire event of facial expression from the onset to the
offset is very quick, which makes the process of expression
recognition very challenging [55].
Many methods have been proposed for automated facial
expression recognition. Most of the traditional approaches
mainly consider still images independently while ignore the
temporal relations of the consecutive frames in a sequence
which are essential for recognizing subtle changes in the
appearance of facial images especially in transiting frames
between emotions. Recently, with the help of Deep Neu-
ral Networks (DNNs), more promising results are reported
in the field [38, 39]. While in traditional approaches engi-
neered features are used to train classifiers, DNNs have the
ability to extract more discriminative features which yield
in a better interpretation of the texture of human face in vi-
sual data.
One of the problems in FER is that training neural net-
1
arXiv:1705.07871v1[cs.CV]22May2017
works is significantly more difficult as most of the exist-
ing databases have a small number of images or video se-
quences for certain emotions [39]. Also, most of these
databases contain still images that are unrelated to each
other (instead of having consecutive frames of exhibiting
the expression from onset to offset) which makes the task
of sequential image labeling more difficult.
In this paper, we propose a method which extracts tem-
poral relations of consecutive frames in a video sequence
using 3D convolutional networks and Long Short-Term
Memory (LSTM). Furthermore, we extract and incorporate
facial landmarks in our proposed method that emphasize
on more expressive facial components which improve the
recognition of subtle changes in the facial expressions in
a sequence (Figure 1). We evaluate our proposed method
using four well-known facial expression databases (CK+,
MMI, FERA, and DISFA) in order to classify the expres-
sions. Furthermore, we examine the ability of our method
in recognition of facial expressions in cross-database clas-
sification tasks.
The remainder of the paper is organized as follows: Sec-
tion 2 provides an overview of the related work in this field.
Section 3 explains the network proposed in this research.
Experimental results and their analysis are presented in Sec-
tion 4 and finally the paper is concluded in Section 5.
2. Related work
Traditionally, algorithms for automated facial expression
recognition consist of three main modules, viz. registration,
feature extraction, and classification. Detailed survey of dif-
ferent approaches in each of these steps can be found in
[44]. Conventional algorithms for affective computing from
faces use engineered features such as Local Binary Patterns
(LBP) [47], Histogram of Oriented Gradients (HOG) [8],
Local Phase Quantization (LPQ) [59], Histogram of Optical
Flow [9], facial landmarks [6, 7], and PCA-based methods
[36]. Since the majority of these features are hand-crafted
for their specific application of recognition, they often lack
required generalizability in cases where there is high varia-
tion in lighting, views, resolution, subjects’ ethnicity, etc.
One of the effective approaches for achieving better
recognition rates for sequence labeling task is to extract
the temporal relations of frames in a sequence. Extract-
ing these temporal relations has been studied using tradi-
tional methods in the past. Examples of these attempts are
Hidden Markov Models [5, 60, 64] (which combine tempo-
ral information and apply segmentation on videos), Spatio-
Temporal Hidden Markov Models (ST-HMM) by coupling
S-HMM and T-HMM [50], Dynamic Bayesian Networks
(DBN) [45, 63] associated with a multi-sensory informa-
tion fusion strategy, Bayesian temporal models [46] to cap-
ture the dynamic facial expression transition, and Condi-
tional Random Fields (CRFs) [19, 20, 25, 48] and their
extensions such as Latent-Dynamic Conditional Random
Fields (LD-CRFs) and Hidden Conditional Random Fields
(HCRFs) [58].
In recent years, “Convolutional Neural Networks”
(CNNs) have become the most popular approach among
researchers in the field. AlexNet [27] is based on the tra-
ditional CNN layered architecture which consists of sev-
eral convolution layers followed by max-pooling layers and
Rectified Linear Units (ReLUs). Szegedy et al. [52] intro-
duced GoogLeNet which is composed of multiple “Incep-
tion” layers. Inception applies several convolutions on the
feature map in different scales. Mollahosseini et al. [38, 39]
have used the Inception layer for the task of facial expres-
sion recognition and achieved state-of-the-art results. Fol-
lowing the success of Inception layers, several variations
of them have been proposed [24, 53]. Moreover, Inception
layer is combined with residual unit introduced by He et al.
[21] and it shows that the resulting architecture accelerates
the training of Inception networks significantly [51].
One of the major restrictions of ordinary Convolutional
Neural Networks is that they only extract spatial relations of
the input data while ignore the temporal relations of them if
they are part of a sequenced data. To overcome this prob-
lem, 3D Convolutional Neural Networks (3D-CNNs) have
been proposed. 3D-CNNs slide over the temporal dimen-
sion of the input data as well as the spatial dimension en-
abling the network to extract feature maps containing tem-
poral information which is essential for sequence labeling
tasks. Song et al. [49] have used 3D-CNNs for 3D object
detection task. Molchanov et al. [37] have proposed a re-
current 3D-CNN for dynamic hand gesture recognition and
Fan et al. [15] won the EmotiW 2016 challenge by cascad-
ing 3D-CNNs with LSTMs.
Traditional Recurrent Neural Networks (RNNs) can
learn temporal dynamics by mapping input sequences to
a sequence of hidden states, and also mapping the hidden
states to outputs [12]. Although RNNs have shown promis-
ing performance on various tasks, it is not easy for them
to learn long-term sequences. This is mainly due to the
vanishing/exploding gradients problem [23] which can be
solved by having a memory for remembering and forget-
ting the previous states. LSTMs [23] provide such memory
and can memorize the context information for long peri-
ods of time. LSTM modules have three gates: 1) the input
gate (i) 2) the forget gate (f) and 3) the output gate (o)
which overwrite, keep, or retrieve the memory cell c respec-
tively at the timestep t. Letting σ(x) = (1 + exp(−x))−1
be the sigmoid function and φ(x) = exp(x)−exp(−x)
exp(x)+exp(−x) =
2σ(2x) − 1 be the hyperbolic tangent function. Letting
x, h, c, W, and b be the input, output, cell state, parameter
matrix, and parameter vector respectively. The LSTM up-
dates for the timestep t given inputs xt, ht−1, and ct−1 are
as follows:
ft = σ(Wf · [ht−1, xt] + bf )
it = σ(Wi · [ht−1, xt] + bi)
ot = σ(Wo · [ht−1, xt] + bo)
gt = φ(WC · [ht−1, xt] + bC)
Ct = ft ∗ Ct−1 + it ∗ gt
ht = ot ∗ φ(Ct)
(1)
Several works have used LSTMs for the task of sequence
labeling. Byeon et al. [3] proposed an LSTM-based net-
work applying LSTMs in four direction sliding windows
and achieved impressive results. Fan et al. [15] cascaded
2D-CNN with LSTMs and combined the feature map with
3D-CNNs for facial expression recognition. Donahue et
al. [12] proposed Long-term Recurrent Convolutional Net-
work (LRCN) by combining CNNs and LSTMs which is
both spatially and temporally deep and has the flexibility
to be applied to different vision tasks involving sequential
inputs and outputs.
3. Proposed method
While Inception and ResNet have shown remarkable re-
sults in FER [20, 52], these methods do not extract the tem-
poral relations of the input data. Therefore, we propose
a 3D Inception-ResNet architecture to address this issue.
Our proposed method, extracts both spatial and temporal
features of the sequences in an end-to-end neural network.
Another component of our method is incorporating facial
landmarks in an automated manner during training in the
proposed neural network. These facial landmarks help the
network to pay more attention to the important facial com-
ponents in the feature maps which results in a more accurate
recognition. The final part of our proposed method is an
LSTM unit which takes the enhanced feature map resulted
from the 3D Inception-ResNet (3DIR) layer as an input and
extracts the temporal information from it. The LSTM unit is
followed by a fully-connected layer associated with a soft-
max activation function. In the following, we explain each
of the aforementioned units in detail.
3.1. 3D Inception-ResNet (3DIR)
We propose 3D version of Inception-ResNet network
which is slightly shallower than the original Inception-
ResNet network proposed in [51]. This network is the re-
sult of investigating several variations of Inception-ResNet
module and achieves better recognition rates comparing to
our other attempts in several databases.
Figure 2 shows the structure of our 3D Inception-ResNet
network. The input videos with the size 10 ×299×299 ×3
(10 frames, 299 × 299 frame size and 3 color channels) are
followed by the “stem” layer. Afterwards, stem is followed
Figure 2. Network architecture. The “V” and “S” marked layers
represent “Valid” and “Same” paddings respectively. The size of
the output tensor is provided next to each layer.
by 3DIR-A, Reduction-A (which reduces the grid size from
38 × 38 to 18 × 18), 3DIR-B, Reduction-B (which reduces
the grid size from 18×18 to 8×8), 3DIR-C, Average Pool-
ing, Dropout, and a fully-connected layer respectively. In
Figure 2, detailed specification of each layer is provided.
Various filter sizes, paddings, strides, and activations have
been investigated and the one that had the best performance
is presented in this paper.
We should mention that all convolution layers (except
the ones that are indicated as “Linear” in Figure 2) are fol-
lowed by an ReLU [27] activation function to avoid the van-
ishing gradient problem.
3.2. Facial landmarks
As mentioned before, the main reason we use facial land-
marks in our network is to differentiate between the impor-
tance of main facial components (such as eyebrows, lip cor-
ners, eyes, etc.) and other parts of the face which are less
expressive of facial expressions. As oppose to general ob-
ject recognition task, in FER, we have the advantage of ex-
tracting facial landmarks and using this information to im-
prove the recognition rate. In a similar approach, Jaiswal
et al. [26] proposed incorporation of binary masks around
different parts of the face in order to encode the shape of
different face components. However, in this work authors
perform AU recognition by using CNN as a feature extrac-
tor for training Bi-directional Long Short-Term Memory
while in our approach, we preserve the temporal order of the
frames throughout the network and train CNN and LSTMs
simultaneously in an end-to-end network. We incorporate
the facial landmarks by replacing the shortcut in residual
unit on original ResNet with element-wise multiplication of
facial landmarks and the input tensor of the residual unit
(Figures 1 and 2).
In order to extract the facial landmarks, OpenCV face
recognition is used to obtain bounding boxes of the faces.
A face alignment algorithm via regression local binary
features [41, 61] was used to extract 66 facial landmark
points. The facial landmark localization technique was
trained using the annotations provided from the 300W com-
petition [42, 43].
After detecting and saving the facial landmarks for all of
the databases, the facial landmark filters are generated for
each sequence automatically during training phase. Given
the facial landmarks for each frame of a sequence, we ini-
tially resize all of the images in the sequence to their cor-
responding filter size in the network. Afterwards, we as-
sign weights to all of the pixels in a frame of a sequence
based on their distances to the detected landmarks. The
closer a pixel is to a facial landmark, the greater weight is
assigned to that pixel. After investigating several distance
measures, we concluded that Manhattan distance with a
linear weight function results in a better recognition rate
in various databases. The Manhattan distance between two
items is the sum of the differences of their corresponding
components (in this case two components).
The weight function that we defined to assign the weight
values to their corresponding feature is a simple linear func-
tion of the Manhattan distance defined as follows:
ω(L, P) = 1 − 0.1 · dM(L,P ) (2)
where dM(L,P ) is the Manhattan distance between the facial
landmark L and pixel P. Therefore, places in which facial
(a) Landmarks (b) Generated filter
Figure 3. Sample image from MMI database (left) and its corre-
sponding filter in the network (right). Best in color.
landmarks are located will have the highest value and their
surrounding pixels will have lower weights proportional
to their distance from the corresponding facial landmark.
In order to avoid overlapping between two adjacent facial
landmarks, we define a 7 × 7 window around each facial
landmark and apply the weight function for these 49 pixels
for each landmark separately. Figure 3 shows an example
of facial image from MMI database and its corresponding
facial landmark filter in the network. We do not incorporate
the facial landmarks with the third 3D Inception-ResNet
module since the resulting feature map size at this stage be-
comes very small for calculating facial landmark filter.
Incorporating facial landmarks in our network replaces
the shortcut in original ResNets [22] with the element-wise
multiplication of the weight function ω and input layer xl
as follows:
yl = ω(L, P) ◦ xl + F(xl, Wl)
xl+1 = f(yl)
(3)
where xl and xl+1 are input and output of the l-th layer, ◦ is
Hadamard product symbol, F is a residual function (in our
case Inception layer convolutions), and f is an activation
function.
3.3. Long Short-Term Memory unit
As explained earlier, to capture the temporal relations
of the resulted feature map from 3DIR and take these rela-
tions into account by the time of classifying the sequences
in the softmax layer, we used an LSTM unit as it is shown in
Figure 2. Using the LSTM unit makes perfect sense since
the resulted feature map from the 3DIR unit contains the
time notion of the sequences within the feature map. There-
fore, vectorizing the resulting feature map of 3DIR on its se-
quence dimension, will provide the required sequenced in-
put for the LSTM unit. While other still image LSTM-based
methods, a vectorized non-sequenced feature map (which
obviously does not contain any time notion) is fed to the
LSTM unit, our method saves the time order of the input
sequences and passes this feature map to the LSTM unit.
We investigated that 200 hidden units for the LSTM unit is
a reasonable amount for the task of FER (Figure 2).
The proposed network was implemented using a com-
bination of TensorFlow [1] and TFlearn [10] toolboxes on
NVIDIA Tesla K40 GPUs. In the training phase we used
asynchronous stochastic gradient descent with momentum
of 0.9, weight decay of 0.0001, and learning rate of 0.01.
We used categorical cross entropy as our loss function and
accuracy as our evaluation metric.
4. Experiments and results
In this section, we briefly review the databases we used
for evaluating our method. We then report the results of our
experiments using these databases and compare the results
with the state of the arts.
4.1. Face databases
Since our method is designed mainly for classifying se-
quences of inputs, databases that contain only independent
unrelated still images of facial expressions such as Mul-
tiPie [18] , SFEW [11] , FER2013 [17] cannot be exam-
ined by our method. We evaluate our proposed method on
MMI [40], extended CK+ [32], GEMEP-FERA [2], and
DISFA [33] which contain videos of annotated facial ex-
pressions. In the following, we briefly review the contents
of these databases.
MMI: The MMI [40] database contains more than 20
subjects, ranging in age from 19 to 62, with different eth-
nicities (European, Asian, or South American). In MMI,
the subjects’ facial expressions start from the neutral state to
the apex of one of the six basic facial expressions and then
returns to the neutral state again. Subjects were instructed
to display 79 series of facial expressions, six of which are
prototypic emotions (angry, disgust, fear, happy, sad, and
surprise). We extracted static frames from each sequence,
which resulted in 11,500 images. Afterwards, we divided
videos into sequences of ten frames to shape the input ten-
sor for our network.
CK+: The extended Cohn-Kanade database (CK+) [32]
contains 593 videos from 123 subjects. However, only 327
sequences from 118 subjects contain facial expression la-
bels. Sequences in this database start from the neutral state
and end at the apex of one of the six basic expressions (an-
gry, contempt, disgust, fear, happy, sad, and surprise). CK+
primarily contains frontal face poses only. In order to make
the database compatible with our network, we consider the
last ten frames of each sequence as an input sequence in our
network.
FERA: The GEMEP-FERA database [2] is a subset of
the GEMEP corpus used as database for the FERA 2011
challenge [56] developed by the Geneva Emotion Research
Group at the University of Geneva. This database contains
87 image sequences of 7 subjects. Each subject shows fa-
cial expressions of the emotion categories: Anger, Fear, Joy,
Relief, and Sadness. Head pose is primarily frontal with rel-
atively fast movements. Each video is annotated with AUs
and holistic expressions. By extracting static frames from
the sequences, we obtained around 7,000 images. We di-
vided the these emotion videos into sequences of ten frames
to shape the input tensor for our network.
DISFA: Denver Intensity of Spontaneous Facial Actions
(DISFA) database [33] is one of a few naturalistic databases
that have been FACS coded by AU intensity values. This
database consists of 27 subjects. The subjects are asked to
watch YouTube videos while their spontaneous facial ex-
pressions are recorded. Twelve AUs are coded for each
frame and AU intensities are on a six-point scale between 0-
5, where 0 denotes the absence of the AU, and 5 represents
maximum intensity. As DISFA is not emotion-specified
coded, we used EMFACS system [16] to convert AU FACS
codes to seven expressions (angry, disgust, fear, happy, neu-
tral, sad, and surprise) which resulted in around 89,000 im-
ages in which the majority have neutral expressions. Same
as other databases, we divided the videos of emotions into
sequences of ten frames to shape the input tensor for our
network.
4.2. Results
As mentioned earlier, after detecting faces we extract 66
facial landmark points by a face alignment algorithm via
regression local binary features. Afterwards, we resize the
faces to 299×299 pixels. One of the reasons why we choose
large image size as input is the fact that larger images and
sequences will enable us to have deeper networks and ex-
tract more abstract features from sequences. All of the net-
works have the same settings (shown in Figure 2 in detail)
and are trained from scratch for each database separately.
We evaluate the accuracy of our proposed method with
two different sets of experiments: “subject-independent”
and “cross-database” evaluations.
4.2.1 Subject-independent task
In the subject-independent task, each database is split into
training and validation sets in a strict subject independent
manner. In all databases, we report the results using the
5-fold cross-validation technique and then averaging the
recognition rates over five folds. For each database and
each fold, we trained our proposed network entirely from
scratch with the aforementioned settings. Table 1 shows the
recognition rates achieved on each database in the subject-
independent case and compares the results with the state-of-
the-art methods. In order to compare the impact of incorpo-
rating facial landmarks, we also provide the results of our
network while the landmark multiplication unit is removed
and replaced with a simple shortcut between the input and
output of the residual unit. In this case, we randomly se-
lect 20 percent of the subjects as the test set and report the
results on those subjects. Table 1 also provides the recogni-
tion rates of the traditional 2D Inception-ResNet from [20]
which does not contain facial landmarks and the LSTM unit
(DISFA is not experimented in this study).
Comparing the recognition rates of the 3D and 2D
Inception-ResNets in Table 1, shows that the sequential
processing of facial expressions considerably enhances the
recognition rate. This improvement is more apparent in
MMI and FERA databases. Incorporating landmarks in the
network is proposed to emphasize on more important fa-
cial changes over time. Since changes in the lips or eyes
are much more expressive than the changes in other com-
ponents such as the cheeks, we utilize facial landmarks to
enhance these temporal changes in the network flow.
The “3D Inception-ResNet with landmarks” column in
Table 1 shows the impact of this enhancement in different
databases. It can be seen that compared with other net-
works, there is a considerable improvement in recognition
rates especially in FERA and MMI databases. The results
on DISFA, however, show higher fluctuations over different
folds which can be in part due to the abundance of inactive
frames in this database which causes confusion in recogniz-
ing different expressions. Therefore, the folds that contain
more neutral faces, would show lower recognition rates.
Comparing to other state-of-the-art works, our method
outperforms others in FERA and DISFA databases while
achieves comparable results in CK+ and MMI databases
(Table 1). Most of these works use traditional approaches
including hand-crafted features tuned for that specific
database, while our network’s settings are the same for all
databases. Also, due to the limited number of samples in
these databases, it is difficult to properly train a deep neural
network and avoid the overfitting problem. For these rea-
sons and in order to have a better understanding about our
proposed method, we also experimented the cross-database
task.
Figure 4 shows the resulting confusion matrices of our
3D Inception-ResNet with incorporating landmarks on dif-
ferent databases over the 5 folds. On CK+ (Figure 4a), it can
be seen that very high recognition rates have been achieved.
The recognition rates of happiness, sadness, and surprise are
higher than those of other expressions. The highest confu-
sion occurred between the happiness and contempt expres-
sions which can be caused from the low number of contempt
sequences in this database (only 18 sequences). On MMI
(Figure 4b), a perfect recognition is achieved for the happy
expression. It can be seen that there is a high confusion
between the sad and fear expressions as well as the angry
and sad expressions. Considering the fact that MMI is a
highly imbalanced dataset, these confusions are reasonable.
On FERA (Figure 4c), the highest and the lowest recogni-
tion rates belong to joy and relief respectively. The relief
category in this database has some similarities with other
categories especially with joy. These similarities make the
classification so difficult even for humans. Despite these
challenges, our method has performed well on all of the cat-
egories and outperforms state of the arts. On DISFA (Fig-
ure 4d), we can see the highest confusion rate compared
with other databases. As mentioned earlier, this database
contains long inactive frames, which means that the number
of neutral sequences is considerably higher than other cate-
gories. This imbalanced training data has made the network
to be biased toward the neutral category and therefore we
can observe a high confusion rate between the neutral ex-
pression and other categories in this database. Despite the
low number of angry and sad sequences in this database, our
method has been able achieve satisfying recognition rates in
these categories.
4.2.2 Cross-database task
In the cross-database task, for testing each database, that
database is entirely used for testing the network and the rest
of the databases are used to train the network. The same
network architecture as subject-independent task (Figure 2)
was used for this task. Table 2 shows the recognition rate
achieved on each database in the cross-database case and it
also compares the results with other state-of-the-art meth-
ods. It can be seen that our method outperforms the state-
of-the-art results in CK+, FERA, and DISFA databases. On
MMI, our method does not show improvements comparing
to others (e.g. [62]). However, authors in [62] trained their
classifier only with CK+ database while our method uses in-
stances from two additional databases (DISFA and FERA)
with completely different settings and subjects which add
significant amount of ambiguity in the training phase.
In order to have a fair comparison with other methods,
we provide the different settings used by the works men-
tioned in Table 2. The results provided in [34] are achieved
by training the models on one of the CK+, MMI, and FEED-
TUM databases and tested on the rest. The reported result
in [47] is the best achieved results using different SVM ker-
nels trained on CK+ and tested on MMI database. In [35]
several experiments were performed using four classifiers
(SVM, Nearest Mean Classifier, Weighted Template Match-
ing, and K-nearest neighbors). The reported results in this
work for CK+ is trained on MMI and Jaffe databases while
the reported results for MMI is trained on the CK+ database
only. As mentioned earlier, in [62] a Multiple Kernel Learn-
ing algorithm is used and the cross-database experiments
are trained on CK+, evaluated on MMI and vice versa. In
[38] a DNN network is proposed using traditional Incep-
tion layer. The networks for the cross-database case in this
work are tested on either CK+, MultiPIE, MMI, DISFA,
FERA, SFEW, or FER2013 while trained on the rest. Some
state-of-the-art methods
2D
Inception-ResNet
3D
Inception-ResNet
3D
Inception-ResNet
+
landmarks
CK+
84.1 [34], 84.4 [28], 88.5 [54], 92.0 [29],
93.2 [38], 92.4 [30], 93.6 [62]
85.77 89.50 93.21±2.32
MMI
63.4 [30], 75.12 [31], 74.7 [29], 79.8 [54],
86.7 [47], 78.51 [36]
55.83 67.50 77.50±1.76
FERA 56.1 [30], 55.6 [57], 76.7 [38] 49.64 67.74 77.42±3.67
DISFA 55.0 [38] - 51.35 58.00±5.77
Table 1. Recognition rates (%) in subject-independent task
(a) CK+ (b) MMI
(c) FERA (d) DISFA
Figure 4. Confusion matrices of 3D Inception-ResNet with landmarks for subject-independent task
of the expressions of these databases are excluded in this
study (such as neutral, relief, and contempt). There are
other works that perform their experiments on action unit
recognition task [4, 14, 26] but since fair comparison of ac-
tion unit recognition and facial expression recognition is not
easily obtainable, we did not mention these works in Ta-
bles 1 and 2.
Figure 5 shows the resulting confusion matrices of our
experiments on 3D Inception-ResNet with landmarks in
cross-database task. For CK+ (Figure 5a), we exclude the
contempt sequences in the test phase since other databases
that are used for training the network, do not contain con-
tempt category. Except for the fear expression (which has
very few number of samples in other databases), the net-
work has been able to correctly recognize other expressions.
For MMI (Figure 5b), highest recognition rate belongs to
surprise while the lowest one belongs to fear. Also, we can
see high confusion rate in recognizing sadness. On FERA
(Figure 5c), we exclude relief category as other databases do
not contain this emotion. Considering the fact that only half
of the train categories exist in the test set, the network shows
acceptable performance in correctly recognizing emotions.
However, surprise category has made significant confusion
in all of the categories. On DISFA (Figure 5d), we exclude
the neutral category as other databases do not contain this
category. Highest recognition rates belong to happy and sur-
(a) CK+ (b) MMI
(c) FERA (d) DISFA
Figure 5. Confusion matrices of 3D Inception-ResNet with landmarks for cross-database task
prise emotions while lowest one belongs to fear. Comparing
to other databases, we can see a significant increase in con-
fusion rate in all of the categories. This can be in part due
to the fact that emotions in DISFA are “spontaneous” while
emotions in the training databases are “posed”. Based on
the aforementioned results, our method provides a compre-
hensive solution that can generalize well to practical appli-
cations.
5. Conclusion
In this paper, we presented a 3D Deep Neural Network
for the task of facial expression recognition in videos. We
state-of-the-art methods
3D
Inception-ResNet
+
landmarks
CK+
47.1 [34], 56.0 [35],
61.2 [62], 64.2 [38]
67.52
MMI
51.4 [34], 50.8 [47],
36.8 [35], 55.6 [38],
66.9 [62]
54.76
FERA 39.4[38] 41.93
DISFA 37.7 [38] 40.51
Table 2. Recognition rates (%) in cross-database task
proposed the 3D Inception-ResNet (3DIR) network which
extends the well-known 2D Inception-ResNet module for
processing image sequences. This additional dimension
will result in a volume of feature maps and will extract
the spatial relations between frames in a sequence. This
module is followed by an LSTM which takes these tempo-
ral relations into account and uses this information to clas-
sify the sequences. In order to differentiate between facial
components and other parts of the face, we incorporated fa-
cial landmarks in our proposed method. These landmarks
are multiplied with the input tensor in the residual module
which is replaced with the shortcuts in the traditional resid-
ual layer.
We evaluated our proposed method in subject-
independent and cross-database tasks. Four well-known
databases were used to evaluate the method: CK+, MMI,
FERA, and DISFA. Our experiments show that the pro-
posed method outperforms many of the state-of-the-art
methods in both tasks and provides a general solution for
the task of FER.
6. Acknowledgement
This work is partially supported by the NSF grants IIS-
1111568 and CNS-1427872. We gratefully acknowledge
the support from NVIDIA Corporation with the donation of
the Tesla K40 GPUs used for this research.
References
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,
C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al.
Tensorflow: Large-scale machine learning on heterogeneous
distributed systems. arXiv preprint arXiv:1603.04467, 2016.
5
[2] T. B¨anziger and K. R. Scherer. Introducing the geneva mul-
timodal emotion portrayal (gemep) corpus. Blueprint for af-
fective computing: A sourcebook, pages 271–294, 2010. 5
[3] W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki. Scene
labeling with lstm recurrent neural networks. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 3547–3555, 2015. 3
[4] W.-S. Chu, F. De la Torre, and J. F. Cohn. Modeling spatial
and temporal cues for multi-label facial action unit detection.
arXiv preprint arXiv:1608.00911, 2016. 7
[5] I. Cohen, N. Sebe, A. Garg, L. S. Chen, and T. S. Huang.
Facial expression recognition from video sequences: tempo-
ral and static modeling. Computer Vision and image under-
standing, 91(1):160–187, 2003. 2
[6] T. F. Cootes, G. J. Edwards, C. J. Taylor, et al. Active ap-
pearance models. IEEE Transactions on pattern analysis and
machine intelligence, 23(6):681–685, 2001. 2
[7] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Ac-
tive shape models-their training and application. Computer
vision and image understanding, 61(1):38–59, 1995. 2
[8] N. Dalal and B. Triggs. Histograms of oriented gradients for
human detection. In Computer Vision and Pattern Recogni-
tion, 2005. CVPR 2005. IEEE Computer Society Conference
on, volume 1, pages 886–893. IEEE, 2005. 2
[9] N. Dalal, B. Triggs, and C. Schmid. Human detection using
oriented histograms of flow and appearance. In European
conference on computer vision, pages 428–441. Springer,
2006. 2
[10] A. Damien et al. Tflearn. https://github.com/
tflearn/tflearn, 2016. 5
[11] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon. Static fa-
cial expression analysis in tough conditions: Data, evalua-
tion protocol and benchmark. In Computer Vision Workshops
(ICCV Workshops), 2011 IEEE International Conference on,
pages 2106–2112. IEEE, 2011. 5
[12] J. Donahue, L. Anne Hendricks, S. Guadarrama,
M. Rohrbach, S. Venugopalan, K. Saenko, and T. Dar-
rell. Long-term recurrent convolutional networks for visual
recognition and description. In Proceedings of the IEEE
conference on computer vision and pattern recognition,
pages 2625–2634, 2015. 2, 3
[13] P. Ekman and W. V. Friesen. Constants across cultures in the
face and emotion. Journal of personality and social psychol-
ogy, 17(2):124, 1971. 1
[14] C. Fabian Benitez-Quiroz, R. Srinivasan, and A. M. Mar-
tinez. Emotionet: An accurate, real-time algorithm for the
automatic annotation of a million facial expressions in the
wild. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 5562–5570, 2016. 7
[15] Y. Fan, X. Lu, D. Li, and Y. Liu. Video-based emotion recog-
nition using cnn-rnn and c3d hybrid networks. In Proceed-
ings of the 18th ACM International Conference on Multi-
modal Interaction, ICMI 2016, pages 445–450, New York,
NY, USA, 2016. ACM. 2, 3
[16] W. V. Friesen and P. Ekman. Emfacs-7: Emotional facial
action coding system. Unpublished manuscript, University
of California at San Francisco, 2(36):1, 1983. 5
[17] I. J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville,
M. Mirza, B. Hamner, W. Cukierski, Y. Tang, D. Thaler, D.-
H. Lee, et al. Challenges in representation learning: A report
on three machine learning contests. In International Con-
ference on Neural Information Processing, pages 117–124.
Springer, 2013. 5
[18] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker.
Multi-pie. Image and Vision Computing, 28(5):807–813,
2010. 5
[19] B. Hasani, M. M. Arzani, M. Fathy, and K. Raahemifar.
Facial expression recognition with discriminatory graphical
models. In 2016 2nd International Conference of Signal Pro-
cessing and Intelligent Systems (ICSPIS), pages 1–7, Dec
2016. 2
[20] B. Hasani and M. H. Mahoor. Spatio-temporal facial expres-
sion recognition using convolutional neural networks and
conditional random fields. arXiv preprint arXiv:1703.06995,
2017. 2, 3, 6
[21] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
for image recognition. In The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), June 2016. 2
[22] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in
deep residual networks. pages 630–645, 2016. 4
[23] S. Hochreiter and J. Schmidhuber. Long short-term memory.
Neural computation, 9(8):1735–1780, 1997. 2
[24] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift.
arXiv preprint arXiv:1502.03167, 2015. 2
[25] S. Jain, C. Hu, and J. K. Aggarwal. Facial expression
recognition with temporal modeling of shapes. In Computer
Vision Workshops (ICCV Workshops), 2011 IEEE Interna-
tional Conference on, pages 1642–1649. IEEE, 2011. 2
[26] S. Jaiswal and M. Valstar. Deep learning the dynamic ap-
pearance and shape of facial action units. In Applications of
Computer Vision (WACV), 2016 IEEE Winter Conference on,
pages 1–8. IEEE, 2016. 4, 7
[27] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
Advances in neural information processing systems, pages
1097–1105, 2012. 2, 4
[28] S. H. Lee, K. N. K. Plataniotis, and Y. M. Ro. Intra-
class variation reduction using training expression images
for sparse representation based facial expression recognition.
IEEE Transactions on Affective Computing, 5(3):340–351,
2014. 7
[29] M. Liu, S. Li, S. Shan, and X. Chen. Au-aware deep
networks for facial expression recognition. In Automatic
Face and Gesture Recognition (FG), 2013 10th IEEE Inter-
national Conference and Workshops on, pages 1–6. IEEE,
2013. 7
[30] M. Liu, S. Li, S. Shan, R. Wang, and X. Chen. Deeply
learning deformable facial action parts model for dynamic
expression analysis. In Asian Conference on Computer Vi-
sion, pages 143–157. Springer, 2014. 7
[31] M. Liu, S. Shan, R. Wang, and X. Chen. Learning expres-
sionlets on spatio-temporal manifold for dynamic facial ex-
pression recognition. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 1749–
1756, 2014. 7
[32] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and
I. Matthews. The extended cohn-kanade dataset (ck+): A
complete dataset for action unit and emotion-specified ex-
pression. In Computer Vision and Pattern Recognition Work-
shops (CVPRW), 2010 IEEE Computer Society Conference
on, pages 94–101. IEEE, 2010. 5
[33] S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, and J. F.
Cohn. Disfa: A spontaneous facial action intensity database.
IEEE Transactions on Affective Computing, 4(2):151–160,
2013. 5
[34] C. Mayer, M. Eggers, and B. Radig. Cross-database evalu-
ation for facial expression recognition. Pattern recognition
and image analysis, 24(1):124–132, 2014. 6, 7, 8
[35] Y.-Q. Miao, R. Araujo, and M. S. Kamel. Cross-domain
facial expression recognition using supervised kernel mean
matching. In Machine Learning and Applications (ICMLA),
2012 11th International Conference on, volume 2, pages
326–332. IEEE, 2012. 6, 8
[36] M. Mohammadi, E. Fatemizadeh, and M. H. Mahoor. Pca-
based dictionary building for accurate facial expression
recognition via sparse representation. Journal of Visual
Communication and Image Representation, 25(5):1082–
1092, 2014. 2, 7
[37] P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and
J. Kautz. Online detection and classification of dynamic hand
gestures with recurrent 3d convolutional neural network. In
Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 4207–4215, 2016. 2
[38] A. Mollahosseini, D. Chan, and M. H. Mahoor. Going deeper
in facial expression recognition using deep neural networks.
In 2016 IEEE Winter Conference on Applications of Com-
puter Vision (WACV), pages 1–10. IEEE, 2016. 1, 2, 6, 7,
8
[39] A. Mollahosseini, B. Hasani, M. J. Salvador, H. Abdollahi,
D. Chan, and M. H. Mahoor. Facial expression recognition
from world wild web. In The IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR) Workshops,
June 2016. 1, 2
[40] M. Pantic, M. Valstar, R. Rademaker, and L. Maat. Web-
based database for facial expression analysis. In 2005 IEEE
international conference on multimedia and Expo, pages 5–
pp. IEEE, 2005. 5
[41] S. Ren, X. Cao, Y. Wei, and J. Sun. Face alignment at 3000
fps via regressing local binary features. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 1685–1692, 2014. 4
[42] C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou,
and M. Pantic. 300 faces in-the-wild challenge: Database
and results. Image and Vision Computing, 47:3–18, 2016. 4
[43] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic.
A semi-automatic methodology for facial landmark annota-
tion. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition Workshops, pages 896–903,
2013. 4
[44] E. Sariyanidi, H. Gunes, and A. Cavallaro. Automatic anal-
ysis of facial affect: A survey of registration, representation,
and recognition. IEEE transactions on pattern analysis and
machine intelligence, 37(6):1113–1133, 2015. 2
[45] N. Sebe, M. S. Lew, Y. Sun, I. Cohen, T. Gevers, and T. S.
Huang. Authentic facial expression analysis. Image and Vi-
sion Computing, 25(12):1856–1863, 2007. 2
[46] C. Shan, S. Gong, and P. W. McOwan. Dynamic facial
expression recognition using a bayesian temporal manifold
model. In BMVC, pages 297–306. Citeseer, 2006. 2
[47] C. Shan, S. Gong, and P. W. McOwan. Facial expression
recognition based on local binary patterns: A comprehensive
study. Image and Vision Computing, 27(6):803–816, 2009.
1, 2, 6, 7, 8
[48] C. Sminchisescu, A. Kanaujia, and D. Metaxas. Conditional
models for contextual human motion recognition. Computer
Vision and Image Understanding, 104(2):210–220, 2006. 2
[49] S. Song and J. Xiao. Deep sliding shapes for amodal 3d ob-
ject detection in rgb-d images. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
pages 808–816, 2016. 2
[50] Y. Sun, X. Chen, M. Rosato, and L. Yin. Tracking vertex flow
and model adaptation for three-dimensional spatiotemporal
face analysis. Systems, Man and Cybernetics, Part A: Sys-
tems and Humans, IEEE Transactions on, 40(3):461–474,
2010. 2
[51] C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4,
inception-resnet and the impact of residual connections on
learning. arXiv preprint arXiv:1602.07261, 2016. 2, 3
[52] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
Going deeper with convolutions. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
pages 1–9, 2015. 2, 3
[53] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.
Rethinking the inception architecture for computer vision.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2016. 2
[54] S. Taheri, Q. Qiu, and R. Chellappa. Structure-preserving
sparse decomposition for facial expression analysis. IEEE
Transactions on Image Processing, 23(8):3590–3603, 2014.
7
[55] Y. Tian, T. Kanade, and J. F. Cohn. Recognizing lower face
action units for facial expression analysis. In Automatic Face
and Gesture Recognition, 2000. Proceedings. Fourth IEEE
International Conference on, pages 484–490. IEEE, 2000. 1
[56] M. F. Valstar, B. Jiang, M. Mehu, M. Pantic, and K. Scherer.
The first facial expression recognition and analysis chal-
lenge. In Automatic Face & Gesture Recognition and Work-
shops (FG 2011), 2011 IEEE International Conference on,
pages 921–926. IEEE, 2011. 5
[57] M. F. Valstar, B. Jiang, M. Mehu, M. Pantic, and K. Scherer.
The first facial expression recognition and analysis chal-
lenge. In Automatic Face & Gesture Recognition and Work-
shops (FG 2011), 2011 IEEE International Conference on,
pages 921–926. IEEE, 2011. 7
[58] S. B. Wang, A. Quattoni, L.-P. Morency, D. Demirdjian,
and T. Darrell. Hidden conditional random fields for ges-
ture recognition. In Computer Vision and Pattern Recogni-
tion, 2006 IEEE Computer Society Conference on, volume 2,
pages 1521–1527. IEEE, 2006. 2
[59] Z. Wang and Z. Ying. Facial expression recognition based on
local phase quantization and sparse representation. In Natu-
ral Computation (ICNC), 2012 Eighth International Confer-
ence on, pages 222–225. IEEE, 2012. 2
[60] M. Yeasin, B. Bullot, and R. Sharma. Recognition of fa-
cial expressions and measurement of levels of interest from
video. Multimedia, IEEE Transactions on, 8(3):500–508,
2006. 2
[61] L. Yu. face-alignment-in-3000fps. https://github.
com/yulequan/face-alignment-in-3000fps,
2016. 4
[62] X. Zhang, M. H. Mahoor, and S. M. Mavadati. Facial expres-
sion recognition using {l} {p}-norm mkl multiclass-svm.
Machine Vision and Applications, 26(4):467–483, 2015. 6,
7, 8
[63] Y. Zhang and Q. Ji. Active and dynamic information fusion
for facial expression understanding from image sequences.
Pattern Analysis and Machine Intelligence, IEEE Transac-
tions on, 27(5):699–714, 2005. 2
[64] Y. Zhu, L. C. De Silva, and C. C. Ko. Using moment in-
variants and hmm in facial expression recognition. Pattern
Recognition Letters, 23(1):83–91, 2002. 2
View publication statsView publication stats

More Related Content

What's hot

Facial expression recongnition Techniques, Database and Classifiers
Facial expression recongnition Techniques, Database and Classifiers Facial expression recongnition Techniques, Database and Classifiers
Facial expression recongnition Techniques, Database and Classifiers
Rupinder Saini
 
Synops emotion recognize
Synops emotion recognizeSynops emotion recognize
Synops emotion recognize
Avdhesh Gupta
 
Emotion recognition and drowsiness detection using python.ppt
Emotion recognition and drowsiness detection using python.pptEmotion recognition and drowsiness detection using python.ppt
Emotion recognition and drowsiness detection using python.ppt
Gopi Naidu
 
Image–based face-detection-and-recognition-using-matlab
Image–based face-detection-and-recognition-using-matlabImage–based face-detection-and-recognition-using-matlab
Image–based face-detection-and-recognition-using-matlab
Ijcem Journal
 

What's hot (20)

IRJET- Facial Emotion Detection using Convolutional Neural Network
IRJET- Facial Emotion Detection using Convolutional Neural NetworkIRJET- Facial Emotion Detection using Convolutional Neural Network
IRJET- Facial Emotion Detection using Convolutional Neural Network
 
A deep learning facial expression recognition based scoring system for restau...
A deep learning facial expression recognition based scoring system for restau...A deep learning facial expression recognition based scoring system for restau...
A deep learning facial expression recognition based scoring system for restau...
 
Facial expression recognition on real world face images (OPTIK)
Facial expression recognition on real world face images (OPTIK)Facial expression recognition on real world face images (OPTIK)
Facial expression recognition on real world face images (OPTIK)
 
Facial expression recongnition Techniques, Database and Classifiers
Facial expression recongnition Techniques, Database and Classifiers Facial expression recongnition Techniques, Database and Classifiers
Facial expression recongnition Techniques, Database and Classifiers
 
Facial emotion recognition
Facial emotion recognitionFacial emotion recognition
Facial emotion recognition
 
Predicting Emotions through Facial Expressions
Predicting Emotions through Facial Expressions  Predicting Emotions through Facial Expressions
Predicting Emotions through Facial Expressions
 
Facial expression recognition using pca and gabor with jaffe database 11748
Facial expression recognition using pca and gabor with jaffe database 11748Facial expression recognition using pca and gabor with jaffe database 11748
Facial expression recognition using pca and gabor with jaffe database 11748
 
Facial Emotion Recognition: A Deep Learning approach
Facial Emotion Recognition: A Deep Learning approachFacial Emotion Recognition: A Deep Learning approach
Facial Emotion Recognition: A Deep Learning approach
 
J01116164
J01116164J01116164
J01116164
 
Facial expression recognition based on image feature
Facial expression recognition based on image featureFacial expression recognition based on image feature
Facial expression recognition based on image feature
 
Real time facial expression analysis using pca
Real time facial expression analysis using pcaReal time facial expression analysis using pca
Real time facial expression analysis using pca
 
Synops emotion recognize
Synops emotion recognizeSynops emotion recognize
Synops emotion recognize
 
Emotion recognition and drowsiness detection using python.ppt
Emotion recognition and drowsiness detection using python.pptEmotion recognition and drowsiness detection using python.ppt
Emotion recognition and drowsiness detection using python.ppt
 
IRJET- Emotion Classification of Human Face Expressions using Transfer Le...
IRJET-  	  Emotion Classification of Human Face Expressions using Transfer Le...IRJET-  	  Emotion Classification of Human Face Expressions using Transfer Le...
IRJET- Emotion Classification of Human Face Expressions using Transfer Le...
 
Image–based face-detection-and-recognition-using-matlab
Image–based face-detection-and-recognition-using-matlabImage–based face-detection-and-recognition-using-matlab
Image–based face-detection-and-recognition-using-matlab
 
Face Emotion Analysis Using Gabor Features In Image Database for Crime Invest...
Face Emotion Analysis Using Gabor Features In Image Database for Crime Invest...Face Emotion Analysis Using Gabor Features In Image Database for Crime Invest...
Face Emotion Analysis Using Gabor Features In Image Database for Crime Invest...
 
Deep Neural NEtwork
Deep Neural NEtworkDeep Neural NEtwork
Deep Neural NEtwork
 
Implementation of Face Recognition in Cloud Vision Using Eigen Faces
Implementation of Face Recognition in Cloud Vision Using Eigen FacesImplementation of Face Recognition in Cloud Vision Using Eigen Faces
Implementation of Face Recognition in Cloud Vision Using Eigen Faces
 
Facial expression recognition system : survey
Facial expression recognition system : surveyFacial expression recognition system : survey
Facial expression recognition system : survey
 
2018 09
2018 092018 09
2018 09
 

Similar to Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Networks

Facial emotion recognition using deep learning detector and classifier
Facial emotion recognition using deep learning detector and classifier Facial emotion recognition using deep learning detector and classifier
Facial emotion recognition using deep learning detector and classifier
IJECEIAES
 
icmi2015_ChaZhang
icmi2015_ChaZhangicmi2015_ChaZhang
icmi2015_ChaZhang
Zhiding Yu
 
Possibility fuzzy c means clustering for expression invariant face recognition
Possibility fuzzy c means clustering for expression invariant face recognitionPossibility fuzzy c means clustering for expression invariant face recognition
Possibility fuzzy c means clustering for expression invariant face recognition
IJCI JOURNAL
 
FaceDetectionforColorImageBasedonMATLAB.pdf
FaceDetectionforColorImageBasedonMATLAB.pdfFaceDetectionforColorImageBasedonMATLAB.pdf
FaceDetectionforColorImageBasedonMATLAB.pdf
Anita Pal
 
Module 04 Content· As a continuation to examining your policies, r
Module 04 Content· As a continuation to examining your policies, rModule 04 Content· As a continuation to examining your policies, r
Module 04 Content· As a continuation to examining your policies, r
IlonaThornburg83
 

Similar to Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Networks (20)

FACE EXPRESSION RECOGNITION USING CONVOLUTION NEURAL NETWORK (CNN) MODELS
FACE EXPRESSION RECOGNITION USING CONVOLUTION NEURAL NETWORK (CNN) MODELS FACE EXPRESSION RECOGNITION USING CONVOLUTION NEURAL NETWORK (CNN) MODELS
FACE EXPRESSION RECOGNITION USING CONVOLUTION NEURAL NETWORK (CNN) MODELS
 
Robust face recognition using convolutional neural networks combined with Kr...
Robust face recognition using convolutional neural networks  combined with Kr...Robust face recognition using convolutional neural networks  combined with Kr...
Robust face recognition using convolutional neural networks combined with Kr...
 
Facial recognition based on enhanced neural network
Facial recognition based on enhanced neural networkFacial recognition based on enhanced neural network
Facial recognition based on enhanced neural network
 
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONQUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
 
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONQUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
 
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONQUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
 
Facial emotion recognition using deep learning detector and classifier
Facial emotion recognition using deep learning detector and classifier Facial emotion recognition using deep learning detector and classifier
Facial emotion recognition using deep learning detector and classifier
 
IRJET - Factors Affecting Deployment of Deep Learning based Face Recognition ...
IRJET - Factors Affecting Deployment of Deep Learning based Face Recognition ...IRJET - Factors Affecting Deployment of Deep Learning based Face Recognition ...
IRJET - Factors Affecting Deployment of Deep Learning based Face Recognition ...
 
icmi2015_ChaZhang
icmi2015_ChaZhangicmi2015_ChaZhang
icmi2015_ChaZhang
 
Possibility fuzzy c means clustering for expression invariant face recognition
Possibility fuzzy c means clustering for expression invariant face recognitionPossibility fuzzy c means clustering for expression invariant face recognition
Possibility fuzzy c means clustering for expression invariant face recognition
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
 
FaceDetectionforColorImageBasedonMATLAB.pdf
FaceDetectionforColorImageBasedonMATLAB.pdfFaceDetectionforColorImageBasedonMATLAB.pdf
FaceDetectionforColorImageBasedonMATLAB.pdf
 
Comparative Study of Different Techniques in Speaker Recognition: Review
Comparative Study of Different Techniques in Speaker Recognition: ReviewComparative Study of Different Techniques in Speaker Recognition: Review
Comparative Study of Different Techniques in Speaker Recognition: Review
 
Sentiment analysis by deep learning approaches
Sentiment analysis by deep learning approachesSentiment analysis by deep learning approaches
Sentiment analysis by deep learning approaches
 
Paper_3.pdf
Paper_3.pdfPaper_3.pdf
Paper_3.pdf
 
Offline Character Recognition Using Monte Carlo Method and Neural Network
Offline Character Recognition Using Monte Carlo Method and Neural NetworkOffline Character Recognition Using Monte Carlo Method and Neural Network
Offline Character Recognition Using Monte Carlo Method and Neural Network
 
Module 04 Content· As a continuation to examining your policies, r
Module 04 Content· As a continuation to examining your policies, rModule 04 Content· As a continuation to examining your policies, r
Module 04 Content· As a continuation to examining your policies, r
 
DeepFace: Closing the Gap to Human-Level Performance in Face Verification
DeepFace: Closing the Gap to Human-Level Performance in Face VerificationDeepFace: Closing the Gap to Human-Level Performance in Face Verification
DeepFace: Closing the Gap to Human-Level Performance in Face Verification
 
IJECET_13_02_003.pdf
IJECET_13_02_003.pdfIJECET_13_02_003.pdf
IJECET_13_02_003.pdf
 

More from Willy Marroquin (WillyDevNET)

World Economic Forum : The Global Risks Report 2024
World Economic Forum : The Global Risks Report 2024World Economic Forum : The Global Risks Report 2024
World Economic Forum : The Global Risks Report 2024
Willy Marroquin (WillyDevNET)
 
Language Is Not All You Need: Aligning Perception with Language Models
Language Is Not All You Need: Aligning Perception with Language ModelsLanguage Is Not All You Need: Aligning Perception with Language Models
Language Is Not All You Need: Aligning Perception with Language Models
Willy Marroquin (WillyDevNET)
 
WEF new vision for education
WEF new vision for educationWEF new vision for education
WEF new vision for education
Willy Marroquin (WillyDevNET)
 
El futuro del trabajo perspectivas regionales
El futuro del trabajo perspectivas regionalesEl futuro del trabajo perspectivas regionales
El futuro del trabajo perspectivas regionales
Willy Marroquin (WillyDevNET)
 
FOR A MEANINGFUL ARTIFICIAL INTELLIGENCE TOWARDS A FRENCH AND EUROPEAN ST...
FOR A  MEANINGFUL  ARTIFICIAL  INTELLIGENCE TOWARDS A FRENCH  AND EUROPEAN ST...FOR A  MEANINGFUL  ARTIFICIAL  INTELLIGENCE TOWARDS A FRENCH  AND EUROPEAN ST...
FOR A MEANINGFUL ARTIFICIAL INTELLIGENCE TOWARDS A FRENCH AND EUROPEAN ST...
Willy Marroquin (WillyDevNET)
 
Seven facts noncognitive skills education labor market
Seven facts noncognitive skills education labor marketSeven facts noncognitive skills education labor market
Seven facts noncognitive skills education labor market
Willy Marroquin (WillyDevNET)
 

More from Willy Marroquin (WillyDevNET) (20)

Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
World Economic Forum : The Global Risks Report 2024
World Economic Forum : The Global Risks Report 2024World Economic Forum : The Global Risks Report 2024
World Economic Forum : The Global Risks Report 2024
 
Language Is Not All You Need: Aligning Perception with Language Models
Language Is Not All You Need: Aligning Perception with Language ModelsLanguage Is Not All You Need: Aligning Perception with Language Models
Language Is Not All You Need: Aligning Perception with Language Models
 
Real Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform DomainReal Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform Domain
 
Data and AI reference architecture
Data and AI reference architectureData and AI reference architecture
Data and AI reference architecture
 
Inteligencia artificial y crecimiento económico. Oportunidades y desafíos par...
Inteligencia artificial y crecimiento económico. Oportunidades y desafíos par...Inteligencia artificial y crecimiento económico. Oportunidades y desafíos par...
Inteligencia artificial y crecimiento económico. Oportunidades y desafíos par...
 
An Artificial Neuron Implemented on an Actual Quantum Processor
An Artificial Neuron Implemented on an Actual Quantum ProcessorAn Artificial Neuron Implemented on an Actual Quantum Processor
An Artificial Neuron Implemented on an Actual Quantum Processor
 
ENFERMEDAD DE ALZHEIMER PRESENTE TERAP...UTICO Y RETOS FUTUROS
ENFERMEDAD DE ALZHEIMER PRESENTE TERAP...UTICO Y RETOS FUTUROSENFERMEDAD DE ALZHEIMER PRESENTE TERAP...UTICO Y RETOS FUTUROS
ENFERMEDAD DE ALZHEIMER PRESENTE TERAP...UTICO Y RETOS FUTUROS
 
The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and...
The Malicious Use   of Artificial Intelligence: Forecasting, Prevention,  and...The Malicious Use   of Artificial Intelligence: Forecasting, Prevention,  and...
The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and...
 
TowardsDeepLearningModelsforPsychological StatePredictionusingSmartphoneData:...
TowardsDeepLearningModelsforPsychological StatePredictionusingSmartphoneData:...TowardsDeepLearningModelsforPsychological StatePredictionusingSmartphoneData:...
TowardsDeepLearningModelsforPsychological StatePredictionusingSmartphoneData:...
 
Deep learning-approach
Deep learning-approachDeep learning-approach
Deep learning-approach
 
WEF new vision for education
WEF new vision for educationWEF new vision for education
WEF new vision for education
 
El futuro del trabajo perspectivas regionales
El futuro del trabajo perspectivas regionalesEl futuro del trabajo perspectivas regionales
El futuro del trabajo perspectivas regionales
 
ASIA Y EL NUEVO (DES)ORDEN MUNDIAL
ASIA Y EL NUEVO (DES)ORDEN MUNDIALASIA Y EL NUEVO (DES)ORDEN MUNDIAL
ASIA Y EL NUEVO (DES)ORDEN MUNDIAL
 
DeepMood: Modeling Mobile Phone Typing Dynamics for Mood Detection
DeepMood: Modeling Mobile Phone Typing Dynamics for Mood DetectionDeepMood: Modeling Mobile Phone Typing Dynamics for Mood Detection
DeepMood: Modeling Mobile Phone Typing Dynamics for Mood Detection
 
FOR A MEANINGFUL ARTIFICIAL INTELLIGENCE TOWARDS A FRENCH AND EUROPEAN ST...
FOR A  MEANINGFUL  ARTIFICIAL  INTELLIGENCE TOWARDS A FRENCH  AND EUROPEAN ST...FOR A  MEANINGFUL  ARTIFICIAL  INTELLIGENCE TOWARDS A FRENCH  AND EUROPEAN ST...
FOR A MEANINGFUL ARTIFICIAL INTELLIGENCE TOWARDS A FRENCH AND EUROPEAN ST...
 
When Will AI Exceed Human Performance? Evidence from AI Experts
When Will AI Exceed Human Performance? Evidence from AI ExpertsWhen Will AI Exceed Human Performance? Evidence from AI Experts
When Will AI Exceed Human Performance? Evidence from AI Experts
 
Microsoft AI Platform Whitepaper
Microsoft AI Platform WhitepaperMicrosoft AI Platform Whitepaper
Microsoft AI Platform Whitepaper
 
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Ad...
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Ad...AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Ad...
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Ad...
 
Seven facts noncognitive skills education labor market
Seven facts noncognitive skills education labor marketSeven facts noncognitive skills education labor market
Seven facts noncognitive skills education labor market
 

Recently uploaded

Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 

Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Networks

  • 2. Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Networks Behzad Hasani and Mohammad H. Mahoor Department of Electrical and Computer Engineering University of Denver, Denver, CO behzad.hasani@du.edu and mmahoor@du.edu Abstract Deep Neural Networks (DNNs) have shown to outper- form traditional methods in various visual recognition tasks including Facial Expression Recognition (FER). In spite of efforts made to improve the accuracy of FER systems using DNN, existing methods still are not generalizable enough in practical applications. This paper proposes a 3D Con- volutional Neural Network method for FER in videos. This new network architecture consists of 3D Inception-ResNet layers followed by an LSTM unit that together extracts the spatial relations within facial images as well as the tem- poral relations between different frames in the video. Fa- cial landmark points are also used as inputs to our network which emphasize on the importance of facial components rather than the facial regions that may not contribute sig- nificantly to generating facial expressions. Our proposed method is evaluated using four publicly available databases in subject-independent and cross-database tasks and out- performs state-of-the-art methods. 1. Introduction Facial expressions are one of the most important non- verbal channels for expressing internal emotions and in- tentions. Ekman et al. [13] defined six expressions (viz. anger, disgust, fear, happiness, sadness, and surprise) as ba- sic emotional expressions which are universal among hu- man beings. Automated Facial Expression Recognition (FER) has been a topic of study for decades. Although there have been many breakthroughs in developing auto- matic FER systems, majority of the existing methods either show undesirable performance in practical applications or lack generalization due to the controlled condition in which they are developed [47]. The FER problem becomes even more difficult when we recognize expressions in videos. Facial expressions have a dynamic pattern that can be divided into three phases: on- Figure 1. Proposed method set, peak and offset, where the onset describes the beginning of the expression, the peak (aka apex) describes the maxi- mum intensity of the expression and the offset describes the moment when the expression vanishes. Most of the times, the entire event of facial expression from the onset to the offset is very quick, which makes the process of expression recognition very challenging [55]. Many methods have been proposed for automated facial expression recognition. Most of the traditional approaches mainly consider still images independently while ignore the temporal relations of the consecutive frames in a sequence which are essential for recognizing subtle changes in the appearance of facial images especially in transiting frames between emotions. Recently, with the help of Deep Neu- ral Networks (DNNs), more promising results are reported in the field [38, 39]. While in traditional approaches engi- neered features are used to train classifiers, DNNs have the ability to extract more discriminative features which yield in a better interpretation of the texture of human face in vi- sual data. One of the problems in FER is that training neural net- 1 arXiv:1705.07871v1[cs.CV]22May2017
  • 3. works is significantly more difficult as most of the exist- ing databases have a small number of images or video se- quences for certain emotions [39]. Also, most of these databases contain still images that are unrelated to each other (instead of having consecutive frames of exhibiting the expression from onset to offset) which makes the task of sequential image labeling more difficult. In this paper, we propose a method which extracts tem- poral relations of consecutive frames in a video sequence using 3D convolutional networks and Long Short-Term Memory (LSTM). Furthermore, we extract and incorporate facial landmarks in our proposed method that emphasize on more expressive facial components which improve the recognition of subtle changes in the facial expressions in a sequence (Figure 1). We evaluate our proposed method using four well-known facial expression databases (CK+, MMI, FERA, and DISFA) in order to classify the expres- sions. Furthermore, we examine the ability of our method in recognition of facial expressions in cross-database clas- sification tasks. The remainder of the paper is organized as follows: Sec- tion 2 provides an overview of the related work in this field. Section 3 explains the network proposed in this research. Experimental results and their analysis are presented in Sec- tion 4 and finally the paper is concluded in Section 5. 2. Related work Traditionally, algorithms for automated facial expression recognition consist of three main modules, viz. registration, feature extraction, and classification. Detailed survey of dif- ferent approaches in each of these steps can be found in [44]. Conventional algorithms for affective computing from faces use engineered features such as Local Binary Patterns (LBP) [47], Histogram of Oriented Gradients (HOG) [8], Local Phase Quantization (LPQ) [59], Histogram of Optical Flow [9], facial landmarks [6, 7], and PCA-based methods [36]. Since the majority of these features are hand-crafted for their specific application of recognition, they often lack required generalizability in cases where there is high varia- tion in lighting, views, resolution, subjects’ ethnicity, etc. One of the effective approaches for achieving better recognition rates for sequence labeling task is to extract the temporal relations of frames in a sequence. Extract- ing these temporal relations has been studied using tradi- tional methods in the past. Examples of these attempts are Hidden Markov Models [5, 60, 64] (which combine tempo- ral information and apply segmentation on videos), Spatio- Temporal Hidden Markov Models (ST-HMM) by coupling S-HMM and T-HMM [50], Dynamic Bayesian Networks (DBN) [45, 63] associated with a multi-sensory informa- tion fusion strategy, Bayesian temporal models [46] to cap- ture the dynamic facial expression transition, and Condi- tional Random Fields (CRFs) [19, 20, 25, 48] and their extensions such as Latent-Dynamic Conditional Random Fields (LD-CRFs) and Hidden Conditional Random Fields (HCRFs) [58]. In recent years, “Convolutional Neural Networks” (CNNs) have become the most popular approach among researchers in the field. AlexNet [27] is based on the tra- ditional CNN layered architecture which consists of sev- eral convolution layers followed by max-pooling layers and Rectified Linear Units (ReLUs). Szegedy et al. [52] intro- duced GoogLeNet which is composed of multiple “Incep- tion” layers. Inception applies several convolutions on the feature map in different scales. Mollahosseini et al. [38, 39] have used the Inception layer for the task of facial expres- sion recognition and achieved state-of-the-art results. Fol- lowing the success of Inception layers, several variations of them have been proposed [24, 53]. Moreover, Inception layer is combined with residual unit introduced by He et al. [21] and it shows that the resulting architecture accelerates the training of Inception networks significantly [51]. One of the major restrictions of ordinary Convolutional Neural Networks is that they only extract spatial relations of the input data while ignore the temporal relations of them if they are part of a sequenced data. To overcome this prob- lem, 3D Convolutional Neural Networks (3D-CNNs) have been proposed. 3D-CNNs slide over the temporal dimen- sion of the input data as well as the spatial dimension en- abling the network to extract feature maps containing tem- poral information which is essential for sequence labeling tasks. Song et al. [49] have used 3D-CNNs for 3D object detection task. Molchanov et al. [37] have proposed a re- current 3D-CNN for dynamic hand gesture recognition and Fan et al. [15] won the EmotiW 2016 challenge by cascad- ing 3D-CNNs with LSTMs. Traditional Recurrent Neural Networks (RNNs) can learn temporal dynamics by mapping input sequences to a sequence of hidden states, and also mapping the hidden states to outputs [12]. Although RNNs have shown promis- ing performance on various tasks, it is not easy for them to learn long-term sequences. This is mainly due to the vanishing/exploding gradients problem [23] which can be solved by having a memory for remembering and forget- ting the previous states. LSTMs [23] provide such memory and can memorize the context information for long peri- ods of time. LSTM modules have three gates: 1) the input gate (i) 2) the forget gate (f) and 3) the output gate (o) which overwrite, keep, or retrieve the memory cell c respec- tively at the timestep t. Letting σ(x) = (1 + exp(−x))−1 be the sigmoid function and φ(x) = exp(x)−exp(−x) exp(x)+exp(−x) = 2σ(2x) − 1 be the hyperbolic tangent function. Letting x, h, c, W, and b be the input, output, cell state, parameter matrix, and parameter vector respectively. The LSTM up- dates for the timestep t given inputs xt, ht−1, and ct−1 are as follows:
  • 4. ft = σ(Wf · [ht−1, xt] + bf ) it = σ(Wi · [ht−1, xt] + bi) ot = σ(Wo · [ht−1, xt] + bo) gt = φ(WC · [ht−1, xt] + bC) Ct = ft ∗ Ct−1 + it ∗ gt ht = ot ∗ φ(Ct) (1) Several works have used LSTMs for the task of sequence labeling. Byeon et al. [3] proposed an LSTM-based net- work applying LSTMs in four direction sliding windows and achieved impressive results. Fan et al. [15] cascaded 2D-CNN with LSTMs and combined the feature map with 3D-CNNs for facial expression recognition. Donahue et al. [12] proposed Long-term Recurrent Convolutional Net- work (LRCN) by combining CNNs and LSTMs which is both spatially and temporally deep and has the flexibility to be applied to different vision tasks involving sequential inputs and outputs. 3. Proposed method While Inception and ResNet have shown remarkable re- sults in FER [20, 52], these methods do not extract the tem- poral relations of the input data. Therefore, we propose a 3D Inception-ResNet architecture to address this issue. Our proposed method, extracts both spatial and temporal features of the sequences in an end-to-end neural network. Another component of our method is incorporating facial landmarks in an automated manner during training in the proposed neural network. These facial landmarks help the network to pay more attention to the important facial com- ponents in the feature maps which results in a more accurate recognition. The final part of our proposed method is an LSTM unit which takes the enhanced feature map resulted from the 3D Inception-ResNet (3DIR) layer as an input and extracts the temporal information from it. The LSTM unit is followed by a fully-connected layer associated with a soft- max activation function. In the following, we explain each of the aforementioned units in detail. 3.1. 3D Inception-ResNet (3DIR) We propose 3D version of Inception-ResNet network which is slightly shallower than the original Inception- ResNet network proposed in [51]. This network is the re- sult of investigating several variations of Inception-ResNet module and achieves better recognition rates comparing to our other attempts in several databases. Figure 2 shows the structure of our 3D Inception-ResNet network. The input videos with the size 10 ×299×299 ×3 (10 frames, 299 × 299 frame size and 3 color channels) are followed by the “stem” layer. Afterwards, stem is followed Figure 2. Network architecture. The “V” and “S” marked layers represent “Valid” and “Same” paddings respectively. The size of the output tensor is provided next to each layer. by 3DIR-A, Reduction-A (which reduces the grid size from 38 × 38 to 18 × 18), 3DIR-B, Reduction-B (which reduces the grid size from 18×18 to 8×8), 3DIR-C, Average Pool- ing, Dropout, and a fully-connected layer respectively. In Figure 2, detailed specification of each layer is provided. Various filter sizes, paddings, strides, and activations have been investigated and the one that had the best performance is presented in this paper. We should mention that all convolution layers (except the ones that are indicated as “Linear” in Figure 2) are fol-
  • 5. lowed by an ReLU [27] activation function to avoid the van- ishing gradient problem. 3.2. Facial landmarks As mentioned before, the main reason we use facial land- marks in our network is to differentiate between the impor- tance of main facial components (such as eyebrows, lip cor- ners, eyes, etc.) and other parts of the face which are less expressive of facial expressions. As oppose to general ob- ject recognition task, in FER, we have the advantage of ex- tracting facial landmarks and using this information to im- prove the recognition rate. In a similar approach, Jaiswal et al. [26] proposed incorporation of binary masks around different parts of the face in order to encode the shape of different face components. However, in this work authors perform AU recognition by using CNN as a feature extrac- tor for training Bi-directional Long Short-Term Memory while in our approach, we preserve the temporal order of the frames throughout the network and train CNN and LSTMs simultaneously in an end-to-end network. We incorporate the facial landmarks by replacing the shortcut in residual unit on original ResNet with element-wise multiplication of facial landmarks and the input tensor of the residual unit (Figures 1 and 2). In order to extract the facial landmarks, OpenCV face recognition is used to obtain bounding boxes of the faces. A face alignment algorithm via regression local binary features [41, 61] was used to extract 66 facial landmark points. The facial landmark localization technique was trained using the annotations provided from the 300W com- petition [42, 43]. After detecting and saving the facial landmarks for all of the databases, the facial landmark filters are generated for each sequence automatically during training phase. Given the facial landmarks for each frame of a sequence, we ini- tially resize all of the images in the sequence to their cor- responding filter size in the network. Afterwards, we as- sign weights to all of the pixels in a frame of a sequence based on their distances to the detected landmarks. The closer a pixel is to a facial landmark, the greater weight is assigned to that pixel. After investigating several distance measures, we concluded that Manhattan distance with a linear weight function results in a better recognition rate in various databases. The Manhattan distance between two items is the sum of the differences of their corresponding components (in this case two components). The weight function that we defined to assign the weight values to their corresponding feature is a simple linear func- tion of the Manhattan distance defined as follows: ω(L, P) = 1 − 0.1 · dM(L,P ) (2) where dM(L,P ) is the Manhattan distance between the facial landmark L and pixel P. Therefore, places in which facial (a) Landmarks (b) Generated filter Figure 3. Sample image from MMI database (left) and its corre- sponding filter in the network (right). Best in color. landmarks are located will have the highest value and their surrounding pixels will have lower weights proportional to their distance from the corresponding facial landmark. In order to avoid overlapping between two adjacent facial landmarks, we define a 7 × 7 window around each facial landmark and apply the weight function for these 49 pixels for each landmark separately. Figure 3 shows an example of facial image from MMI database and its corresponding facial landmark filter in the network. We do not incorporate the facial landmarks with the third 3D Inception-ResNet module since the resulting feature map size at this stage be- comes very small for calculating facial landmark filter. Incorporating facial landmarks in our network replaces the shortcut in original ResNets [22] with the element-wise multiplication of the weight function ω and input layer xl as follows: yl = ω(L, P) ◦ xl + F(xl, Wl) xl+1 = f(yl) (3) where xl and xl+1 are input and output of the l-th layer, ◦ is Hadamard product symbol, F is a residual function (in our case Inception layer convolutions), and f is an activation function. 3.3. Long Short-Term Memory unit As explained earlier, to capture the temporal relations of the resulted feature map from 3DIR and take these rela- tions into account by the time of classifying the sequences in the softmax layer, we used an LSTM unit as it is shown in Figure 2. Using the LSTM unit makes perfect sense since the resulted feature map from the 3DIR unit contains the time notion of the sequences within the feature map. There- fore, vectorizing the resulting feature map of 3DIR on its se- quence dimension, will provide the required sequenced in- put for the LSTM unit. While other still image LSTM-based methods, a vectorized non-sequenced feature map (which obviously does not contain any time notion) is fed to the LSTM unit, our method saves the time order of the input sequences and passes this feature map to the LSTM unit. We investigated that 200 hidden units for the LSTM unit is a reasonable amount for the task of FER (Figure 2).
  • 6. The proposed network was implemented using a com- bination of TensorFlow [1] and TFlearn [10] toolboxes on NVIDIA Tesla K40 GPUs. In the training phase we used asynchronous stochastic gradient descent with momentum of 0.9, weight decay of 0.0001, and learning rate of 0.01. We used categorical cross entropy as our loss function and accuracy as our evaluation metric. 4. Experiments and results In this section, we briefly review the databases we used for evaluating our method. We then report the results of our experiments using these databases and compare the results with the state of the arts. 4.1. Face databases Since our method is designed mainly for classifying se- quences of inputs, databases that contain only independent unrelated still images of facial expressions such as Mul- tiPie [18] , SFEW [11] , FER2013 [17] cannot be exam- ined by our method. We evaluate our proposed method on MMI [40], extended CK+ [32], GEMEP-FERA [2], and DISFA [33] which contain videos of annotated facial ex- pressions. In the following, we briefly review the contents of these databases. MMI: The MMI [40] database contains more than 20 subjects, ranging in age from 19 to 62, with different eth- nicities (European, Asian, or South American). In MMI, the subjects’ facial expressions start from the neutral state to the apex of one of the six basic facial expressions and then returns to the neutral state again. Subjects were instructed to display 79 series of facial expressions, six of which are prototypic emotions (angry, disgust, fear, happy, sad, and surprise). We extracted static frames from each sequence, which resulted in 11,500 images. Afterwards, we divided videos into sequences of ten frames to shape the input ten- sor for our network. CK+: The extended Cohn-Kanade database (CK+) [32] contains 593 videos from 123 subjects. However, only 327 sequences from 118 subjects contain facial expression la- bels. Sequences in this database start from the neutral state and end at the apex of one of the six basic expressions (an- gry, contempt, disgust, fear, happy, sad, and surprise). CK+ primarily contains frontal face poses only. In order to make the database compatible with our network, we consider the last ten frames of each sequence as an input sequence in our network. FERA: The GEMEP-FERA database [2] is a subset of the GEMEP corpus used as database for the FERA 2011 challenge [56] developed by the Geneva Emotion Research Group at the University of Geneva. This database contains 87 image sequences of 7 subjects. Each subject shows fa- cial expressions of the emotion categories: Anger, Fear, Joy, Relief, and Sadness. Head pose is primarily frontal with rel- atively fast movements. Each video is annotated with AUs and holistic expressions. By extracting static frames from the sequences, we obtained around 7,000 images. We di- vided the these emotion videos into sequences of ten frames to shape the input tensor for our network. DISFA: Denver Intensity of Spontaneous Facial Actions (DISFA) database [33] is one of a few naturalistic databases that have been FACS coded by AU intensity values. This database consists of 27 subjects. The subjects are asked to watch YouTube videos while their spontaneous facial ex- pressions are recorded. Twelve AUs are coded for each frame and AU intensities are on a six-point scale between 0- 5, where 0 denotes the absence of the AU, and 5 represents maximum intensity. As DISFA is not emotion-specified coded, we used EMFACS system [16] to convert AU FACS codes to seven expressions (angry, disgust, fear, happy, neu- tral, sad, and surprise) which resulted in around 89,000 im- ages in which the majority have neutral expressions. Same as other databases, we divided the videos of emotions into sequences of ten frames to shape the input tensor for our network. 4.2. Results As mentioned earlier, after detecting faces we extract 66 facial landmark points by a face alignment algorithm via regression local binary features. Afterwards, we resize the faces to 299×299 pixels. One of the reasons why we choose large image size as input is the fact that larger images and sequences will enable us to have deeper networks and ex- tract more abstract features from sequences. All of the net- works have the same settings (shown in Figure 2 in detail) and are trained from scratch for each database separately. We evaluate the accuracy of our proposed method with two different sets of experiments: “subject-independent” and “cross-database” evaluations. 4.2.1 Subject-independent task In the subject-independent task, each database is split into training and validation sets in a strict subject independent manner. In all databases, we report the results using the 5-fold cross-validation technique and then averaging the recognition rates over five folds. For each database and each fold, we trained our proposed network entirely from scratch with the aforementioned settings. Table 1 shows the recognition rates achieved on each database in the subject- independent case and compares the results with the state-of- the-art methods. In order to compare the impact of incorpo- rating facial landmarks, we also provide the results of our network while the landmark multiplication unit is removed and replaced with a simple shortcut between the input and output of the residual unit. In this case, we randomly se-
  • 7. lect 20 percent of the subjects as the test set and report the results on those subjects. Table 1 also provides the recogni- tion rates of the traditional 2D Inception-ResNet from [20] which does not contain facial landmarks and the LSTM unit (DISFA is not experimented in this study). Comparing the recognition rates of the 3D and 2D Inception-ResNets in Table 1, shows that the sequential processing of facial expressions considerably enhances the recognition rate. This improvement is more apparent in MMI and FERA databases. Incorporating landmarks in the network is proposed to emphasize on more important fa- cial changes over time. Since changes in the lips or eyes are much more expressive than the changes in other com- ponents such as the cheeks, we utilize facial landmarks to enhance these temporal changes in the network flow. The “3D Inception-ResNet with landmarks” column in Table 1 shows the impact of this enhancement in different databases. It can be seen that compared with other net- works, there is a considerable improvement in recognition rates especially in FERA and MMI databases. The results on DISFA, however, show higher fluctuations over different folds which can be in part due to the abundance of inactive frames in this database which causes confusion in recogniz- ing different expressions. Therefore, the folds that contain more neutral faces, would show lower recognition rates. Comparing to other state-of-the-art works, our method outperforms others in FERA and DISFA databases while achieves comparable results in CK+ and MMI databases (Table 1). Most of these works use traditional approaches including hand-crafted features tuned for that specific database, while our network’s settings are the same for all databases. Also, due to the limited number of samples in these databases, it is difficult to properly train a deep neural network and avoid the overfitting problem. For these rea- sons and in order to have a better understanding about our proposed method, we also experimented the cross-database task. Figure 4 shows the resulting confusion matrices of our 3D Inception-ResNet with incorporating landmarks on dif- ferent databases over the 5 folds. On CK+ (Figure 4a), it can be seen that very high recognition rates have been achieved. The recognition rates of happiness, sadness, and surprise are higher than those of other expressions. The highest confu- sion occurred between the happiness and contempt expres- sions which can be caused from the low number of contempt sequences in this database (only 18 sequences). On MMI (Figure 4b), a perfect recognition is achieved for the happy expression. It can be seen that there is a high confusion between the sad and fear expressions as well as the angry and sad expressions. Considering the fact that MMI is a highly imbalanced dataset, these confusions are reasonable. On FERA (Figure 4c), the highest and the lowest recogni- tion rates belong to joy and relief respectively. The relief category in this database has some similarities with other categories especially with joy. These similarities make the classification so difficult even for humans. Despite these challenges, our method has performed well on all of the cat- egories and outperforms state of the arts. On DISFA (Fig- ure 4d), we can see the highest confusion rate compared with other databases. As mentioned earlier, this database contains long inactive frames, which means that the number of neutral sequences is considerably higher than other cate- gories. This imbalanced training data has made the network to be biased toward the neutral category and therefore we can observe a high confusion rate between the neutral ex- pression and other categories in this database. Despite the low number of angry and sad sequences in this database, our method has been able achieve satisfying recognition rates in these categories. 4.2.2 Cross-database task In the cross-database task, for testing each database, that database is entirely used for testing the network and the rest of the databases are used to train the network. The same network architecture as subject-independent task (Figure 2) was used for this task. Table 2 shows the recognition rate achieved on each database in the cross-database case and it also compares the results with other state-of-the-art meth- ods. It can be seen that our method outperforms the state- of-the-art results in CK+, FERA, and DISFA databases. On MMI, our method does not show improvements comparing to others (e.g. [62]). However, authors in [62] trained their classifier only with CK+ database while our method uses in- stances from two additional databases (DISFA and FERA) with completely different settings and subjects which add significant amount of ambiguity in the training phase. In order to have a fair comparison with other methods, we provide the different settings used by the works men- tioned in Table 2. The results provided in [34] are achieved by training the models on one of the CK+, MMI, and FEED- TUM databases and tested on the rest. The reported result in [47] is the best achieved results using different SVM ker- nels trained on CK+ and tested on MMI database. In [35] several experiments were performed using four classifiers (SVM, Nearest Mean Classifier, Weighted Template Match- ing, and K-nearest neighbors). The reported results in this work for CK+ is trained on MMI and Jaffe databases while the reported results for MMI is trained on the CK+ database only. As mentioned earlier, in [62] a Multiple Kernel Learn- ing algorithm is used and the cross-database experiments are trained on CK+, evaluated on MMI and vice versa. In [38] a DNN network is proposed using traditional Incep- tion layer. The networks for the cross-database case in this work are tested on either CK+, MultiPIE, MMI, DISFA, FERA, SFEW, or FER2013 while trained on the rest. Some
  • 8. state-of-the-art methods 2D Inception-ResNet 3D Inception-ResNet 3D Inception-ResNet + landmarks CK+ 84.1 [34], 84.4 [28], 88.5 [54], 92.0 [29], 93.2 [38], 92.4 [30], 93.6 [62] 85.77 89.50 93.21±2.32 MMI 63.4 [30], 75.12 [31], 74.7 [29], 79.8 [54], 86.7 [47], 78.51 [36] 55.83 67.50 77.50±1.76 FERA 56.1 [30], 55.6 [57], 76.7 [38] 49.64 67.74 77.42±3.67 DISFA 55.0 [38] - 51.35 58.00±5.77 Table 1. Recognition rates (%) in subject-independent task (a) CK+ (b) MMI (c) FERA (d) DISFA Figure 4. Confusion matrices of 3D Inception-ResNet with landmarks for subject-independent task of the expressions of these databases are excluded in this study (such as neutral, relief, and contempt). There are other works that perform their experiments on action unit recognition task [4, 14, 26] but since fair comparison of ac- tion unit recognition and facial expression recognition is not easily obtainable, we did not mention these works in Ta- bles 1 and 2. Figure 5 shows the resulting confusion matrices of our experiments on 3D Inception-ResNet with landmarks in cross-database task. For CK+ (Figure 5a), we exclude the contempt sequences in the test phase since other databases that are used for training the network, do not contain con- tempt category. Except for the fear expression (which has very few number of samples in other databases), the net- work has been able to correctly recognize other expressions. For MMI (Figure 5b), highest recognition rate belongs to surprise while the lowest one belongs to fear. Also, we can see high confusion rate in recognizing sadness. On FERA (Figure 5c), we exclude relief category as other databases do not contain this emotion. Considering the fact that only half of the train categories exist in the test set, the network shows acceptable performance in correctly recognizing emotions. However, surprise category has made significant confusion in all of the categories. On DISFA (Figure 5d), we exclude the neutral category as other databases do not contain this category. Highest recognition rates belong to happy and sur-
  • 9. (a) CK+ (b) MMI (c) FERA (d) DISFA Figure 5. Confusion matrices of 3D Inception-ResNet with landmarks for cross-database task prise emotions while lowest one belongs to fear. Comparing to other databases, we can see a significant increase in con- fusion rate in all of the categories. This can be in part due to the fact that emotions in DISFA are “spontaneous” while emotions in the training databases are “posed”. Based on the aforementioned results, our method provides a compre- hensive solution that can generalize well to practical appli- cations. 5. Conclusion In this paper, we presented a 3D Deep Neural Network for the task of facial expression recognition in videos. We state-of-the-art methods 3D Inception-ResNet + landmarks CK+ 47.1 [34], 56.0 [35], 61.2 [62], 64.2 [38] 67.52 MMI 51.4 [34], 50.8 [47], 36.8 [35], 55.6 [38], 66.9 [62] 54.76 FERA 39.4[38] 41.93 DISFA 37.7 [38] 40.51 Table 2. Recognition rates (%) in cross-database task proposed the 3D Inception-ResNet (3DIR) network which extends the well-known 2D Inception-ResNet module for processing image sequences. This additional dimension will result in a volume of feature maps and will extract the spatial relations between frames in a sequence. This module is followed by an LSTM which takes these tempo- ral relations into account and uses this information to clas- sify the sequences. In order to differentiate between facial components and other parts of the face, we incorporated fa- cial landmarks in our proposed method. These landmarks are multiplied with the input tensor in the residual module which is replaced with the shortcuts in the traditional resid- ual layer. We evaluated our proposed method in subject- independent and cross-database tasks. Four well-known databases were used to evaluate the method: CK+, MMI, FERA, and DISFA. Our experiments show that the pro- posed method outperforms many of the state-of-the-art methods in both tasks and provides a general solution for the task of FER. 6. Acknowledgement This work is partially supported by the NSF grants IIS- 1111568 and CNS-1427872. We gratefully acknowledge the support from NVIDIA Corporation with the donation of the Tesla K40 GPUs used for this research.
  • 10. References [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016. 5 [2] T. B¨anziger and K. R. Scherer. Introducing the geneva mul- timodal emotion portrayal (gemep) corpus. Blueprint for af- fective computing: A sourcebook, pages 271–294, 2010. 5 [3] W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki. Scene labeling with lstm recurrent neural networks. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3547–3555, 2015. 3 [4] W.-S. Chu, F. De la Torre, and J. F. Cohn. Modeling spatial and temporal cues for multi-label facial action unit detection. arXiv preprint arXiv:1608.00911, 2016. 7 [5] I. Cohen, N. Sebe, A. Garg, L. S. Chen, and T. S. Huang. Facial expression recognition from video sequences: tempo- ral and static modeling. Computer Vision and image under- standing, 91(1):160–187, 2003. 2 [6] T. F. Cootes, G. J. Edwards, C. J. Taylor, et al. Active ap- pearance models. IEEE Transactions on pattern analysis and machine intelligence, 23(6):681–685, 2001. 2 [7] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Ac- tive shape models-their training and application. Computer vision and image understanding, 61(1):38–59, 1995. 2 [8] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recogni- tion, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005. 2 [9] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow and appearance. In European conference on computer vision, pages 428–441. Springer, 2006. 2 [10] A. Damien et al. Tflearn. https://github.com/ tflearn/tflearn, 2016. 5 [11] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon. Static fa- cial expression analysis in tough conditions: Data, evalua- tion protocol and benchmark. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pages 2106–2112. IEEE, 2011. 5 [12] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Dar- rell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015. 2, 3 [13] P. Ekman and W. V. Friesen. Constants across cultures in the face and emotion. Journal of personality and social psychol- ogy, 17(2):124, 1971. 1 [14] C. Fabian Benitez-Quiroz, R. Srinivasan, and A. M. Mar- tinez. Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5562–5570, 2016. 7 [15] Y. Fan, X. Lu, D. Li, and Y. Liu. Video-based emotion recog- nition using cnn-rnn and c3d hybrid networks. In Proceed- ings of the 18th ACM International Conference on Multi- modal Interaction, ICMI 2016, pages 445–450, New York, NY, USA, 2016. ACM. 2, 3 [16] W. V. Friesen and P. Ekman. Emfacs-7: Emotional facial action coding system. Unpublished manuscript, University of California at San Francisco, 2(36):1, 1983. 5 [17] I. J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville, M. Mirza, B. Hamner, W. Cukierski, Y. Tang, D. Thaler, D.- H. Lee, et al. Challenges in representation learning: A report on three machine learning contests. In International Con- ference on Neural Information Processing, pages 117–124. Springer, 2013. 5 [18] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-pie. Image and Vision Computing, 28(5):807–813, 2010. 5 [19] B. Hasani, M. M. Arzani, M. Fathy, and K. Raahemifar. Facial expression recognition with discriminatory graphical models. In 2016 2nd International Conference of Signal Pro- cessing and Intelligent Systems (ICSPIS), pages 1–7, Dec 2016. 2 [20] B. Hasani and M. H. Mahoor. Spatio-temporal facial expres- sion recognition using convolutional neural networks and conditional random fields. arXiv preprint arXiv:1703.06995, 2017. 2, 3, 6 [21] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. 2 [22] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. pages 630–645, 2016. 4 [23] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. 2 [24] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 2 [25] S. Jain, C. Hu, and J. K. Aggarwal. Facial expression recognition with temporal modeling of shapes. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE Interna- tional Conference on, pages 1642–1649. IEEE, 2011. 2 [26] S. Jaiswal and M. Valstar. Deep learning the dynamic ap- pearance and shape of facial action units. In Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on, pages 1–8. IEEE, 2016. 4, 7 [27] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. 2, 4 [28] S. H. Lee, K. N. K. Plataniotis, and Y. M. Ro. Intra- class variation reduction using training expression images for sparse representation based facial expression recognition. IEEE Transactions on Affective Computing, 5(3):340–351, 2014. 7 [29] M. Liu, S. Li, S. Shan, and X. Chen. Au-aware deep networks for facial expression recognition. In Automatic Face and Gesture Recognition (FG), 2013 10th IEEE Inter- national Conference and Workshops on, pages 1–6. IEEE, 2013. 7
  • 11. [30] M. Liu, S. Li, S. Shan, R. Wang, and X. Chen. Deeply learning deformable facial action parts model for dynamic expression analysis. In Asian Conference on Computer Vi- sion, pages 143–157. Springer, 2014. 7 [31] M. Liu, S. Shan, R. Wang, and X. Chen. Learning expres- sionlets on spatio-temporal manifold for dynamic facial ex- pression recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1749– 1756, 2014. 7 [32] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified ex- pression. In Computer Vision and Pattern Recognition Work- shops (CVPRW), 2010 IEEE Computer Society Conference on, pages 94–101. IEEE, 2010. 5 [33] S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, and J. F. Cohn. Disfa: A spontaneous facial action intensity database. IEEE Transactions on Affective Computing, 4(2):151–160, 2013. 5 [34] C. Mayer, M. Eggers, and B. Radig. Cross-database evalu- ation for facial expression recognition. Pattern recognition and image analysis, 24(1):124–132, 2014. 6, 7, 8 [35] Y.-Q. Miao, R. Araujo, and M. S. Kamel. Cross-domain facial expression recognition using supervised kernel mean matching. In Machine Learning and Applications (ICMLA), 2012 11th International Conference on, volume 2, pages 326–332. IEEE, 2012. 6, 8 [36] M. Mohammadi, E. Fatemizadeh, and M. H. Mahoor. Pca- based dictionary building for accurate facial expression recognition via sparse representation. Journal of Visual Communication and Image Representation, 25(5):1082– 1092, 2014. 2, 7 [37] P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4207–4215, 2016. 2 [38] A. Mollahosseini, D. Chan, and M. H. Mahoor. Going deeper in facial expression recognition using deep neural networks. In 2016 IEEE Winter Conference on Applications of Com- puter Vision (WACV), pages 1–10. IEEE, 2016. 1, 2, 6, 7, 8 [39] A. Mollahosseini, B. Hasani, M. J. Salvador, H. Abdollahi, D. Chan, and M. H. Mahoor. Facial expression recognition from world wild web. In The IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR) Workshops, June 2016. 1, 2 [40] M. Pantic, M. Valstar, R. Rademaker, and L. Maat. Web- based database for facial expression analysis. In 2005 IEEE international conference on multimedia and Expo, pages 5– pp. IEEE, 2005. 5 [41] S. Ren, X. Cao, Y. Wei, and J. Sun. Face alignment at 3000 fps via regressing local binary features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 1685–1692, 2014. 4 [42] C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 faces in-the-wild challenge: Database and results. Image and Vision Computing, 47:3–18, 2016. 4 [43] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. A semi-automatic methodology for facial landmark annota- tion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 896–903, 2013. 4 [44] E. Sariyanidi, H. Gunes, and A. Cavallaro. Automatic anal- ysis of facial affect: A survey of registration, representation, and recognition. IEEE transactions on pattern analysis and machine intelligence, 37(6):1113–1133, 2015. 2 [45] N. Sebe, M. S. Lew, Y. Sun, I. Cohen, T. Gevers, and T. S. Huang. Authentic facial expression analysis. Image and Vi- sion Computing, 25(12):1856–1863, 2007. 2 [46] C. Shan, S. Gong, and P. W. McOwan. Dynamic facial expression recognition using a bayesian temporal manifold model. In BMVC, pages 297–306. Citeseer, 2006. 2 [47] C. Shan, S. Gong, and P. W. McOwan. Facial expression recognition based on local binary patterns: A comprehensive study. Image and Vision Computing, 27(6):803–816, 2009. 1, 2, 6, 7, 8 [48] C. Sminchisescu, A. Kanaujia, and D. Metaxas. Conditional models for contextual human motion recognition. Computer Vision and Image Understanding, 104(2):210–220, 2006. 2 [49] S. Song and J. Xiao. Deep sliding shapes for amodal 3d ob- ject detection in rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 808–816, 2016. 2 [50] Y. Sun, X. Chen, M. Rosato, and L. Yin. Tracking vertex flow and model adaptation for three-dimensional spatiotemporal face analysis. Systems, Man and Cybernetics, Part A: Sys- tems and Humans, IEEE Transactions on, 40(3):461–474, 2010. 2 [51] C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261, 2016. 2, 3 [52] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015. 2, 3 [53] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. 2 [54] S. Taheri, Q. Qiu, and R. Chellappa. Structure-preserving sparse decomposition for facial expression analysis. IEEE Transactions on Image Processing, 23(8):3590–3603, 2014. 7 [55] Y. Tian, T. Kanade, and J. F. Cohn. Recognizing lower face action units for facial expression analysis. In Automatic Face and Gesture Recognition, 2000. Proceedings. Fourth IEEE International Conference on, pages 484–490. IEEE, 2000. 1 [56] M. F. Valstar, B. Jiang, M. Mehu, M. Pantic, and K. Scherer. The first facial expression recognition and analysis chal- lenge. In Automatic Face & Gesture Recognition and Work- shops (FG 2011), 2011 IEEE International Conference on, pages 921–926. IEEE, 2011. 5
  • 12. [57] M. F. Valstar, B. Jiang, M. Mehu, M. Pantic, and K. Scherer. The first facial expression recognition and analysis chal- lenge. In Automatic Face & Gesture Recognition and Work- shops (FG 2011), 2011 IEEE International Conference on, pages 921–926. IEEE, 2011. 7 [58] S. B. Wang, A. Quattoni, L.-P. Morency, D. Demirdjian, and T. Darrell. Hidden conditional random fields for ges- ture recognition. In Computer Vision and Pattern Recogni- tion, 2006 IEEE Computer Society Conference on, volume 2, pages 1521–1527. IEEE, 2006. 2 [59] Z. Wang and Z. Ying. Facial expression recognition based on local phase quantization and sparse representation. In Natu- ral Computation (ICNC), 2012 Eighth International Confer- ence on, pages 222–225. IEEE, 2012. 2 [60] M. Yeasin, B. Bullot, and R. Sharma. Recognition of fa- cial expressions and measurement of levels of interest from video. Multimedia, IEEE Transactions on, 8(3):500–508, 2006. 2 [61] L. Yu. face-alignment-in-3000fps. https://github. com/yulequan/face-alignment-in-3000fps, 2016. 4 [62] X. Zhang, M. H. Mahoor, and S. M. Mavadati. Facial expres- sion recognition using {l} {p}-norm mkl multiclass-svm. Machine Vision and Applications, 26(4):467–483, 2015. 6, 7, 8 [63] Y. Zhang and Q. Ji. Active and dynamic information fusion for facial expression understanding from image sequences. Pattern Analysis and Machine Intelligence, IEEE Transac- tions on, 27(5):699–714, 2005. 2 [64] Y. Zhu, L. C. De Silva, and C. C. Ko. Using moment in- variants and hmm in facial expression recognition. Pattern Recognition Letters, 23(1):83–91, 2002. 2 View publication statsView publication stats