SlideShare a Scribd company logo
1
Building Emotional Machines: Recognizing Image
Emotions through Deep Neural Networks
Hye-Rin Kim, Yeong-Seok Kim, Seon Joo Kim, In-Kwon Lee
Abstract—An image is a very effective tool for conveying emo-
tions. Many researchers have investigated in computing the image
emotions by using various features extracted from images. In this
paper, we focus on two high level features, the object and the
background, and assume that the semantic information of images
is a good cue for predicting emotion. An object is one of the most
important elements that define an image, and we find out through
experiments that there is a high correlation between the object
and the emotion in images. Even with the same object, there may
be slight difference in emotion due to different backgrounds, and
we use the semantic information of the background to improve
the prediction performance. By combining the different levels of
features, we build an emotion based feed forward deep neural
network which produces the emotion values of a given image.
The output emotion values in our framework are continuous
values in the 2-dimensional space (Valence and Arousal), which
are more effective than using a few number of emotion categories
in describing emotions. Experiments confirm the effectiveness of
our network in predicting the emotion of images.
I. INTRODUCTION
Images are very powerful tools for conveying moods and
emotions as shown in Figure 1. Through images, people can
express their feelings and communicate with other people.
With the recent development in the deep learning technology,
computers have become better at recognizing objects, faces,
and actions. Computers have also started to write image
captions and answer questions about images. But how about
emotions? Can we teach computers to have similar feeling
as humans do when looking at images? Predicting evoked
emotion from an image is a difficult task and is still in its
early stage.
As the deep learning technology shows remarkable perfor-
mance in various computer vision tasks such as image clas-
sifcation [1]–[4], segmentation [5], [6] and image processing
[7], [8], several studies have been introduced recently that
apply the deep learning for the emotion prediction [9]–[13].
Those works mostly use the convolutional neural network
(CNN) [14], which has shown better prediction results for
the emotion classification compared to the model that uses
a shallow network such as the linear model.
The CNN has had a big impact especially in the image
classification, and it is an effective network model for learning
filters that capture the shapes that repeatedly appear in images.
However, we argue that the learning process for the image
emotion prediction should be different from that of image
classification. This is because some images with different
appearances can have same emotions, and some images with
similar appearances can have different emotions. For example,
an image that includes a person riding a bike and an image
Fig. 1. Images with different emotions.
that includes a person surfing in the ocean may give the same
feelings, even though they look different. From this point of
view, the performance may be limited if CNN is applied to an
emotion prediction system.
In addition, one of the main issues in the emotion recog-
nition is the affective gap. The affective gap is the lack of
coincidence between the measurable signal properties, com-
monly referred to as features, and the expected affective state
in which the user is brought by perceiving the signal [15].
To narrow this affective gap, several works proposed emotion
classification systems based on the psychology and art theory
based high level features such as harmony, movement, rule of
third, etc. [16]–[18]. While those features help to improve the
emotion recognition, a better set of features are still necessary.
Distinguishing the effective features among various features
is also important. As another example, similar to the example
above, images with the objects such as guns or sharks arouse
scary feeling, while images with babies or flowers lead to more
happiness. It can be speculated that certain objects will affect
the determination of emotions. Based on this observation, we
assume that the main object appearing in the image plays an
important role in determining the emotion. The idea is that
object categories can be good cues for emotions. Through
arXiv:1705.07543v2
[cs.CV]
3
Jul
2017
2
1 9
Negative Positive
1
9
Exciting
Calm
Valence
Arousal
Fig. 2. Two dimensional emotion models (Valence and Arousal).
experiments, we show that objects appearing in images are
related to emotion values, and objects are used as one of
the features of our model. Besides, the emotion, even if
images include the same object, can vary depending on the
background. We also use the semantic information of the
background as our features to improve prediction performance.
Predicting emotions from images is a complex task that is
quite different from the object detection or the image classifi-
cation. In this paper, we combine high-level features such as
the object and the background information extracted from a
pre-trained deep network model for the image classification
and the segmentation with low-level features such as color
statistics to obtain a set of features for the emotion prediction.
Using this feature set, we design a feed forward deep network
(FFNN) [19] which produces the emotion value for a given
image.
In previous works for the emotion prediction, the emotions
are categorized by a few number of classes, such as happy,
awe, sad, fear, etc. In comparison, we opt for the dimensional
model for expressing the emotions [20], which is widely used
in the field of psychology. Specifically, the dimensional model
consists of two parameters, Valence and Arousal (Figure 2).
Valence represents the pleasure through the scale from 1
(negative) to 9 (positive). Arousal is the level of excitement,
which also ranges from 1 (calm) to 9 (excited). Using this
model, any emotion can be represented by using these two
values in 2D space. As this dimensional model can express
more emotions compared to using a discrete set of emotion
categories, we train our emotion prediction system based on
the V-A model.
The contributions of our work are summarized as follows:
• We propose an image emotion recognition system which
outputs valence-arousal values for expressing emotion. As
far as we know, this is the first system to build the deep
learning model for the dimensional emotion model (VA
model).
• We build a new image emotion dataset through the
crowdsourcing. The feed forward neural network is used
to learn emotion features from this database.
• We propose a novel idea of using an pre-trained CNN to
relate the main object and background of the image to
the emotional feature.
• We show the effectiveness of an object to estimate an
emotion of an image using correlations between emo-
tional values of words and images.
II. RELATED WORK
Emotion of an image can be evoked by various factors.
To figure out significant features for the emotion prediction
problem, many researchers have considered various types of
the features from color statistics to the art and the psycho-
logical features. Machajdik et al. [17] introduced an affective
image classification system using psychology and art theory
based features such as Itten’s color contrast and rule of thirds.
Zhao et al. [18] proposed to extract principles of art features
for emotion classification. Similarly, Lu et al. [16] computed
the shape based features in natural images. As a high level
concept for representing the sentiment of an image, adjective-
noun pairs were introduced by Borth et al. in [21].
Despite the rise of deep learning studies, relatively few
studies have attempted to address the emotion prediction of
images using the deep network. With the data set from Borth
et al. [21], Chen et al. [22] classified the adjective-noun pairs
using the CNN and achieved better accurate classification
performance than Borth et al. [21].
Several studies utilized the pre-trained model for the im-
age classification and transferred the learned parameter. By
changing the number of outputs to be the same as the number
of labels of their dataset, the classifier can be trained. For
example, for a binary classification with a positive and a
negative label, the number of output would be two. Some
researchers [9]–[13] used AlexNet’s structure [1] with pre-
trained weight to train their sentiment prediction frameworks
with different output numbers. Binary classification which
gives either a positive or a negative label was considered
in [11], [13], [23], [24]. Peng et al. [10] and You et al. [12]
trained the classifiers for seven and eight classes, respectively.
In most studies, an emotion classification model was created
through transfer learning using a pre-trained model for image
classification. In general, fine-tuned models record better emo-
tion classification performance compared to previous studies
with shallow networks. In this paper, we use a feed forward
neural network using low level and high-level features rather
than transfer learning method. Our method is compared with
transfer learning method by using same training and test
dataset.
Compared to the previous CNN based emotion recognition
systems which are limited to output only a few number of
emotional states, we propose to use the valence-arousal model
to represent emotions. By using these two parameters that lie
3
Valence (negative - positive)
Arousal (calm - exciting)
Fig. 3. Image representation of the Self Assessment Manikin (SAM) [25]
used for V-A value assignment in the user study.
on a continuous space, we can represent emotions much better
than the previous works. To enable the learning of the VA
values using the CNN, we build a large set of data with various
emotions and obtain the V-A labels through the crowdsourcing
survey.
III. IMAGE EMOTION DATASET
In order to build the dataset with emotion values (valence
and arousal), we collected a large set of images from two
resources. We first searched for the emotional images using
the 22 keywords from [26] in Flickr [27]. Those words include
what we call basic emotions (happy, angry, afraid, sad), as well
as less prototypical emotions (gloomy, bored) and affective
states (sleepy, serene).
Using the keywords, we first collected over twenty thousand
images. Since the goal is to collect emotional images, we man-
ually eliminated non-emotional images from the collection.
Each candidate image in the dataset was assessed by three
human subjects, and only the images that were determined to
be emotional by more than two subjects were included in the
database. The words used for the search are listed along with
the number of images in the dataset: afraid (113), alarmed
(14), angry (217), annoyed (179), distressed (150), frustrated
(77), tense (8), aroused (4), delighted (28), excited (734), glad
(315), happy (102), astonished (13), at ease (16), content (656),
satisfied (33), serene (1917), pleased (42), depressed (179),
bored (312), tired (942), gloomy(793). As a result, the total
number of image is 6844.
Second, we collected more images from [12] which include
eight types of emotion categories (Amusement, Awe, Content-
ment, Excitement, Anger, Disgust, Fear, Sad). We took 3,236
images out of 23,308 images. An even number of images from
eight classes have been selected and included in our database.
As a result, a total of 10766 images is acquired in our dataset.
Next, we use the AMT to assign V-A emotion values for
all images in our database. Given an image, a worker rates the
emotion values for each image using the representation of the
V-A scale, Self-Assessment Manikin (SAM) [25] (Figure 3).
In each question, two images are shown to the worker. One
is the image whose emotion value is to be measured, and the
other is the image of the previous question with emotion value
Fig. 4. The histogram of valence and arousal value from the collected images.
recorded by the worker. By showing the previous image, the
worker can measure a value for the given image compared to
the value selected in the previous problem. It also alleviates the
difficulty of having to choose absolute numbers. Each image is
represented to the worker randomly and is evaluated only once
by the same worker. To ensure the quality of the answers, we
only allow the workers with 95% approval rate to participate in
our Human Intelligence Task (HIT). We also limit the number
of questions for each worker to 200 to keep them focused
throughout the test.
A total of 1,339 workers were recruited to assign the
emotion values of the 10,766 images in the dataset. The
evaluation time per image varied from 10 seconds to a minute
depending on the worker. Each image is evaluated on at least
five workers, and the average of the acquired values is assigned
to the emotion value of the image. Figure 4 shows the total
distribution of the V-A values of the images. In the case of
Valence, it can be seen that there are relatively more positive
images than negative images. In common sense, people usually
share the positive images, rather than negative images, and this
tendency is well represented in this distribution. In the case
of Arousal, various emotion values were obtained except for
the extreme values. Figure 5 shows some of the images in our
database.
We compare the emotion distribution of the dataset we
collected with the distribution of IAPS dataset which is an
emotion dataset introduced in [28]. The IAPS consists of of
1182 images, each of which contains valence, arousal and
dominance values. Although the number of images is different,
we can see that our data is spread more than the IAPS data
distribution in the V-A emotion space (Figure 6).
We provide additional analysis to validate our new dataset.
We split Valence-Arousal space into 4 by 4 sections: Most
negative (from 1 to 3), negative (from 3 to 5), positive (from
5 to 7), and most positive (from 7 to 9). Most calm (from 1
to 3), calm (from 3 to 5), exciting (from 5 to 7), and most
exciting (from 7 to 9). We then place the images in the emotion
space according to the obtained emotion values and extract
the most used words (using the image tags) in each section.
The results are shown in Figure 7. In the negative section
of V-A space, most words have low valence values such as
afraid, angry, annoyed, depressed and gloomy. On the other
hand, the words such as content, serene, excited, and glad are
included in the positive section. In the case of the arousal,
the lowest arousal side (from 1 to 3) consists of the words
such as serene, tired and gloomy and the highest arousal side
4
Fig. 5. Example images in our database. From left to right side, the valence value of the image increase. The images on the left/right have negative/positive
emotion. From bottom to top, the arousal value of the image increase. The images on the bottom/top have the calm/exciting emotion.
Fig. 6. Emotion distribution of our database and the IAPS [28] in V-A emotion
space.
(from 7 to 9) include the words such as excited, angry and
distressed. This analysis indicates that the use of words for
image collection and user study is appropriate for obtaining
as diverse and well-distributed emotion values as possible in
space. The new image emotion dataset can be used for the
emotion recognition in two ways. A classifier can be trained to
output either the continuous V-A values or discrete categories
of emotions using the words in Figure 7.
IV. FEATURES
Now, we extract a various type of features for emotion
prediction including color, local, object, and semantic features.
A. Color features
Color is the most basic and powerful element to express
emotions and can be effectively used by artists to induce emo-
tional effects. Many studies have been conducted to change
the color of an image as a means to change the emotion
of the image [29]–[32]. Color is not an element that can
directly resolve an affective-gap because it can be viewed as
a low-dimensional feature, but color is still a crucial factor in
emotion recognition. We extract the mean values of RGB and
HSV color space as the basic color characteristics. We also
calculate the HSV histogram and extract the label number and
Fig. 7. The most used words in each section.
values of the bin with the largest value in the histogram. A
concept similar to a color histogram, it calculates how much of
the 11 basic colors exist in the image [33]. According to [20],
saturation and brightness can have direct in on pleasure,
arousal, and dominance. Using the saturation and brightness
values, Valdez and Mehrabian [34] introduced formulas to
measure the value of pleasure, arousal, and dominance through
experiments. The formula for computing the values is as
follows:
Pleasure = 0.69Y + 0.22S (1)
Arousal = 0.31Y + 0.60S (2)
Dominance = 0.76Y + 0.32S (3)
We measure the values for the three elements from the image
and use them as features.
B. Local features
We exploit two kinds of local features used in the [21].
We use a 512-dimensional GIST descriptor that is effective
5
Fig. 8. Correlation between image emotion and object emotion. Image
emotion values are taken from the IAPS data [28] set and object emotion
values are taken from the word emotion dictionary [35].
in detecting scenes and a 59-dimensional local binary pattern
(LBP) descriptor that is effective in detecting textures.
C. Object features
Identifying the emotion of the image from low-level fea-
tures, such as color statistics and texture-related features, is
difficult for a human subject. Many researchers stated the need
for high level of features on the affective level and designed
the various types of features for emotion prediction. In our
study, we assume that the object is one of the most important
factors contributing to the emotion of an image. We conducted
an experiment to prove that the object in images has relevance
to the emotion elicited from the images. Each image in the
IAPS dataset, an emotion dataset introduced in [28], includes
a tag representing the primary object (e.g., baby, snake, and
shark) and the V-A emotion values. In order to replace the
object tag with the emotion level representation, we adopt the
word emotion dictionary [35] with the valence and the arousal
values for each word. We convert each tag to V-A values
by searching for the tag in the word emotion dictionary. By
measuring the Pearson correlation coefficient, we can easily
understand the relationship between the object and the image
emotion. Figure 8 shows the results of the experiment. As you
can see, the correlation between the object emotion and the
image emotion is significantly high, which means the object
affects the emotion of the image. Especially, the correlation
of the valence is higher than the arousal, which means that
valence is more likely to be affected by the object than the
arousal.
Based on this observation, we add object-based features
to our system in predicting emotions. In recent years, many
studies have used CNN models of various structures using
ImageNet dataset [36] for the image classification. We use
three of the most popular models (AlexNet, VGG16, and
ResNet) to extract object features from our image datasets
and experiment with the effects of features extracted from
each model on emotion prediction results. Our object feature
is the result of the final output layer, and it represents the
probabilities of 1000 object categories.
D. Sematic features
As a high-level feature with a similar concept to object, we
consider another semantic information that can describe the
background of the image. It is important what the object is
in the image, but what the background is made up of is as
important. For example, if the main object of an image is a
person, the emotion may be different depending on whether
the background is a city with a lot of buildings or nature
such as a mountain or a sea. Also, the ratio of sky, sea, or
buildings in the background can also affect the emotions. To
use the semantic information of the background as a feature,
we perform scene parsing on all images. Wu et al. [37]
proposed a semantic segmentation method based on a deep
network, which classifies each pixel of an image into one of
150 semantic categories. Given a semantic map, we find out
which of the 150 semantic categories each pixel in the image
belongs to. As a result, a 150-dimensional vector is obtained
and used as one of the input features.
These object and semantic-based high level features are
combined with low level features such as color and local
features to learn our networks for emotion prediction.
V. LEARNING EMOTION MODEL
In this section, we introduce the details of our emotion
prediction framework. The overall architecture of our frame-
work is shown in Figure 9. Our model is a fully connected
feedforward neural network. In general, a neural network
consists of an input layer, an output layer, and one or more
hidden layers. Normally, when there are two or more hidden
layers, the network is called deep network. Each layer is made
up of multiple neurons, and the edges that connect neurons
between adjacent layers have weights. The values of neurons
(except the neurons in input layer) and weights are trained
during a training phase.
Our network F(X, θ) including an input layer, three hidden
layers and an output layer is as follows:
F(X, θ) = f 4
◦ g3
◦ f 3
◦ g2
◦ f 2
◦ g1
◦ f 1
(X), (4)
where X is input feature vector, θ is a set of weights including
weights w and bias b, and f 4 produces the final output of our
neural network (Valence and Arousal value).
Specifically, given an input vector Xl = [xl
i, ..., xl
n]
T
in layer
l, the preactivation value pj for the neuron j of layer l + 1 is
obtained through function f l+1:
pl+1
j = f l+1
(xl
) =
n
Õ
i=1
wl+1
ij xl
i + bl+1
j , (5)
where wl+1
ij is the connection weight connecting xi in layer l
to neuron j in layer l + 1, bl+1
j is the bias of neural j in layer
l + 1, and n is the number of the neuron in layer l. Then, the
output value xj of the layer l (also input vector for layer l +1)
is obtained through function gl which is a nonlinear activation
function in layer l + 1,
xl+1
j = gl+1
(pl+1
j ). (6)
Note that, rectified linear unit(ReLU) [1], max(0, x), is used as
the non-linear activation function throughout the all network.
The number of neurons in each layer is given in the
Table I. We set the loss function of our network as L =
ÍK
k=1 ( f 4
k − Tk)2, where the f 4
k and Tk are the output value
6
Scene segmentation
Object classification
1000 D vector
150 D vector
…
…
…
…
Low-level features
Emotion prediction model
V-A values
1938 D vector
Fig. 9. The overall architecture of the proposed emotion prediction model.
TABLE I
THE NUMBER OF NEURONS IN EACH LAYER.
Input H1 H2 H3 Output
Num of neurons 1588 3000 1000 1000 1
predicted by our model and the ground truth emotion value
of given image I and K is the number of images. In training
phase, network weights θ are updated by backpropagating the
gradients through all layers. By minimizing the cost of the
loss function, we can optimize the weights of our network.
We set learning rate to 0.0001 and the network is trained
by using the stocahstic gradient descent (SGD) optimization
method with momentum of 0.9. We set the batch size to 1,000
and train our model until the error no longer diminishes. All
experiments are implemented by using the open source deep
learning framework Tensorflow [38].
VI. EXPERIMENT
A. Model performance
We first evaluate the performance of our model. The entire
dataset is divided into five groups, four groups are used for
the training phase, and the remaining one group is used for
the test phase. Each group is used as a test phase once. In
each group, the number of training images is 8600, and the
number of test images is 2166. All groups are learned using
the same structure.
The results are presented in Table II. The number in the
first row represents the group number. Column g1 shows the
training and test error when using group 2 to 5 as training
data and group 1 as test data. The dimension of input feature is
3088 described in Tabel I, and the output is valence or arousal
value. To compare the performance of the object features
obtained from the three CNNs, we build various models by
combining object features with other features and compare
their performance. Note that the ‘A’, ‘V’, and ‘R’ in the
third column represent the AlexNet, VGG16 and ResNet,
respectively. The number of each bin represent the mean
square error between ground truth emotion value and predicted
emotion value by each model. We also extracted category-level
TABLE II
RESULTS OF 5-FOLD VALIDATION EXPERIMENTS
g1 g2 g3 g4 g5 Avg.
Valence
Training
A 1.37 1.37 1.35 1.37 1.38 1.37
V 1.31 1.31 1.30 1.30 1.33 1.31
R 1.29 1.30 1.28 1.31 1.30 1.30
O 1.47 1.47 1.46 1.48 1.48 1.47
Test
A 1.72 1.68 1.67 1.69 1.61 1.67
V 1.68 1.66 1.63 1.63 1.59 1.64
R 1.70 1.66 1.65 1.62 1.60 1.65
O 1.80 1.81 1.73 1.73 1.69 1.75
Arousal
Training
A 1.22 1.24 1.20 1.21 1.23 1.22
V 1.16 1.21 1.17 1.16 1.16 1.17
R 1.18 1.22 1.18 1.18 1.18 1.19
O 1.26 1.30 1.25 1.27 1.29 1.27
Test
A 1.50 1.52 1.50 1.48 1.44 1.49
V 1.48 1.49 1.46 1.48 1.44 1.47
R 1.49 1.49 1.47 1.47 1.45 1.48
O 1.56 1.53 1.54 1.50 1.48 1.52
features from [39], not from CNN-based model, and included
them as input features in emotion prediction. Borth et al. [21]
also used this feature to predict the sentiment of images. Note
that, the dimension of this feature is 2000, and the number of
input feature neurons in our model is changed to 2588. The
row with ’O’ represents the prediction result. The results show
that the features from VGG16 achieved the best performance
in both the ’Valence’ and ’Arousal’ models (valence: 1.64,
arousal: 1.47).
Figures 10 to 14 show the qualitative analysis with emotion
values and accuracy which is predicted by our model. In
Figure 10, the images were placed so that the predicted values
matches the emotion values in the VA space. Figure 11 and
12 shows the results of the valence model. Figure 13 and 14
shows the results of the arousal model.
B. Feature performance
We also investigate the effects of the various features we
proposed. First, we combine the color feature and the local
features into a low level feature. As a method for constructing
a network model using each feature, we use the structure of
our model and change the number of neurons. Our proposed
model consists of 3 hidden layers with 3000, 1000, and 500
7
neurons. In the model for feature learning, the number of nodes
in each layer is based on the ratio of the number of nodes in
two adjacent layers of our model (Table III (bottom)). As a
result, the object feature among the three features showed the
best result in emotion prediction. In Valence, the object feature
extracted from VGG16 had the best result (mse:1.92). In
arousal, the object feature extracted from AlexNet showed the
best result (mse:1.61). Semantic features also showed lower
error than low-level features. Our model combining all the
features resulted in the best prediction performance, which is
the synergy effect of extracted features from various sources
and deep neural network, which is a powerful expression
power.
TABLE III
PERFORMANCE OF EACH FEATURE
Low
Object
semantic All
Alexnet VGG16 ResNet
Valence 1.98 1.93 1.92 1.98 1.97 1.64
Arousal 1.66 1.61 1.62 1.68 1.64 1.47
input 438 1000 1000 1000 150 1588
h1 900 2000 300 3000
h2 300 700 100 1000
h3 150 350 350 350 50 500
output 1
C. Comparison with CNN
Some studies have learned emotion classification models
using pre-trained weights learned for image classification [9]–
[13]. We compare our emotion prediction model with CNN-
based emotion prediction model generated by transfer learning.
Two CNN structures are used for comparison; AlexNet and
Vgg19. We first initialize the weights of AlexNet and VGG19
to the weights learned for the image classification. Except for
the final output layer, the other convolution layer and the fully-
connected layer use the existing model structure. Since the
number of the output layer of CNN model based on ImageNet
is 1000, we change the number of output layers to 1 for our
purpose (Valence and Arousal).
Besides, various results can be obtained in transfer learn-
ing. AlexNet has five convolutional layers and three fully
connected layers, and VGG19 has 16 convolutional layers
and three fully connected layers. In transfer learning, we can
determine which layer to freeze and which layer to train.
We experiment with two conditions. The first is that the
convolutional layer is frozen, only the fully connected layer
is learned (conv-frozen), and the second is learning all layers
together (conv-train). The learning environment for the CNN-
based model is almost similar to that of our FFNN model
except for the batch size and learning rate. The learning
rates of both conv-frozen network and conv-train network are
0.5 × 10−4, and the batch size of both CNN models is 50.
When we train the CNN models, including our model, we use
same training and test dataset.
The results are shown in the Table IV. The second column
shows the training range, conv-frozen means that only the
fully connected layer has been learned, and conv-train means
TABLE IV
COMPARISION WITH OTHER LEARNING METHOD
Emotion Train range AlexNet VGG19 Linear SVR Ours
Valence
conv-frozen 2.76 2.64
2.42 1.75 1.64
conv-train 2.64 2.60
Arousal
conv-frozen 1.95 1.91
2.65 1.54 1.47
conv-train 1.89 1.87
that all layers have been learned. From the training range
perspective, it can be seen that the error of the transfer
learning of the entire network is smaller than that of the fully
connected layer transfer learning. This result implies that the
filter in low-level, as well as the filters in high-level, must
be learned in order to achieve a better performance emotion
prediction model. In other words, the convolutional layer and
the fully connected layer must be learned together. On the
model side, we can see that the performance of the model
learned by using the structure of Vgg19 Network is better
than AlexNet. However, the results of both models are much
more error-prone than our proposed emotion based FFNN.
If CNN-based models have the same type of data with the
same class, such as object detection or image classification,
the learning is well done and the prediction performance is
excellent. However, as mentioned earlier, images with different
shapes can have the same emotions, and images with similar
shapes can have different emotions. We also conducted the
test using other machine learning methods with same dataset.
Linear regression and Support vector regression method were
used. Compared to the CNN-based model, the performance of
both models is better, but the results of our model still show
the best performance (See Table IV).
VII. CONCLUSION
In this paper, we presented a new emotion recognition
system with a deep learning framework. To reduce the affec-
tive gap, we designed and extracted objects and background
semantic features as high-level features, and showed that
these features are effective for emotion prediction. Both high-
level features and low-level features complement each other
well, which leads to better emotion recognition performance.
As expected, the accuracy of the object recognition has an
impact on the performance of the emotion prediction. The
object features with incorrect recognition may lead to incorrect
emotion prediction results. There is also a problem when the
main object of the image is not included in the existing 1000
classes. However, with the rapid progress in the deep learning
technology with large dataset, the accuracy as the number of
classes will be increased, which in turn will also help our
emotion recognition system.
As an interesting future work, one can consider the presence
of a person in the image and the facial expression. A facial
expression is one of the features that can greatly affect the
emotion prediction. Even if the overall mood of the image
is dark, smiling face can mitigate negative emotion a little
(Figure 15). In addition, when the face occupies most of
the part in the photograph, facial expression and emotion are
directly connected. We will consider enhancing emotion recog-
8
V: 3.17/2.50
A: 6.30/7.00
V: 4.55/3.83
A: 5.00/5.17
V: 4.89/4.57
A: 5.51/5.57
V: 4.87/4.08
A: 5.56/5.53
V: 3.91/4.40
A: 4.42/4.20
V: 3.46/3.80
A: 4.33/4.80
V: 4.28/3.80
A: 4.78/4.00
V: 3.97/4.67
A: 4.56/5.33
V: 3.54/4.00
A: 4.86/4.40
V: 4.80/4.29
A: 4.24/3.70
V: 3.84/4.60
A: 3.47/3.20
V: 5.68/4.60
A: 5.77/6.00
V: 4.14/3.00
A: 3.54/3.17
V: 7.37/7.50
A: 7.05/7.33
V: 6.43/6.89
A: 6.07/5.33
V: 7.08/6.80
A: 5.81/5.20
V: 7.41/8.00
A: 6.98/6.80
V: 7.36/6.60
A: 6.80/6.80
V: 6.87/7.33
A: 6.51/7.00
V: 6.65/7.00
A: 6.23/6.80
V: 6.06/6.60
A: 5.36/6.00
V: 7.25/7.29
A: 4.70/4.14
V: 7.63/8.00
A: 3.89/4.00
V: 7.27/7.17
A: 3.78/4.00
V: 5.84/5.80
A: 3.32/3.40
V: 6.65/6.60
A: 5.32/5.20
V: 3.63/2.86
A: 4.59/5.57
91.6% / 91.3% 90.1% / 99.6% 86.5% / 97.1% 96.5% / 92.4% 94.3% / 93.9% 92.6% / 97.8% 98.4% / 96.5%
90.4% / 87.8% 94.3% / 94.3% 91.0% / 97.9% 96.0% / 99.3% 94.3% / 90.8% 90.5% / 100% 95.6% / 92.9%
95.8% / 94.1% 91.3% / 90.4% 93.6% / 93.3% 99.5% / 99.0% 99.4% / 98.5% 93.3% / 92.0% 99.5% / 93.0%
90.5% / 96.6% 93.9% / 97.3% 94.0% / 90.3% 85.8% / 95.4% 98.8% / 97.3% 95.4% / 98.6%
Fig. 10. Prediction results. The values below each image show the prediction results of FFNN and ground truth emotion values, respectively. The prediction
accuracy results (V/A) are also shown.
nition performance by adding a facial expression recognition
framework.
Several studies have demonstrated that biometric data has a
positive effect on emotion recognition [40], [41]. We can also
consider using biometric data or an observer’s facial feature
as additional features. However, in general, deep networks
require thousands or tens of thousands of data, and collecting
these biometric and facial data is not easy task. We will try to
find the method to improve the performance of the model by
considering a small amount of biometric data, which is left as
another future work.
We also built a database for the emotion estimation with
the V-A model and will continue to collect more data. We
expect our dataset will be widely used in the field of affective
computing.
REFERENCES
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural infor-
mation processing systems, 2012, pp. 1097–1105.
[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2015, pp. 1–9.
[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” arXiv preprint arXiv:1512.03385, 2015.
[5] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
for semantic segmentation,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
[6] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmentation,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2014, pp. 580–587.
[7] Z. Cheng, Q. Yang, and B. Sheng, “Deep colorization,” in Proceedings
of the IEEE International Conference on Computer Vision, 2015, pp.
415–423.
[8] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolu-
tional network for image super-resolution,” in European Conference on
Computer Vision. Springer, 2014, pp. 184–199.
[9] V. Campos, A. Salvador, X. Giro-i Nieto, and B. Jou, “Diving deep
into sentiment: Understanding fine-tuned cnns for visual sentiment
prediction,” in Proceedings of the 1st International Workshop on Affect
& Sentiment in Multimedia. ACM, 2015, pp. 57–62.
[10] K.-C. Peng, T. Chen, A. Sadovnik, and A. C. Gallagher, “A mixed
bag of emotions: Model, predict, and transfer emotion distributions,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2015, pp. 860–868.
[11] Q. You, J. Luo, H. Jin, and J. Yang, “Robust image sentiment analysis
9
3.96/3.89 (99.1%)
3.88/3.20 (91.5%) 3.95/3.25 (91.3%)
4.97/4.60 (95.4%)
4.60/4.40 (97.5%)
4.46/4.2 (96.8%)
4.19/3.67 (93.5%)
4.90/5.10 (97.5%)
4.10/4.71 (92.4%)
4.33/3.83 (93.8%) 4.61/4.60 (99.9%)
5.12/5.13 (99.9%) 5.18/5 (97.8%)
5.11/5.33 (97.3%)
5.75/5.80 (99.4%)
6.56/6.80 (97.0%)
6.36/6.00 (95.5%)
6.77/6.80 (99.6%) 7.28/7.57 (96.4%)
5.79/5.83 (99.5%)
5.68/5.60 (99.0%)
4.20/4 (97.5%)
5.90/6.00 (98.8%)
7.65/7.88 (97.1%) 7.67/8.00 (95.9%)
7.23/7.00 (97.1%)
Fig. 11. Valence prediction results of the images with low arousal. The values below each image represent the predicted valence value and ground truth value
(prediction/ground truth). The prediction accuracy result is also shown. From the top left to the bottom right, the value of valence increases. All images in
this example have low arousal values.
3.22/3.00 (97.3%) 3.98/3.59 (95.1%)
3.83/3.60 (97.1%)
3.78/3.40 (95.3%) 3.84/3.80 (99.5%) 3.88/3.60 (96.5%) 4.51/4.50 (99.9%)
5.77/6.00 (97.1%)
5.43/5.17 (96.8%)
4.90/4.80 (98.8%) 5.19/4.80 (95.1%)
7.63/8.00 (95.4%)
7.24/7.29 (99.4%) 7.33/7.00 (95.9%)
6.94/7.00 (99.3%)
6.52/6.70 (97.8%)
4.99/4.80 (97.6%)
4.15/3.80 (95.6%)
6.50/6.89 (95.1%)
7.70/8.00 (96.3%)
7.46/7.63 (97.9%)
7.42/8.13 (91.1%) 7.51/7.88 (95.4%)
4.91/4.40 (93.6%)
7.19/7.40 (97.4%)
7.42/7.00 (94.8%)
7.18/7.83 (91.9%)
7.02/6.60 (94.8%)
Fig. 12. Valence prediction results of the images with high arousal. The values below each image represent the predicted valence value and ground truth
value (prediction/ground truth). The prediction accuracy result is also shown. From the top left to the bottom right, the value of valence increases. All images
in this example have high arousal values.
10
3.98/3.89 (98.9%) 3.46/3.20 (96.8%) 3.88/3.83 (99.4%) 3.87/3.67 (97.5%)
4.67/4.33 (95.8%) 4.75/4.80 (99.4%)
4.75/4.60(98.1%)
4.60/4.40 (97.5%)
4.47/4.60 (98.4%) 4.49/4.40 (98.9%)
4.65/4.50 (98.1%) 4.70/5.20 (93.8%)
4.17/4.00 (97.9%)
5.05/5.60(93.1%)
5.34/5.60 (96.8%) 5.40/5.20 (97.5%)
5.24/5.80 (93.0%) 5.43/6.00 (92.9%) 6.23/6.33 (98.8%)
5.29/5.60 (96.1%)
4.19/4.50 (96.1%) 4.40/4.60 (97.5%)
4.14/4.00 (98.3%)
4.50/4.00 (93.8%)
5.69/ 5.85 (98.0%)
5.16/4.75 (94.9%)
Fig. 13. Arousal prediction results of the images with low valence. The values below each image represent the predicted arousal value and ground truth value
(prediction/ground truth). The prediction accuracy result is also shown. From the top left to the bottom right, the value of arousal increases. All images in
this example have low valence values.
3.70/4.20 (93.8%) 3.74/3.40 (95.8%) 3.88/3.50 (95.3%) 3.21/3.00 (97.4%) 3.33/3.60 (96.6%) 3.85/3.80 (99.4%)
3.97/4.50 (93.4%) 3.61/3.67 (99.3%) 3.99/4.00 (99.9%) 3.42/3.20 (97.3%) 3.88/4.00 (98.5%) 3.96/4.17 (97.4%)
4.70/4.88 (97.8%) 5.00/5.00 (100.0%) 5.80/6.20 (95.0%) 6.33/6.33 (100.0%) 6.44/6.40 (99.5%) 5.76/6.00 (97.0%)
5.87/6.25 (95.3%) 5.82/6.20 (95.3%) 5.68/6.00 (96.0%) 6.13/6.30 (97.9%) 6.65/6.83 (97.8%) 7.22/7.40 (97.8%)
6.95/6.80 (98.1%)
Fig. 14. Arousal prediction results of the images with high valence. The values below each image represent the predicted arousal value and ground truth
value (prediction/ground truth). The prediction accuracy result is also shown. From the top left to the bottom right, the value of arousal increases. All images
in this example have high valence values.
11
Fig. 15. Different emotions with different facial expressions.
using progressively trained and domain transferred deep networks,” in
Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
[12] ——, “Building a large scale dataset for image emotion recognition:
The fine print and the benchmark,” 2016.
[13] C. Xu, S. Cetintas, K.-C. Lee, and L.-J. Li, “Visual sentiment
prediction with deep convolutional neural networks,” arXiv preprint
arXiv:1411.5731, 2014.
[14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, 1998.
[15] A. Hanjalic, “Extracting moods from pictures and sounds: Towards truly
personalized tv,” IEEE Signal Processing Magazine, vol. 23, no. 2, pp.
90–100, 2006.
[16] X. Lu, P. Suryanarayan, R. B. Adams Jr, J. Li, M. G. Newman, and J. Z.
Wang, “On shape and the computability of emotions,” in Proceedings
of the 20th ACM international conference on Multimedia. ACM, 2012,
pp. 229–238.
[17] J. Machajdik and A. Hanbury, “Affective image classification using
features inspired by psychology and art theory,” in Proceedings of the
18th ACM international conference on Multimedia. ACM, 2010, pp.
83–92.
[18] S. Zhao, Y. Gao, X. Jiang, H. Yao, T.-S. Chua, and X. Sun, “Exploring
principles-of-art features for image emotion recognition,” in Proceedings
of the 22nd ACM international conference on Multimedia. ACM, 2014,
pp. 47–56.
[19] S. Haykin, Neural Networks: A Comprehensive Foundation. Prentice
Hall, 1999. [Online]. Available: https://books.google.co.kr/books?id=
3-1HPwAACAAJ
[20] C. E. Osgood, “The nature and measurement of meaning.” Psychological
bulletin, vol. 49, no. 3, p. 197, 1952.
[21] D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang, “Large-scale
visual sentiment ontology and detectors using adjective noun pairs,” in
Proceedings of the 21st ACM international conference on Multimedia.
ACM, 2013, pp. 223–232.
[22] T. Chen, D. Borth, T. Darrell, and S.-F. Chang, “Deepsentibank: Vi-
sual sentiment concept classification with deep convolutional neural
networks,” arXiv preprint arXiv:1410.8586, 2014.
[23] V. Campos, B. Jou, and X. Giro-i Nieto, “From pixels to sentiment:
Fine-tuning cnns for visual sentiment prediction,” Image and Vision
Computing, 2017.
[24] J. Islam and Y. Zhang, “Visual sentiment analysis for social images
using transfer learning approach,” in Big Data and Cloud Computing
(BDCloud), Social Computing and Networking (SocialCom), Sustainable
Computing and Communications (SustainCom)(BDCloud-SocialCom-
SustainCom), 2016 IEEE International Conferences on. IEEE, 2016,
pp. 124–130.
[25] M. M. Bradley and P. J. Lang, “Measuring emotion: the self-assessment
manikin and the semantic differential,” Journal of behavior therapy and
experimental psychiatry, vol. 25, no. 1, pp. 49–59, 1994.
[26] J. A. Russell, “A circumplex model of affect,” 1980.
[27] Flickr, “http://www.flickr.com.”
[28] P. J. Lang, M. M. Bradley, and B. N. Cuthbert, “International affective
picture system (iaps): Affective ratings of pictures and instruction
manual,” Technical report A-8, 2008.
[29] E. Reinhard, M. Ashikhmin, B. Gooch, and P. Shirley, “Color transfer
between images,” IEEE Computer graphics and applications, no. 5, pp.
34–41, 2001.
[30] G. Csurka, S. Skaff, L. Marchesotti, and C. Saunders, “Learning moods
and emotions from color combinations,” in Proceedings of the Seventh
Indian Conference on Computer Vision, Graphics and Image Processing.
ACM, 2010, pp. 298–305.
[31] L. He, H. Qi, and R. Zaretzki, “Image color transfer to evoke differ-
ent emotions based on color combinations,” Signal, Image and Video
Processing, pp. 1–9, 2014.
[32] H.-R. Kim, H. Kang, and I.-K. Lee, “Image recoloring with valence-
arousal emotion model,” in Computer Graphics Forum, vol. 35, no. 7.
Wiley Online Library, 2016, pp. 209–216.
[33] J. Van De Weijer, C. Schmid, J. Verbeek, and D. Larlus, “Learning
color names for real-world applications,” IEEE Transactions on Image
Processing, vol. 18, no. 7, pp. 1512–1523, 2009.
[34] P. Valdez and A. Mehrabian, “Effects of color on emotions.” Journal of
experimental psychology: General, vol. 123, no. 4, p. 394, 1994.
[35] A. B. Warriner, V. Kuperman, and M. Brysbaert, “Norms of valence,
arousal, and dominance for 13,915 english lemmas,” Behavior research
methods, vol. 45, no. 4, pp. 1191–1207, 2013.
[36] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
A large-scale hierarchical image database,” in Computer Vision and
Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE,
2009, pp. 248–255.
[37] Z. Wu, C. Shen, and A. v. d. Hengel, “Wider or deeper: Revisiting the
resnet model for visual recognition,” arXiv preprint arXiv:1611.10080,
2016.
[38] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale
machine learning on heterogeneous distributed systems,” arXiv preprint
arXiv:1603.04467, 2016.
[39] F. X. Yu, L. Cao, R. S. Feris, J. R. Smith, and S.-F. Chang, “Design-
ing category-level attributes for discriminative visual recognition,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2013, pp. 771–778.
[40] L. Aftanas, N. Reva, A. Varlamov, S. Pavlov, and V. Makhnev, “Analysis
of evoked eeg synchronization and desynchronization in conditions of
emotional activation in humans: temporal and topographic characteris-
tics,” Neuroscience and behavioral physiology, vol. 34, no. 8, pp. 859–
867, 2004.
[41] G. Chanel, J. Kronegg, D. Grandjean, and T. Pun, “Emotion assessment:
Arousal evaluation using eegâĂŹs and peripheral physiological signals,”
Multimedia content representation, classification and security, pp. 530–
537, 2006.

More Related Content

What's hot

A02414491453
A02414491453A02414491453
A02414491453
IJMER
 
Thermal Imaging Emotion Recognition final report 01
Thermal Imaging Emotion Recognition final report 01Thermal Imaging Emotion Recognition final report 01
Thermal Imaging Emotion Recognition final report 01
Ai Zhang
 

What's hot (20)

Emotion based music player
Emotion based music playerEmotion based music player
Emotion based music player
 
Morpho
MorphoMorpho
Morpho
 
Facial Expression Recognition via Python
Facial Expression Recognition via PythonFacial Expression Recognition via Python
Facial Expression Recognition via Python
 
Facial expression recognition
Facial expression recognitionFacial expression recognition
Facial expression recognition
 
IRJET- Facial Emotion Detection using Convolutional Neural Network
IRJET- Facial Emotion Detection using Convolutional Neural NetworkIRJET- Facial Emotion Detection using Convolutional Neural Network
IRJET- Facial Emotion Detection using Convolutional Neural Network
 
J01116164
J01116164J01116164
J01116164
 
Intellectual Person Identification Using 3DMM, GPSO and Genetic Algorithm
Intellectual Person Identification Using 3DMM, GPSO and Genetic AlgorithmIntellectual Person Identification Using 3DMM, GPSO and Genetic Algorithm
Intellectual Person Identification Using 3DMM, GPSO and Genetic Algorithm
 
A02414491453
A02414491453A02414491453
A02414491453
 
REAL TIME INTELLIGENT EMOTIONAL MUSIC PLAYER USING ANDROID
REAL TIME INTELLIGENT EMOTIONAL MUSIC PLAYER USING ANDROIDREAL TIME INTELLIGENT EMOTIONAL MUSIC PLAYER USING ANDROID
REAL TIME INTELLIGENT EMOTIONAL MUSIC PLAYER USING ANDROID
 
Facial emotion detection on babies' emotional face using Deep Learning.
Facial emotion detection on babies' emotional face using Deep Learning.Facial emotion detection on babies' emotional face using Deep Learning.
Facial emotion detection on babies' emotional face using Deep Learning.
 
Cl4301502506
Cl4301502506Cl4301502506
Cl4301502506
 
An Approach of Human Emotional States Classification and Modeling from EEG
An Approach of Human Emotional States Classification and Modeling from EEGAn Approach of Human Emotional States Classification and Modeling from EEG
An Approach of Human Emotional States Classification and Modeling from EEG
 
Facial Expression Recognition
Facial Expression Recognition Facial Expression Recognition
Facial Expression Recognition
 
Ct35535539
Ct35535539Ct35535539
Ct35535539
 
O24114119
O24114119O24114119
O24114119
 
Happiness Expression Recognition at Different Age Conditions
Happiness Expression Recognition at Different Age ConditionsHappiness Expression Recognition at Different Age Conditions
Happiness Expression Recognition at Different Age Conditions
 
Facial Emotion Recognition: A Deep Learning approach
Facial Emotion Recognition: A Deep Learning approachFacial Emotion Recognition: A Deep Learning approach
Facial Emotion Recognition: A Deep Learning approach
 
Thermal Imaging Emotion Recognition final report 01
Thermal Imaging Emotion Recognition final report 01Thermal Imaging Emotion Recognition final report 01
Thermal Imaging Emotion Recognition final report 01
 
Facial expression recognition using pca and gabor with jaffe database 11748
Facial expression recognition using pca and gabor with jaffe database 11748Facial expression recognition using pca and gabor with jaffe database 11748
Facial expression recognition using pca and gabor with jaffe database 11748
 
A deep learning facial expression recognition based scoring system for restau...
A deep learning facial expression recognition based scoring system for restau...A deep learning facial expression recognition based scoring system for restau...
A deep learning facial expression recognition based scoring system for restau...
 

Similar to 1705.07543

E MOTION I NTERACTION WITH V IRTUAL R EALITY U SING H YBRID E MOTION C...
E MOTION  I NTERACTION WITH  V IRTUAL  R EALITY  U SING  H YBRID  E MOTION  C...E MOTION  I NTERACTION WITH  V IRTUAL  R EALITY  U SING  H YBRID  E MOTION  C...
E MOTION I NTERACTION WITH V IRTUAL R EALITY U SING H YBRID E MOTION C...
ijcsit
 
Affective analysis in machine learning using AMIGOS with Gaussian expectatio...
Affective analysis in machine learning using AMIGOS with  Gaussian expectatio...Affective analysis in machine learning using AMIGOS with  Gaussian expectatio...
Affective analysis in machine learning using AMIGOS with Gaussian expectatio...
International Journal of Reconfigurable and Embedded Systems
 
Image–based face-detection-and-recognition-using-matlab
Image–based face-detection-and-recognition-using-matlabImage–based face-detection-and-recognition-using-matlab
Image–based face-detection-and-recognition-using-matlab
Ijcem Journal
 

Similar to 1705.07543 (20)

A Music Visual Interface via Emotion Detection Supervisor
A Music Visual Interface via Emotion Detection SupervisorA Music Visual Interface via Emotion Detection Supervisor
A Music Visual Interface via Emotion Detection Supervisor
 
E MOTION I NTERACTION WITH V IRTUAL R EALITY U SING H YBRID E MOTION C...
E MOTION  I NTERACTION WITH  V IRTUAL  R EALITY  U SING  H YBRID  E MOTION  C...E MOTION  I NTERACTION WITH  V IRTUAL  R EALITY  U SING  H YBRID  E MOTION  C...
E MOTION I NTERACTION WITH V IRTUAL R EALITY U SING H YBRID E MOTION C...
 
A COLLAGE IMAGE CREATION & “KANISEI” ANALYSIS SYSTEM BY COMBINING MULTIPLE IM...
A COLLAGE IMAGE CREATION & “KANISEI” ANALYSIS SYSTEM BY COMBINING MULTIPLE IM...A COLLAGE IMAGE CREATION & “KANISEI” ANALYSIS SYSTEM BY COMBINING MULTIPLE IM...
A COLLAGE IMAGE CREATION & “KANISEI” ANALYSIS SYSTEM BY COMBINING MULTIPLE IM...
 
A Collage Image Creation & "KANISEI" Analysis System by Combining Multiple Im...
A Collage Image Creation & "KANISEI" Analysis System by Combining Multiple Im...A Collage Image Creation & "KANISEI" Analysis System by Combining Multiple Im...
A Collage Image Creation & "KANISEI" Analysis System by Combining Multiple Im...
 
EMOTION RECOGNITION SYSTEMS: A REVIEW
EMOTION RECOGNITION SYSTEMS: A REVIEWEMOTION RECOGNITION SYSTEMS: A REVIEW
EMOTION RECOGNITION SYSTEMS: A REVIEW
 
PERFORMANCE EVALUATION OF VARIOUS EMOTION CLASSIFICATION APPROACHES FROM PHYS...
PERFORMANCE EVALUATION OF VARIOUS EMOTION CLASSIFICATION APPROACHES FROM PHYS...PERFORMANCE EVALUATION OF VARIOUS EMOTION CLASSIFICATION APPROACHES FROM PHYS...
PERFORMANCE EVALUATION OF VARIOUS EMOTION CLASSIFICATION APPROACHES FROM PHYS...
 
HUMAN EMOTION RECOGNIITION SYSTEM
HUMAN EMOTION RECOGNIITION SYSTEMHUMAN EMOTION RECOGNIITION SYSTEM
HUMAN EMOTION RECOGNIITION SYSTEM
 
Text to Emotion Extraction Using Supervised Machine Learning Techniques
Text to Emotion Extraction Using Supervised Machine Learning TechniquesText to Emotion Extraction Using Supervised Machine Learning Techniques
Text to Emotion Extraction Using Supervised Machine Learning Techniques
 
FULL PAPER.PDF
FULL PAPER.PDFFULL PAPER.PDF
FULL PAPER.PDF
 
Affective analysis in machine learning using AMIGOS with Gaussian expectatio...
Affective analysis in machine learning using AMIGOS with  Gaussian expectatio...Affective analysis in machine learning using AMIGOS with  Gaussian expectatio...
Affective analysis in machine learning using AMIGOS with Gaussian expectatio...
 
Issues in Sentiment analysis
Issues in Sentiment analysisIssues in Sentiment analysis
Issues in Sentiment analysis
 
HUMAN EMOTION ESTIMATION FROM EEG AND FACE USING STATISTICAL FEATURES AND SVM
HUMAN EMOTION ESTIMATION FROM EEG AND FACE USING STATISTICAL FEATURES AND SVMHUMAN EMOTION ESTIMATION FROM EEG AND FACE USING STATISTICAL FEATURES AND SVM
HUMAN EMOTION ESTIMATION FROM EEG AND FACE USING STATISTICAL FEATURES AND SVM
 
AffectNet_oneColumn-2.pdf
AffectNet_oneColumn-2.pdfAffectNet_oneColumn-2.pdf
AffectNet_oneColumn-2.pdf
 
Thelxinoë: Recognizing Human Emotions Using Pupillometry and Machine Learning
Thelxinoë: Recognizing Human Emotions Using Pupillometry and Machine LearningThelxinoë: Recognizing Human Emotions Using Pupillometry and Machine Learning
Thelxinoë: Recognizing Human Emotions Using Pupillometry and Machine Learning
 
Thelxinoë: Recognizing Human Emotions Using Pupillometry and Machine Learning
Thelxinoë: Recognizing Human Emotions Using Pupillometry and Machine LearningThelxinoë: Recognizing Human Emotions Using Pupillometry and Machine Learning
Thelxinoë: Recognizing Human Emotions Using Pupillometry and Machine Learning
 
Predicting Facial Expression using Neural Network
Predicting Facial Expression using Neural Network Predicting Facial Expression using Neural Network
Predicting Facial Expression using Neural Network
 
Facial Expression Recognition
Facial Expression RecognitionFacial Expression Recognition
Facial Expression Recognition
 
A COLLAGE IMAGE CREATION & “KANISEI” ANALYSIS SYSTEM BY COMBINING MULTIPLE IM...
A COLLAGE IMAGE CREATION & “KANISEI” ANALYSIS SYSTEM BY COMBINING MULTIPLE IM...A COLLAGE IMAGE CREATION & “KANISEI” ANALYSIS SYSTEM BY COMBINING MULTIPLE IM...
A COLLAGE IMAGE CREATION & “KANISEI” ANALYSIS SYSTEM BY COMBINING MULTIPLE IM...
 
A COLLAGE IMAGE CREATION & “KANISEI” ANALYSIS SYSTEM BY COMBINING MULTIPLE IM...
A COLLAGE IMAGE CREATION & “KANISEI” ANALYSIS SYSTEM BY COMBINING MULTIPLE IM...A COLLAGE IMAGE CREATION & “KANISEI” ANALYSIS SYSTEM BY COMBINING MULTIPLE IM...
A COLLAGE IMAGE CREATION & “KANISEI” ANALYSIS SYSTEM BY COMBINING MULTIPLE IM...
 
Image–based face-detection-and-recognition-using-matlab
Image–based face-detection-and-recognition-using-matlabImage–based face-detection-and-recognition-using-matlab
Image–based face-detection-and-recognition-using-matlab
 

Recently uploaded

Recently uploaded (20)

Basic_QTL_Marker-assisted_Selection_Sourabh.ppt
Basic_QTL_Marker-assisted_Selection_Sourabh.pptBasic_QTL_Marker-assisted_Selection_Sourabh.ppt
Basic_QTL_Marker-assisted_Selection_Sourabh.ppt
 
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General QuizPragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
 
Morse OER Some Benefits and Challenges.pptx
Morse OER Some Benefits and Challenges.pptxMorse OER Some Benefits and Challenges.pptx
Morse OER Some Benefits and Challenges.pptx
 
slides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptxslides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptx
 
Sectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdfSectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdf
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdfINU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
 
The Benefits and Challenges of Open Educational Resources
The Benefits and Challenges of Open Educational ResourcesThe Benefits and Challenges of Open Educational Resources
The Benefits and Challenges of Open Educational Resources
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptxJose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
 
Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity G...
Basic Civil Engineering Notes of Chapter-6,  Topic- Ecosystem, Biodiversity G...Basic Civil Engineering Notes of Chapter-6,  Topic- Ecosystem, Biodiversity G...
Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity G...
 
Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"
Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"
Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"
 
[GDSC YCCE] Build with AI Online Presentation
[GDSC YCCE] Build with AI Online Presentation[GDSC YCCE] Build with AI Online Presentation
[GDSC YCCE] Build with AI Online Presentation
 
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & EngineeringBasic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
 
Operations Management - Book1.p - Dr. Abdulfatah A. Salem
Operations Management - Book1.p  - Dr. Abdulfatah A. SalemOperations Management - Book1.p  - Dr. Abdulfatah A. Salem
Operations Management - Book1.p - Dr. Abdulfatah A. Salem
 
Basic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumersBasic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumers
 
How to the fix Attribute Error in odoo 17
How to the fix Attribute Error in odoo 17How to the fix Attribute Error in odoo 17
How to the fix Attribute Error in odoo 17
 

1705.07543

  • 1. 1 Building Emotional Machines: Recognizing Image Emotions through Deep Neural Networks Hye-Rin Kim, Yeong-Seok Kim, Seon Joo Kim, In-Kwon Lee Abstract—An image is a very effective tool for conveying emo- tions. Many researchers have investigated in computing the image emotions by using various features extracted from images. In this paper, we focus on two high level features, the object and the background, and assume that the semantic information of images is a good cue for predicting emotion. An object is one of the most important elements that define an image, and we find out through experiments that there is a high correlation between the object and the emotion in images. Even with the same object, there may be slight difference in emotion due to different backgrounds, and we use the semantic information of the background to improve the prediction performance. By combining the different levels of features, we build an emotion based feed forward deep neural network which produces the emotion values of a given image. The output emotion values in our framework are continuous values in the 2-dimensional space (Valence and Arousal), which are more effective than using a few number of emotion categories in describing emotions. Experiments confirm the effectiveness of our network in predicting the emotion of images. I. INTRODUCTION Images are very powerful tools for conveying moods and emotions as shown in Figure 1. Through images, people can express their feelings and communicate with other people. With the recent development in the deep learning technology, computers have become better at recognizing objects, faces, and actions. Computers have also started to write image captions and answer questions about images. But how about emotions? Can we teach computers to have similar feeling as humans do when looking at images? Predicting evoked emotion from an image is a difficult task and is still in its early stage. As the deep learning technology shows remarkable perfor- mance in various computer vision tasks such as image clas- sifcation [1]–[4], segmentation [5], [6] and image processing [7], [8], several studies have been introduced recently that apply the deep learning for the emotion prediction [9]–[13]. Those works mostly use the convolutional neural network (CNN) [14], which has shown better prediction results for the emotion classification compared to the model that uses a shallow network such as the linear model. The CNN has had a big impact especially in the image classification, and it is an effective network model for learning filters that capture the shapes that repeatedly appear in images. However, we argue that the learning process for the image emotion prediction should be different from that of image classification. This is because some images with different appearances can have same emotions, and some images with similar appearances can have different emotions. For example, an image that includes a person riding a bike and an image Fig. 1. Images with different emotions. that includes a person surfing in the ocean may give the same feelings, even though they look different. From this point of view, the performance may be limited if CNN is applied to an emotion prediction system. In addition, one of the main issues in the emotion recog- nition is the affective gap. The affective gap is the lack of coincidence between the measurable signal properties, com- monly referred to as features, and the expected affective state in which the user is brought by perceiving the signal [15]. To narrow this affective gap, several works proposed emotion classification systems based on the psychology and art theory based high level features such as harmony, movement, rule of third, etc. [16]–[18]. While those features help to improve the emotion recognition, a better set of features are still necessary. Distinguishing the effective features among various features is also important. As another example, similar to the example above, images with the objects such as guns or sharks arouse scary feeling, while images with babies or flowers lead to more happiness. It can be speculated that certain objects will affect the determination of emotions. Based on this observation, we assume that the main object appearing in the image plays an important role in determining the emotion. The idea is that object categories can be good cues for emotions. Through arXiv:1705.07543v2 [cs.CV] 3 Jul 2017
  • 2. 2 1 9 Negative Positive 1 9 Exciting Calm Valence Arousal Fig. 2. Two dimensional emotion models (Valence and Arousal). experiments, we show that objects appearing in images are related to emotion values, and objects are used as one of the features of our model. Besides, the emotion, even if images include the same object, can vary depending on the background. We also use the semantic information of the background as our features to improve prediction performance. Predicting emotions from images is a complex task that is quite different from the object detection or the image classifi- cation. In this paper, we combine high-level features such as the object and the background information extracted from a pre-trained deep network model for the image classification and the segmentation with low-level features such as color statistics to obtain a set of features for the emotion prediction. Using this feature set, we design a feed forward deep network (FFNN) [19] which produces the emotion value for a given image. In previous works for the emotion prediction, the emotions are categorized by a few number of classes, such as happy, awe, sad, fear, etc. In comparison, we opt for the dimensional model for expressing the emotions [20], which is widely used in the field of psychology. Specifically, the dimensional model consists of two parameters, Valence and Arousal (Figure 2). Valence represents the pleasure through the scale from 1 (negative) to 9 (positive). Arousal is the level of excitement, which also ranges from 1 (calm) to 9 (excited). Using this model, any emotion can be represented by using these two values in 2D space. As this dimensional model can express more emotions compared to using a discrete set of emotion categories, we train our emotion prediction system based on the V-A model. The contributions of our work are summarized as follows: • We propose an image emotion recognition system which outputs valence-arousal values for expressing emotion. As far as we know, this is the first system to build the deep learning model for the dimensional emotion model (VA model). • We build a new image emotion dataset through the crowdsourcing. The feed forward neural network is used to learn emotion features from this database. • We propose a novel idea of using an pre-trained CNN to relate the main object and background of the image to the emotional feature. • We show the effectiveness of an object to estimate an emotion of an image using correlations between emo- tional values of words and images. II. RELATED WORK Emotion of an image can be evoked by various factors. To figure out significant features for the emotion prediction problem, many researchers have considered various types of the features from color statistics to the art and the psycho- logical features. Machajdik et al. [17] introduced an affective image classification system using psychology and art theory based features such as Itten’s color contrast and rule of thirds. Zhao et al. [18] proposed to extract principles of art features for emotion classification. Similarly, Lu et al. [16] computed the shape based features in natural images. As a high level concept for representing the sentiment of an image, adjective- noun pairs were introduced by Borth et al. in [21]. Despite the rise of deep learning studies, relatively few studies have attempted to address the emotion prediction of images using the deep network. With the data set from Borth et al. [21], Chen et al. [22] classified the adjective-noun pairs using the CNN and achieved better accurate classification performance than Borth et al. [21]. Several studies utilized the pre-trained model for the im- age classification and transferred the learned parameter. By changing the number of outputs to be the same as the number of labels of their dataset, the classifier can be trained. For example, for a binary classification with a positive and a negative label, the number of output would be two. Some researchers [9]–[13] used AlexNet’s structure [1] with pre- trained weight to train their sentiment prediction frameworks with different output numbers. Binary classification which gives either a positive or a negative label was considered in [11], [13], [23], [24]. Peng et al. [10] and You et al. [12] trained the classifiers for seven and eight classes, respectively. In most studies, an emotion classification model was created through transfer learning using a pre-trained model for image classification. In general, fine-tuned models record better emo- tion classification performance compared to previous studies with shallow networks. In this paper, we use a feed forward neural network using low level and high-level features rather than transfer learning method. Our method is compared with transfer learning method by using same training and test dataset. Compared to the previous CNN based emotion recognition systems which are limited to output only a few number of emotional states, we propose to use the valence-arousal model to represent emotions. By using these two parameters that lie
  • 3. 3 Valence (negative - positive) Arousal (calm - exciting) Fig. 3. Image representation of the Self Assessment Manikin (SAM) [25] used for V-A value assignment in the user study. on a continuous space, we can represent emotions much better than the previous works. To enable the learning of the VA values using the CNN, we build a large set of data with various emotions and obtain the V-A labels through the crowdsourcing survey. III. IMAGE EMOTION DATASET In order to build the dataset with emotion values (valence and arousal), we collected a large set of images from two resources. We first searched for the emotional images using the 22 keywords from [26] in Flickr [27]. Those words include what we call basic emotions (happy, angry, afraid, sad), as well as less prototypical emotions (gloomy, bored) and affective states (sleepy, serene). Using the keywords, we first collected over twenty thousand images. Since the goal is to collect emotional images, we man- ually eliminated non-emotional images from the collection. Each candidate image in the dataset was assessed by three human subjects, and only the images that were determined to be emotional by more than two subjects were included in the database. The words used for the search are listed along with the number of images in the dataset: afraid (113), alarmed (14), angry (217), annoyed (179), distressed (150), frustrated (77), tense (8), aroused (4), delighted (28), excited (734), glad (315), happy (102), astonished (13), at ease (16), content (656), satisfied (33), serene (1917), pleased (42), depressed (179), bored (312), tired (942), gloomy(793). As a result, the total number of image is 6844. Second, we collected more images from [12] which include eight types of emotion categories (Amusement, Awe, Content- ment, Excitement, Anger, Disgust, Fear, Sad). We took 3,236 images out of 23,308 images. An even number of images from eight classes have been selected and included in our database. As a result, a total of 10766 images is acquired in our dataset. Next, we use the AMT to assign V-A emotion values for all images in our database. Given an image, a worker rates the emotion values for each image using the representation of the V-A scale, Self-Assessment Manikin (SAM) [25] (Figure 3). In each question, two images are shown to the worker. One is the image whose emotion value is to be measured, and the other is the image of the previous question with emotion value Fig. 4. The histogram of valence and arousal value from the collected images. recorded by the worker. By showing the previous image, the worker can measure a value for the given image compared to the value selected in the previous problem. It also alleviates the difficulty of having to choose absolute numbers. Each image is represented to the worker randomly and is evaluated only once by the same worker. To ensure the quality of the answers, we only allow the workers with 95% approval rate to participate in our Human Intelligence Task (HIT). We also limit the number of questions for each worker to 200 to keep them focused throughout the test. A total of 1,339 workers were recruited to assign the emotion values of the 10,766 images in the dataset. The evaluation time per image varied from 10 seconds to a minute depending on the worker. Each image is evaluated on at least five workers, and the average of the acquired values is assigned to the emotion value of the image. Figure 4 shows the total distribution of the V-A values of the images. In the case of Valence, it can be seen that there are relatively more positive images than negative images. In common sense, people usually share the positive images, rather than negative images, and this tendency is well represented in this distribution. In the case of Arousal, various emotion values were obtained except for the extreme values. Figure 5 shows some of the images in our database. We compare the emotion distribution of the dataset we collected with the distribution of IAPS dataset which is an emotion dataset introduced in [28]. The IAPS consists of of 1182 images, each of which contains valence, arousal and dominance values. Although the number of images is different, we can see that our data is spread more than the IAPS data distribution in the V-A emotion space (Figure 6). We provide additional analysis to validate our new dataset. We split Valence-Arousal space into 4 by 4 sections: Most negative (from 1 to 3), negative (from 3 to 5), positive (from 5 to 7), and most positive (from 7 to 9). Most calm (from 1 to 3), calm (from 3 to 5), exciting (from 5 to 7), and most exciting (from 7 to 9). We then place the images in the emotion space according to the obtained emotion values and extract the most used words (using the image tags) in each section. The results are shown in Figure 7. In the negative section of V-A space, most words have low valence values such as afraid, angry, annoyed, depressed and gloomy. On the other hand, the words such as content, serene, excited, and glad are included in the positive section. In the case of the arousal, the lowest arousal side (from 1 to 3) consists of the words such as serene, tired and gloomy and the highest arousal side
  • 4. 4 Fig. 5. Example images in our database. From left to right side, the valence value of the image increase. The images on the left/right have negative/positive emotion. From bottom to top, the arousal value of the image increase. The images on the bottom/top have the calm/exciting emotion. Fig. 6. Emotion distribution of our database and the IAPS [28] in V-A emotion space. (from 7 to 9) include the words such as excited, angry and distressed. This analysis indicates that the use of words for image collection and user study is appropriate for obtaining as diverse and well-distributed emotion values as possible in space. The new image emotion dataset can be used for the emotion recognition in two ways. A classifier can be trained to output either the continuous V-A values or discrete categories of emotions using the words in Figure 7. IV. FEATURES Now, we extract a various type of features for emotion prediction including color, local, object, and semantic features. A. Color features Color is the most basic and powerful element to express emotions and can be effectively used by artists to induce emo- tional effects. Many studies have been conducted to change the color of an image as a means to change the emotion of the image [29]–[32]. Color is not an element that can directly resolve an affective-gap because it can be viewed as a low-dimensional feature, but color is still a crucial factor in emotion recognition. We extract the mean values of RGB and HSV color space as the basic color characteristics. We also calculate the HSV histogram and extract the label number and Fig. 7. The most used words in each section. values of the bin with the largest value in the histogram. A concept similar to a color histogram, it calculates how much of the 11 basic colors exist in the image [33]. According to [20], saturation and brightness can have direct in on pleasure, arousal, and dominance. Using the saturation and brightness values, Valdez and Mehrabian [34] introduced formulas to measure the value of pleasure, arousal, and dominance through experiments. The formula for computing the values is as follows: Pleasure = 0.69Y + 0.22S (1) Arousal = 0.31Y + 0.60S (2) Dominance = 0.76Y + 0.32S (3) We measure the values for the three elements from the image and use them as features. B. Local features We exploit two kinds of local features used in the [21]. We use a 512-dimensional GIST descriptor that is effective
  • 5. 5 Fig. 8. Correlation between image emotion and object emotion. Image emotion values are taken from the IAPS data [28] set and object emotion values are taken from the word emotion dictionary [35]. in detecting scenes and a 59-dimensional local binary pattern (LBP) descriptor that is effective in detecting textures. C. Object features Identifying the emotion of the image from low-level fea- tures, such as color statistics and texture-related features, is difficult for a human subject. Many researchers stated the need for high level of features on the affective level and designed the various types of features for emotion prediction. In our study, we assume that the object is one of the most important factors contributing to the emotion of an image. We conducted an experiment to prove that the object in images has relevance to the emotion elicited from the images. Each image in the IAPS dataset, an emotion dataset introduced in [28], includes a tag representing the primary object (e.g., baby, snake, and shark) and the V-A emotion values. In order to replace the object tag with the emotion level representation, we adopt the word emotion dictionary [35] with the valence and the arousal values for each word. We convert each tag to V-A values by searching for the tag in the word emotion dictionary. By measuring the Pearson correlation coefficient, we can easily understand the relationship between the object and the image emotion. Figure 8 shows the results of the experiment. As you can see, the correlation between the object emotion and the image emotion is significantly high, which means the object affects the emotion of the image. Especially, the correlation of the valence is higher than the arousal, which means that valence is more likely to be affected by the object than the arousal. Based on this observation, we add object-based features to our system in predicting emotions. In recent years, many studies have used CNN models of various structures using ImageNet dataset [36] for the image classification. We use three of the most popular models (AlexNet, VGG16, and ResNet) to extract object features from our image datasets and experiment with the effects of features extracted from each model on emotion prediction results. Our object feature is the result of the final output layer, and it represents the probabilities of 1000 object categories. D. Sematic features As a high-level feature with a similar concept to object, we consider another semantic information that can describe the background of the image. It is important what the object is in the image, but what the background is made up of is as important. For example, if the main object of an image is a person, the emotion may be different depending on whether the background is a city with a lot of buildings or nature such as a mountain or a sea. Also, the ratio of sky, sea, or buildings in the background can also affect the emotions. To use the semantic information of the background as a feature, we perform scene parsing on all images. Wu et al. [37] proposed a semantic segmentation method based on a deep network, which classifies each pixel of an image into one of 150 semantic categories. Given a semantic map, we find out which of the 150 semantic categories each pixel in the image belongs to. As a result, a 150-dimensional vector is obtained and used as one of the input features. These object and semantic-based high level features are combined with low level features such as color and local features to learn our networks for emotion prediction. V. LEARNING EMOTION MODEL In this section, we introduce the details of our emotion prediction framework. The overall architecture of our frame- work is shown in Figure 9. Our model is a fully connected feedforward neural network. In general, a neural network consists of an input layer, an output layer, and one or more hidden layers. Normally, when there are two or more hidden layers, the network is called deep network. Each layer is made up of multiple neurons, and the edges that connect neurons between adjacent layers have weights. The values of neurons (except the neurons in input layer) and weights are trained during a training phase. Our network F(X, θ) including an input layer, three hidden layers and an output layer is as follows: F(X, θ) = f 4 ◦ g3 ◦ f 3 ◦ g2 ◦ f 2 ◦ g1 ◦ f 1 (X), (4) where X is input feature vector, θ is a set of weights including weights w and bias b, and f 4 produces the final output of our neural network (Valence and Arousal value). Specifically, given an input vector Xl = [xl i, ..., xl n] T in layer l, the preactivation value pj for the neuron j of layer l + 1 is obtained through function f l+1: pl+1 j = f l+1 (xl ) = n Õ i=1 wl+1 ij xl i + bl+1 j , (5) where wl+1 ij is the connection weight connecting xi in layer l to neuron j in layer l + 1, bl+1 j is the bias of neural j in layer l + 1, and n is the number of the neuron in layer l. Then, the output value xj of the layer l (also input vector for layer l +1) is obtained through function gl which is a nonlinear activation function in layer l + 1, xl+1 j = gl+1 (pl+1 j ). (6) Note that, rectified linear unit(ReLU) [1], max(0, x), is used as the non-linear activation function throughout the all network. The number of neurons in each layer is given in the Table I. We set the loss function of our network as L = ÍK k=1 ( f 4 k − Tk)2, where the f 4 k and Tk are the output value
  • 6. 6 Scene segmentation Object classification 1000 D vector 150 D vector … … … … Low-level features Emotion prediction model V-A values 1938 D vector Fig. 9. The overall architecture of the proposed emotion prediction model. TABLE I THE NUMBER OF NEURONS IN EACH LAYER. Input H1 H2 H3 Output Num of neurons 1588 3000 1000 1000 1 predicted by our model and the ground truth emotion value of given image I and K is the number of images. In training phase, network weights θ are updated by backpropagating the gradients through all layers. By minimizing the cost of the loss function, we can optimize the weights of our network. We set learning rate to 0.0001 and the network is trained by using the stocahstic gradient descent (SGD) optimization method with momentum of 0.9. We set the batch size to 1,000 and train our model until the error no longer diminishes. All experiments are implemented by using the open source deep learning framework Tensorflow [38]. VI. EXPERIMENT A. Model performance We first evaluate the performance of our model. The entire dataset is divided into five groups, four groups are used for the training phase, and the remaining one group is used for the test phase. Each group is used as a test phase once. In each group, the number of training images is 8600, and the number of test images is 2166. All groups are learned using the same structure. The results are presented in Table II. The number in the first row represents the group number. Column g1 shows the training and test error when using group 2 to 5 as training data and group 1 as test data. The dimension of input feature is 3088 described in Tabel I, and the output is valence or arousal value. To compare the performance of the object features obtained from the three CNNs, we build various models by combining object features with other features and compare their performance. Note that the ‘A’, ‘V’, and ‘R’ in the third column represent the AlexNet, VGG16 and ResNet, respectively. The number of each bin represent the mean square error between ground truth emotion value and predicted emotion value by each model. We also extracted category-level TABLE II RESULTS OF 5-FOLD VALIDATION EXPERIMENTS g1 g2 g3 g4 g5 Avg. Valence Training A 1.37 1.37 1.35 1.37 1.38 1.37 V 1.31 1.31 1.30 1.30 1.33 1.31 R 1.29 1.30 1.28 1.31 1.30 1.30 O 1.47 1.47 1.46 1.48 1.48 1.47 Test A 1.72 1.68 1.67 1.69 1.61 1.67 V 1.68 1.66 1.63 1.63 1.59 1.64 R 1.70 1.66 1.65 1.62 1.60 1.65 O 1.80 1.81 1.73 1.73 1.69 1.75 Arousal Training A 1.22 1.24 1.20 1.21 1.23 1.22 V 1.16 1.21 1.17 1.16 1.16 1.17 R 1.18 1.22 1.18 1.18 1.18 1.19 O 1.26 1.30 1.25 1.27 1.29 1.27 Test A 1.50 1.52 1.50 1.48 1.44 1.49 V 1.48 1.49 1.46 1.48 1.44 1.47 R 1.49 1.49 1.47 1.47 1.45 1.48 O 1.56 1.53 1.54 1.50 1.48 1.52 features from [39], not from CNN-based model, and included them as input features in emotion prediction. Borth et al. [21] also used this feature to predict the sentiment of images. Note that, the dimension of this feature is 2000, and the number of input feature neurons in our model is changed to 2588. The row with ’O’ represents the prediction result. The results show that the features from VGG16 achieved the best performance in both the ’Valence’ and ’Arousal’ models (valence: 1.64, arousal: 1.47). Figures 10 to 14 show the qualitative analysis with emotion values and accuracy which is predicted by our model. In Figure 10, the images were placed so that the predicted values matches the emotion values in the VA space. Figure 11 and 12 shows the results of the valence model. Figure 13 and 14 shows the results of the arousal model. B. Feature performance We also investigate the effects of the various features we proposed. First, we combine the color feature and the local features into a low level feature. As a method for constructing a network model using each feature, we use the structure of our model and change the number of neurons. Our proposed model consists of 3 hidden layers with 3000, 1000, and 500
  • 7. 7 neurons. In the model for feature learning, the number of nodes in each layer is based on the ratio of the number of nodes in two adjacent layers of our model (Table III (bottom)). As a result, the object feature among the three features showed the best result in emotion prediction. In Valence, the object feature extracted from VGG16 had the best result (mse:1.92). In arousal, the object feature extracted from AlexNet showed the best result (mse:1.61). Semantic features also showed lower error than low-level features. Our model combining all the features resulted in the best prediction performance, which is the synergy effect of extracted features from various sources and deep neural network, which is a powerful expression power. TABLE III PERFORMANCE OF EACH FEATURE Low Object semantic All Alexnet VGG16 ResNet Valence 1.98 1.93 1.92 1.98 1.97 1.64 Arousal 1.66 1.61 1.62 1.68 1.64 1.47 input 438 1000 1000 1000 150 1588 h1 900 2000 300 3000 h2 300 700 100 1000 h3 150 350 350 350 50 500 output 1 C. Comparison with CNN Some studies have learned emotion classification models using pre-trained weights learned for image classification [9]– [13]. We compare our emotion prediction model with CNN- based emotion prediction model generated by transfer learning. Two CNN structures are used for comparison; AlexNet and Vgg19. We first initialize the weights of AlexNet and VGG19 to the weights learned for the image classification. Except for the final output layer, the other convolution layer and the fully- connected layer use the existing model structure. Since the number of the output layer of CNN model based on ImageNet is 1000, we change the number of output layers to 1 for our purpose (Valence and Arousal). Besides, various results can be obtained in transfer learn- ing. AlexNet has five convolutional layers and three fully connected layers, and VGG19 has 16 convolutional layers and three fully connected layers. In transfer learning, we can determine which layer to freeze and which layer to train. We experiment with two conditions. The first is that the convolutional layer is frozen, only the fully connected layer is learned (conv-frozen), and the second is learning all layers together (conv-train). The learning environment for the CNN- based model is almost similar to that of our FFNN model except for the batch size and learning rate. The learning rates of both conv-frozen network and conv-train network are 0.5 × 10−4, and the batch size of both CNN models is 50. When we train the CNN models, including our model, we use same training and test dataset. The results are shown in the Table IV. The second column shows the training range, conv-frozen means that only the fully connected layer has been learned, and conv-train means TABLE IV COMPARISION WITH OTHER LEARNING METHOD Emotion Train range AlexNet VGG19 Linear SVR Ours Valence conv-frozen 2.76 2.64 2.42 1.75 1.64 conv-train 2.64 2.60 Arousal conv-frozen 1.95 1.91 2.65 1.54 1.47 conv-train 1.89 1.87 that all layers have been learned. From the training range perspective, it can be seen that the error of the transfer learning of the entire network is smaller than that of the fully connected layer transfer learning. This result implies that the filter in low-level, as well as the filters in high-level, must be learned in order to achieve a better performance emotion prediction model. In other words, the convolutional layer and the fully connected layer must be learned together. On the model side, we can see that the performance of the model learned by using the structure of Vgg19 Network is better than AlexNet. However, the results of both models are much more error-prone than our proposed emotion based FFNN. If CNN-based models have the same type of data with the same class, such as object detection or image classification, the learning is well done and the prediction performance is excellent. However, as mentioned earlier, images with different shapes can have the same emotions, and images with similar shapes can have different emotions. We also conducted the test using other machine learning methods with same dataset. Linear regression and Support vector regression method were used. Compared to the CNN-based model, the performance of both models is better, but the results of our model still show the best performance (See Table IV). VII. CONCLUSION In this paper, we presented a new emotion recognition system with a deep learning framework. To reduce the affec- tive gap, we designed and extracted objects and background semantic features as high-level features, and showed that these features are effective for emotion prediction. Both high- level features and low-level features complement each other well, which leads to better emotion recognition performance. As expected, the accuracy of the object recognition has an impact on the performance of the emotion prediction. The object features with incorrect recognition may lead to incorrect emotion prediction results. There is also a problem when the main object of the image is not included in the existing 1000 classes. However, with the rapid progress in the deep learning technology with large dataset, the accuracy as the number of classes will be increased, which in turn will also help our emotion recognition system. As an interesting future work, one can consider the presence of a person in the image and the facial expression. A facial expression is one of the features that can greatly affect the emotion prediction. Even if the overall mood of the image is dark, smiling face can mitigate negative emotion a little (Figure 15). In addition, when the face occupies most of the part in the photograph, facial expression and emotion are directly connected. We will consider enhancing emotion recog-
  • 8. 8 V: 3.17/2.50 A: 6.30/7.00 V: 4.55/3.83 A: 5.00/5.17 V: 4.89/4.57 A: 5.51/5.57 V: 4.87/4.08 A: 5.56/5.53 V: 3.91/4.40 A: 4.42/4.20 V: 3.46/3.80 A: 4.33/4.80 V: 4.28/3.80 A: 4.78/4.00 V: 3.97/4.67 A: 4.56/5.33 V: 3.54/4.00 A: 4.86/4.40 V: 4.80/4.29 A: 4.24/3.70 V: 3.84/4.60 A: 3.47/3.20 V: 5.68/4.60 A: 5.77/6.00 V: 4.14/3.00 A: 3.54/3.17 V: 7.37/7.50 A: 7.05/7.33 V: 6.43/6.89 A: 6.07/5.33 V: 7.08/6.80 A: 5.81/5.20 V: 7.41/8.00 A: 6.98/6.80 V: 7.36/6.60 A: 6.80/6.80 V: 6.87/7.33 A: 6.51/7.00 V: 6.65/7.00 A: 6.23/6.80 V: 6.06/6.60 A: 5.36/6.00 V: 7.25/7.29 A: 4.70/4.14 V: 7.63/8.00 A: 3.89/4.00 V: 7.27/7.17 A: 3.78/4.00 V: 5.84/5.80 A: 3.32/3.40 V: 6.65/6.60 A: 5.32/5.20 V: 3.63/2.86 A: 4.59/5.57 91.6% / 91.3% 90.1% / 99.6% 86.5% / 97.1% 96.5% / 92.4% 94.3% / 93.9% 92.6% / 97.8% 98.4% / 96.5% 90.4% / 87.8% 94.3% / 94.3% 91.0% / 97.9% 96.0% / 99.3% 94.3% / 90.8% 90.5% / 100% 95.6% / 92.9% 95.8% / 94.1% 91.3% / 90.4% 93.6% / 93.3% 99.5% / 99.0% 99.4% / 98.5% 93.3% / 92.0% 99.5% / 93.0% 90.5% / 96.6% 93.9% / 97.3% 94.0% / 90.3% 85.8% / 95.4% 98.8% / 97.3% 95.4% / 98.6% Fig. 10. Prediction results. The values below each image show the prediction results of FFNN and ground truth emotion values, respectively. The prediction accuracy results (V/A) are also shown. nition performance by adding a facial expression recognition framework. Several studies have demonstrated that biometric data has a positive effect on emotion recognition [40], [41]. We can also consider using biometric data or an observer’s facial feature as additional features. However, in general, deep networks require thousands or tens of thousands of data, and collecting these biometric and facial data is not easy task. We will try to find the method to improve the performance of the model by considering a small amount of biometric data, which is left as another future work. We also built a database for the emotion estimation with the V-A model and will continue to collect more data. We expect our dataset will be widely used in the field of affective computing. REFERENCES [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural infor- mation processing systems, 2012, pp. 1097–1105. [2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9. [4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015. [5] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440. [6] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587. [7] Z. Cheng, Q. Yang, and B. Sheng, “Deep colorization,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 415–423. [8] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolu- tional network for image super-resolution,” in European Conference on Computer Vision. Springer, 2014, pp. 184–199. [9] V. Campos, A. Salvador, X. Giro-i Nieto, and B. Jou, “Diving deep into sentiment: Understanding fine-tuned cnns for visual sentiment prediction,” in Proceedings of the 1st International Workshop on Affect & Sentiment in Multimedia. ACM, 2015, pp. 57–62. [10] K.-C. Peng, T. Chen, A. Sadovnik, and A. C. Gallagher, “A mixed bag of emotions: Model, predict, and transfer emotion distributions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 860–868. [11] Q. You, J. Luo, H. Jin, and J. Yang, “Robust image sentiment analysis
  • 9. 9 3.96/3.89 (99.1%) 3.88/3.20 (91.5%) 3.95/3.25 (91.3%) 4.97/4.60 (95.4%) 4.60/4.40 (97.5%) 4.46/4.2 (96.8%) 4.19/3.67 (93.5%) 4.90/5.10 (97.5%) 4.10/4.71 (92.4%) 4.33/3.83 (93.8%) 4.61/4.60 (99.9%) 5.12/5.13 (99.9%) 5.18/5 (97.8%) 5.11/5.33 (97.3%) 5.75/5.80 (99.4%) 6.56/6.80 (97.0%) 6.36/6.00 (95.5%) 6.77/6.80 (99.6%) 7.28/7.57 (96.4%) 5.79/5.83 (99.5%) 5.68/5.60 (99.0%) 4.20/4 (97.5%) 5.90/6.00 (98.8%) 7.65/7.88 (97.1%) 7.67/8.00 (95.9%) 7.23/7.00 (97.1%) Fig. 11. Valence prediction results of the images with low arousal. The values below each image represent the predicted valence value and ground truth value (prediction/ground truth). The prediction accuracy result is also shown. From the top left to the bottom right, the value of valence increases. All images in this example have low arousal values. 3.22/3.00 (97.3%) 3.98/3.59 (95.1%) 3.83/3.60 (97.1%) 3.78/3.40 (95.3%) 3.84/3.80 (99.5%) 3.88/3.60 (96.5%) 4.51/4.50 (99.9%) 5.77/6.00 (97.1%) 5.43/5.17 (96.8%) 4.90/4.80 (98.8%) 5.19/4.80 (95.1%) 7.63/8.00 (95.4%) 7.24/7.29 (99.4%) 7.33/7.00 (95.9%) 6.94/7.00 (99.3%) 6.52/6.70 (97.8%) 4.99/4.80 (97.6%) 4.15/3.80 (95.6%) 6.50/6.89 (95.1%) 7.70/8.00 (96.3%) 7.46/7.63 (97.9%) 7.42/8.13 (91.1%) 7.51/7.88 (95.4%) 4.91/4.40 (93.6%) 7.19/7.40 (97.4%) 7.42/7.00 (94.8%) 7.18/7.83 (91.9%) 7.02/6.60 (94.8%) Fig. 12. Valence prediction results of the images with high arousal. The values below each image represent the predicted valence value and ground truth value (prediction/ground truth). The prediction accuracy result is also shown. From the top left to the bottom right, the value of valence increases. All images in this example have high arousal values.
  • 10. 10 3.98/3.89 (98.9%) 3.46/3.20 (96.8%) 3.88/3.83 (99.4%) 3.87/3.67 (97.5%) 4.67/4.33 (95.8%) 4.75/4.80 (99.4%) 4.75/4.60(98.1%) 4.60/4.40 (97.5%) 4.47/4.60 (98.4%) 4.49/4.40 (98.9%) 4.65/4.50 (98.1%) 4.70/5.20 (93.8%) 4.17/4.00 (97.9%) 5.05/5.60(93.1%) 5.34/5.60 (96.8%) 5.40/5.20 (97.5%) 5.24/5.80 (93.0%) 5.43/6.00 (92.9%) 6.23/6.33 (98.8%) 5.29/5.60 (96.1%) 4.19/4.50 (96.1%) 4.40/4.60 (97.5%) 4.14/4.00 (98.3%) 4.50/4.00 (93.8%) 5.69/ 5.85 (98.0%) 5.16/4.75 (94.9%) Fig. 13. Arousal prediction results of the images with low valence. The values below each image represent the predicted arousal value and ground truth value (prediction/ground truth). The prediction accuracy result is also shown. From the top left to the bottom right, the value of arousal increases. All images in this example have low valence values. 3.70/4.20 (93.8%) 3.74/3.40 (95.8%) 3.88/3.50 (95.3%) 3.21/3.00 (97.4%) 3.33/3.60 (96.6%) 3.85/3.80 (99.4%) 3.97/4.50 (93.4%) 3.61/3.67 (99.3%) 3.99/4.00 (99.9%) 3.42/3.20 (97.3%) 3.88/4.00 (98.5%) 3.96/4.17 (97.4%) 4.70/4.88 (97.8%) 5.00/5.00 (100.0%) 5.80/6.20 (95.0%) 6.33/6.33 (100.0%) 6.44/6.40 (99.5%) 5.76/6.00 (97.0%) 5.87/6.25 (95.3%) 5.82/6.20 (95.3%) 5.68/6.00 (96.0%) 6.13/6.30 (97.9%) 6.65/6.83 (97.8%) 7.22/7.40 (97.8%) 6.95/6.80 (98.1%) Fig. 14. Arousal prediction results of the images with high valence. The values below each image represent the predicted arousal value and ground truth value (prediction/ground truth). The prediction accuracy result is also shown. From the top left to the bottom right, the value of arousal increases. All images in this example have high valence values.
  • 11. 11 Fig. 15. Different emotions with different facial expressions. using progressively trained and domain transferred deep networks,” in Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015. [12] ——, “Building a large scale dataset for image emotion recognition: The fine print and the benchmark,” 2016. [13] C. Xu, S. Cetintas, K.-C. Lee, and L.-J. Li, “Visual sentiment prediction with deep convolutional neural networks,” arXiv preprint arXiv:1411.5731, 2014. [14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. [15] A. Hanjalic, “Extracting moods from pictures and sounds: Towards truly personalized tv,” IEEE Signal Processing Magazine, vol. 23, no. 2, pp. 90–100, 2006. [16] X. Lu, P. Suryanarayan, R. B. Adams Jr, J. Li, M. G. Newman, and J. Z. Wang, “On shape and the computability of emotions,” in Proceedings of the 20th ACM international conference on Multimedia. ACM, 2012, pp. 229–238. [17] J. Machajdik and A. Hanbury, “Affective image classification using features inspired by psychology and art theory,” in Proceedings of the 18th ACM international conference on Multimedia. ACM, 2010, pp. 83–92. [18] S. Zhao, Y. Gao, X. Jiang, H. Yao, T.-S. Chua, and X. Sun, “Exploring principles-of-art features for image emotion recognition,” in Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014, pp. 47–56. [19] S. Haykin, Neural Networks: A Comprehensive Foundation. Prentice Hall, 1999. [Online]. Available: https://books.google.co.kr/books?id= 3-1HPwAACAAJ [20] C. E. Osgood, “The nature and measurement of meaning.” Psychological bulletin, vol. 49, no. 3, p. 197, 1952. [21] D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang, “Large-scale visual sentiment ontology and detectors using adjective noun pairs,” in Proceedings of the 21st ACM international conference on Multimedia. ACM, 2013, pp. 223–232. [22] T. Chen, D. Borth, T. Darrell, and S.-F. Chang, “Deepsentibank: Vi- sual sentiment concept classification with deep convolutional neural networks,” arXiv preprint arXiv:1410.8586, 2014. [23] V. Campos, B. Jou, and X. Giro-i Nieto, “From pixels to sentiment: Fine-tuning cnns for visual sentiment prediction,” Image and Vision Computing, 2017. [24] J. Islam and Y. Zhang, “Visual sentiment analysis for social images using transfer learning approach,” in Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom)(BDCloud-SocialCom- SustainCom), 2016 IEEE International Conferences on. IEEE, 2016, pp. 124–130. [25] M. M. Bradley and P. J. Lang, “Measuring emotion: the self-assessment manikin and the semantic differential,” Journal of behavior therapy and experimental psychiatry, vol. 25, no. 1, pp. 49–59, 1994. [26] J. A. Russell, “A circumplex model of affect,” 1980. [27] Flickr, “http://www.flickr.com.” [28] P. J. Lang, M. M. Bradley, and B. N. Cuthbert, “International affective picture system (iaps): Affective ratings of pictures and instruction manual,” Technical report A-8, 2008. [29] E. Reinhard, M. Ashikhmin, B. Gooch, and P. Shirley, “Color transfer between images,” IEEE Computer graphics and applications, no. 5, pp. 34–41, 2001. [30] G. Csurka, S. Skaff, L. Marchesotti, and C. Saunders, “Learning moods and emotions from color combinations,” in Proceedings of the Seventh Indian Conference on Computer Vision, Graphics and Image Processing. ACM, 2010, pp. 298–305. [31] L. He, H. Qi, and R. Zaretzki, “Image color transfer to evoke differ- ent emotions based on color combinations,” Signal, Image and Video Processing, pp. 1–9, 2014. [32] H.-R. Kim, H. Kang, and I.-K. Lee, “Image recoloring with valence- arousal emotion model,” in Computer Graphics Forum, vol. 35, no. 7. Wiley Online Library, 2016, pp. 209–216. [33] J. Van De Weijer, C. Schmid, J. Verbeek, and D. Larlus, “Learning color names for real-world applications,” IEEE Transactions on Image Processing, vol. 18, no. 7, pp. 1512–1523, 2009. [34] P. Valdez and A. Mehrabian, “Effects of color on emotions.” Journal of experimental psychology: General, vol. 123, no. 4, p. 394, 1994. [35] A. B. Warriner, V. Kuperman, and M. Brysbaert, “Norms of valence, arousal, and dominance for 13,915 english lemmas,” Behavior research methods, vol. 45, no. 4, pp. 1191–1207, 2013. [36] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 248–255. [37] Z. Wu, C. Shen, and A. v. d. Hengel, “Wider or deeper: Revisiting the resnet model for visual recognition,” arXiv preprint arXiv:1611.10080, 2016. [38] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016. [39] F. X. Yu, L. Cao, R. S. Feris, J. R. Smith, and S.-F. Chang, “Design- ing category-level attributes for discriminative visual recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 771–778. [40] L. Aftanas, N. Reva, A. Varlamov, S. Pavlov, and V. Makhnev, “Analysis of evoked eeg synchronization and desynchronization in conditions of emotional activation in humans: temporal and topographic characteris- tics,” Neuroscience and behavioral physiology, vol. 34, no. 8, pp. 859– 867, 2004. [41] G. Chanel, J. Kronegg, D. Grandjean, and T. Pun, “Emotion assessment: Arousal evaluation using eegâĂŹs and peripheral physiological signals,” Multimedia content representation, classification and security, pp. 530– 537, 2006.