This document describes a deep neural network model for predicting emotions in images based on high-level semantic features like objects and backgrounds. The model was trained on a new image emotion dataset consisting of over 10,000 images labeled with valence and arousal values obtained through crowdsourcing. The model uses features like object detection, background semantics, and color statistics as inputs to a feedforward neural network that predicts continuous valence and arousal values, providing a more nuanced representation of emotions than discrete categories. Experiments showed the model could effectively predict emotions in images.
1. 1
Building Emotional Machines: Recognizing Image
Emotions through Deep Neural Networks
Hye-Rin Kim, Yeong-Seok Kim, Seon Joo Kim, In-Kwon Lee
Abstract—An image is a very effective tool for conveying emo-
tions. Many researchers have investigated in computing the image
emotions by using various features extracted from images. In this
paper, we focus on two high level features, the object and the
background, and assume that the semantic information of images
is a good cue for predicting emotion. An object is one of the most
important elements that define an image, and we find out through
experiments that there is a high correlation between the object
and the emotion in images. Even with the same object, there may
be slight difference in emotion due to different backgrounds, and
we use the semantic information of the background to improve
the prediction performance. By combining the different levels of
features, we build an emotion based feed forward deep neural
network which produces the emotion values of a given image.
The output emotion values in our framework are continuous
values in the 2-dimensional space (Valence and Arousal), which
are more effective than using a few number of emotion categories
in describing emotions. Experiments confirm the effectiveness of
our network in predicting the emotion of images.
I. INTRODUCTION
Images are very powerful tools for conveying moods and
emotions as shown in Figure 1. Through images, people can
express their feelings and communicate with other people.
With the recent development in the deep learning technology,
computers have become better at recognizing objects, faces,
and actions. Computers have also started to write image
captions and answer questions about images. But how about
emotions? Can we teach computers to have similar feeling
as humans do when looking at images? Predicting evoked
emotion from an image is a difficult task and is still in its
early stage.
As the deep learning technology shows remarkable perfor-
mance in various computer vision tasks such as image clas-
sifcation [1]–[4], segmentation [5], [6] and image processing
[7], [8], several studies have been introduced recently that
apply the deep learning for the emotion prediction [9]–[13].
Those works mostly use the convolutional neural network
(CNN) [14], which has shown better prediction results for
the emotion classification compared to the model that uses
a shallow network such as the linear model.
The CNN has had a big impact especially in the image
classification, and it is an effective network model for learning
filters that capture the shapes that repeatedly appear in images.
However, we argue that the learning process for the image
emotion prediction should be different from that of image
classification. This is because some images with different
appearances can have same emotions, and some images with
similar appearances can have different emotions. For example,
an image that includes a person riding a bike and an image
Fig. 1. Images with different emotions.
that includes a person surfing in the ocean may give the same
feelings, even though they look different. From this point of
view, the performance may be limited if CNN is applied to an
emotion prediction system.
In addition, one of the main issues in the emotion recog-
nition is the affective gap. The affective gap is the lack of
coincidence between the measurable signal properties, com-
monly referred to as features, and the expected affective state
in which the user is brought by perceiving the signal [15].
To narrow this affective gap, several works proposed emotion
classification systems based on the psychology and art theory
based high level features such as harmony, movement, rule of
third, etc. [16]–[18]. While those features help to improve the
emotion recognition, a better set of features are still necessary.
Distinguishing the effective features among various features
is also important. As another example, similar to the example
above, images with the objects such as guns or sharks arouse
scary feeling, while images with babies or flowers lead to more
happiness. It can be speculated that certain objects will affect
the determination of emotions. Based on this observation, we
assume that the main object appearing in the image plays an
important role in determining the emotion. The idea is that
object categories can be good cues for emotions. Through
arXiv:1705.07543v2
[cs.CV]
3
Jul
2017
2. 2
1 9
Negative Positive
1
9
Exciting
Calm
Valence
Arousal
Fig. 2. Two dimensional emotion models (Valence and Arousal).
experiments, we show that objects appearing in images are
related to emotion values, and objects are used as one of
the features of our model. Besides, the emotion, even if
images include the same object, can vary depending on the
background. We also use the semantic information of the
background as our features to improve prediction performance.
Predicting emotions from images is a complex task that is
quite different from the object detection or the image classifi-
cation. In this paper, we combine high-level features such as
the object and the background information extracted from a
pre-trained deep network model for the image classification
and the segmentation with low-level features such as color
statistics to obtain a set of features for the emotion prediction.
Using this feature set, we design a feed forward deep network
(FFNN) [19] which produces the emotion value for a given
image.
In previous works for the emotion prediction, the emotions
are categorized by a few number of classes, such as happy,
awe, sad, fear, etc. In comparison, we opt for the dimensional
model for expressing the emotions [20], which is widely used
in the field of psychology. Specifically, the dimensional model
consists of two parameters, Valence and Arousal (Figure 2).
Valence represents the pleasure through the scale from 1
(negative) to 9 (positive). Arousal is the level of excitement,
which also ranges from 1 (calm) to 9 (excited). Using this
model, any emotion can be represented by using these two
values in 2D space. As this dimensional model can express
more emotions compared to using a discrete set of emotion
categories, we train our emotion prediction system based on
the V-A model.
The contributions of our work are summarized as follows:
• We propose an image emotion recognition system which
outputs valence-arousal values for expressing emotion. As
far as we know, this is the first system to build the deep
learning model for the dimensional emotion model (VA
model).
• We build a new image emotion dataset through the
crowdsourcing. The feed forward neural network is used
to learn emotion features from this database.
• We propose a novel idea of using an pre-trained CNN to
relate the main object and background of the image to
the emotional feature.
• We show the effectiveness of an object to estimate an
emotion of an image using correlations between emo-
tional values of words and images.
II. RELATED WORK
Emotion of an image can be evoked by various factors.
To figure out significant features for the emotion prediction
problem, many researchers have considered various types of
the features from color statistics to the art and the psycho-
logical features. Machajdik et al. [17] introduced an affective
image classification system using psychology and art theory
based features such as Itten’s color contrast and rule of thirds.
Zhao et al. [18] proposed to extract principles of art features
for emotion classification. Similarly, Lu et al. [16] computed
the shape based features in natural images. As a high level
concept for representing the sentiment of an image, adjective-
noun pairs were introduced by Borth et al. in [21].
Despite the rise of deep learning studies, relatively few
studies have attempted to address the emotion prediction of
images using the deep network. With the data set from Borth
et al. [21], Chen et al. [22] classified the adjective-noun pairs
using the CNN and achieved better accurate classification
performance than Borth et al. [21].
Several studies utilized the pre-trained model for the im-
age classification and transferred the learned parameter. By
changing the number of outputs to be the same as the number
of labels of their dataset, the classifier can be trained. For
example, for a binary classification with a positive and a
negative label, the number of output would be two. Some
researchers [9]–[13] used AlexNet’s structure [1] with pre-
trained weight to train their sentiment prediction frameworks
with different output numbers. Binary classification which
gives either a positive or a negative label was considered
in [11], [13], [23], [24]. Peng et al. [10] and You et al. [12]
trained the classifiers for seven and eight classes, respectively.
In most studies, an emotion classification model was created
through transfer learning using a pre-trained model for image
classification. In general, fine-tuned models record better emo-
tion classification performance compared to previous studies
with shallow networks. In this paper, we use a feed forward
neural network using low level and high-level features rather
than transfer learning method. Our method is compared with
transfer learning method by using same training and test
dataset.
Compared to the previous CNN based emotion recognition
systems which are limited to output only a few number of
emotional states, we propose to use the valence-arousal model
to represent emotions. By using these two parameters that lie
3. 3
Valence (negative - positive)
Arousal (calm - exciting)
Fig. 3. Image representation of the Self Assessment Manikin (SAM) [25]
used for V-A value assignment in the user study.
on a continuous space, we can represent emotions much better
than the previous works. To enable the learning of the VA
values using the CNN, we build a large set of data with various
emotions and obtain the V-A labels through the crowdsourcing
survey.
III. IMAGE EMOTION DATASET
In order to build the dataset with emotion values (valence
and arousal), we collected a large set of images from two
resources. We first searched for the emotional images using
the 22 keywords from [26] in Flickr [27]. Those words include
what we call basic emotions (happy, angry, afraid, sad), as well
as less prototypical emotions (gloomy, bored) and affective
states (sleepy, serene).
Using the keywords, we first collected over twenty thousand
images. Since the goal is to collect emotional images, we man-
ually eliminated non-emotional images from the collection.
Each candidate image in the dataset was assessed by three
human subjects, and only the images that were determined to
be emotional by more than two subjects were included in the
database. The words used for the search are listed along with
the number of images in the dataset: afraid (113), alarmed
(14), angry (217), annoyed (179), distressed (150), frustrated
(77), tense (8), aroused (4), delighted (28), excited (734), glad
(315), happy (102), astonished (13), at ease (16), content (656),
satisfied (33), serene (1917), pleased (42), depressed (179),
bored (312), tired (942), gloomy(793). As a result, the total
number of image is 6844.
Second, we collected more images from [12] which include
eight types of emotion categories (Amusement, Awe, Content-
ment, Excitement, Anger, Disgust, Fear, Sad). We took 3,236
images out of 23,308 images. An even number of images from
eight classes have been selected and included in our database.
As a result, a total of 10766 images is acquired in our dataset.
Next, we use the AMT to assign V-A emotion values for
all images in our database. Given an image, a worker rates the
emotion values for each image using the representation of the
V-A scale, Self-Assessment Manikin (SAM) [25] (Figure 3).
In each question, two images are shown to the worker. One
is the image whose emotion value is to be measured, and the
other is the image of the previous question with emotion value
Fig. 4. The histogram of valence and arousal value from the collected images.
recorded by the worker. By showing the previous image, the
worker can measure a value for the given image compared to
the value selected in the previous problem. It also alleviates the
difficulty of having to choose absolute numbers. Each image is
represented to the worker randomly and is evaluated only once
by the same worker. To ensure the quality of the answers, we
only allow the workers with 95% approval rate to participate in
our Human Intelligence Task (HIT). We also limit the number
of questions for each worker to 200 to keep them focused
throughout the test.
A total of 1,339 workers were recruited to assign the
emotion values of the 10,766 images in the dataset. The
evaluation time per image varied from 10 seconds to a minute
depending on the worker. Each image is evaluated on at least
five workers, and the average of the acquired values is assigned
to the emotion value of the image. Figure 4 shows the total
distribution of the V-A values of the images. In the case of
Valence, it can be seen that there are relatively more positive
images than negative images. In common sense, people usually
share the positive images, rather than negative images, and this
tendency is well represented in this distribution. In the case
of Arousal, various emotion values were obtained except for
the extreme values. Figure 5 shows some of the images in our
database.
We compare the emotion distribution of the dataset we
collected with the distribution of IAPS dataset which is an
emotion dataset introduced in [28]. The IAPS consists of of
1182 images, each of which contains valence, arousal and
dominance values. Although the number of images is different,
we can see that our data is spread more than the IAPS data
distribution in the V-A emotion space (Figure 6).
We provide additional analysis to validate our new dataset.
We split Valence-Arousal space into 4 by 4 sections: Most
negative (from 1 to 3), negative (from 3 to 5), positive (from
5 to 7), and most positive (from 7 to 9). Most calm (from 1
to 3), calm (from 3 to 5), exciting (from 5 to 7), and most
exciting (from 7 to 9). We then place the images in the emotion
space according to the obtained emotion values and extract
the most used words (using the image tags) in each section.
The results are shown in Figure 7. In the negative section
of V-A space, most words have low valence values such as
afraid, angry, annoyed, depressed and gloomy. On the other
hand, the words such as content, serene, excited, and glad are
included in the positive section. In the case of the arousal,
the lowest arousal side (from 1 to 3) consists of the words
such as serene, tired and gloomy and the highest arousal side
4. 4
Fig. 5. Example images in our database. From left to right side, the valence value of the image increase. The images on the left/right have negative/positive
emotion. From bottom to top, the arousal value of the image increase. The images on the bottom/top have the calm/exciting emotion.
Fig. 6. Emotion distribution of our database and the IAPS [28] in V-A emotion
space.
(from 7 to 9) include the words such as excited, angry and
distressed. This analysis indicates that the use of words for
image collection and user study is appropriate for obtaining
as diverse and well-distributed emotion values as possible in
space. The new image emotion dataset can be used for the
emotion recognition in two ways. A classifier can be trained to
output either the continuous V-A values or discrete categories
of emotions using the words in Figure 7.
IV. FEATURES
Now, we extract a various type of features for emotion
prediction including color, local, object, and semantic features.
A. Color features
Color is the most basic and powerful element to express
emotions and can be effectively used by artists to induce emo-
tional effects. Many studies have been conducted to change
the color of an image as a means to change the emotion
of the image [29]–[32]. Color is not an element that can
directly resolve an affective-gap because it can be viewed as
a low-dimensional feature, but color is still a crucial factor in
emotion recognition. We extract the mean values of RGB and
HSV color space as the basic color characteristics. We also
calculate the HSV histogram and extract the label number and
Fig. 7. The most used words in each section.
values of the bin with the largest value in the histogram. A
concept similar to a color histogram, it calculates how much of
the 11 basic colors exist in the image [33]. According to [20],
saturation and brightness can have direct in on pleasure,
arousal, and dominance. Using the saturation and brightness
values, Valdez and Mehrabian [34] introduced formulas to
measure the value of pleasure, arousal, and dominance through
experiments. The formula for computing the values is as
follows:
Pleasure = 0.69Y + 0.22S (1)
Arousal = 0.31Y + 0.60S (2)
Dominance = 0.76Y + 0.32S (3)
We measure the values for the three elements from the image
and use them as features.
B. Local features
We exploit two kinds of local features used in the [21].
We use a 512-dimensional GIST descriptor that is effective
5. 5
Fig. 8. Correlation between image emotion and object emotion. Image
emotion values are taken from the IAPS data [28] set and object emotion
values are taken from the word emotion dictionary [35].
in detecting scenes and a 59-dimensional local binary pattern
(LBP) descriptor that is effective in detecting textures.
C. Object features
Identifying the emotion of the image from low-level fea-
tures, such as color statistics and texture-related features, is
difficult for a human subject. Many researchers stated the need
for high level of features on the affective level and designed
the various types of features for emotion prediction. In our
study, we assume that the object is one of the most important
factors contributing to the emotion of an image. We conducted
an experiment to prove that the object in images has relevance
to the emotion elicited from the images. Each image in the
IAPS dataset, an emotion dataset introduced in [28], includes
a tag representing the primary object (e.g., baby, snake, and
shark) and the V-A emotion values. In order to replace the
object tag with the emotion level representation, we adopt the
word emotion dictionary [35] with the valence and the arousal
values for each word. We convert each tag to V-A values
by searching for the tag in the word emotion dictionary. By
measuring the Pearson correlation coefficient, we can easily
understand the relationship between the object and the image
emotion. Figure 8 shows the results of the experiment. As you
can see, the correlation between the object emotion and the
image emotion is significantly high, which means the object
affects the emotion of the image. Especially, the correlation
of the valence is higher than the arousal, which means that
valence is more likely to be affected by the object than the
arousal.
Based on this observation, we add object-based features
to our system in predicting emotions. In recent years, many
studies have used CNN models of various structures using
ImageNet dataset [36] for the image classification. We use
three of the most popular models (AlexNet, VGG16, and
ResNet) to extract object features from our image datasets
and experiment with the effects of features extracted from
each model on emotion prediction results. Our object feature
is the result of the final output layer, and it represents the
probabilities of 1000 object categories.
D. Sematic features
As a high-level feature with a similar concept to object, we
consider another semantic information that can describe the
background of the image. It is important what the object is
in the image, but what the background is made up of is as
important. For example, if the main object of an image is a
person, the emotion may be different depending on whether
the background is a city with a lot of buildings or nature
such as a mountain or a sea. Also, the ratio of sky, sea, or
buildings in the background can also affect the emotions. To
use the semantic information of the background as a feature,
we perform scene parsing on all images. Wu et al. [37]
proposed a semantic segmentation method based on a deep
network, which classifies each pixel of an image into one of
150 semantic categories. Given a semantic map, we find out
which of the 150 semantic categories each pixel in the image
belongs to. As a result, a 150-dimensional vector is obtained
and used as one of the input features.
These object and semantic-based high level features are
combined with low level features such as color and local
features to learn our networks for emotion prediction.
V. LEARNING EMOTION MODEL
In this section, we introduce the details of our emotion
prediction framework. The overall architecture of our frame-
work is shown in Figure 9. Our model is a fully connected
feedforward neural network. In general, a neural network
consists of an input layer, an output layer, and one or more
hidden layers. Normally, when there are two or more hidden
layers, the network is called deep network. Each layer is made
up of multiple neurons, and the edges that connect neurons
between adjacent layers have weights. The values of neurons
(except the neurons in input layer) and weights are trained
during a training phase.
Our network F(X, θ) including an input layer, three hidden
layers and an output layer is as follows:
F(X, θ) = f 4
◦ g3
◦ f 3
◦ g2
◦ f 2
◦ g1
◦ f 1
(X), (4)
where X is input feature vector, θ is a set of weights including
weights w and bias b, and f 4 produces the final output of our
neural network (Valence and Arousal value).
Specifically, given an input vector Xl = [xl
i, ..., xl
n]
T
in layer
l, the preactivation value pj for the neuron j of layer l + 1 is
obtained through function f l+1:
pl+1
j = f l+1
(xl
) =
n
Õ
i=1
wl+1
ij xl
i + bl+1
j , (5)
where wl+1
ij is the connection weight connecting xi in layer l
to neuron j in layer l + 1, bl+1
j is the bias of neural j in layer
l + 1, and n is the number of the neuron in layer l. Then, the
output value xj of the layer l (also input vector for layer l +1)
is obtained through function gl which is a nonlinear activation
function in layer l + 1,
xl+1
j = gl+1
(pl+1
j ). (6)
Note that, rectified linear unit(ReLU) [1], max(0, x), is used as
the non-linear activation function throughout the all network.
The number of neurons in each layer is given in the
Table I. We set the loss function of our network as L =
ÍK
k=1 ( f 4
k − Tk)2, where the f 4
k and Tk are the output value
6. 6
Scene segmentation
Object classification
1000 D vector
150 D vector
…
…
…
…
Low-level features
Emotion prediction model
V-A values
1938 D vector
Fig. 9. The overall architecture of the proposed emotion prediction model.
TABLE I
THE NUMBER OF NEURONS IN EACH LAYER.
Input H1 H2 H3 Output
Num of neurons 1588 3000 1000 1000 1
predicted by our model and the ground truth emotion value
of given image I and K is the number of images. In training
phase, network weights θ are updated by backpropagating the
gradients through all layers. By minimizing the cost of the
loss function, we can optimize the weights of our network.
We set learning rate to 0.0001 and the network is trained
by using the stocahstic gradient descent (SGD) optimization
method with momentum of 0.9. We set the batch size to 1,000
and train our model until the error no longer diminishes. All
experiments are implemented by using the open source deep
learning framework Tensorflow [38].
VI. EXPERIMENT
A. Model performance
We first evaluate the performance of our model. The entire
dataset is divided into five groups, four groups are used for
the training phase, and the remaining one group is used for
the test phase. Each group is used as a test phase once. In
each group, the number of training images is 8600, and the
number of test images is 2166. All groups are learned using
the same structure.
The results are presented in Table II. The number in the
first row represents the group number. Column g1 shows the
training and test error when using group 2 to 5 as training
data and group 1 as test data. The dimension of input feature is
3088 described in Tabel I, and the output is valence or arousal
value. To compare the performance of the object features
obtained from the three CNNs, we build various models by
combining object features with other features and compare
their performance. Note that the ‘A’, ‘V’, and ‘R’ in the
third column represent the AlexNet, VGG16 and ResNet,
respectively. The number of each bin represent the mean
square error between ground truth emotion value and predicted
emotion value by each model. We also extracted category-level
TABLE II
RESULTS OF 5-FOLD VALIDATION EXPERIMENTS
g1 g2 g3 g4 g5 Avg.
Valence
Training
A 1.37 1.37 1.35 1.37 1.38 1.37
V 1.31 1.31 1.30 1.30 1.33 1.31
R 1.29 1.30 1.28 1.31 1.30 1.30
O 1.47 1.47 1.46 1.48 1.48 1.47
Test
A 1.72 1.68 1.67 1.69 1.61 1.67
V 1.68 1.66 1.63 1.63 1.59 1.64
R 1.70 1.66 1.65 1.62 1.60 1.65
O 1.80 1.81 1.73 1.73 1.69 1.75
Arousal
Training
A 1.22 1.24 1.20 1.21 1.23 1.22
V 1.16 1.21 1.17 1.16 1.16 1.17
R 1.18 1.22 1.18 1.18 1.18 1.19
O 1.26 1.30 1.25 1.27 1.29 1.27
Test
A 1.50 1.52 1.50 1.48 1.44 1.49
V 1.48 1.49 1.46 1.48 1.44 1.47
R 1.49 1.49 1.47 1.47 1.45 1.48
O 1.56 1.53 1.54 1.50 1.48 1.52
features from [39], not from CNN-based model, and included
them as input features in emotion prediction. Borth et al. [21]
also used this feature to predict the sentiment of images. Note
that, the dimension of this feature is 2000, and the number of
input feature neurons in our model is changed to 2588. The
row with ’O’ represents the prediction result. The results show
that the features from VGG16 achieved the best performance
in both the ’Valence’ and ’Arousal’ models (valence: 1.64,
arousal: 1.47).
Figures 10 to 14 show the qualitative analysis with emotion
values and accuracy which is predicted by our model. In
Figure 10, the images were placed so that the predicted values
matches the emotion values in the VA space. Figure 11 and
12 shows the results of the valence model. Figure 13 and 14
shows the results of the arousal model.
B. Feature performance
We also investigate the effects of the various features we
proposed. First, we combine the color feature and the local
features into a low level feature. As a method for constructing
a network model using each feature, we use the structure of
our model and change the number of neurons. Our proposed
model consists of 3 hidden layers with 3000, 1000, and 500
7. 7
neurons. In the model for feature learning, the number of nodes
in each layer is based on the ratio of the number of nodes in
two adjacent layers of our model (Table III (bottom)). As a
result, the object feature among the three features showed the
best result in emotion prediction. In Valence, the object feature
extracted from VGG16 had the best result (mse:1.92). In
arousal, the object feature extracted from AlexNet showed the
best result (mse:1.61). Semantic features also showed lower
error than low-level features. Our model combining all the
features resulted in the best prediction performance, which is
the synergy effect of extracted features from various sources
and deep neural network, which is a powerful expression
power.
TABLE III
PERFORMANCE OF EACH FEATURE
Low
Object
semantic All
Alexnet VGG16 ResNet
Valence 1.98 1.93 1.92 1.98 1.97 1.64
Arousal 1.66 1.61 1.62 1.68 1.64 1.47
input 438 1000 1000 1000 150 1588
h1 900 2000 300 3000
h2 300 700 100 1000
h3 150 350 350 350 50 500
output 1
C. Comparison with CNN
Some studies have learned emotion classification models
using pre-trained weights learned for image classification [9]–
[13]. We compare our emotion prediction model with CNN-
based emotion prediction model generated by transfer learning.
Two CNN structures are used for comparison; AlexNet and
Vgg19. We first initialize the weights of AlexNet and VGG19
to the weights learned for the image classification. Except for
the final output layer, the other convolution layer and the fully-
connected layer use the existing model structure. Since the
number of the output layer of CNN model based on ImageNet
is 1000, we change the number of output layers to 1 for our
purpose (Valence and Arousal).
Besides, various results can be obtained in transfer learn-
ing. AlexNet has five convolutional layers and three fully
connected layers, and VGG19 has 16 convolutional layers
and three fully connected layers. In transfer learning, we can
determine which layer to freeze and which layer to train.
We experiment with two conditions. The first is that the
convolutional layer is frozen, only the fully connected layer
is learned (conv-frozen), and the second is learning all layers
together (conv-train). The learning environment for the CNN-
based model is almost similar to that of our FFNN model
except for the batch size and learning rate. The learning
rates of both conv-frozen network and conv-train network are
0.5 × 10−4, and the batch size of both CNN models is 50.
When we train the CNN models, including our model, we use
same training and test dataset.
The results are shown in the Table IV. The second column
shows the training range, conv-frozen means that only the
fully connected layer has been learned, and conv-train means
TABLE IV
COMPARISION WITH OTHER LEARNING METHOD
Emotion Train range AlexNet VGG19 Linear SVR Ours
Valence
conv-frozen 2.76 2.64
2.42 1.75 1.64
conv-train 2.64 2.60
Arousal
conv-frozen 1.95 1.91
2.65 1.54 1.47
conv-train 1.89 1.87
that all layers have been learned. From the training range
perspective, it can be seen that the error of the transfer
learning of the entire network is smaller than that of the fully
connected layer transfer learning. This result implies that the
filter in low-level, as well as the filters in high-level, must
be learned in order to achieve a better performance emotion
prediction model. In other words, the convolutional layer and
the fully connected layer must be learned together. On the
model side, we can see that the performance of the model
learned by using the structure of Vgg19 Network is better
than AlexNet. However, the results of both models are much
more error-prone than our proposed emotion based FFNN.
If CNN-based models have the same type of data with the
same class, such as object detection or image classification,
the learning is well done and the prediction performance is
excellent. However, as mentioned earlier, images with different
shapes can have the same emotions, and images with similar
shapes can have different emotions. We also conducted the
test using other machine learning methods with same dataset.
Linear regression and Support vector regression method were
used. Compared to the CNN-based model, the performance of
both models is better, but the results of our model still show
the best performance (See Table IV).
VII. CONCLUSION
In this paper, we presented a new emotion recognition
system with a deep learning framework. To reduce the affec-
tive gap, we designed and extracted objects and background
semantic features as high-level features, and showed that
these features are effective for emotion prediction. Both high-
level features and low-level features complement each other
well, which leads to better emotion recognition performance.
As expected, the accuracy of the object recognition has an
impact on the performance of the emotion prediction. The
object features with incorrect recognition may lead to incorrect
emotion prediction results. There is also a problem when the
main object of the image is not included in the existing 1000
classes. However, with the rapid progress in the deep learning
technology with large dataset, the accuracy as the number of
classes will be increased, which in turn will also help our
emotion recognition system.
As an interesting future work, one can consider the presence
of a person in the image and the facial expression. A facial
expression is one of the features that can greatly affect the
emotion prediction. Even if the overall mood of the image
is dark, smiling face can mitigate negative emotion a little
(Figure 15). In addition, when the face occupies most of
the part in the photograph, facial expression and emotion are
directly connected. We will consider enhancing emotion recog-
8. 8
V: 3.17/2.50
A: 6.30/7.00
V: 4.55/3.83
A: 5.00/5.17
V: 4.89/4.57
A: 5.51/5.57
V: 4.87/4.08
A: 5.56/5.53
V: 3.91/4.40
A: 4.42/4.20
V: 3.46/3.80
A: 4.33/4.80
V: 4.28/3.80
A: 4.78/4.00
V: 3.97/4.67
A: 4.56/5.33
V: 3.54/4.00
A: 4.86/4.40
V: 4.80/4.29
A: 4.24/3.70
V: 3.84/4.60
A: 3.47/3.20
V: 5.68/4.60
A: 5.77/6.00
V: 4.14/3.00
A: 3.54/3.17
V: 7.37/7.50
A: 7.05/7.33
V: 6.43/6.89
A: 6.07/5.33
V: 7.08/6.80
A: 5.81/5.20
V: 7.41/8.00
A: 6.98/6.80
V: 7.36/6.60
A: 6.80/6.80
V: 6.87/7.33
A: 6.51/7.00
V: 6.65/7.00
A: 6.23/6.80
V: 6.06/6.60
A: 5.36/6.00
V: 7.25/7.29
A: 4.70/4.14
V: 7.63/8.00
A: 3.89/4.00
V: 7.27/7.17
A: 3.78/4.00
V: 5.84/5.80
A: 3.32/3.40
V: 6.65/6.60
A: 5.32/5.20
V: 3.63/2.86
A: 4.59/5.57
91.6% / 91.3% 90.1% / 99.6% 86.5% / 97.1% 96.5% / 92.4% 94.3% / 93.9% 92.6% / 97.8% 98.4% / 96.5%
90.4% / 87.8% 94.3% / 94.3% 91.0% / 97.9% 96.0% / 99.3% 94.3% / 90.8% 90.5% / 100% 95.6% / 92.9%
95.8% / 94.1% 91.3% / 90.4% 93.6% / 93.3% 99.5% / 99.0% 99.4% / 98.5% 93.3% / 92.0% 99.5% / 93.0%
90.5% / 96.6% 93.9% / 97.3% 94.0% / 90.3% 85.8% / 95.4% 98.8% / 97.3% 95.4% / 98.6%
Fig. 10. Prediction results. The values below each image show the prediction results of FFNN and ground truth emotion values, respectively. The prediction
accuracy results (V/A) are also shown.
nition performance by adding a facial expression recognition
framework.
Several studies have demonstrated that biometric data has a
positive effect on emotion recognition [40], [41]. We can also
consider using biometric data or an observer’s facial feature
as additional features. However, in general, deep networks
require thousands or tens of thousands of data, and collecting
these biometric and facial data is not easy task. We will try to
find the method to improve the performance of the model by
considering a small amount of biometric data, which is left as
another future work.
We also built a database for the emotion estimation with
the V-A model and will continue to collect more data. We
expect our dataset will be widely used in the field of affective
computing.
REFERENCES
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural infor-
mation processing systems, 2012, pp. 1097–1105.
[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2015, pp. 1–9.
[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” arXiv preprint arXiv:1512.03385, 2015.
[5] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
for semantic segmentation,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
[6] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmentation,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2014, pp. 580–587.
[7] Z. Cheng, Q. Yang, and B. Sheng, “Deep colorization,” in Proceedings
of the IEEE International Conference on Computer Vision, 2015, pp.
415–423.
[8] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolu-
tional network for image super-resolution,” in European Conference on
Computer Vision. Springer, 2014, pp. 184–199.
[9] V. Campos, A. Salvador, X. Giro-i Nieto, and B. Jou, “Diving deep
into sentiment: Understanding fine-tuned cnns for visual sentiment
prediction,” in Proceedings of the 1st International Workshop on Affect
& Sentiment in Multimedia. ACM, 2015, pp. 57–62.
[10] K.-C. Peng, T. Chen, A. Sadovnik, and A. C. Gallagher, “A mixed
bag of emotions: Model, predict, and transfer emotion distributions,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2015, pp. 860–868.
[11] Q. You, J. Luo, H. Jin, and J. Yang, “Robust image sentiment analysis
9. 9
3.96/3.89 (99.1%)
3.88/3.20 (91.5%) 3.95/3.25 (91.3%)
4.97/4.60 (95.4%)
4.60/4.40 (97.5%)
4.46/4.2 (96.8%)
4.19/3.67 (93.5%)
4.90/5.10 (97.5%)
4.10/4.71 (92.4%)
4.33/3.83 (93.8%) 4.61/4.60 (99.9%)
5.12/5.13 (99.9%) 5.18/5 (97.8%)
5.11/5.33 (97.3%)
5.75/5.80 (99.4%)
6.56/6.80 (97.0%)
6.36/6.00 (95.5%)
6.77/6.80 (99.6%) 7.28/7.57 (96.4%)
5.79/5.83 (99.5%)
5.68/5.60 (99.0%)
4.20/4 (97.5%)
5.90/6.00 (98.8%)
7.65/7.88 (97.1%) 7.67/8.00 (95.9%)
7.23/7.00 (97.1%)
Fig. 11. Valence prediction results of the images with low arousal. The values below each image represent the predicted valence value and ground truth value
(prediction/ground truth). The prediction accuracy result is also shown. From the top left to the bottom right, the value of valence increases. All images in
this example have low arousal values.
3.22/3.00 (97.3%) 3.98/3.59 (95.1%)
3.83/3.60 (97.1%)
3.78/3.40 (95.3%) 3.84/3.80 (99.5%) 3.88/3.60 (96.5%) 4.51/4.50 (99.9%)
5.77/6.00 (97.1%)
5.43/5.17 (96.8%)
4.90/4.80 (98.8%) 5.19/4.80 (95.1%)
7.63/8.00 (95.4%)
7.24/7.29 (99.4%) 7.33/7.00 (95.9%)
6.94/7.00 (99.3%)
6.52/6.70 (97.8%)
4.99/4.80 (97.6%)
4.15/3.80 (95.6%)
6.50/6.89 (95.1%)
7.70/8.00 (96.3%)
7.46/7.63 (97.9%)
7.42/8.13 (91.1%) 7.51/7.88 (95.4%)
4.91/4.40 (93.6%)
7.19/7.40 (97.4%)
7.42/7.00 (94.8%)
7.18/7.83 (91.9%)
7.02/6.60 (94.8%)
Fig. 12. Valence prediction results of the images with high arousal. The values below each image represent the predicted valence value and ground truth
value (prediction/ground truth). The prediction accuracy result is also shown. From the top left to the bottom right, the value of valence increases. All images
in this example have high arousal values.
10. 10
3.98/3.89 (98.9%) 3.46/3.20 (96.8%) 3.88/3.83 (99.4%) 3.87/3.67 (97.5%)
4.67/4.33 (95.8%) 4.75/4.80 (99.4%)
4.75/4.60(98.1%)
4.60/4.40 (97.5%)
4.47/4.60 (98.4%) 4.49/4.40 (98.9%)
4.65/4.50 (98.1%) 4.70/5.20 (93.8%)
4.17/4.00 (97.9%)
5.05/5.60(93.1%)
5.34/5.60 (96.8%) 5.40/5.20 (97.5%)
5.24/5.80 (93.0%) 5.43/6.00 (92.9%) 6.23/6.33 (98.8%)
5.29/5.60 (96.1%)
4.19/4.50 (96.1%) 4.40/4.60 (97.5%)
4.14/4.00 (98.3%)
4.50/4.00 (93.8%)
5.69/ 5.85 (98.0%)
5.16/4.75 (94.9%)
Fig. 13. Arousal prediction results of the images with low valence. The values below each image represent the predicted arousal value and ground truth value
(prediction/ground truth). The prediction accuracy result is also shown. From the top left to the bottom right, the value of arousal increases. All images in
this example have low valence values.
3.70/4.20 (93.8%) 3.74/3.40 (95.8%) 3.88/3.50 (95.3%) 3.21/3.00 (97.4%) 3.33/3.60 (96.6%) 3.85/3.80 (99.4%)
3.97/4.50 (93.4%) 3.61/3.67 (99.3%) 3.99/4.00 (99.9%) 3.42/3.20 (97.3%) 3.88/4.00 (98.5%) 3.96/4.17 (97.4%)
4.70/4.88 (97.8%) 5.00/5.00 (100.0%) 5.80/6.20 (95.0%) 6.33/6.33 (100.0%) 6.44/6.40 (99.5%) 5.76/6.00 (97.0%)
5.87/6.25 (95.3%) 5.82/6.20 (95.3%) 5.68/6.00 (96.0%) 6.13/6.30 (97.9%) 6.65/6.83 (97.8%) 7.22/7.40 (97.8%)
6.95/6.80 (98.1%)
Fig. 14. Arousal prediction results of the images with high valence. The values below each image represent the predicted arousal value and ground truth
value (prediction/ground truth). The prediction accuracy result is also shown. From the top left to the bottom right, the value of arousal increases. All images
in this example have high valence values.
11. 11
Fig. 15. Different emotions with different facial expressions.
using progressively trained and domain transferred deep networks,” in
Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
[12] ——, “Building a large scale dataset for image emotion recognition:
The fine print and the benchmark,” 2016.
[13] C. Xu, S. Cetintas, K.-C. Lee, and L.-J. Li, “Visual sentiment
prediction with deep convolutional neural networks,” arXiv preprint
arXiv:1411.5731, 2014.
[14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, 1998.
[15] A. Hanjalic, “Extracting moods from pictures and sounds: Towards truly
personalized tv,” IEEE Signal Processing Magazine, vol. 23, no. 2, pp.
90–100, 2006.
[16] X. Lu, P. Suryanarayan, R. B. Adams Jr, J. Li, M. G. Newman, and J. Z.
Wang, “On shape and the computability of emotions,” in Proceedings
of the 20th ACM international conference on Multimedia. ACM, 2012,
pp. 229–238.
[17] J. Machajdik and A. Hanbury, “Affective image classification using
features inspired by psychology and art theory,” in Proceedings of the
18th ACM international conference on Multimedia. ACM, 2010, pp.
83–92.
[18] S. Zhao, Y. Gao, X. Jiang, H. Yao, T.-S. Chua, and X. Sun, “Exploring
principles-of-art features for image emotion recognition,” in Proceedings
of the 22nd ACM international conference on Multimedia. ACM, 2014,
pp. 47–56.
[19] S. Haykin, Neural Networks: A Comprehensive Foundation. Prentice
Hall, 1999. [Online]. Available: https://books.google.co.kr/books?id=
3-1HPwAACAAJ
[20] C. E. Osgood, “The nature and measurement of meaning.” Psychological
bulletin, vol. 49, no. 3, p. 197, 1952.
[21] D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang, “Large-scale
visual sentiment ontology and detectors using adjective noun pairs,” in
Proceedings of the 21st ACM international conference on Multimedia.
ACM, 2013, pp. 223–232.
[22] T. Chen, D. Borth, T. Darrell, and S.-F. Chang, “Deepsentibank: Vi-
sual sentiment concept classification with deep convolutional neural
networks,” arXiv preprint arXiv:1410.8586, 2014.
[23] V. Campos, B. Jou, and X. Giro-i Nieto, “From pixels to sentiment:
Fine-tuning cnns for visual sentiment prediction,” Image and Vision
Computing, 2017.
[24] J. Islam and Y. Zhang, “Visual sentiment analysis for social images
using transfer learning approach,” in Big Data and Cloud Computing
(BDCloud), Social Computing and Networking (SocialCom), Sustainable
Computing and Communications (SustainCom)(BDCloud-SocialCom-
SustainCom), 2016 IEEE International Conferences on. IEEE, 2016,
pp. 124–130.
[25] M. M. Bradley and P. J. Lang, “Measuring emotion: the self-assessment
manikin and the semantic differential,” Journal of behavior therapy and
experimental psychiatry, vol. 25, no. 1, pp. 49–59, 1994.
[26] J. A. Russell, “A circumplex model of affect,” 1980.
[27] Flickr, “http://www.flickr.com.”
[28] P. J. Lang, M. M. Bradley, and B. N. Cuthbert, “International affective
picture system (iaps): Affective ratings of pictures and instruction
manual,” Technical report A-8, 2008.
[29] E. Reinhard, M. Ashikhmin, B. Gooch, and P. Shirley, “Color transfer
between images,” IEEE Computer graphics and applications, no. 5, pp.
34–41, 2001.
[30] G. Csurka, S. Skaff, L. Marchesotti, and C. Saunders, “Learning moods
and emotions from color combinations,” in Proceedings of the Seventh
Indian Conference on Computer Vision, Graphics and Image Processing.
ACM, 2010, pp. 298–305.
[31] L. He, H. Qi, and R. Zaretzki, “Image color transfer to evoke differ-
ent emotions based on color combinations,” Signal, Image and Video
Processing, pp. 1–9, 2014.
[32] H.-R. Kim, H. Kang, and I.-K. Lee, “Image recoloring with valence-
arousal emotion model,” in Computer Graphics Forum, vol. 35, no. 7.
Wiley Online Library, 2016, pp. 209–216.
[33] J. Van De Weijer, C. Schmid, J. Verbeek, and D. Larlus, “Learning
color names for real-world applications,” IEEE Transactions on Image
Processing, vol. 18, no. 7, pp. 1512–1523, 2009.
[34] P. Valdez and A. Mehrabian, “Effects of color on emotions.” Journal of
experimental psychology: General, vol. 123, no. 4, p. 394, 1994.
[35] A. B. Warriner, V. Kuperman, and M. Brysbaert, “Norms of valence,
arousal, and dominance for 13,915 english lemmas,” Behavior research
methods, vol. 45, no. 4, pp. 1191–1207, 2013.
[36] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
A large-scale hierarchical image database,” in Computer Vision and
Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE,
2009, pp. 248–255.
[37] Z. Wu, C. Shen, and A. v. d. Hengel, “Wider or deeper: Revisiting the
resnet model for visual recognition,” arXiv preprint arXiv:1611.10080,
2016.
[38] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale
machine learning on heterogeneous distributed systems,” arXiv preprint
arXiv:1603.04467, 2016.
[39] F. X. Yu, L. Cao, R. S. Feris, J. R. Smith, and S.-F. Chang, “Design-
ing category-level attributes for discriminative visual recognition,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2013, pp. 771–778.
[40] L. Aftanas, N. Reva, A. Varlamov, S. Pavlov, and V. Makhnev, “Analysis
of evoked eeg synchronization and desynchronization in conditions of
emotional activation in humans: temporal and topographic characteris-
tics,” Neuroscience and behavioral physiology, vol. 34, no. 8, pp. 859–
867, 2004.
[41] G. Chanel, J. Kronegg, D. Grandjean, and T. Pun, “Emotion assessment:
Arousal evaluation using eegâĂŹs and peripheral physiological signals,”
Multimedia content representation, classification and security, pp. 530–
537, 2006.