1. Emotion Detection from Speech to Enrich Multimedia Content
Feng Yu1*, Eric Chang2, Ying-Qing Xu2, Heung-Yeung Shum2
1Dept. of Computer Science and Technology, Tsinghua Univ., Beijing 100084, P.R.C.
yufeng99@stumail.tsinghua.edu.cn
2
Microsoft Research China, 3/F Beijing Sigma Center, Beijing 100080, P.R.C.
{ echang,yqxu,hshum}@microsoft.com
Abstract. This paper describes an experimental study on the detection of
emotion from speech. As computer-based characters such as avatars and virtual
chat faces become more common, the use of emotion to drive the expression of
the virtual characters becomes more important. This study utilizes a corpus
containing emotional speech with 721 short utterances expressing four
emotions: anger, happiness, sadness, and the neutral (unemotional) state, which
were captured manually from movies and teleplays. We introduce a new
concept to evaluate emotions in speech. Emotions are so complex that most
speech sentences cannot be precisely assigned to a particular emotion category;
however, most emotional states nevertheless can be described as a mixture of
multiple emotions. Based on this concept we have trained SVMs (support
vector machines) to recognize utterances within these four categories and
developed an agent that can recognize and express emotions.
1 Introduction
Nowadays, with the proliferation of the Internet and multimedia, many kinds of
multimedia equipment are available. Even common users can record or easily
download video or audio data by himself/herself. Can we determine the contents of
this multimedia data expeditiously with the computer’s help? The ability to detect
expressed emotions and to express facial expressions with each given utterance would
help improve the naturalness of a computer-human interface.
Certainly, emotion is an important factor in communication. And people express
emotions not only verbally but also by non-verbal means. Non-verbal means consist
of body gestures, facial expressions, modifications of prosodic parameters, and
changes in the spectral energy distribution [12]. Often, people can evaluate human
emotion from the speaker’s voice alone since intonations of a person’s speech can
reveal emotions. Simultaneously, facial expressions also vary with emotions. There is
a great deal of mutual information between vocal and facial expressions. Our own
research concentrates on how to form a correspondence between emotional speech
and expressions in a facial image sequence. We already have a controllable cartoon
facial model that can generate various facial images based on different emotional state
* Visiting Microsoft Research China from Department of Computer Science and Technology,
Tsinghua University, Beijing, China
H.-Y. Shum, M. Liao, and S.-F. Chang (Eds.): PCM 2001, LNCS 2195, pp. 550–557, 2001.
c
Springer-Verlag Berlin Heidelberg 2001
2. Emotion Detection from Speech to Enrich Multimedia Content 551
inputs [14]. This system could be especially important in situations where speech is
the primary mode of interaction with the machine.
How can facial animation be produced using audio to drive a facial control model?
Speech-driven facial animation is an effective technique for user interface and has
been an active research topic over the past twenty years. Various audio-visual
mapping models have been proposed for facial animation [1..3]. However, these
methods only synchronize facial motions with speech and rarely can animate facial
expressions automatically. In addition, the complexity of audio-visual mapping
relations makes the synthesis process language-dependent and less effective.
In the computer speech community, much attention has been given to “what was
said” and “who said it”, and the associated tasks of speech recognition and speaker
identification, whereas “how it was said” has received relatively little. Most
importantly in our application, we need an effective tool by which we can easily tell
“how it was said” for each utterance.
Previous research on emotions both in psychology and speech tell us that we can
find information associated with emotions from a combination of prosodic, tonal and
spectral information; speaking rate and stress distribution also provide some clues
about emotions [6, 7, 10, 12]. Prosodic features are multi-functional. They not only
express emotions but also serve a variety of other functions as well, such as word and
sentence stress or syntactic segmentation. The role of prosodic information within the
communication of emotions has been studied extensively in psychology and psycho-
linguistics. More importantly, fundamental frequency and intensity in particular vary
considerably across speakers and have to be normalized properly [12].
What kinds of features might carry more information about the emotional meaning
of each utterance? Because of the diversity of languages and the different roles and
significance of features in different languages, they cannot be treated equally [13]. It is
hard to calculate which features carry more information, and how to combine these
features to get a better recognition rate.
Research in automatic detection of expressed emotion is quite limited. Recent
research in this aspect mostly focuses on classification, in the other words, mostly
aims at ascertaining the emotion of each utterance. This, however, is insufficient for
our applications. To describe the degree, compound and variety of emotions in speech
more realistically and naturally, we present a novel criterion. Based on this criterion,
emotion information contained in utterances can be evaluated well.
We assume that there is an emotion space corresponding to our existing facial
control model. In [11] Pereira described his research on dimensions of emotional
meaning in speech, but our emotion space is totally different from his ideas. The
facial control model contains sets of emotional facial templates of different degrees
drawn by an artist. Within this emotion space the special category “neutral” lies at the
origin; other categories are associated with the axes directions in this space. With this
assumption we correspond our cartoon facial control model with emotions. We also
would like to determine the corresponding location in this emotion space of given
emotional utterances, unlike other methods that simply give a classification result.
This part of the investigation is confined to information within the utterance.
Various classification algorithms have been used in recent studies about emotions
in speech recognition, such as Nearest Neighbor, NN (Neural Network), MLB
(Maximum-Likelihood Bayes), KR (Kernel Regression), GMM (Gaussian Mixture
3. 552 F. Yu et al.
Model), and HMM (Hidden Markov Model) [5, 6, 9, 12]. Appropriate for our
implementation we choose SVM as our classification algorithm.
In our investigation, we have captured a corpus containing emotional speech from
movies and teleplays, with over 2000 utterances from several different speakers.
Since we model only four kinds of basic emotions—“neutral”, “anger”, “happiness”
and “sadness”—we obtain good recognition accuracy. A total of 721 of the most
characteristic short utterances in these four emotional categories were selected from
the corpus.
2 Experimental Study
Because in practice only the emotions “neutral”, “anger”, “happiness” and “sadness”
lead to good recognition accuracy, we deal just with these four representative
categories in our application even though this small set of emotions does not provide
enough range to describe all types of emotions. Furthermore, some utterances can
hardly be evaluated as one particular emotion. We still can find some utterances that
can be classified solely as one kind of emotion, which we call pure emotional
utterances.
We construct an emotion space in which the special category “neutral” is at the
origin, and the other categories are measured along the axes; all pure emotions
correspond to the points lying directly on an axis (or if we relax the restrictions,
nearby an axis); the distance from these points to the origin denotes the degree of
these emotional utterances. When the coordinates of a point have more than one
nonzero value, the utterance contains more than one kind of emotion and cannot be
ascribed to any single emotion category.
We further consider utterances whose emotion type is undoubtedly “neutral” as
corresponding to the region closely surrounding the origin of the emotion space. For
each of the other three categories, take “anger” for example, utterances which are
undoubtedly “anger” have a strong correspondence with the anger axis.
Since people cannot express emotions to an infinite degree, we assume that each
axis has an upper limit based on extraordinarily emotional utterances from the real
world. So we choose extraordinary utterances for each emotion as our training data.
Since people cannot measure the degree of emotions precisely, we simply choose
utterances that are determined to portray a given emotion by almost 100% of the
subjects to find the origin and the upper limits of the three axes.
Our approach is considerably different from those of other researchers. Other
methods can only perform classification to tell which emotional category an utterance
belongs to. Our method can handle more complicated problems, such as utterances
that contain multiple emotions and the degrees of each emotion.
2.1 Corpus of Emotional Data
We need an extensive amount of training data to accurately estimate statistical
models. So speech segments from Chinese teleplays are chosen as our corpus. By
using teleplays (one film is still not long enough to satisfy our requirement), we were
4. Emotion Detection from Speech to Enrich Multimedia Content 553
able to collect a large supply of emotional speech samples in a short amount of time.
And previous experiments indicate that the emotions in acted speech could be
consistently decoded by humans and automatic systems [6], which provided further
motivation for their use.
The teleplay files were downloaded from Video CDs with audio data extracted at
the sampling rate of 16 KHz and the resolution of 16 bits per sample. We employed
three students to capture and segment these speech data files.
A total of more than 2000 utterances were captured, segmented and pre-tagged
from the teleplays. The chosen utterances are all preceded and followed by silence
with no background music or any other kinds of background noise. The expressed
emotion within an utterance has to be constant.
All of these utterances need to be subjectively tagged as one of the four classes.
Only pure emotional utterances are usable in accurately forming statistical models.
One of these students and a researcher tagged these utterances. They heard and
tagged all these utterances several times. Each time, if the tag of an utterance differed
from its previous designation, this utterance was removed from our corpus. The initial
tags were those that the three students pre-tagged for all of the over 2000 utterances.
Each tagging session was separated by several days.
After tagging several times, only 721 utterances remained. The numbers of
waveforms which belong to each emotion category are shown in Table 1.
Table 1. Data sets
Anger Happiness Neutral Sadness
215 136 242 128
All data files are 16kHz, 16bit waveforms.
2.2 Feature Extraction
Previous research has shown some statistics of the pitch (fundamental frequency F0)
to be the main vocal cue for emotion recognition. Also, the first and second formants,
vocal energy, frequency spectral features and speaking rate contribute to vocal
emotion signaling [6].
In our study, evaluation features of voice are mainly extracted from pitch, and the
features that we grasp from pitch are sufficient for most of our needs. The main
means of choosing and drawing features is the method of [6].
First, we obtained the pitch sequence using an internally developed pitch extractor
[4]. Then we smoothed the pitch contour using smoothing cubic splines. The resulting
approximation of the pitch is smooth and continuous, and it enables us to measure
features of the pitch: the pitch derivative, pitch slopes, and the behavior of their
minima and maxima over time.
We have measured a total of sixteen features, grouped under the headings below:
• Statistics related to rhythm: Speaking rate, Average length between voiced
regions, Number of maxima / Number of (minima + maxima), Number of
upslopes / Number of slopes;
• Statistics on the smoothed pitch signal: Min, Max, Median, Standard deviation;
5. 554 F. Yu et al.
• Statistics on the derivative of the smoothed pitch: Min, Max, Median,
Standard deviation;
• Statistics over the individual voiced parts: Mean min, Mean max;
• Statistics over the individual slopes: Mean positive derivative, Mean negative
derivative.
All these features are calculated only in the valid region which begins at the first
non-zero pitch point and ends at the last non-zero pitch point of each utterance.
The features in the first group are related to rhythm. Rhythm is represented by the
shape of a pitch contour. We assume the inverse of the average length of the voiced
part of an utterance denotes the speaking rate; the average length between voiced
regions can denote pauses in an utterance.
The features in the second and third groups are general features of the pitch signal
and its derivative.
In each individual voiced part we can easily find minima and maxima. We choose
the mean of the minima and the mean of the maxima as our fourth group features.
We can compute the derivative of each individual slope. If the slope is an upslope,
the derivative is positive; otherwise the derivative is negative. The mean of these
positive derivatives and mean of negative derivatives are our features in the fifth
group.
2.3 Performance of Emotion Evaluator
The classification algorithms used in this research section are mostly based on K-
nearest-neighbors (KNN) or neural networks (NN). Considering our application, we
need not only classification results, but also proportions of each emotion an utterance
contains. After some experimentation, we chose the support vector machine (SVM) as
our evaluation algorithm [8], because of its high speed and each SVM can give an
evaluation to each emotion category. From training data, we can find the origin and
the three axes.
Because different features are extracted from audio data in different ways and the
relationships among these features are complex, we chose a Gaussian kernel
2
2
2
/
)
,
(
σ
j
i x
x
j
i e
x
x
K
−
−
= to be our SVM kernel function.
For each emotional state, a model is learned to separate its type of utterances from
others. We refer to SVMs trained in this way as 1-v-r SVMs (short for one-versus-
rest). The scheme we adopted learns different SVM models for different categories
that can distinguish this kind of emotion from others.
Our preliminary experimental results indicate that we can obtain satisfactory
results only when there are at least 200 different utterances in each emotion category.
While each SVM only deals with just a two-class problem and the performance of a
SVM classifier is related to just these two classes, the boundary will tend to benefit
the class that contains more data. To avoid this kind of skewing, we balance the
training data set of the SVM. Taking “anger” as an example, we choose about 150
utterances in the “anger” state and also choose about 150 utterances from other
emotion categories, with approximately the same number chosen from each of the
other categories. In this way, the results are much better than those learned from
6. Emotion Detection from Speech to Enrich Multimedia Content 555
imbalanced training data sets. Note that the training data can be replicated to balance
the data sets.
The training data and performances for each SVM are shown in Table 2. The
remainder of the data set not used during learning for each individual SVM are used
as testing data for this SVM.
Table 2. SVM training data sets and performance
Category
SVM
One Rest Accuracy on test set
Anger 162 147 77.16%
Happiness 102 94 65.64%
Neutral 194 193 83.73%
Sadness 96 96 70.59%
We obtain the given emotional utterance’s feature vector by computation using
each SVM and collecting each evaluation. We then have the emotional evaluation of
the utterance. If only one evaluation is greater than 0, ( ( ) 0
x
fi
, 3
0 ≤
≤ i , ( ) 0
x
f j
,
j
i ≠ ), we label this utterance as this particular kind of emotional utterance; if more
than one evaluation is greater than 0, ( ( ) 0
x
fi
, ( ) 0
x
f j
, 3
,
0 ≤
≤ j
i , j
i ≠ ), we
label the emotion of this utterance as a mixture of several kinds of emotions, each
proportional to the emotion’s SVM evaluation. If all evaluations are less than 0,
( ( ) 0
x
fi
, 3
..
0
=
i ), we can say the emotion of this utterance is undefined in our
system.
2.4 Comparison
We have also compared the effectiveness of the SVM classifier to the K-nearest
neighbor classifier and the neural network classifier. One can observe that the SVM
classifier compares favorably to the other two types of classifiers.
Table 3. Comparison of NN, KNN and SVM
Accuracy (%)
Method
A H N S
NN 40.00 27.78 62.68 35.71
KNN 42.86 39.28 89.29 32.14
SVM 77.16 65.64 83.73 70.59
Remark: In each category there are 100 learning utterances, and all remaining
utterances are used for testing.
3 Conclusions Discussion
Compared with KNN, training an SVM model gives a good classifier without needing
much training time. Even if we do not know the exact pertinences between each
7. 556 F. Yu et al.
feature, we still can obtain good results. After we produce the SVM model from
training data sets, these training data sets are no longer needed since the SVM model
contains all the useful information. So classification does not need much time, and
almost can be applied within real-time rendering. The KNN rule relies on a distance
metric to perform classification, it is expected that changing this metric will yield
different and possibly better results. Intuitively, one should weigh each feature
according to how well it correlates with the correct classification. But in our
investigation, those features are not irrelevant to each other. The performance
landscape in this metric space is quite rugged and optimization is likely expensive.
SVM can handle this problem well. We need not know the relationships within each
feature pair and the dimensionality of each feature.
Compared with NNs, training a SVM model requires much less time than training
an NN classifier. And SVMs are much more robust than NNs. In our application, the
corpus comes from movies and teleplays. There are many speakers with various
backgrounds. In these kinds of instances, NNs do not work well.
The most important reason why we chose SVMs is that SVMs give a magnitude
for recognition. We need this magnitude for synthesizing expressions with different
degrees. For our future work, we plan to study the effectiveness of our current
approach on data from different languages and cultures.
References
1. Brand, M.: “Voice Puppetry”, Proceedings of the SIGGRAPH, 21-28, 1999.
2. Cassell, J., Bickmore, T., Campbell, L., Chang, K., Vilhjlmsson, H., and Yan, H.:
“Requirements for an architecture for embodied conversational characters”, Proceedings of
Computer Animation and Simulation, 109-120, 1999.
3. Cassell, J., Pelachaud, C., Badler, N.I., Steedman, M., Achorn, B., Beckett, T., Douville,
B., Prevost, S. and Stone, M.: “Animated conversation: rule-based generation of facial display,
gesture and spoken intonation for multiple conversational agents”, Proceedings of the
SIGGRAPH, 28(4): 413-420, 1994.
4. Chang, E., Zhou, J.-L., Di, S., Huang, C., and Lee., K.-F.: Large vocabulary Mandarin
speech recognition with different approaches in modeling tones, International Conference on
Spoken Language Processing, 2000.
5. Roy, D., and Pentland, A.: “Automatic spoken affect analysis and classification”, in
Proceedings of the Sencond International Conference on Automatic Face and Gesture
Recognition, pp. 363-367, 1996.
6. Dellaert, F., Polzin, T., and Waibel, A.: “Recognizing Emotion in Speech”, Proceedings of
the ICSLP, 1996.
7. Erickson, D., Abramson, A., Maekawa, K., and Kaburagi, T.: “Articulatory Characteristics
of Emotional Utterances in Spoken English” , Proceedings of the ICSLP, 2000.
8. Joachims, T., Schölkopf, B., Burges, C., and Smola, A.(ed.): Making large-Scale SVM
Training Practical. Advances in Kernel Methods - Support Vector Training, MIT-Press, 1999.
9. Kang, B.-S., Han C.-H., Lee, S.-T., Youn, D.-H., and Lee, C.-Y.: “Speaker Dependent
Emotion Recognition using Speech Signals” , Proceedings of the ICSLP, 2000.
10. Paeschke, A., and Sendlmeier, W. F.: “Prosodic Characteristics of Emotional Speech:
Measurements of Fundamental Frequency Movements”, Proceedings of the ISCA-Workshop
on Speech and Emotion, 2000.
8. Emotion Detection from Speech to Enrich Multimedia Content 557
11. Pereira, C.: “Dimensions of Emotional Meaning in Speech”, Proceedings of the ISCA-
Workshop on Speech and Emotion, 2000.
12. Polzin, T., and Waibel, A.: “Emotion-Sensitive Human-Computer Interfaces”,
Proceedings of the ISCA-Workshop on Speech and Emotion, 2000.
13. Scherer, K.R.: “A Cross-Cultural Investigation of Emotion Inferences from Voice and
Speech: Implications for Speech”, Proceedings of the ICSLP, 2000.
14. Li, Y., Yu, F., Xu, Y.-Q., Chang, E., and Shum, H.-Y.: “Speech-Driven Cartoon
Animation with Emotions”, to be appeared in ACM Multimedia 2001.