SlideShare a Scribd company logo
1 of 8
Download to read offline
Emotion Detection from Speech to Enrich Multimedia Content
Feng Yu1*, Eric Chang2, Ying-Qing Xu2, Heung-Yeung Shum2
1Dept. of Computer Science and Technology, Tsinghua Univ., Beijing 100084, P.R.C.
yufeng99@stumail.tsinghua.edu.cn
2
Microsoft Research China, 3/F Beijing Sigma Center, Beijing 100080, P.R.C.
{ echang,yqxu,hshum}@microsoft.com
Abstract. This paper describes an experimental study on the detection of
emotion from speech. As computer-based characters such as avatars and virtual
chat faces become more common, the use of emotion to drive the expression of
the virtual characters becomes more important. This study utilizes a corpus
containing emotional speech with 721 short utterances expressing four
emotions: anger, happiness, sadness, and the neutral (unemotional) state, which
were captured manually from movies and teleplays. We introduce a new
concept to evaluate emotions in speech. Emotions are so complex that most
speech sentences cannot be precisely assigned to a particular emotion category;
however, most emotional states nevertheless can be described as a mixture of
multiple emotions. Based on this concept we have trained SVMs (support
vector machines) to recognize utterances within these four categories and
developed an agent that can recognize and express emotions.
1 Introduction
Nowadays, with the proliferation of the Internet and multimedia, many kinds of
multimedia equipment are available. Even common users can record or easily
download video or audio data by himself/herself. Can we determine the contents of
this multimedia data expeditiously with the computer’s help? The ability to detect
expressed emotions and to express facial expressions with each given utterance would
help improve the naturalness of a computer-human interface.
Certainly, emotion is an important factor in communication. And people express
emotions not only verbally but also by non-verbal means. Non-verbal means consist
of body gestures, facial expressions, modifications of prosodic parameters, and
changes in the spectral energy distribution [12]. Often, people can evaluate human
emotion from the speaker’s voice alone since intonations of a person’s speech can
reveal emotions. Simultaneously, facial expressions also vary with emotions. There is
a great deal of mutual information between vocal and facial expressions. Our own
research concentrates on how to form a correspondence between emotional speech
and expressions in a facial image sequence. We already have a controllable cartoon
facial model that can generate various facial images based on different emotional state
* Visiting Microsoft Research China from Department of Computer Science and Technology,
Tsinghua University, Beijing, China
H.-Y. Shum, M. Liao, and S.-F. Chang (Eds.): PCM 2001, LNCS 2195, pp. 550–557, 2001.
c
 Springer-Verlag Berlin Heidelberg 2001
Emotion Detection from Speech to Enrich Multimedia Content 551
inputs [14]. This system could be especially important in situations where speech is
the primary mode of interaction with the machine.
How can facial animation be produced using audio to drive a facial control model?
Speech-driven facial animation is an effective technique for user interface and has
been an active research topic over the past twenty years. Various audio-visual
mapping models have been proposed for facial animation [1..3]. However, these
methods only synchronize facial motions with speech and rarely can animate facial
expressions automatically. In addition, the complexity of audio-visual mapping
relations makes the synthesis process language-dependent and less effective.
In the computer speech community, much attention has been given to “what was
said” and “who said it”, and the associated tasks of speech recognition and speaker
identification, whereas “how it was said” has received relatively little. Most
importantly in our application, we need an effective tool by which we can easily tell
“how it was said” for each utterance.
Previous research on emotions both in psychology and speech tell us that we can
find information associated with emotions from a combination of prosodic, tonal and
spectral information; speaking rate and stress distribution also provide some clues
about emotions [6, 7, 10, 12]. Prosodic features are multi-functional. They not only
express emotions but also serve a variety of other functions as well, such as word and
sentence stress or syntactic segmentation. The role of prosodic information within the
communication of emotions has been studied extensively in psychology and psycho-
linguistics. More importantly, fundamental frequency and intensity in particular vary
considerably across speakers and have to be normalized properly [12].
What kinds of features might carry more information about the emotional meaning
of each utterance? Because of the diversity of languages and the different roles and
significance of features in different languages, they cannot be treated equally [13]. It is
hard to calculate which features carry more information, and how to combine these
features to get a better recognition rate.
Research in automatic detection of expressed emotion is quite limited. Recent
research in this aspect mostly focuses on classification, in the other words, mostly
aims at ascertaining the emotion of each utterance. This, however, is insufficient for
our applications. To describe the degree, compound and variety of emotions in speech
more realistically and naturally, we present a novel criterion. Based on this criterion,
emotion information contained in utterances can be evaluated well.
We assume that there is an emotion space corresponding to our existing facial
control model. In [11] Pereira described his research on dimensions of emotional
meaning in speech, but our emotion space is totally different from his ideas. The
facial control model contains sets of emotional facial templates of different degrees
drawn by an artist. Within this emotion space the special category “neutral” lies at the
origin; other categories are associated with the axes directions in this space. With this
assumption we correspond our cartoon facial control model with emotions. We also
would like to determine the corresponding location in this emotion space of given
emotional utterances, unlike other methods that simply give a classification result.
This part of the investigation is confined to information within the utterance.
Various classification algorithms have been used in recent studies about emotions
in speech recognition, such as Nearest Neighbor, NN (Neural Network), MLB
(Maximum-Likelihood Bayes), KR (Kernel Regression), GMM (Gaussian Mixture
552 F. Yu et al.
Model), and HMM (Hidden Markov Model) [5, 6, 9, 12]. Appropriate for our
implementation we choose SVM as our classification algorithm.
In our investigation, we have captured a corpus containing emotional speech from
movies and teleplays, with over 2000 utterances from several different speakers.
Since we model only four kinds of basic emotions—“neutral”, “anger”, “happiness”
and “sadness”—we obtain good recognition accuracy. A total of 721 of the most
characteristic short utterances in these four emotional categories were selected from
the corpus.
2 Experimental Study
Because in practice only the emotions “neutral”, “anger”, “happiness” and “sadness”
lead to good recognition accuracy, we deal just with these four representative
categories in our application even though this small set of emotions does not provide
enough range to describe all types of emotions. Furthermore, some utterances can
hardly be evaluated as one particular emotion. We still can find some utterances that
can be classified solely as one kind of emotion, which we call pure emotional
utterances.
We construct an emotion space in which the special category “neutral” is at the
origin, and the other categories are measured along the axes; all pure emotions
correspond to the points lying directly on an axis (or if we relax the restrictions,
nearby an axis); the distance from these points to the origin denotes the degree of
these emotional utterances. When the coordinates of a point have more than one
nonzero value, the utterance contains more than one kind of emotion and cannot be
ascribed to any single emotion category.
We further consider utterances whose emotion type is undoubtedly “neutral” as
corresponding to the region closely surrounding the origin of the emotion space. For
each of the other three categories, take “anger” for example, utterances which are
undoubtedly “anger” have a strong correspondence with the anger axis.
Since people cannot express emotions to an infinite degree, we assume that each
axis has an upper limit based on extraordinarily emotional utterances from the real
world. So we choose extraordinary utterances for each emotion as our training data.
Since people cannot measure the degree of emotions precisely, we simply choose
utterances that are determined to portray a given emotion by almost 100% of the
subjects to find the origin and the upper limits of the three axes.
Our approach is considerably different from those of other researchers. Other
methods can only perform classification to tell which emotional category an utterance
belongs to. Our method can handle more complicated problems, such as utterances
that contain multiple emotions and the degrees of each emotion.
2.1 Corpus of Emotional Data
We need an extensive amount of training data to accurately estimate statistical
models. So speech segments from Chinese teleplays are chosen as our corpus. By
using teleplays (one film is still not long enough to satisfy our requirement), we were
Emotion Detection from Speech to Enrich Multimedia Content 553
able to collect a large supply of emotional speech samples in a short amount of time.
And previous experiments indicate that the emotions in acted speech could be
consistently decoded by humans and automatic systems [6], which provided further
motivation for their use.
The teleplay files were downloaded from Video CDs with audio data extracted at
the sampling rate of 16 KHz and the resolution of 16 bits per sample. We employed
three students to capture and segment these speech data files.
A total of more than 2000 utterances were captured, segmented and pre-tagged
from the teleplays. The chosen utterances are all preceded and followed by silence
with no background music or any other kinds of background noise. The expressed
emotion within an utterance has to be constant.
All of these utterances need to be subjectively tagged as one of the four classes.
Only pure emotional utterances are usable in accurately forming statistical models.
One of these students and a researcher tagged these utterances. They heard and
tagged all these utterances several times. Each time, if the tag of an utterance differed
from its previous designation, this utterance was removed from our corpus. The initial
tags were those that the three students pre-tagged for all of the over 2000 utterances.
Each tagging session was separated by several days.
After tagging several times, only 721 utterances remained. The numbers of
waveforms which belong to each emotion category are shown in Table 1.
Table 1. Data sets
Anger Happiness Neutral Sadness
215 136 242 128
All data files are 16kHz, 16bit waveforms.
2.2 Feature Extraction
Previous research has shown some statistics of the pitch (fundamental frequency F0)
to be the main vocal cue for emotion recognition. Also, the first and second formants,
vocal energy, frequency spectral features and speaking rate contribute to vocal
emotion signaling [6].
In our study, evaluation features of voice are mainly extracted from pitch, and the
features that we grasp from pitch are sufficient for most of our needs. The main
means of choosing and drawing features is the method of [6].
First, we obtained the pitch sequence using an internally developed pitch extractor
[4]. Then we smoothed the pitch contour using smoothing cubic splines. The resulting
approximation of the pitch is smooth and continuous, and it enables us to measure
features of the pitch: the pitch derivative, pitch slopes, and the behavior of their
minima and maxima over time.
We have measured a total of sixteen features, grouped under the headings below:
• Statistics related to rhythm: Speaking rate, Average length between voiced
regions, Number of maxima / Number of (minima + maxima), Number of
upslopes / Number of slopes;
• Statistics on the smoothed pitch signal: Min, Max, Median, Standard deviation;
554 F. Yu et al.
• Statistics on the derivative of the smoothed pitch: Min, Max, Median,
Standard deviation;
• Statistics over the individual voiced parts: Mean min, Mean max;
• Statistics over the individual slopes: Mean positive derivative, Mean negative
derivative.
All these features are calculated only in the valid region which begins at the first
non-zero pitch point and ends at the last non-zero pitch point of each utterance.
The features in the first group are related to rhythm. Rhythm is represented by the
shape of a pitch contour. We assume the inverse of the average length of the voiced
part of an utterance denotes the speaking rate; the average length between voiced
regions can denote pauses in an utterance.
The features in the second and third groups are general features of the pitch signal
and its derivative.
In each individual voiced part we can easily find minima and maxima. We choose
the mean of the minima and the mean of the maxima as our fourth group features.
We can compute the derivative of each individual slope. If the slope is an upslope,
the derivative is positive; otherwise the derivative is negative. The mean of these
positive derivatives and mean of negative derivatives are our features in the fifth
group.
2.3 Performance of Emotion Evaluator
The classification algorithms used in this research section are mostly based on K-
nearest-neighbors (KNN) or neural networks (NN). Considering our application, we
need not only classification results, but also proportions of each emotion an utterance
contains. After some experimentation, we chose the support vector machine (SVM) as
our evaluation algorithm [8], because of its high speed and each SVM can give an
evaluation to each emotion category. From training data, we can find the origin and
the three axes.
Because different features are extracted from audio data in different ways and the
relationships among these features are complex, we chose a Gaussian kernel
2
2
2
/
)
,
(
σ
j
i x
x
j
i e
x
x
K
−
−
= to be our SVM kernel function.
For each emotional state, a model is learned to separate its type of utterances from
others. We refer to SVMs trained in this way as 1-v-r SVMs (short for one-versus-
rest). The scheme we adopted learns different SVM models for different categories
that can distinguish this kind of emotion from others.
Our preliminary experimental results indicate that we can obtain satisfactory
results only when there are at least 200 different utterances in each emotion category.
While each SVM only deals with just a two-class problem and the performance of a
SVM classifier is related to just these two classes, the boundary will tend to benefit
the class that contains more data. To avoid this kind of skewing, we balance the
training data set of the SVM. Taking “anger” as an example, we choose about 150
utterances in the “anger” state and also choose about 150 utterances from other
emotion categories, with approximately the same number chosen from each of the
other categories. In this way, the results are much better than those learned from
Emotion Detection from Speech to Enrich Multimedia Content 555
imbalanced training data sets. Note that the training data can be replicated to balance
the data sets.
The training data and performances for each SVM are shown in Table 2. The
remainder of the data set not used during learning for each individual SVM are used
as testing data for this SVM.
Table 2. SVM training data sets and performance
Category
SVM
One Rest Accuracy on test set
Anger 162 147 77.16%
Happiness 102 94 65.64%
Neutral 194 193 83.73%
Sadness 96 96 70.59%
We obtain the given emotional utterance’s feature vector by computation using
each SVM and collecting each evaluation. We then have the emotional evaluation of
the utterance. If only one evaluation is greater than 0, ( ( ) 0

x
fi
, 3
0 ≤
≤ i , ( ) 0

x
f j
,
j
i ≠ ), we label this utterance as this particular kind of emotional utterance; if more
than one evaluation is greater than 0, ( ( ) 0

x
fi
, ( ) 0

x
f j
, 3
,
0 ≤
≤ j
i , j
i ≠ ), we
label the emotion of this utterance as a mixture of several kinds of emotions, each
proportional to the emotion’s SVM evaluation. If all evaluations are less than 0,
( ( ) 0

x
fi
, 3
..
0
=
i ), we can say the emotion of this utterance is undefined in our
system.
2.4 Comparison
We have also compared the effectiveness of the SVM classifier to the K-nearest
neighbor classifier and the neural network classifier. One can observe that the SVM
classifier compares favorably to the other two types of classifiers.
Table 3. Comparison of NN, KNN and SVM
Accuracy (%)
Method
A H N S
NN 40.00 27.78 62.68 35.71
KNN 42.86 39.28 89.29 32.14
SVM 77.16 65.64 83.73 70.59
Remark: In each category there are 100 learning utterances, and all remaining
utterances are used for testing.
3 Conclusions  Discussion
Compared with KNN, training an SVM model gives a good classifier without needing
much training time. Even if we do not know the exact pertinences between each
556 F. Yu et al.
feature, we still can obtain good results. After we produce the SVM model from
training data sets, these training data sets are no longer needed since the SVM model
contains all the useful information. So classification does not need much time, and
almost can be applied within real-time rendering. The KNN rule relies on a distance
metric to perform classification, it is expected that changing this metric will yield
different and possibly better results. Intuitively, one should weigh each feature
according to how well it correlates with the correct classification. But in our
investigation, those features are not irrelevant to each other. The performance
landscape in this metric space is quite rugged and optimization is likely expensive.
SVM can handle this problem well. We need not know the relationships within each
feature pair and the dimensionality of each feature.
Compared with NNs, training a SVM model requires much less time than training
an NN classifier. And SVMs are much more robust than NNs. In our application, the
corpus comes from movies and teleplays. There are many speakers with various
backgrounds. In these kinds of instances, NNs do not work well.
The most important reason why we chose SVMs is that SVMs give a magnitude
for recognition. We need this magnitude for synthesizing expressions with different
degrees. For our future work, we plan to study the effectiveness of our current
approach on data from different languages and cultures.
References
1. Brand, M.: “Voice Puppetry”, Proceedings of the SIGGRAPH, 21-28, 1999.
2. Cassell, J., Bickmore, T., Campbell, L., Chang, K., Vilhjlmsson, H., and Yan, H.:
“Requirements for an architecture for embodied conversational characters”, Proceedings of
Computer Animation and Simulation, 109-120, 1999.
3. Cassell, J., Pelachaud, C., Badler, N.I., Steedman, M., Achorn, B., Beckett, T., Douville,
B., Prevost, S. and Stone, M.: “Animated conversation: rule-based generation of facial display,
gesture and spoken intonation for multiple conversational agents”, Proceedings of the
SIGGRAPH, 28(4): 413-420, 1994.
4. Chang, E., Zhou, J.-L., Di, S., Huang, C., and Lee., K.-F.: Large vocabulary Mandarin
speech recognition with different approaches in modeling tones, International Conference on
Spoken Language Processing, 2000.
5. Roy, D., and Pentland, A.: “Automatic spoken affect analysis and classification”, in
Proceedings of the Sencond International Conference on Automatic Face and Gesture
Recognition, pp. 363-367, 1996.
6. Dellaert, F., Polzin, T., and Waibel, A.: “Recognizing Emotion in Speech”, Proceedings of
the ICSLP, 1996.
7. Erickson, D., Abramson, A., Maekawa, K., and Kaburagi, T.: “Articulatory Characteristics
of Emotional Utterances in Spoken English” , Proceedings of the ICSLP, 2000.
8. Joachims, T., Schölkopf, B., Burges, C., and Smola, A.(ed.): Making large-Scale SVM
Training Practical. Advances in Kernel Methods - Support Vector Training, MIT-Press, 1999.
9. Kang, B.-S., Han C.-H., Lee, S.-T., Youn, D.-H., and Lee, C.-Y.: “Speaker Dependent
Emotion Recognition using Speech Signals” , Proceedings of the ICSLP, 2000.
10. Paeschke, A., and Sendlmeier, W. F.: “Prosodic Characteristics of Emotional Speech:
Measurements of Fundamental Frequency Movements”, Proceedings of the ISCA-Workshop
on Speech and Emotion, 2000.
Emotion Detection from Speech to Enrich Multimedia Content 557
11. Pereira, C.: “Dimensions of Emotional Meaning in Speech”, Proceedings of the ISCA-
Workshop on Speech and Emotion, 2000.
12. Polzin, T., and Waibel, A.: “Emotion-Sensitive Human-Computer Interfaces”,
Proceedings of the ISCA-Workshop on Speech and Emotion, 2000.
13. Scherer, K.R.: “A Cross-Cultural Investigation of Emotion Inferences from Voice and
Speech: Implications for Speech”, Proceedings of the ICSLP, 2000.
14. Li, Y., Yu, F., Xu, Y.-Q., Chang, E., and Shum, H.-Y.: “Speech-Driven Cartoon
Animation with Emotions”, to be appeared in ACM Multimedia 2001.

More Related Content

Similar to 3-540-45453-5_71.pdf

Thelxinoë: Recognizing Human Emotions Using Pupillometry and Machine Learning
Thelxinoë: Recognizing Human Emotions Using Pupillometry and Machine LearningThelxinoë: Recognizing Human Emotions Using Pupillometry and Machine Learning
Thelxinoë: Recognizing Human Emotions Using Pupillometry and Machine Learningmlaij
 
Signal Processing Tool for Emotion Recognition
Signal Processing Tool for Emotion RecognitionSignal Processing Tool for Emotion Recognition
Signal Processing Tool for Emotion Recognitionidescitation
 
Emotion Recognition and Emotional Resonance: Exploring the Relationship betwe...
Emotion Recognition and Emotional Resonance: Exploring the Relationship betwe...Emotion Recognition and Emotional Resonance: Exploring the Relationship betwe...
Emotion Recognition and Emotional Resonance: Exploring the Relationship betwe...Rebecca Noskeau
 
Affect Sensing and Contextual Affect Modeling from Improvisational Interaction
Affect Sensing and Contextual Affect Modeling from Improvisational InteractionAffect Sensing and Contextual Affect Modeling from Improvisational Interaction
Affect Sensing and Contextual Affect Modeling from Improvisational InteractionWaqas Tariq
 
PearlEnverga2014InPress_MindPrintLang
PearlEnverga2014InPress_MindPrintLangPearlEnverga2014InPress_MindPrintLang
PearlEnverga2014InPress_MindPrintLangLisa Pearl
 
F Y P S L I D E 1061109509
F Y P  S L I D E 1061109509F Y P  S L I D E 1061109509
F Y P S L I D E 1061109509Tswelelo Keteng
 
EMOTION DETECTION FROM TEXT
EMOTION DETECTION FROM TEXTEMOTION DETECTION FROM TEXT
EMOTION DETECTION FROM TEXTcscpconf
 
O R I G I N A L A R T I C L EUnconscious emotions quantif.docx
O R I G I N A L A R T I C L EUnconscious emotions quantif.docxO R I G I N A L A R T I C L EUnconscious emotions quantif.docx
O R I G I N A L A R T I C L EUnconscious emotions quantif.docxhopeaustin33688
 
RULE-BASED SENTIMENT ANALYSIS OF UKRAINIAN REVIEWS
RULE-BASED SENTIMENT ANALYSIS OF UKRAINIAN REVIEWSRULE-BASED SENTIMENT ANALYSIS OF UKRAINIAN REVIEWS
RULE-BASED SENTIMENT ANALYSIS OF UKRAINIAN REVIEWSijaia
 
Major Papers_Dongbin Tobin Cho
Major Papers_Dongbin Tobin ChoMajor Papers_Dongbin Tobin Cho
Major Papers_Dongbin Tobin ChoDongbin Tobin Cho
 
Emotional communication1
Emotional communication1Emotional communication1
Emotional communication1bson1012
 
Upgrading the Performance of Speech Emotion Recognition at the Segmental Level
Upgrading the Performance of Speech Emotion Recognition at the Segmental Level Upgrading the Performance of Speech Emotion Recognition at the Segmental Level
Upgrading the Performance of Speech Emotion Recognition at the Segmental Level IOSR Journals
 
A Flexible Mapping Scheme For Discrete And Dimensional Emotion Representations
A Flexible Mapping Scheme For Discrete And Dimensional Emotion RepresentationsA Flexible Mapping Scheme For Discrete And Dimensional Emotion Representations
A Flexible Mapping Scheme For Discrete And Dimensional Emotion RepresentationsGina Rizzo
 
Multi-layer affective computing model based on emotional psychology
Multi-layer affective computing model based on emotional psychologyMulti-layer affective computing model based on emotional psychology
Multi-layer affective computing model based on emotional psychologyRoohanaRehmat
 
Write a 700-1000 word paper analyzing an emotional experience. The e.docx
Write a 700-1000 word paper analyzing an emotional experience. The e.docxWrite a 700-1000 word paper analyzing an emotional experience. The e.docx
Write a 700-1000 word paper analyzing an emotional experience. The e.docxsmithhedwards48727
 

Similar to 3-540-45453-5_71.pdf (20)

Thelxinoë: Recognizing Human Emotions Using Pupillometry and Machine Learning
Thelxinoë: Recognizing Human Emotions Using Pupillometry and Machine LearningThelxinoë: Recognizing Human Emotions Using Pupillometry and Machine Learning
Thelxinoë: Recognizing Human Emotions Using Pupillometry and Machine Learning
 
H010215561
H010215561H010215561
H010215561
 
Artificial Emotion
Artificial EmotionArtificial Emotion
Artificial Emotion
 
Signal Processing Tool for Emotion Recognition
Signal Processing Tool for Emotion RecognitionSignal Processing Tool for Emotion Recognition
Signal Processing Tool for Emotion Recognition
 
Emotion Recognition and Emotional Resonance: Exploring the Relationship betwe...
Emotion Recognition and Emotional Resonance: Exploring the Relationship betwe...Emotion Recognition and Emotional Resonance: Exploring the Relationship betwe...
Emotion Recognition and Emotional Resonance: Exploring the Relationship betwe...
 
Voice Emotion Recognition
Voice Emotion RecognitionVoice Emotion Recognition
Voice Emotion Recognition
 
Affect Sensing and Contextual Affect Modeling from Improvisational Interaction
Affect Sensing and Contextual Affect Modeling from Improvisational InteractionAffect Sensing and Contextual Affect Modeling from Improvisational Interaction
Affect Sensing and Contextual Affect Modeling from Improvisational Interaction
 
Emotion mining in text
Emotion mining in textEmotion mining in text
Emotion mining in text
 
Ac tsumugu 20170712
Ac tsumugu 20170712Ac tsumugu 20170712
Ac tsumugu 20170712
 
PearlEnverga2014InPress_MindPrintLang
PearlEnverga2014InPress_MindPrintLangPearlEnverga2014InPress_MindPrintLang
PearlEnverga2014InPress_MindPrintLang
 
F Y P S L I D E 1061109509
F Y P  S L I D E 1061109509F Y P  S L I D E 1061109509
F Y P S L I D E 1061109509
 
EMOTION DETECTION FROM TEXT
EMOTION DETECTION FROM TEXTEMOTION DETECTION FROM TEXT
EMOTION DETECTION FROM TEXT
 
O R I G I N A L A R T I C L EUnconscious emotions quantif.docx
O R I G I N A L A R T I C L EUnconscious emotions quantif.docxO R I G I N A L A R T I C L EUnconscious emotions quantif.docx
O R I G I N A L A R T I C L EUnconscious emotions quantif.docx
 
RULE-BASED SENTIMENT ANALYSIS OF UKRAINIAN REVIEWS
RULE-BASED SENTIMENT ANALYSIS OF UKRAINIAN REVIEWSRULE-BASED SENTIMENT ANALYSIS OF UKRAINIAN REVIEWS
RULE-BASED SENTIMENT ANALYSIS OF UKRAINIAN REVIEWS
 
Major Papers_Dongbin Tobin Cho
Major Papers_Dongbin Tobin ChoMajor Papers_Dongbin Tobin Cho
Major Papers_Dongbin Tobin Cho
 
Emotional communication1
Emotional communication1Emotional communication1
Emotional communication1
 
Upgrading the Performance of Speech Emotion Recognition at the Segmental Level
Upgrading the Performance of Speech Emotion Recognition at the Segmental Level Upgrading the Performance of Speech Emotion Recognition at the Segmental Level
Upgrading the Performance of Speech Emotion Recognition at the Segmental Level
 
A Flexible Mapping Scheme For Discrete And Dimensional Emotion Representations
A Flexible Mapping Scheme For Discrete And Dimensional Emotion RepresentationsA Flexible Mapping Scheme For Discrete And Dimensional Emotion Representations
A Flexible Mapping Scheme For Discrete And Dimensional Emotion Representations
 
Multi-layer affective computing model based on emotional psychology
Multi-layer affective computing model based on emotional psychologyMulti-layer affective computing model based on emotional psychology
Multi-layer affective computing model based on emotional psychology
 
Write a 700-1000 word paper analyzing an emotional experience. The e.docx
Write a 700-1000 word paper analyzing an emotional experience. The e.docxWrite a 700-1000 word paper analyzing an emotional experience. The e.docx
Write a 700-1000 word paper analyzing an emotional experience. The e.docx
 

Recently uploaded

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 

Recently uploaded (20)

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 

3-540-45453-5_71.pdf

  • 1. Emotion Detection from Speech to Enrich Multimedia Content Feng Yu1*, Eric Chang2, Ying-Qing Xu2, Heung-Yeung Shum2 1Dept. of Computer Science and Technology, Tsinghua Univ., Beijing 100084, P.R.C. yufeng99@stumail.tsinghua.edu.cn 2 Microsoft Research China, 3/F Beijing Sigma Center, Beijing 100080, P.R.C. { echang,yqxu,hshum}@microsoft.com Abstract. This paper describes an experimental study on the detection of emotion from speech. As computer-based characters such as avatars and virtual chat faces become more common, the use of emotion to drive the expression of the virtual characters becomes more important. This study utilizes a corpus containing emotional speech with 721 short utterances expressing four emotions: anger, happiness, sadness, and the neutral (unemotional) state, which were captured manually from movies and teleplays. We introduce a new concept to evaluate emotions in speech. Emotions are so complex that most speech sentences cannot be precisely assigned to a particular emotion category; however, most emotional states nevertheless can be described as a mixture of multiple emotions. Based on this concept we have trained SVMs (support vector machines) to recognize utterances within these four categories and developed an agent that can recognize and express emotions. 1 Introduction Nowadays, with the proliferation of the Internet and multimedia, many kinds of multimedia equipment are available. Even common users can record or easily download video or audio data by himself/herself. Can we determine the contents of this multimedia data expeditiously with the computer’s help? The ability to detect expressed emotions and to express facial expressions with each given utterance would help improve the naturalness of a computer-human interface. Certainly, emotion is an important factor in communication. And people express emotions not only verbally but also by non-verbal means. Non-verbal means consist of body gestures, facial expressions, modifications of prosodic parameters, and changes in the spectral energy distribution [12]. Often, people can evaluate human emotion from the speaker’s voice alone since intonations of a person’s speech can reveal emotions. Simultaneously, facial expressions also vary with emotions. There is a great deal of mutual information between vocal and facial expressions. Our own research concentrates on how to form a correspondence between emotional speech and expressions in a facial image sequence. We already have a controllable cartoon facial model that can generate various facial images based on different emotional state * Visiting Microsoft Research China from Department of Computer Science and Technology, Tsinghua University, Beijing, China H.-Y. Shum, M. Liao, and S.-F. Chang (Eds.): PCM 2001, LNCS 2195, pp. 550–557, 2001. c Springer-Verlag Berlin Heidelberg 2001
  • 2. Emotion Detection from Speech to Enrich Multimedia Content 551 inputs [14]. This system could be especially important in situations where speech is the primary mode of interaction with the machine. How can facial animation be produced using audio to drive a facial control model? Speech-driven facial animation is an effective technique for user interface and has been an active research topic over the past twenty years. Various audio-visual mapping models have been proposed for facial animation [1..3]. However, these methods only synchronize facial motions with speech and rarely can animate facial expressions automatically. In addition, the complexity of audio-visual mapping relations makes the synthesis process language-dependent and less effective. In the computer speech community, much attention has been given to “what was said” and “who said it”, and the associated tasks of speech recognition and speaker identification, whereas “how it was said” has received relatively little. Most importantly in our application, we need an effective tool by which we can easily tell “how it was said” for each utterance. Previous research on emotions both in psychology and speech tell us that we can find information associated with emotions from a combination of prosodic, tonal and spectral information; speaking rate and stress distribution also provide some clues about emotions [6, 7, 10, 12]. Prosodic features are multi-functional. They not only express emotions but also serve a variety of other functions as well, such as word and sentence stress or syntactic segmentation. The role of prosodic information within the communication of emotions has been studied extensively in psychology and psycho- linguistics. More importantly, fundamental frequency and intensity in particular vary considerably across speakers and have to be normalized properly [12]. What kinds of features might carry more information about the emotional meaning of each utterance? Because of the diversity of languages and the different roles and significance of features in different languages, they cannot be treated equally [13]. It is hard to calculate which features carry more information, and how to combine these features to get a better recognition rate. Research in automatic detection of expressed emotion is quite limited. Recent research in this aspect mostly focuses on classification, in the other words, mostly aims at ascertaining the emotion of each utterance. This, however, is insufficient for our applications. To describe the degree, compound and variety of emotions in speech more realistically and naturally, we present a novel criterion. Based on this criterion, emotion information contained in utterances can be evaluated well. We assume that there is an emotion space corresponding to our existing facial control model. In [11] Pereira described his research on dimensions of emotional meaning in speech, but our emotion space is totally different from his ideas. The facial control model contains sets of emotional facial templates of different degrees drawn by an artist. Within this emotion space the special category “neutral” lies at the origin; other categories are associated with the axes directions in this space. With this assumption we correspond our cartoon facial control model with emotions. We also would like to determine the corresponding location in this emotion space of given emotional utterances, unlike other methods that simply give a classification result. This part of the investigation is confined to information within the utterance. Various classification algorithms have been used in recent studies about emotions in speech recognition, such as Nearest Neighbor, NN (Neural Network), MLB (Maximum-Likelihood Bayes), KR (Kernel Regression), GMM (Gaussian Mixture
  • 3. 552 F. Yu et al. Model), and HMM (Hidden Markov Model) [5, 6, 9, 12]. Appropriate for our implementation we choose SVM as our classification algorithm. In our investigation, we have captured a corpus containing emotional speech from movies and teleplays, with over 2000 utterances from several different speakers. Since we model only four kinds of basic emotions—“neutral”, “anger”, “happiness” and “sadness”—we obtain good recognition accuracy. A total of 721 of the most characteristic short utterances in these four emotional categories were selected from the corpus. 2 Experimental Study Because in practice only the emotions “neutral”, “anger”, “happiness” and “sadness” lead to good recognition accuracy, we deal just with these four representative categories in our application even though this small set of emotions does not provide enough range to describe all types of emotions. Furthermore, some utterances can hardly be evaluated as one particular emotion. We still can find some utterances that can be classified solely as one kind of emotion, which we call pure emotional utterances. We construct an emotion space in which the special category “neutral” is at the origin, and the other categories are measured along the axes; all pure emotions correspond to the points lying directly on an axis (or if we relax the restrictions, nearby an axis); the distance from these points to the origin denotes the degree of these emotional utterances. When the coordinates of a point have more than one nonzero value, the utterance contains more than one kind of emotion and cannot be ascribed to any single emotion category. We further consider utterances whose emotion type is undoubtedly “neutral” as corresponding to the region closely surrounding the origin of the emotion space. For each of the other three categories, take “anger” for example, utterances which are undoubtedly “anger” have a strong correspondence with the anger axis. Since people cannot express emotions to an infinite degree, we assume that each axis has an upper limit based on extraordinarily emotional utterances from the real world. So we choose extraordinary utterances for each emotion as our training data. Since people cannot measure the degree of emotions precisely, we simply choose utterances that are determined to portray a given emotion by almost 100% of the subjects to find the origin and the upper limits of the three axes. Our approach is considerably different from those of other researchers. Other methods can only perform classification to tell which emotional category an utterance belongs to. Our method can handle more complicated problems, such as utterances that contain multiple emotions and the degrees of each emotion. 2.1 Corpus of Emotional Data We need an extensive amount of training data to accurately estimate statistical models. So speech segments from Chinese teleplays are chosen as our corpus. By using teleplays (one film is still not long enough to satisfy our requirement), we were
  • 4. Emotion Detection from Speech to Enrich Multimedia Content 553 able to collect a large supply of emotional speech samples in a short amount of time. And previous experiments indicate that the emotions in acted speech could be consistently decoded by humans and automatic systems [6], which provided further motivation for their use. The teleplay files were downloaded from Video CDs with audio data extracted at the sampling rate of 16 KHz and the resolution of 16 bits per sample. We employed three students to capture and segment these speech data files. A total of more than 2000 utterances were captured, segmented and pre-tagged from the teleplays. The chosen utterances are all preceded and followed by silence with no background music or any other kinds of background noise. The expressed emotion within an utterance has to be constant. All of these utterances need to be subjectively tagged as one of the four classes. Only pure emotional utterances are usable in accurately forming statistical models. One of these students and a researcher tagged these utterances. They heard and tagged all these utterances several times. Each time, if the tag of an utterance differed from its previous designation, this utterance was removed from our corpus. The initial tags were those that the three students pre-tagged for all of the over 2000 utterances. Each tagging session was separated by several days. After tagging several times, only 721 utterances remained. The numbers of waveforms which belong to each emotion category are shown in Table 1. Table 1. Data sets Anger Happiness Neutral Sadness 215 136 242 128 All data files are 16kHz, 16bit waveforms. 2.2 Feature Extraction Previous research has shown some statistics of the pitch (fundamental frequency F0) to be the main vocal cue for emotion recognition. Also, the first and second formants, vocal energy, frequency spectral features and speaking rate contribute to vocal emotion signaling [6]. In our study, evaluation features of voice are mainly extracted from pitch, and the features that we grasp from pitch are sufficient for most of our needs. The main means of choosing and drawing features is the method of [6]. First, we obtained the pitch sequence using an internally developed pitch extractor [4]. Then we smoothed the pitch contour using smoothing cubic splines. The resulting approximation of the pitch is smooth and continuous, and it enables us to measure features of the pitch: the pitch derivative, pitch slopes, and the behavior of their minima and maxima over time. We have measured a total of sixteen features, grouped under the headings below: • Statistics related to rhythm: Speaking rate, Average length between voiced regions, Number of maxima / Number of (minima + maxima), Number of upslopes / Number of slopes; • Statistics on the smoothed pitch signal: Min, Max, Median, Standard deviation;
  • 5. 554 F. Yu et al. • Statistics on the derivative of the smoothed pitch: Min, Max, Median, Standard deviation; • Statistics over the individual voiced parts: Mean min, Mean max; • Statistics over the individual slopes: Mean positive derivative, Mean negative derivative. All these features are calculated only in the valid region which begins at the first non-zero pitch point and ends at the last non-zero pitch point of each utterance. The features in the first group are related to rhythm. Rhythm is represented by the shape of a pitch contour. We assume the inverse of the average length of the voiced part of an utterance denotes the speaking rate; the average length between voiced regions can denote pauses in an utterance. The features in the second and third groups are general features of the pitch signal and its derivative. In each individual voiced part we can easily find minima and maxima. We choose the mean of the minima and the mean of the maxima as our fourth group features. We can compute the derivative of each individual slope. If the slope is an upslope, the derivative is positive; otherwise the derivative is negative. The mean of these positive derivatives and mean of negative derivatives are our features in the fifth group. 2.3 Performance of Emotion Evaluator The classification algorithms used in this research section are mostly based on K- nearest-neighbors (KNN) or neural networks (NN). Considering our application, we need not only classification results, but also proportions of each emotion an utterance contains. After some experimentation, we chose the support vector machine (SVM) as our evaluation algorithm [8], because of its high speed and each SVM can give an evaluation to each emotion category. From training data, we can find the origin and the three axes. Because different features are extracted from audio data in different ways and the relationships among these features are complex, we chose a Gaussian kernel 2 2 2 / ) , ( σ j i x x j i e x x K − − = to be our SVM kernel function. For each emotional state, a model is learned to separate its type of utterances from others. We refer to SVMs trained in this way as 1-v-r SVMs (short for one-versus- rest). The scheme we adopted learns different SVM models for different categories that can distinguish this kind of emotion from others. Our preliminary experimental results indicate that we can obtain satisfactory results only when there are at least 200 different utterances in each emotion category. While each SVM only deals with just a two-class problem and the performance of a SVM classifier is related to just these two classes, the boundary will tend to benefit the class that contains more data. To avoid this kind of skewing, we balance the training data set of the SVM. Taking “anger” as an example, we choose about 150 utterances in the “anger” state and also choose about 150 utterances from other emotion categories, with approximately the same number chosen from each of the other categories. In this way, the results are much better than those learned from
  • 6. Emotion Detection from Speech to Enrich Multimedia Content 555 imbalanced training data sets. Note that the training data can be replicated to balance the data sets. The training data and performances for each SVM are shown in Table 2. The remainder of the data set not used during learning for each individual SVM are used as testing data for this SVM. Table 2. SVM training data sets and performance Category SVM One Rest Accuracy on test set Anger 162 147 77.16% Happiness 102 94 65.64% Neutral 194 193 83.73% Sadness 96 96 70.59% We obtain the given emotional utterance’s feature vector by computation using each SVM and collecting each evaluation. We then have the emotional evaluation of the utterance. If only one evaluation is greater than 0, ( ( ) 0 x fi , 3 0 ≤ ≤ i , ( ) 0 x f j , j i ≠ ), we label this utterance as this particular kind of emotional utterance; if more than one evaluation is greater than 0, ( ( ) 0 x fi , ( ) 0 x f j , 3 , 0 ≤ ≤ j i , j i ≠ ), we label the emotion of this utterance as a mixture of several kinds of emotions, each proportional to the emotion’s SVM evaluation. If all evaluations are less than 0, ( ( ) 0 x fi , 3 .. 0 = i ), we can say the emotion of this utterance is undefined in our system. 2.4 Comparison We have also compared the effectiveness of the SVM classifier to the K-nearest neighbor classifier and the neural network classifier. One can observe that the SVM classifier compares favorably to the other two types of classifiers. Table 3. Comparison of NN, KNN and SVM Accuracy (%) Method A H N S NN 40.00 27.78 62.68 35.71 KNN 42.86 39.28 89.29 32.14 SVM 77.16 65.64 83.73 70.59 Remark: In each category there are 100 learning utterances, and all remaining utterances are used for testing. 3 Conclusions Discussion Compared with KNN, training an SVM model gives a good classifier without needing much training time. Even if we do not know the exact pertinences between each
  • 7. 556 F. Yu et al. feature, we still can obtain good results. After we produce the SVM model from training data sets, these training data sets are no longer needed since the SVM model contains all the useful information. So classification does not need much time, and almost can be applied within real-time rendering. The KNN rule relies on a distance metric to perform classification, it is expected that changing this metric will yield different and possibly better results. Intuitively, one should weigh each feature according to how well it correlates with the correct classification. But in our investigation, those features are not irrelevant to each other. The performance landscape in this metric space is quite rugged and optimization is likely expensive. SVM can handle this problem well. We need not know the relationships within each feature pair and the dimensionality of each feature. Compared with NNs, training a SVM model requires much less time than training an NN classifier. And SVMs are much more robust than NNs. In our application, the corpus comes from movies and teleplays. There are many speakers with various backgrounds. In these kinds of instances, NNs do not work well. The most important reason why we chose SVMs is that SVMs give a magnitude for recognition. We need this magnitude for synthesizing expressions with different degrees. For our future work, we plan to study the effectiveness of our current approach on data from different languages and cultures. References 1. Brand, M.: “Voice Puppetry”, Proceedings of the SIGGRAPH, 21-28, 1999. 2. Cassell, J., Bickmore, T., Campbell, L., Chang, K., Vilhjlmsson, H., and Yan, H.: “Requirements for an architecture for embodied conversational characters”, Proceedings of Computer Animation and Simulation, 109-120, 1999. 3. Cassell, J., Pelachaud, C., Badler, N.I., Steedman, M., Achorn, B., Beckett, T., Douville, B., Prevost, S. and Stone, M.: “Animated conversation: rule-based generation of facial display, gesture and spoken intonation for multiple conversational agents”, Proceedings of the SIGGRAPH, 28(4): 413-420, 1994. 4. Chang, E., Zhou, J.-L., Di, S., Huang, C., and Lee., K.-F.: Large vocabulary Mandarin speech recognition with different approaches in modeling tones, International Conference on Spoken Language Processing, 2000. 5. Roy, D., and Pentland, A.: “Automatic spoken affect analysis and classification”, in Proceedings of the Sencond International Conference on Automatic Face and Gesture Recognition, pp. 363-367, 1996. 6. Dellaert, F., Polzin, T., and Waibel, A.: “Recognizing Emotion in Speech”, Proceedings of the ICSLP, 1996. 7. Erickson, D., Abramson, A., Maekawa, K., and Kaburagi, T.: “Articulatory Characteristics of Emotional Utterances in Spoken English” , Proceedings of the ICSLP, 2000. 8. Joachims, T., Schölkopf, B., Burges, C., and Smola, A.(ed.): Making large-Scale SVM Training Practical. Advances in Kernel Methods - Support Vector Training, MIT-Press, 1999. 9. Kang, B.-S., Han C.-H., Lee, S.-T., Youn, D.-H., and Lee, C.-Y.: “Speaker Dependent Emotion Recognition using Speech Signals” , Proceedings of the ICSLP, 2000. 10. Paeschke, A., and Sendlmeier, W. F.: “Prosodic Characteristics of Emotional Speech: Measurements of Fundamental Frequency Movements”, Proceedings of the ISCA-Workshop on Speech and Emotion, 2000.
  • 8. Emotion Detection from Speech to Enrich Multimedia Content 557 11. Pereira, C.: “Dimensions of Emotional Meaning in Speech”, Proceedings of the ISCA- Workshop on Speech and Emotion, 2000. 12. Polzin, T., and Waibel, A.: “Emotion-Sensitive Human-Computer Interfaces”, Proceedings of the ISCA-Workshop on Speech and Emotion, 2000. 13. Scherer, K.R.: “A Cross-Cultural Investigation of Emotion Inferences from Voice and Speech: Implications for Speech”, Proceedings of the ICSLP, 2000. 14. Li, Y., Yu, F., Xu, Y.-Q., Chang, E., and Shum, H.-Y.: “Speech-Driven Cartoon Animation with Emotions”, to be appeared in ACM Multimedia 2001.