This document summarizes an academic paper on audio-visual emotion recognition. It discusses proposed methods for visual and audio emotion detection that extract features such as horizontal and vertical cross-correlation from segmented eye and mouth regions of video frames, and perceptual linear predictive coefficients and mel-frequency cepstral coefficients from audio frames. Classification is performed using k-nearest neighbors. The results show an audio-visual emotion recognition rate of 96.67%, visual emotion recognition rate of 87.6%, and speech emotion recognition rate of 69.83%. Future work directions are also proposed, such as increasing wavelet decomposition levels and extracting features from other facial regions and pitch cycles.
2. Why Emotion Recognition ?
1. Enhancing naturalness in human machine interaction (for
example: on-board car driving system, autonomous call center
services, interactive movie, story telling, E-tutoring, autonomous
psychological therapy etc. )
2. Speech-to-speech translation system
3. Medical disorder diagnosis
4. Indexing and retrieving audio/video files based on emotions
and so on
3. VIDEO
AUDIO
PLPC MFCC WAVELET PACKET
FRAMES
VCCR HCCR
FRAMESPLPC MFCC
FINAL FEATURE
TRAIN SUBJECTS
CLASSIFICATION
TEST SUBJECTS
DECISION
4. Proposed Method for
Visual Emotion
Detection
VIDEO
MOUTH REGION EYE REGION
FRAME
RGB TO GRAY CONVERSION
VERTICAL CCR HORIZONTAL CCR
SILENT FRAMES REMOVAL
VERTICAL CCR HORIZONTAL CCR
FINAL FEATURE
TRAIN SUBJECTS
CLASSIFICATION
TEST SUBJECTS
DECISION
5. Pre-processing of video
frames
1. Each video frame is converted
from RGB scale to gray scale
and median filtered.
Why? RGB scale isn’t needed as only
geometrical information is extracted.
Median filtering removes noisy pixels
from image.
6. Removing silent frames
1. Short time energy is utilized to
remove frames where no
speech is given by the subject.
Why? Gives only the frames where
speech is provided and emotion is
expressed. Reduces redundant frames
and increases accuracy to a great extent.
Why not in audio? Removes the Vowel
Onset Points (VOP), which carry vital
information about specific emotions.
Silent region 1
Speech region 1
Silent region 2
Silent
region3
Speech
region 2
Silent region 4
7. Eye and mouth
segmentation by Viola
Jones
1. Viola Jones algorithm is
applied to each frame for
selecting only the mouth and
eye regions of face.
Why? Emotion is primarily expressed by
using different shapes of mouth and
eyes and relative location between eyes
and eyebrows.
Eye Region Extraction Mouth Region
Extraction
Original Gray Scale Image
8. Extracting Vertical and
Horizontal Cross
Correlation
1. Cross Correlation features,
Vertical Cross Correlation
(between each two columns)
and Horizontal Cross
Correlation (between each two
rows) are extracted from
segmented regions.
Why? Cross Correlation features give
detailed geometrical shape information
with minimal computational cost.
x(n) y(n)
x(n)
y(n)
𝑅 𝑥𝑦(𝑚)
𝑅 𝑥𝑦(𝑚) =
𝑛=0
𝑁 −𝑚 −1
𝑥 𝑚+𝑛 𝑦𝑛
∗
𝑅 𝑥𝑦(𝑚)
Vertical Cross
Correlation
Horizontal Cross
Correlation
10. Horizonal Cross
Correlation-Mouth
Left eye is considered for
symmetry
Left Half of mouth region is
considered due to symmetry
Verical Cross Correlation - Eye
Left eye is considered for
symmetricity
Verical Cross
Correlation - Mouth
Left Half of mouth region is
considered due to symmetry
Complete Visual Feature from one video file (234 length):
Horizonal Cross Correlation-Eye
HCCR Eye (57) HCCR Mouth (65) VCCR Eye (44) VCCR Mouth (68)
11. Proposed Method for
Emotion Detection from
speech
AUDIO
PLPC MFCC WAVELET PACKET
FRAMESPLPC MFCC
PRE EMPHASIS
FRAMING AND WINDOWING
MEAN MEAN
FINAL FEATURE
TRAIN SUBJECTS
CLASSIFICATION
TEST SUBJECTS
DECISION
Proposed Features:
1. Perceptual Linear
Predictive Coefficients
2. Mel Frequency Cepstral
Coefficients
3. Perceptual Linear
Predictive Coefficients
and Mel Frequency
Cepstral Coefficient of
Wavelet Packet
Coefficients
12. Channel Selection and
Endpoint detection
1. Left, right and mono channels
are selected for feature
extraction.
2. Short time enegy is calculated
for threholding the frames at
starting and ending of audio
signal.
Why? Each channel gives slightly
different accuracies.
Silent regions at starting and ending
lowers the accuracy value.
13. Pre-Emphasis filtering
1. Audio signal is passed through
a first order high pass filter with
pre-emphasis coefficient
0.9785.
Why? To emphasis information of
formants and remove the impact of
excitation source.
H(Z) V(Z)
𝟏
𝐇(𝐙)
𝐇 𝐳 = 𝟏 − 𝐚𝐳−𝟏
E(z) S(z)
H(z) = Pre Emphasis Filter
E(z) = Excitation Source
V(z) = Vocal Tract Filter
S(s) = Sound Signal
14. Framing and windowing
1. Audio signal is framed into
25ms windows, with 10ms
overlaps and windowed by
Hamming window.
Why? Within 10-25ms duration, speech
can be considered quasi-stationary.
Hamming windowing prevents Gibbs
phenomena.
25ms
25ms
25ms
Audio :
10ms
…
Block Processing
15. Perceptual Linear
Predictive Co-efficient
(PLPC) Extraction
1. 12 order (length 13) PLPCs are
extracted applying Bark filter
bank from each 25ms frame.
Why? Bark filter bank or Bark frequency
scale represents how the human ear
perceives frequency ranges, which is
useful for emotion related information
extraction.
𝐵𝑎𝑟𝑘 = 13 tan−1
(
0.76𝑓
1000
) + 3.5 tan−1
(
𝑓2
75002)
Bark Frequency conversion formula:
16. Mel-Frequency Cepstral
Coefficient (MFCC)
Extraction
1. 26 MFCCs, using 13 filters of
Mel-Frequency band are
extracted from 25ms frames.
Why? Mel scale represents a different
entity for human-ear frequency range
perception mechanism. Both Bark and
Mel scale is utilized here for better
understanding information related to
emotion.
𝑚𝑒𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 = 2595 log(1 +
7
700.0
)
Mel Frequency conversion formula:
Mel Filter Banks
17. Temporal Smoothing
1. Temporal smoothing or
averaging filtering of length 3
(taking into account two
previous and two following
frames) is applied to each
frame’s features.
Why? Removes any sudden changes in
features due to noisy speech samples.
xsma(n) =
1
W
i=−(W−1)/2
(W−1)/2
x(n + i)
Smoothed feature vector,
x(n) = feature vector
W = 3, smoothing window
18. Applying Statistical
Functionals
1. Mean of all frames’ features is
taken.
Why? Effect of all frames and their
temporal evolution is taken into account
as emotions are expressed for long
durations, not just in 25ms frames,
Statistical Functional
(Mean),
Mean =
1
𝑁 𝑛=0
𝑁−1
𝑥(𝑛)
x(n) = feature vector
N = feature length
19. Wavelet Packet
decomposition
1. 3 level wavelet packet
decomposition is performed
using both Coiflet and
Daubechiesh filters after down-
sampling to 16kHz.
Why? Wavelet Packet decomposition
presents another scale of perceptual
frequency range. Down-sampling
reduces computational cost.
Wavelet Packet tree decomposition
(0, 0)
(1, 0)
(2, 0) (2, 1)
(1, 1)
(3, 0) (3, 1)
Level 0
Level 1
Level 2
Level 3
* Bold faced nodes’ coefficients are used
20. PLPC and MFCC of
wavelet coefficients
1. PLPC and MFCCs are extracted
from each of four wavelet
coefficients using previous
method.
Why? Combination of three different
perceptual frequency scales provide
information regarding emotions in
greater and finer details. frequency
𝑓𝑛
8
𝑓𝑛
4
𝑓𝑛
2
𝑓𝑛
Level 1Level 2
Level 3
Amplitude
Wavelet Packet filter banks for level 3:
21. 26 Mel-Frequency Cepstral Coefficients
Complete Audio Feature from one video file (195 length):
12 order filter coefficient using
Bark Frequency Scale
13 Mel Filter Banks and Discrete Cosine Transform
of type 2 is used
13 Perceptual Linear
Predictive Coefficients
13 PLPC Feature 26 MFCC Feature 156 WPD PLPC and MFCC
156 wavelet packet PLPC
and MFCC
13 PLPCs and 26 MFCCs from each of
4 wavelet coefficients
22. Video Feature: length 234
13 PLPC, 26 MFCC, 154 WPD PLPC
and WPD MFCC
57 HCCR (Eye), 65 HCCR (Mouth),
44 VCCR (Eye), 68 VCCR (Mouth)
Complete Audio-Visual Feature from one video file (429 length):
Audio Feature: length 195
195 length audio feature 234 length video feature
23. Emotion Recognition results from Speech Features
Channel : Left
0
10
20
30
40
50
60
70
80
90
100
Angry Disgust Fear Happiness Sadness Surprise
PLPC + MFCC
PLPC + MFCC + WPC
(coif5)
24. Emotion Recognition results from Visual Features
0
20
40
60
80
100
120
Angry Disgust Fear Happiness Sadness Surprise
VCCR
HCCR
VCCR + HCCR
25. Emotion Recognition results from Audio-Visual Features
Audio Channel : Left
0
20
40
60
80
100
120
Angry Disgust Fear Happiness Sadness Surprise
Audio (Wavelet coif5)
Audio (Wavelet db10)
Video
Audio + Video
26. Summary of Results:
Audio-Visual
Emotion recognition:
96.67%
Combined features of both
audio and images.
Classifier: KNN
Visual Emotion
Recognition:
87.6%
Features: Horizontal and
Vertical Cross Correlation
Classifier: Ensemble KNN
Speech Emotion
Recognition:
69.83%
Features: PLPC, MFCC and
WPD PLPC-MFCC
Classifier: KNN
29. Future Work
Wavelet Decomposition
Level Increase
Effects of wavelet decomposition level, wavelet
decomposition filters on accuracy
Nasalobial region, Cheek
region and other regions
Visual features from other significant facial regions and
their effects on accuracy
VOP (Vowel onset point)
Speech features
Instead of using entire speech, only vowel onset point
features and their effects on recognition capability
Speech features from
pitch cycle
Instead of block processing, features can be extracted
from pitch cycles of speech.
30. References
[1] D. Datcu and L. Rothkrantz, “Multimodal recognition of emotions in car environments Multimodal
recognition of emotions in car environments,” DCI&I 2009, 2009
[2] M. Paleari, B. Huet, and S. Antipolis, “Towards Multimodal Emotion Recognition : A New Approach,” pp.
174–181.
[3] M. Mansoorizadeh and N. Moghaddam, “Multimodal information fusion application to human emotion
recognition from face and speech,” pp. 277–297, 2010.
[4] V. Struc, “Multi-modal Emotion Recognition using Canonical Correlations and Acoustic Features ˇ,” no. i,
pp. 4141–4144, 2010.
[5] S. Zhalehpour, Z. Akhtar, and C. E. Erdem, “Multimodal Emotion Recognition with Automatic Peak Frame
Selection,” pp. 0–5, 2014.
[6] K. Huang, H. S. Lin, J. Chan, and Y. Kuo, “LEARNING COLLABORATIVE DECISION-MAKING PARAMETERS
FOR MULTIMODAL EMOTION RECOGNITION.”
[7] D. Jiang, Y. Cui, X. Zhang, and P. Fan, “Audio Visual Emotion Recognition Based on Triple-Stream Dynamic
Bayesian Network Models,” pp. 609–610.
[8] Y. Wang, L. Guan, and A. N. Venetsanopoulos, “Kernel Cross-Modal Factor Analysis for Information Fusion
With Application to Bimodal Emotion Recognition,” vol. 14, no. 3, pp. 597–607, 2012.