IEEE ICASSP 2020
In this work, we explore the impact of visual modality in addition to speech and text for improving the accuracy of the emotion detection system. The traditional approaches tackle this task by fusing the knowledge from the various modalities independently for performing emotion classification. In contrast to these approaches, we tackle the problem by introducing an attention mechanism to combine the information. In this regard, we first apply a neural network to obtain hidden representations of the modalities. Then, the attention mechanism is defined to select and aggregate important parts of the video data by conditioning on the audio and text data. Furthermore, the attention mechanism is again applied to attend important parts of the speech and textual data, by considering other modality. Experiments are performed on the standard IEMOCAP dataset using all three modalities (audio, text, and video). The achieved results show a significant improvement of 3.65% in terms of weighted accuracy compared to the baseline system.
5. Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
Related Work: Single modality (acoustic)
5
Automatic Speech Emotion Recognition Using Recurrent Neural
Networks with Local Attention, Mirsamadi et. al., ICASSP-17
RNN based model with Attention mechanism
Achieve up to 63.5% WA in IEMOCAP dataset (4-class)
Related Works
6. Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
Related Work: Multi modality (acoustic, text)
6
Deep Neural Networks for Emotion Recognition Combining Audio and
Transcripts, Cho et. al., INTERSPEECH-18
Combine acoustic information and conversation transcripts
Achieve up to 64.9% WA in IEMOCAP dataset (4-class)
Related Works
LSTM with temporal
mean pooling
Acoustic system
frame size was set to 20ms
with 10ms overlap
SVM
Multi-resolution CNN for transcripts
11. Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
11
Attentive Modality Hopping (AMH)
Aggregating Visual Information
Context : Textual and Acoustic modality
Results : 𝐇 𝟏
𝑽
𝐡1
A
𝐡2
A
𝐡 𝑡
A
…
audio encoder
𝐡1
T
𝐡2
T
𝐡 𝑡
T
…
text encoder
𝑎𝑖⊙
𝐡1
V
𝐡1
V
𝐡 𝑡
V
…
video encoder
𝑓 ( 𝐡last
A
, 𝐡last
𝑇
)
𝐇 𝟏
𝑽
=
𝑖
𝑎𝑖 𝐡𝑖
V
Methodology
𝐇hop1 = 𝑓 (𝐡last
A
, 𝐡last
T
, 𝐇 𝟏
𝑽
)
final representation
attention weight
12. Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
Aggregating Acoustic Information
Context : Textual and aggregated-Visual modality
Results : 𝐇 𝟏
𝑨
12
Attentive Modality Hopping (AMH)
𝐡1
A
𝐡1
A
𝐡 𝑡
A
…
audio encoder
𝐡1
T
𝐡2
T
𝐡 𝑡
T
…
text encoder
𝑎𝑖 ⊙
𝐡1
V
𝐡1
V
𝐡 𝑡
V
…
video encoder
𝐇 𝟏
𝑨
=
𝑖
𝑎𝑖 𝐡𝑖
A
𝐇 𝟏
𝑽
𝑓 ( 𝒉last
T
, 𝐇 𝟏
𝑽
)
Methodology
𝐇hop2 = 𝑓 (𝐇 𝟏
𝑨
, 𝐡last
T
, 𝐇 𝟏
𝑽
)
final representation
attention weight
13. Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
13
Attentive Modality Hopping (AMH)
Aggregating Textual Information
Context : aggregated-Acoustic and aggregated-Visual modality
Results : 𝐇 𝟏
𝑻
𝒉1
A
𝒉1
A
𝒉 𝑡
A
…
audio encoder
𝐡1
T
𝐡2
T
𝐡 𝑡
T
…
text encoder
𝑎𝑖⊙𝐇 𝟏
𝑻
=
𝑖
𝑎𝒊 𝐡𝑖
T
𝒉1
V
𝒉1
V
𝒉 𝑡
V
…
video encoder
𝑓 (𝐇 𝟏
𝑨
, 𝐇 𝟏
𝐕
)
𝐇 𝟏
𝑽
𝐇 𝟏
𝑨
Methodology
𝐇hop3 = 𝑓 (𝐇 𝟏
𝐀
, 𝐇 𝟏
𝐓
, 𝐇 𝟏
𝑽
)
final representation
attention weight
21. Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
21
Dataset
Interactive Emotional Dyadic Motion Capture
(IEMOCAP)
Five sessions of utterances between two speakers
(one male and one female)
Total 10 unique speakers participated
Dataset Split
7-class, 7,847 utterances, :
(1,103 angry, 1,041 excited, 595 happy, 1,084 sad, 1,849 frustrated,
107 surprise, and 1,708 neutral)
10-fold cross-validation
Implementation
22. Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
22
Implementation Details
Implementation
Acoustic data
MFCC features (using Kaldi)
Frame size 25 ms at a rate of 10 ms with the Hamming window
Concatenate it with its first, second order derivates → 120-dims
Maximum step: 1,000 (10.0 s, mean + 2std)
Prosodic features (using OpenSMILE)
35-dims
Appended to the MFCC features
Textual data
Ground-truth transcript form the IEMOCAP dataset
ASR-processed transcript* (WER 5.53%)
*Google Cloud Speech API
34. Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
34
Error Analysis
Confusion matrix
Model frequently misclassifies emotion to neutral class
(supported by previously reported claims)*
Experiments
*Yoon et. al. (2019), “Speech emotion recognition
using multi-hop attention mechanism.”
*Neumann et. al. (2017), “Attentive
convolutionalneural network based speech
emotion recognition: Astudy on the impact of
input features, signal length, and actedspeech.”
38. Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
38
Conclusion
Propose attentive modality-hopping mechanism to combine
acoustic, textual, and visual modality for speech emotion recognition task
Show the proposed model outperforms the best baseline system
Test with ASR-processed transcripts and show the reliability of the
proposed system in the practical scenario where the ground-truth transcripts
are not available
We study how to recognize speech emotion using
multimodal information
Conclusion