SlideShare a Scribd company logo
1 of 39
Download to read offline
ATTENTIVE MODALITY HOPPING MECHANISM
FOR SPEECH EMOTION RECOGNITION
1Seunghyun Yoon 1Hwanhee Lee 2Subhadeep Dey 1Kyomin Jung
2
Index
• Problem to Solve
• Related Works
• Proposed Model: Attentive Modality Hopping
• Implementation Details
• Empirical Results
• Conclusion
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
Research Problem
Speech Emotion Recognition
Exploiting the impact of visual modality
in addition to speech and text
3
Problem
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
Related Work: Single modality (acoustic)
4
Using Regional Saliency for Speech Emotion Recognition, Aldeneh, et.
al., ICASSP-17
CNN based model
Achieve up to 60.7% WA
in IEMOCAP dataset (4-class)
Related Works
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
Related Work: Single modality (acoustic)
5
Automatic Speech Emotion Recognition Using Recurrent Neural
Networks with Local Attention, Mirsamadi et. al., ICASSP-17
RNN based model with Attention mechanism
Achieve up to 63.5% WA in IEMOCAP dataset (4-class)
Related Works
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
Related Work: Multi modality (acoustic, text)
6
Deep Neural Networks for Emotion Recognition Combining Audio and
Transcripts, Cho et. al., INTERSPEECH-18
Combine acoustic information and conversation transcripts
Achieve up to 64.9% WA in IEMOCAP dataset (4-class)
Related Works
LSTM with temporal
mean pooling
Acoustic system
frame size was set to 20ms
with 10ms overlap
SVM
Multi-resolution CNN for transcripts
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
Related Work: Multi modality (acoustic, text)
7
Multimodal Speech Emotion Recognition Using Audio and Text, Yoon et.
al., SLT-18
RNN based model
End-to-end training
Achieve up to 71.8% WA
in IEMOCAP dataset (4-class)
Related Works
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
Related Work: Multi modality (acoustic, text)
8
Speech Emotion Recognition Using Multi-hop Attention Mechanism,
Yoon et. al., ICASSP-19
Bi-RNN based model
Attention pooling is employed
Achieve up to 76.5% WA
in IEMOCAP dataset (4-class)
Related Works
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
Recurrent Encoder
Recurrent Encoder for each of three modalities
Modality
Speech (MFCC + prosody)
Text (word-level embedding)
Visual (ResNet-100)
𝒉 𝒕 = 𝒇 𝜽 𝒉 𝒕−𝟏, 𝒙 𝒕
𝒙 𝒕 : audio feature
𝐩 : prosodic feature vector
9
Encoding Single Modality
Recurrent Encoder
Methodology
residual
backward
residual
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
10
Attention over Modality
Motivated by human behavior
Contextual Understanding from an iterative process
Methodology
acoustic textual
visual
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
11
Attentive Modality Hopping (AMH)
Aggregating Visual Information
Context : Textual and Acoustic modality
Results : 𝐇 𝟏
𝑽
𝐡1
A
𝐡2
A
𝐡 𝑡
A
…
audio encoder
𝐡1
T
𝐡2
T
𝐡 𝑡
T
…
text encoder
𝑎𝑖⊙
𝐡1
V
𝐡1
V
𝐡 𝑡
V
…
video encoder
𝑓 ( 𝐡last
A
, 𝐡last
𝑇
)
𝐇 𝟏
𝑽
= ෍
𝑖
𝑎𝑖 𝐡𝑖
V
Methodology
𝐇hop1 = 𝑓 (𝐡last
A
, 𝐡last
T
, 𝐇 𝟏
𝑽
)
final representation
attention weight
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
Aggregating Acoustic Information
Context : Textual and aggregated-Visual modality
Results : 𝐇 𝟏
𝑨
12
Attentive Modality Hopping (AMH)
𝐡1
A
𝐡1
A
𝐡 𝑡
A
…
audio encoder
𝐡1
T
𝐡2
T
𝐡 𝑡
T
…
text encoder
𝑎𝑖 ⊙
𝐡1
V
𝐡1
V
𝐡 𝑡
V
…
video encoder
𝐇 𝟏
𝑨
= ෍
𝑖
𝑎𝑖 𝐡𝑖
A
𝐇 𝟏
𝑽
𝑓 ( 𝒉last
T
, 𝐇 𝟏
𝑽
)
Methodology
𝐇hop2 = 𝑓 (𝐇 𝟏
𝑨
, 𝐡last
T
, 𝐇 𝟏
𝑽
)
final representation
attention weight
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
13
Attentive Modality Hopping (AMH)
Aggregating Textual Information
Context : aggregated-Acoustic and aggregated-Visual modality
Results : 𝐇 𝟏
𝑻
𝒉1
A
𝒉1
A
𝒉 𝑡
A
…
audio encoder
𝐡1
T
𝐡2
T
𝐡 𝑡
T
…
text encoder
𝑎𝑖⊙𝐇 𝟏
𝑻
= ෍
𝑖
𝑎𝒊 𝐡𝑖
T
𝒉1
V
𝒉1
V
𝒉 𝑡
V
…
video encoder
𝑓 (𝐇 𝟏
𝑨
, 𝐇 𝟏
𝐕
)
𝐇 𝟏
𝑽
𝐇 𝟏
𝑨
Methodology
𝐇hop3 = 𝑓 (𝐇 𝟏
𝐀
, 𝐇 𝟏
𝐓
, 𝐇 𝟏
𝑽
)
final representation
attention weight
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
14
Attention over Modality
Iterative Process (hop-1)
Methodology
acoustic textual
visual
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
15
Attention over Modality
Iterative Process (hop-2)
Methodology
acoustic textual
visual
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
16
Attention over Modality
Iterative Process (hop-3)
Methodology
acoustic textual
visual
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
17
Attention over Modality
Iterative Process (hop-4)
Methodology
acoustic textual
visual
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
18
Attention over Modality
Iterative Process (hop-5)
Methodology
acoustic textual
visual
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
19
Attention over Modality
Iterative Process (hop-6)
Methodology
acoustic textual
visual
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
20
Optimization
Objective : classification
Compute distribution of the predicted probability
Cross-entropy loss
Adam optimizer* (learning rate 1e-3)
Methodology
*Kingma et al. (2014), “Adam: A method for stochastic optimization.”
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
21
Dataset
Interactive Emotional Dyadic Motion Capture
(IEMOCAP)
Five sessions of utterances between two speakers
(one male and one female)
Total 10 unique speakers participated
Dataset Split
7-class, 7,847 utterances, :
(1,103 angry, 1,041 excited, 595 happy, 1,084 sad, 1,849 frustrated,
107 surprise, and 1,708 neutral)
10-fold cross-validation
Implementation
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
22
Implementation Details
Implementation
Acoustic data
MFCC features (using Kaldi)
Frame size 25 ms at a rate of 10 ms with the Hamming window
Concatenate it with its first, second order derivates → 120-dims
Maximum step: 1,000 (10.0 s, mean + 2std)
Prosodic features (using OpenSMILE)
35-dims
Appended to the MFCC features
Textual data
Ground-truth transcript form the IEMOCAP dataset
ASR-processed transcript* (WER 5.53%)
*Google Cloud Speech API
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
23
Implementation Details
Implementation
Visual data
Example of visual data (IEMOCAP)
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
24
Implementation Details
Implementation
Visual data
① Split each video frame into two sub-frame
Example of visual data (IEMOCAP)
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
25
Implementation Details
Implementation
Visual data
② Crop the center of each frame with 224*224 window
(focus on the actor, remove background)
Example of cropping
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
26
Implementation Details
Implementation
Visual data
③ Extract feature using ResNet-101 → 2,048-dims
frame rate of 3 per second
maximum step: 32 (10.6 s)
Pretrained
ResNet-101*
*He et al. (2016), “Deep residual learning for image recognition.”
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
27
Implementation Details
Implementation
Hyperparameters
hyperparameters are optimized on the development set
Audio Text Video
max step 750 128 25
number of layer 1 1 1
hidden dim 200 200 128
dropout ratio 0.7 0.3 0.7
Training
10-times experiments for each fold
report the average and standard deviation results
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
28
Experimental Results
Single modality experiment
The textual modality-based model shows high performance
Experiments
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
29
Experimental Results
Experiments
Bi-modality experiment
The use of textual and visual modality shows high performance
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
30
Experimental Results
Tri-modality experiment
AMH outperform the MDRE by 3.65%
Experiments
3.65% (0.602 → 0.624)
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
31
Experimental Results
Performance with the ASR-processed transcript
Performance degradation in AMH-ASR by 2.08%
Experiments
2.08% (0.624 → 0.611)
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
32
Experimental Results
Performance with the ASR-processed transcript
AMH-ASR still outperform the MDRE by 1.49%
Experiments
1.49% (0.602 → 0.611)
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
33
Experimental Results
Performance with the number of hop
Iterative hopping process increases model performance
Experiments
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
34
Error Analysis
Confusion matrix
Model frequently misclassifies emotion to neutral class
(supported by previously reported claims)*
Experiments
*Yoon et. al. (2019), “Speech emotion recognition
using multi-hop attention mechanism.”
*Neumann et. al. (2017), “Attentive
convolutionalneural network based speech
emotion recognition: Astudy on the impact of
input features, signal length, and actedspeech.”
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
35
Error Analysis
Confusion matrix
Excite and happy class are hard to distinguish
(overlap in distinguishing these two classes even human evaluations)*
Experiments
*Busso et al. (2008), “IEMOCP: Interactive
emotional dyadic motion capture database.”
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
36
Error Analysis
Confusion matrix
Misclassify angry to frustrated at a rate of 38.89%
In the opposite case → only 4.56%
Experiments
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
37
Error Analysis
Confusion matrix
Lowest performance on the surprise class, 28.57%
(small size of data, 107 samples)
Experiments
Problem
Related Works
Methodology
Implementation
Experiments
Conclusion
38
Conclusion
Propose attentive modality-hopping mechanism to combine
acoustic, textual, and visual modality for speech emotion recognition task
Show the proposed model outperforms the best baseline system
Test with ASR-processed transcripts and show the reliability of the
proposed system in the practical scenario where the ground-truth transcripts
are not available
We study how to recognize speech emotion using
multimodal information
Conclusion
Thank you
code, data, contact ☺ → http://david-yoon.github.io

More Related Content

Similar to [slide] Attentive Modality Hopping Mechanism for Speech Emotion Recognition

Sift based arabic sign language recognition aecia 2014 –november17-19, addis ...
Sift based arabic sign language recognition aecia 2014 –november17-19, addis ...Sift based arabic sign language recognition aecia 2014 –november17-19, addis ...
Sift based arabic sign language recognition aecia 2014 –november17-19, addis ...
Tarek Gaber
 
Learning to Search Henry Kautz
Learning to Search Henry KautzLearning to Search Henry Kautz
Learning to Search Henry Kautz
butest
 
Learning to Search Henry Kautz
Learning to Search Henry KautzLearning to Search Henry Kautz
Learning to Search Henry Kautz
butest
 

Similar to [slide] Attentive Modality Hopping Mechanism for Speech Emotion Recognition (20)

Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...
 
D111823
D111823D111823
D111823
 
Sound shredding moustafa
Sound shredding moustafaSound shredding moustafa
Sound shredding moustafa
 
Sift based arabic sign language recognition aecia 2014 –november17-19, addis ...
Sift based arabic sign language recognition aecia 2014 –november17-19, addis ...Sift based arabic sign language recognition aecia 2014 –november17-19, addis ...
Sift based arabic sign language recognition aecia 2014 –november17-19, addis ...
 
P54 Presentation at 2007 ITU Fully Networked Car Workshop
P54 Presentation at 2007 ITU Fully Networked Car WorkshopP54 Presentation at 2007 ITU Fully Networked Car Workshop
P54 Presentation at 2007 ITU Fully Networked Car Workshop
 
Testing with Fewer Resources: An Adaptive Approach to Performance-Aware Test ...
Testing with Fewer Resources: An Adaptive Approach to Performance-Aware Test ...Testing with Fewer Resources: An Adaptive Approach to Performance-Aware Test ...
Testing with Fewer Resources: An Adaptive Approach to Performance-Aware Test ...
 
A Systems Approach to the Modeling and Control of Molecular, Microparticle, a...
A Systems Approach to the Modeling and Control of Molecular, Microparticle, a...A Systems Approach to the Modeling and Control of Molecular, Microparticle, a...
A Systems Approach to the Modeling and Control of Molecular, Microparticle, a...
 
An Evolutionary Approach to Speech Quality Estimation
An Evolutionary Approach to Speech Quality EstimationAn Evolutionary Approach to Speech Quality Estimation
An Evolutionary Approach to Speech Quality Estimation
 
An Evolutionary Approach to Speech Quality Estimation Using Genetic Programming
An Evolutionary Approach to Speech Quality Estimation Using Genetic ProgrammingAn Evolutionary Approach to Speech Quality Estimation Using Genetic Programming
An Evolutionary Approach to Speech Quality Estimation Using Genetic Programming
 
Automated software testing cases generation framework to ensure the efficienc...
Automated software testing cases generation framework to ensure the efficienc...Automated software testing cases generation framework to ensure the efficienc...
Automated software testing cases generation framework to ensure the efficienc...
 
Learning to Search Henry Kautz
Learning to Search Henry KautzLearning to Search Henry Kautz
Learning to Search Henry Kautz
 
Learning to Search Henry Kautz
Learning to Search Henry KautzLearning to Search Henry Kautz
Learning to Search Henry Kautz
 
Md Mushfiqul Alam: Biological, NeuralNet Approaches to Recognition, Gain Cont...
Md Mushfiqul Alam: Biological, NeuralNet Approaches to Recognition, Gain Cont...Md Mushfiqul Alam: Biological, NeuralNet Approaches to Recognition, Gain Cont...
Md Mushfiqul Alam: Biological, NeuralNet Approaches to Recognition, Gain Cont...
 
On the use of voice activity detection in speech emotion recognition
On the use of voice activity detection in speech emotion recognitionOn the use of voice activity detection in speech emotion recognition
On the use of voice activity detection in speech emotion recognition
 
Mjfg now
Mjfg nowMjfg now
Mjfg now
 
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
 
Online learning in estimation of distribution algorithms for dynamic environm...
Online learning in estimation of distribution algorithms for dynamic environm...Online learning in estimation of distribution algorithms for dynamic environm...
Online learning in estimation of distribution algorithms for dynamic environm...
 
MediaEval 2016 - ININ Submission to Zero Cost ASR Task
MediaEval 2016 - ININ Submission to Zero Cost ASR TaskMediaEval 2016 - ININ Submission to Zero Cost ASR Task
MediaEval 2016 - ININ Submission to Zero Cost ASR Task
 
Speaker identification system using close set
Speaker identification system using close setSpeaker identification system using close set
Speaker identification system using close set
 

Recently uploaded

Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
raffaeleoman
 
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
David Celestin
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
Kayode Fayemi
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
Kayode Fayemi
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
amilabibi1
 

Recently uploaded (15)

Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatment
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
 
ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfSOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of Drupal
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 

[slide] Attentive Modality Hopping Mechanism for Speech Emotion Recognition