M3er multiplicative_multimodal_emotion_recognition

•Download as PPTX, PDF•

0 likes•43 views

ChiKim86

For research Multiplicative Multimodal Emotion Recognition using Facial, Textual and Speech Cues

Engineering

THE THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-20)

MULTIMODEL(MULTI-INPUT) IN EMOTION RECOGNITION
 Reason:
o Richer information: Cues from different modalities can augment or complement each other, and hence lead to
more sophisticated inference algorithms.
o Robustness to Sensor Noise: Information on different modalities captured through sensors can often be cor
rupted due to signal noise, or be missing altogether when the particular modality is not expressed, or can not be
captured due to occlusion, sensor artifacts, etc. We call such modalities ineffectual. Ineffectual modali ties are
especially prevalent in in-the-wild datasets.

DATASET
 IEMOCAP(2008):
 CMU_MOSEI(2018):

DATASET
 MULTI-COMPARE BETWEEN
CMU-MOSEI AND IEMOCAP
IEMOCAP

CHALLENGE
 Challenge:
o Decide which modalities should be combined and how
o Lack of agreement on the most efficient mechanism for combining(fusing) multi modalities

TECHNIQUES
 Early fusion:
Sikka et al 2013: Multiple Kernel Learning for Emotion
Recognition in the Wild
Majumder et al (2018)

 Late fusion:
Gunes et al 2007: Multimodal emotion recognition
from expressive faces, body gestures
Lee et al (2018) Convolutional Attention
Networks for Multimodal Emotion
Recognition from Speech and Text Data

RELATED WORK
 Multimodalities comparision
Dataset Method Modalities F1 scores MA
IEMOCAP
Kim et al (2013) Deep Belief Network Motion capture and audio
video
72.8 %
Yoon et al(2019) Multi-hop attention Text and Speech 77,6 %
Majumdar et al (2018) Text, Audio and Video 76.5 %
CMU-
MOSEI
Zadeh et al (2018) Dynamic Fusion
Graph
Language, vision and
acoustic
76.3%
Lee et al (2018) Text and Speech 89% 84.08%
Sahay et al(2018) tensor fusion network Text and audio 66.8%

MODALITIES CHECK
 Purpose: filter ineffectual data to increase the accuracy of reality data
Using Canonical Correlation Analysis (CCA) to compute
the correlation score, ρ, of every pair of input modalities
Compute the correlation score for the pair {𝑓𝑖, 𝑓𝑗}
Check them against an empirically chosen
threshold (τ)

REGENERATING PROXY FEATURE VECTORS
 Purpose: decrease the noise of each feature by regenerating proxy feature vectors for the ineffectual
modalities missed
Finding vj = argminjd(vj, ff), where is any distance metric
Compute constants ai ∈ R by solving the following linear system:

MULTIPLICATIVE MODALITY FUSION
 Idea: to explicitly suppress the weaker (not so expressive) modalities, which indirectly boost the stronger
(expressive) modalities
The loss for the 𝑖𝑡ℎ modality:

MODALITY COMBINATION
 Requirement:
o Be able to process the sotisphicated – data driven ( CMU-MOSEI, Youtube…) which has noise, occlusion, …
o Increase the reliability
 Proposal combination:
o Using single-hidden-layer LSTMs, each of output dimension 32.
o Then using multiplicative fusion to combine 3 32 dimensional feature vectors.
o This feature vecto is concatenated with the final value of the memory variable, and the resultant 160 dimensional
feature vector is passed through a 64 dimensional fully connected layer followed by a 6 dimensional fully
connected to generate the network outputs

EXPERIMENTS
 Feature extraction:
 Text(ft): Pre-trained GloVe word with 300-dimension embedding method
 Using the COVAREP software (Degottex et al., 2014) to extract acoustic features including 12 Mel-frequency
cepstral coefficients, pitch, voiced/unvoiced segmenting features, glottal source parameters, peak slope
parameters and maxima dispersion quotients.
 Using the combination of face embeddings obtained from state-ofthe-art facial recognition models, facial action
units, and facial landmarks for CMU-MOSEI

LIMITATION
• Often confuses between certain class labels
• There is no absolute precision of the human perception of emotion in
an instant moment
• May consider adding context to emotional recognition

What's hot

H010215561IOSR Journals

Speech emotion recognitionsaniya shaikh

Speech Emotion Recognition by Using Combinations of Support Vector Machine (S...mathsjournal

Human Emotion Recognition using Machine Learningijtsrd

Emotion recognition using image processing in deep learningvishnuv43

Voice RecognitionAmrita More

Speech RecognitionHardik Kanjariya

A critical insight into multi-languages speech emotion databasesjournalBEEI

Speaker recognition using MFCCHira Shaukat

Speaker recognition.Nimmagadda Ushakiran

Deep Learning for Speech Recognition - Vikrant Singh TomarWithTheBest

Automatic speech recognition system using deep learningAnkan Dutta

Deep Learning in practice : Speech recognition and beyond - MeetupLINAGORA

Automatic speech recognitionRichie

Speech Recognition TechnologySeminar Links

Voice recognition systemavinash raibole

ASERS-CNN: ARABIC SPEECH EMOTION RECOGNITION SYSTEM BASED ON CNN MODELsipij

ASERS-LSTM: Arabic Speech Emotion Recognition System Based on LSTM Modelsipij

Mini Project- Audio EnhancementUniversity of Hertfordshire, School of Electronic Communications and Electrical Engineering

Short story presentationStutiAgarwal36

What's hot (20)

H010215561

Speech emotion recognition

Speech Emotion Recognition by Using Combinations of Support Vector Machine (S...

Human Emotion Recognition using Machine Learning

Emotion recognition using image processing in deep learning

Voice Recognition

Speech Recognition

A critical insight into multi-languages speech emotion databases

Speaker recognition using MFCC

Speaker recognition.

Deep Learning for Speech Recognition - Vikrant Singh Tomar

Automatic speech recognition system using deep learning

Deep Learning in practice : Speech recognition and beyond - Meetup

Automatic speech recognition

Speech Recognition Technology

Voice recognition system

ASERS-CNN: ARABIC SPEECH EMOTION RECOGNITION SYSTEM BASED ON CNN MODEL

ASERS-LSTM: Arabic Speech Emotion Recognition System Based on LSTM Model

Mini Project- Audio Enhancement

Short story presentation

Similar to M3er multiplicative_multimodal_emotion_recognition

An ann approach for networkIJNSA Journal

ANNs have been widely used in various domains for: Pattern recognition Funct...vijaym148

ai7.pptqwerty432737

Investigation of the performance of multi-input multi-output detectors based...IJECEIAES

AN ANN APPROACH FOR NETWORK INTRUSION DETECTION USING ENTROPY BASED FEATURE S...IJNSA Journal

ai7.pptMrHacker61

A novel automatic voice recognition system based on text-independent in a noi...IJECEIAES

I0362048053ijceronline

SIGNAL DETECTION IN MIMO COMMUNICATIONS SYSTEM WITH NON-GAUSSIAN NOISES BASED...ijwmn

Signal Detection in MIMO Communications System with Non-Gaussian Noises based...ijwmn

Hardware efficient singular value decomposition in mimo ofdm systemIAEME Publication

X-TREPAN: A MULTI CLASS REGRESSION AND ADAPTED EXTRACTION OF COMPREHENSIBLE D...cscpconf

X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...csandit

A Mixed Binary-Real NSGA II Algorithm Ensuring Both Accuracy and Interpretabi...IJECEIAES

Deep leaning Vincent VanhouckePruthvi Raju Pakalapati Ninja/Black Belt Recruiter

An Algorithm For Vector Quantizer DesignAngie Miller

Designing an Efficient Multimodal Biometric System using Palmprint and Speech...IDES Editor

X trepan an extended trepan forijaia

Architecture neural network deep optimizing based on self organizing feature ...journalBEEI

D111823inventionjournals

Similar to M3er multiplicative_multimodal_emotion_recognition (20)

An ann approach for network

ANNs have been widely used in various domains for: Pattern recognition Funct...

ai7.ppt

Investigation of the performance of multi-input multi-output detectors based...

AN ANN APPROACH FOR NETWORK INTRUSION DETECTION USING ENTROPY BASED FEATURE S...

ai7.ppt

A novel automatic voice recognition system based on text-independent in a noi...

I0362048053

SIGNAL DETECTION IN MIMO COMMUNICATIONS SYSTEM WITH NON-GAUSSIAN NOISES BASED...

Signal Detection in MIMO Communications System with Non-Gaussian Noises based...

Hardware efficient singular value decomposition in mimo ofdm system

X-TREPAN: A MULTI CLASS REGRESSION AND ADAPTED EXTRACTION OF COMPREHENSIBLE D...

X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...

A Mixed Binary-Real NSGA II Algorithm Ensuring Both Accuracy and Interpretabi...

Deep leaning Vincent Vanhoucke

An Algorithm For Vector Quantizer Design

Designing an Efficient Multimodal Biometric System using Palmprint and Speech...

X trepan an extended trepan for

Architecture neural network deep optimizing based on self organizing feature ...

D111823

Recently uploaded

Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort

VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ

Application of Residue Theorem to evaluate real integrations.pptx959SahilShah

pipeline in computer architecture designssuser87fa0c1

GDSC ASEB Gen AI study jams presentationGDSCAESB

Heart Disease Prediction using machine learning.pptxPoojaBan

Effects of rheological properties on mixingviprabot1

young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774

Biology for Computer Engineers Course Handout.pptxDeepakSakkari2

HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95

Internship report on mechanical engineeringmalavadedarshan25

main PPT.pptx of girls hostel security using rfidNikhilNagaraju

CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani

Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxnull - The Open Security Community

DATA ANALYTICS PPT definition usage examplePragyanshuParadkar1

EduAI - E learning Platform integrated with AIkoyaldeepu123

Architect Hassan Khalil Portfolio for 2024hassan khalil

young call girls in Green Park🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst

Recently uploaded (20)

Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service

VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...

Application of Residue Theorem to evaluate real integrations.pptx

pipeline in computer architecture design

GDSC ASEB Gen AI study jams presentation

Heart Disease Prediction using machine learning.pptx

Effects of rheological properties on mixing

young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service

Arduino_CSE ece ppt for working and principal of arduino.ppt

Biology for Computer Engineers Course Handout.pptx

HARMONY IN THE NATURE AND EXISTENCE - Unit-IV

Internship report on mechanical engineering

main PPT.pptx of girls hostel security using rfid

CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf

Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx

DATA ANALYTICS PPT definition usage example

EduAI - E learning Platform integrated with AI

Architect Hassan Khalil Portfolio for 2024

young call girls in Green Park🔝 9953056974 🔝 escort Service

IVE Industry Focused Event - Defence Sector 2024

M3er multiplicative_multimodal_emotion_recognition

1. THE THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-20)

2. MULTIMODEL(MULTI-INPUT) IN EMOTION RECOGNITION  Reason: o Richer information: Cues from different modalities can augment or complement each other, and hence lead to more sophisticated inference algorithms. o Robustness to Sensor Noise: Information on different modalities captured through sensors can often be cor rupted due to signal noise, or be missing altogether when the particular modality is not expressed, or can not be captured due to occlusion, sensor artifacts, etc. We call such modalities ineffectual. Ineffectual modali ties are especially prevalent in in-the-wild datasets.

3. DATASET  IEMOCAP(2008):  CMU_MOSEI(2018):

4. DATASET  MULTI-COMPARE BETWEEN CMU-MOSEI AND IEMOCAP IEMOCAP

5. CHALLENGE  Challenge: o Decide which modalities should be combined and how o Lack of agreement on the most efficient mechanism for combining(fusing) multi modalities

6. TECHNIQUES  Early fusion: Sikka et al 2013: Multiple Kernel Learning for Emotion Recognition in the Wild Majumder et al (2018)

7.  Late fusion: Gunes et al 2007: Multimodal emotion recognition from expressive faces, body gestures Lee et al (2018) Convolutional Attention Networks for Multimodal Emotion Recognition from Speech and Text Data

8. RELATED WORK  Multimodalities comparision Dataset Method Modalities F1 scores MA IEMOCAP Kim et al (2013) Deep Belief Network Motion capture and audio video 72.8 % Yoon et al(2019) Multi-hop attention Text and Speech 77,6 % Majumdar et al (2018) Text, Audio and Video 76.5 % CMU- MOSEI Zadeh et al (2018) Dynamic Fusion Graph Language, vision and acoustic 76.3% Lee et al (2018) Text and Speech 89% 84.08% Sahay et al(2018) tensor fusion network Text and audio 66.8%

9. SOLUTION The general diagram of M3ER

10. MODALITIES CHECK  Purpose: filter ineffectual data to increase the accuracy of reality data Using Canonical Correlation Analysis (CCA) to compute the correlation score, ρ, of every pair of input modalities Compute the correlation score for the pair {𝑓𝑖, 𝑓𝑗} Check them against an empirically chosen threshold (τ)

11. REGENERATING PROXY FEATURE VECTORS  Purpose: decrease the noise of each feature by regenerating proxy feature vectors for the ineffectual modalities missed Finding vj = argminjd(vj, ff), where is any distance metric Compute constants ai ∈ R by solving the following linear system:

12. MULTIPLICATIVE MODALITY FUSION  Idea: to explicitly suppress the weaker (not so expressive) modalities, which indirectly boost the stronger (expressive) modalities The loss for the 𝑖𝑡ℎ modality:

13. MODALITY COMBINATION  Requirement: o Be able to process the sotisphicated – data driven ( CMU-MOSEI, Youtube…) which has noise, occlusion, … o Increase the reliability  Proposal combination: o Using single-hidden-layer LSTMs, each of output dimension 32. o Then using multiplicative fusion to combine 3 32 dimensional feature vectors. o This feature vecto is concatenated with the final value of the memory variable, and the resultant 160 dimensional feature vector is passed through a 64 dimensional fully connected layer followed by a 6 dimensional fully connected to generate the network outputs

14. EXPERIMENTS  Feature extraction:  Text(ft): Pre-trained GloVe word with 300-dimension embedding method  Using the COVAREP software (Degottex et al., 2014) to extract acoustic features including 12 Mel-frequency cepstral coefficients, pitch, voiced/unvoiced segmenting features, glottal source parameters, peak slope parameters and maxima dispersion quotients.  Using the combination of face embeddings obtained from state-ofthe-art facial recognition models, facial action units, and facial landmarks for CMU-MOSEI

15. EVALUATION

16. LIMITATION • Often confuses between certain class labels • There is no absolute precision of the human perception of emotion in an instant moment • May consider adding context to emotional recognition

17. THANK YOU ENA HO

Editor's Notes

Đường ống phân loại của phương pháp đề xuất. Khi các tính năng hình ảnh và âm thanh được trích xuất, chúng tôi xây dựng một hạt nhân hàm cơ sở xuyên tâm (RBF) từ mỗi bộ mô tả. Sau đó, chúng tôi sử dụng MKL để kết hợp tối ưu các hạt nhân tính năng cho đầu vào vào bộ phân loại SVM.
A direct way to learn about the relationship between these two feature vectors would be to utilize a shallow model, which is a simple concatenation of two feature vectors. However, since the correlations between feature vectors from speech and text is highly non-linear, it is difficult for a shallow model to properly learn multimodal representations. Therefore, we utilize trainable attention mechanisms to learn nonlinear correlations between these feature vectors. Attention mechanisms also help retain information in the timedomain by forming temporal embedding between two feature vectors. 2:Using the cross-validation method to integrate

M3er multiplicative_multimodal_emotion_recognition

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to M3er multiplicative_multimodal_emotion_recognition

Similar to M3er multiplicative_multimodal_emotion_recognition (20)

Recently uploaded

Recently uploaded (20)

M3er multiplicative_multimodal_emotion_recognition

Editor's Notes