2. MULTIMODEL(MULTI-INPUT) IN EMOTION RECOGNITION
Reason:
o Richer information: Cues from different modalities can augment or complement each other, and hence lead to
more sophisticated inference algorithms.
o Robustness to Sensor Noise: Information on different modalities captured through sensors can often be cor
rupted due to signal noise, or be missing altogether when the particular modality is not expressed, or can not be
captured due to occlusion, sensor artifacts, etc. We call such modalities ineffectual. Ineffectual modali ties are
especially prevalent in in-the-wild datasets.
5. CHALLENGE
Challenge:
o Decide which modalities should be combined and how
o Lack of agreement on the most efficient mechanism for combining(fusing) multi modalities
7. Late fusion:
Gunes et al 2007: Multimodal emotion recognition
from expressive faces, body gestures
Lee et al (2018) Convolutional Attention
Networks for Multimodal Emotion
Recognition from Speech and Text Data
8. RELATED WORK
Multimodalities comparision
Dataset Method Modalities F1 scores MA
IEMOCAP
Kim et al (2013) Deep Belief Network Motion capture and audio
video
72.8 %
Yoon et al(2019) Multi-hop attention Text and Speech 77,6 %
Majumdar et al (2018) Text, Audio and Video 76.5 %
CMU-
MOSEI
Zadeh et al (2018) Dynamic Fusion
Graph
Language, vision and
acoustic
76.3%
Lee et al (2018) Text and Speech 89% 84.08%
Sahay et al(2018) tensor fusion network Text and audio 66.8%
10. MODALITIES CHECK
Purpose: filter ineffectual data to increase the accuracy of reality data
Using Canonical Correlation Analysis (CCA) to compute
the correlation score, ρ, of every pair of input modalities
Compute the correlation score for the pair {𝑓𝑖, 𝑓𝑗}
Check them against an empirically chosen
threshold (τ)
11. REGENERATING PROXY FEATURE VECTORS
Purpose: decrease the noise of each feature by regenerating proxy feature vectors for the ineffectual
modalities missed
Finding vj = argminjd(vj, ff), where is any distance metric
Compute constants ai ∈ R by solving the following linear system:
12. MULTIPLICATIVE MODALITY FUSION
Idea: to explicitly suppress the weaker (not so expressive) modalities, which indirectly boost the stronger
(expressive) modalities
The loss for the 𝑖𝑡ℎ modality:
13. MODALITY COMBINATION
Requirement:
o Be able to process the sotisphicated – data driven ( CMU-MOSEI, Youtube…) which has noise, occlusion, …
o Increase the reliability
Proposal combination:
o Using single-hidden-layer LSTMs, each of output dimension 32.
o Then using multiplicative fusion to combine 3 32 dimensional feature vectors.
o This feature vecto is concatenated with the final value of the memory variable, and the resultant 160 dimensional
feature vector is passed through a 64 dimensional fully connected layer followed by a 6 dimensional fully
connected to generate the network outputs
14. EXPERIMENTS
Feature extraction:
Text(ft): Pre-trained GloVe word with 300-dimension embedding method
Using the COVAREP software (Degottex et al., 2014) to extract acoustic features including 12 Mel-frequency
cepstral coefficients, pitch, voiced/unvoiced segmenting features, glottal source parameters, peak slope
parameters and maxima dispersion quotients.
Using the combination of face embeddings obtained from state-ofthe-art facial recognition models, facial action
units, and facial landmarks for CMU-MOSEI
16. LIMITATION
• Often confuses between certain class labels
• There is no absolute precision of the human perception of emotion in
an instant moment
• May consider adding context to emotional recognition
Đường ống phân loại của phương pháp đề xuất. Khi các tính năng hình ảnh và âm thanh được trích xuất, chúng tôi xây dựng một hạt nhân hàm cơ sở xuyên tâm (RBF) từ mỗi bộ mô tả. Sau đó, chúng tôi sử dụng MKL để kết hợp tối ưu các hạt nhân tính năng cho đầu vào vào bộ phân loại SVM.
A direct way to learn about the relationship between these two feature vectors would be to utilize a shallow model, which is a simple concatenation of two feature vectors. However, since the correlations between feature vectors from speech and text is highly non-linear, it is difficult for a shallow model to properly learn multimodal representations. Therefore, we utilize trainable attention mechanisms to learn nonlinear correlations between these feature vectors. Attention mechanisms also help retain information in the timedomain by forming temporal embedding between two feature vectors.
2:Using the cross-validation method to integrate