This document outlines research on developing models for speech emotion recognition. It discusses the general approach of problem-based and data-driven research. Selected current research topics covered include general speech emotion recognition, multimodal information fusion, and multitask learning. The document also mentions future research prospects and conclusions.
1. Research and development of models
for speech emotion recognition
Bagus Tris Atmaja, PhD
Lecturer, Engineering Physics, ITS
Postdoctoral Researcher, AIRC, AIST
Version: 10/13/22
2. 2/22
Outline
●
Introduction
●
General approach for research
●
Selected current research topics (problems & solutions):
– General speech emotion recognition
– Multimodal information fusion: video, audio, text
– Multitask learning
●
Future research & prospects
●
Conclusions
3. 3/22
Self Introduction
●
2005 – 2009: Undergraduate, Eng. Physics, ITS
●
2010 – 2012: Master, Eng. Physics, ITS
●
2011 – 2012: Research student, Kumamoto Univ.
●
2012 – 2014: Engineer, Shimizu Seisakusho, Mie
●
2014 – Now : Lecturer, ITS
●
2017 – 2018: Research Student, JAIST
●
2018 – 2021: PhD, JAIST
●
2021 – Now : Postdoctoral researcher, AIST
4. 4/22
Research Theme: General Approach
Problem-based and data-driven research, e.g., mental-state
monitoring, satisfaction evaluation, abnormal sound detection
5. 5/22
Selected Research Topic:
Speech Emotion Recognition (SER)
Dataset
Acoustic embeddings
correlate to emotion?
Model suited for
SER?
Pre-processing/
post processing?
Feature
Selection?
Problems in previous SER Research:
- Most research were conducted for categorical emotion
- For categorical SER, the overall performance is not satisfactory (<70% Accuracy)
- Most SER research were conducted for English
- Is there any other information can be obtained from speech in addition to SER?
Categorical Emotion
Loss
function
Linguistic
Features?
Typical SER workflow:
6. 6/22
Why dimensional SER?
●
Most previous studies were conducted for
categorical emotion → Dimensional SER
●
Why dimensional SER? Because the biological
category like categorical emotion doesn’t have an
essence due to high variability; continuous scores
are more universal evaluations.
●
For dimensional SER, the performance of valence
is lower than others → Adding linguistic information
(in addition to Acoustics)
7. 7/22
Ideas for dimensional SER
●
For dimensional SER, the performance of valence is lower than others →
Adding linguistic information (in addition to Acoustics):
– Evaluate different word embeddings
– Adjusting loss function for each emotion dimension (valence, arousal, dominance)
– Evaluate early and late fusions
– Extracting emotional-correlated speech embedding
Dimensional Emotion
Regression
Loss
function
Linguistic
Features?
Acoustic embeddings
correlate to emotion?
8. 8/22
How to improve low valence performance?
B. T. Atmaja and M. Akagi, “Improving Valence Prediction in Dimensional Speech Emotion Recognition Using Linguistic
Information,” in Oriental COCOSDA, 2020, pp. 166–171. [github https://github.com/bagustris/dser_with_text]
Adding linguistic information since valence is similar to sentiment
e.g.: “Your service is bad “ (negative sentiment, low valence)
(emotion)
(e.g.,
prosody)
The goal is to get linguistic scores from transcribed text
(e.g., GloVe) which correlates to valence (V), arousal (A),
and dominance (above scores are from a dictionary).
9. 9/22
How about the result?
HSF+WE HSF+Word2Vec HSF+FastText HSF+GloVe HSF+BERT
0
20
40
60
80
100
120
140
Valence STL Valence MTL Averaged CCC MTL
Relative
improvement
(%)
Im
provem
ent (%
)
10. 10/22
Is there any way to optimize acoustic+linguistic
fusion for SER?
B. T. Atmaja and M. Akagi, “Dimensional speech emotion recognition from speech features and word embeddings by using multitask learning,”
APSIPA Trans. Signal Inf. Process., vol. 9, no. May, p. e17, May 2020. [Github https://github.com/bagustris/dimensional-ser]
α
β
α=0.7, β=0.2, CCC = 0.51
ccc MTL:
●
No parameters
●
2 parameters
●
3 parameters
Yes, one is by adjusting loss function for each emotion dimension.
11. 11/22
Is concatenation the only method to fuse acoustic
and linguistic information?
pAA: python audio analysis (34 acoustic features)
sil: silence
B. T. Atmaja and M. Akagi, “Two-stage dimensional emotion recognition by fusing predictions of acoustic and text networks using SVM,”
Speech Commun., vol. 126, pp. 9–21, Feb. 2021. [Github https://github.com/bagustris/two-stage-ser]
No, there are early (feature) and late (decision) fusion that can be applied.
12. 12/22
How to extract acoustic embedding correlate to
emotional speech?
Finetuned pre-trained model on affective speech
dataset
Affective dataset
(MSP-Podcast)
Last hidden states: 1024-dim
Logits: 3-dim
[1] J. Wagner et al., “Dawn of
the transformer era in speech
emotion recognition: closing
the valence gap,” ArXiv:
2203.07378, Mar. 2022.
Acoustic embedding: last hidden states + logits (VAD)
[1] B. T. Atmaja and A. Sasou, “Leveraging Pre-Trained Acoustic Feature Extractor For Affective Vocal Bursts Tasks,” APSIPA-ASC 2022.
[Github https://github.com/bagustris/A-VB2022]
13. 13/22
Does the fusion also improves categorical SER?
Does acoustic embedding also improves categorical SER?
[1] B. T. Atmaja, K. Shirai, and M. Akagi, “Speech Emotion Recognition Using Speech Feature and Word Embedding,” in 2019 Asia-Pacific Signal and
Information Processing Association Annual Summit and Conference (APSIPA ASC), 2019, pp. 519–523.
[Github https://github.com/bagustris/Apsipa2019_SpeechText]
[2] B. T. Atmaja and A. Sasou, “Effects of Data Augmentations on Speech Emotion Recognition,” Sensors, vol. 22, no. 16, p. 5941, Aug. 2022, doi:
10.3390/s22165941. [Github https://github.com/bagustris/ser_aug]
ASR
Yes, it does. Acoustic embedding trained for dimensional also improves categorical
14. 14/22
Most SER research were conducted in English,
How about Japanese?
Text-Independent (TI) obtained lowest scores, probably, due to: (1) small
number of test data (200 samples), (2) STI learns better than TI.
On IEMOCAP dataset, split by
speaker and script (SP&SC=SI&TI)
showed the most challenging task
(Pepino et al., 2020)
On JTES dataset, we spot
the similar phenomena,
highlighting the
dependency of SER to
linguistic information
We conducted SER research on
Japanese and found a phenomena while
evaluating it using different splits.
15. 15/22
How to improve Japanese SER?
Pre-trained model + data augmentation.
[1] B. T. Atmaja and A. Sasou, “Effects of Data Augmentations on Speech Emotion Recognition,” Sensors, vol. 22, no. 16, p. 5941, Aug. 2022,
16. 16/22
Is there any information can be obtained beside
emotion from speech simultaneously?
B. T. Atmaja, A.
Sasou, and M. Akagi,
“Speech Emotion and
Naturalness
Recognitions with
Multitask and Single-
task Learnings,” IEEE
Access, pp. 1–1,
2022, doi:
10.1109/ACCESS.202
2.3189481.
[Pre-trained model is
available:
https://github.com/
bagustris/sner]
Yes, there is. naturalness of speech shows potential score in multitask learning
with emotion recognition; it helps improve SER scores.
17. 17/22
Is there any information can be obtained beside
emotion from speech simultaneously?
Emo
Age
Country
Input
Shared
Layer
Independent
Layer
Yes, there is. Age and Country also could be
obtained from speech simultaneously.
18. 18/22
Potency for Production
●
Cloud computing
●
Edge computing (on-device)
SER Model on
the cloud
(e.g., Huggingface,
ABCI, AWS)
Public API
●
Smaller (lite) models
●
Private data
●
Offline use
19. 19/22
Some future works
(Tawaran judul TA/Thesis)
●
Build dataset and the pre-trained models for Bahasa Indonesia
●
Evaluate effectiveness of single task SER vs. multitask general speech processing (SER + ASR
+ sentiment + language/dialect + etc..) → I believe MTL is superior.
●
Evaluate the ethics (purpose of use), bias (gender, ethnics), and latency (real-time factor) of
the model.
●
On-device (edge computing) SER for data privacy, hardware optimization, etc.
negative
neutral
positive
neutral
happy
angry
sad
Sentiment Emotion
Recognizing common sentiment and
emotion (on-device?) dataset: MOSEI, JTES
Concordance correlation coefficient (CCC)
loss with machine learning (on-device?)
20. 20/22
Future research: AI for Speech
Information manifested in speech (Fujisaki, 2003)
AI
text
dialect
speaker
age
gender
emotion
naturalness
disease
Short-term goals:
●
Pre-trained models for four common emotion categories:
multilingual SER
●
Pre-trained model for SER suited for Japanese language
●
Insights on affective vocal bursts (cry, laughter, etc.)
[knowledge]
21. 21/22
Conclusions
●
Most SER research have been conducted in English, there is a
need to boost research on SER for non-English language:
Bahasa?
●
Multitask learning and multimodal information fusion
shows more performance but there is trade off.
●
It is not only emotion that can be extracted from speech, but
also naturalness, age, and country. More knowledge will follow.
●
Data and training paradigm (e.g. finetuning) are important to
generate models since algorithms /methods are openly
available.
●
The need to protect user data while collecting information to
increase well being without sacrificing model performance.