SlideShare a Scribd company logo
1 of 22
Download to read offline
Research and development of models
for speech emotion recognition
Bagus Tris Atmaja, PhD
Lecturer, Engineering Physics, ITS
Postdoctoral Researcher, AIRC, AIST
Version: 10/13/22
2/22
Outline
●
Introduction
●
General approach for research
●
Selected current research topics (problems & solutions):
– General speech emotion recognition
– Multimodal information fusion: video, audio, text
– Multitask learning
●
Future research & prospects
●
Conclusions
3/22
Self Introduction
●
2005 – 2009: Undergraduate, Eng. Physics, ITS
●
2010 – 2012: Master, Eng. Physics, ITS
●
2011 – 2012: Research student, Kumamoto Univ.
●
2012 – 2014: Engineer, Shimizu Seisakusho, Mie
●
2014 – Now : Lecturer, ITS
●
2017 – 2018: Research Student, JAIST
●
2018 – 2021: PhD, JAIST
●
2021 – Now : Postdoctoral researcher, AIST
4/22
Research Theme: General Approach
Problem-based and data-driven research, e.g., mental-state
monitoring, satisfaction evaluation, abnormal sound detection
5/22
Selected Research Topic:
Speech Emotion Recognition (SER)
Dataset
Acoustic embeddings
correlate to emotion?
Model suited for
SER?
Pre-processing/
post processing?
Feature
Selection?
Problems in previous SER Research:
- Most research were conducted for categorical emotion
- For categorical SER, the overall performance is not satisfactory (<70% Accuracy)
- Most SER research were conducted for English
- Is there any other information can be obtained from speech in addition to SER?
Categorical Emotion
Loss
function
Linguistic
Features?
Typical SER workflow:
6/22
Why dimensional SER?
●
Most previous studies were conducted for
categorical emotion → Dimensional SER
●
Why dimensional SER? Because the biological
category like categorical emotion doesn’t have an
essence due to high variability; continuous scores
are more universal evaluations.
●
For dimensional SER, the performance of valence
is lower than others → Adding linguistic information
(in addition to Acoustics)
7/22
Ideas for dimensional SER
●
For dimensional SER, the performance of valence is lower than others →
Adding linguistic information (in addition to Acoustics):
– Evaluate different word embeddings
– Adjusting loss function for each emotion dimension (valence, arousal, dominance)
– Evaluate early and late fusions
– Extracting emotional-correlated speech embedding
Dimensional Emotion
Regression
Loss
function
Linguistic
Features?
Acoustic embeddings
correlate to emotion?
8/22
How to improve low valence performance?
B. T. Atmaja and M. Akagi, “Improving Valence Prediction in Dimensional Speech Emotion Recognition Using Linguistic
Information,” in Oriental COCOSDA, 2020, pp. 166–171. [github https://github.com/bagustris/dser_with_text]
Adding linguistic information since valence is similar to sentiment
e.g.: “Your service is bad “ (negative sentiment, low valence)
(emotion)
(e.g.,
prosody)
The goal is to get linguistic scores from transcribed text
(e.g., GloVe) which correlates to valence (V), arousal (A),
and dominance (above scores are from a dictionary).
9/22
How about the result?
HSF+WE HSF+Word2Vec HSF+FastText HSF+GloVe HSF+BERT
0
20
40
60
80
100
120
140
Valence STL Valence MTL Averaged CCC MTL
Relative
improvement
(%)
Im
provem
ent (%
)
10/22
Is there any way to optimize acoustic+linguistic
fusion for SER?
B. T. Atmaja and M. Akagi, “Dimensional speech emotion recognition from speech features and word embeddings by using multitask learning,”
APSIPA Trans. Signal Inf. Process., vol. 9, no. May, p. e17, May 2020. [Github https://github.com/bagustris/dimensional-ser]
α
β
α=0.7, β=0.2, CCC = 0.51
ccc MTL:
●
No parameters
●
2 parameters
●
3 parameters
Yes, one is by adjusting loss function for each emotion dimension.
11/22
Is concatenation the only method to fuse acoustic
and linguistic information?
pAA: python audio analysis (34 acoustic features)
sil: silence
B. T. Atmaja and M. Akagi, “Two-stage dimensional emotion recognition by fusing predictions of acoustic and text networks using SVM,”
Speech Commun., vol. 126, pp. 9–21, Feb. 2021. [Github https://github.com/bagustris/two-stage-ser]
No, there are early (feature) and late (decision) fusion that can be applied.
12/22
How to extract acoustic embedding correlate to
emotional speech?
Finetuned pre-trained model on affective speech
dataset
Affective dataset
(MSP-Podcast)
Last hidden states: 1024-dim
Logits: 3-dim
[1] J. Wagner et al., “Dawn of
the transformer era in speech
emotion recognition: closing
the valence gap,” ArXiv:
2203.07378, Mar. 2022.
Acoustic embedding: last hidden states + logits (VAD)
[1] B. T. Atmaja and A. Sasou, “Leveraging Pre-Trained Acoustic Feature Extractor For Affective Vocal Bursts Tasks,” APSIPA-ASC 2022.
[Github https://github.com/bagustris/A-VB2022]
13/22
Does the fusion also improves categorical SER?
Does acoustic embedding also improves categorical SER?
[1] B. T. Atmaja, K. Shirai, and M. Akagi, “Speech Emotion Recognition Using Speech Feature and Word Embedding,” in 2019 Asia-Pacific Signal and
Information Processing Association Annual Summit and Conference (APSIPA ASC), 2019, pp. 519–523.
[Github https://github.com/bagustris/Apsipa2019_SpeechText]
[2] B. T. Atmaja and A. Sasou, “Effects of Data Augmentations on Speech Emotion Recognition,” Sensors, vol. 22, no. 16, p. 5941, Aug. 2022, doi:
10.3390/s22165941. [Github https://github.com/bagustris/ser_aug]
ASR
Yes, it does. Acoustic embedding trained for dimensional also improves categorical
14/22
Most SER research were conducted in English,
How about Japanese?
Text-Independent (TI) obtained lowest scores, probably, due to: (1) small
number of test data (200 samples), (2) STI learns better than TI.
On IEMOCAP dataset, split by
speaker and script (SP&SC=SI&TI)
showed the most challenging task
(Pepino et al., 2020)
On JTES dataset, we spot
the similar phenomena,
highlighting the
dependency of SER to
linguistic information
We conducted SER research on
Japanese and found a phenomena while
evaluating it using different splits.
15/22
How to improve Japanese SER?
Pre-trained model + data augmentation.
[1] B. T. Atmaja and A. Sasou, “Effects of Data Augmentations on Speech Emotion Recognition,” Sensors, vol. 22, no. 16, p. 5941, Aug. 2022,
16/22
Is there any information can be obtained beside
emotion from speech simultaneously?
B. T. Atmaja, A.
Sasou, and M. Akagi,
“Speech Emotion and
Naturalness
Recognitions with
Multitask and Single-
task Learnings,” IEEE
Access, pp. 1–1,
2022, doi:
10.1109/ACCESS.202
2.3189481.
[Pre-trained model is
available:
https://github.com/
bagustris/sner]
Yes, there is. naturalness of speech shows potential score in multitask learning
with emotion recognition; it helps improve SER scores.
17/22
Is there any information can be obtained beside
emotion from speech simultaneously?
Emo
Age
Country
Input
Shared
Layer
Independent
Layer
Yes, there is. Age and Country also could be
obtained from speech simultaneously.
18/22
Potency for Production
●
Cloud computing
●
Edge computing (on-device)
SER Model on
the cloud
(e.g., Huggingface,
ABCI, AWS)
Public API
●
Smaller (lite) models
●
Private data
●
Offline use
19/22
Some future works
(Tawaran judul TA/Thesis)
●
Build dataset and the pre-trained models for Bahasa Indonesia
●
Evaluate effectiveness of single task SER vs. multitask general speech processing (SER + ASR
+ sentiment + language/dialect + etc..) → I believe MTL is superior.
●
Evaluate the ethics (purpose of use), bias (gender, ethnics), and latency (real-time factor) of
the model.
●
On-device (edge computing) SER for data privacy, hardware optimization, etc.
negative
neutral
positive
neutral
happy
angry
sad
Sentiment Emotion
Recognizing common sentiment and
emotion (on-device?) dataset: MOSEI, JTES
Concordance correlation coefficient (CCC)
loss with machine learning (on-device?)
20/22
Future research: AI for Speech
Information manifested in speech (Fujisaki, 2003)
AI
text
dialect
speaker
age
gender
emotion
naturalness
disease
Short-term goals:
●
Pre-trained models for four common emotion categories:
multilingual SER
●
Pre-trained model for SER suited for Japanese language
●
Insights on affective vocal bursts (cry, laughter, etc.)
[knowledge]
21/22
Conclusions
●
Most SER research have been conducted in English, there is a
need to boost research on SER for non-English language:
Bahasa?
●
Multitask learning and multimodal information fusion
shows more performance but there is trade off.
●
It is not only emotion that can be extracted from speech, but
also naturalness, age, and country. More knowledge will follow.
●
Data and training paradigm (e.g. finetuning) are important to
generate models since algorithms /methods are openly
available.
●
The need to protect user data while collecting information to
increase well being without sacrificing model performance.
22/22
Contact information
●
Email-only: bagus@ep.its.ac.id
●
Keep healthy, mentally and
physically!

More Related Content

Similar to Webinar PENS October 2022

OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
mathsjournal
 
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
mathsjournal
 
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
mathsjournal
 
The Evolution of Speech Recognition Datasets: Fueling the Future of AI
The Evolution of Speech Recognition Datasets: Fueling the Future of AIThe Evolution of Speech Recognition Datasets: Fueling the Future of AI
The Evolution of Speech Recognition Datasets: Fueling the Future of AI
GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
 
Unlocking the Power of AI Text-to-Speech
Unlocking the Power of AI Text-to-SpeechUnlocking the Power of AI Text-to-Speech
Unlocking the Power of AI Text-to-Speech
Nola58
 
Unlocking the Potential of Speech Recognition Dataset: A Key to Advancing AI ...
Unlocking the Potential of Speech Recognition Dataset: A Key to Advancing AI ...Unlocking the Potential of Speech Recognition Dataset: A Key to Advancing AI ...
Unlocking the Potential of Speech Recognition Dataset: A Key to Advancing AI ...
GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
 
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHSPATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
kevig
 
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHSPATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
ijnlc
 

Similar to Webinar PENS October 2022 (20)

Speech emotion recognition using 2D-convolutional neural network
Speech emotion recognition using 2D-convolutional neural  networkSpeech emotion recognition using 2D-convolutional neural  network
Speech emotion recognition using 2D-convolutional neural network
 
An Overview Of Natural Language Processing
An Overview Of Natural Language ProcessingAn Overview Of Natural Language Processing
An Overview Of Natural Language Processing
 
Review On Speech Recognition using Deep Learning
Review On Speech Recognition using Deep LearningReview On Speech Recognition using Deep Learning
Review On Speech Recognition using Deep Learning
 
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
 
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
 
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
 
The Evolution of Speech Recognition Datasets: Fueling the Future of AI
The Evolution of Speech Recognition Datasets: Fueling the Future of AIThe Evolution of Speech Recognition Datasets: Fueling the Future of AI
The Evolution of Speech Recognition Datasets: Fueling the Future of AI
 
A prior case study of natural language processing on different domain
A prior case study of natural language processing  on different domain A prior case study of natural language processing  on different domain
A prior case study of natural language processing on different domain
 
NATURAL LANGUAGE PROCESSING
NATURAL LANGUAGE PROCESSINGNATURAL LANGUAGE PROCESSING
NATURAL LANGUAGE PROCESSING
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...
 
Glis Localization Internationalization 05 20071030
Glis Localization Internationalization 05 20071030Glis Localization Internationalization 05 20071030
Glis Localization Internationalization 05 20071030
 
Unlocking the Power of AI Text-to-Speech
Unlocking the Power of AI Text-to-SpeechUnlocking the Power of AI Text-to-Speech
Unlocking the Power of AI Text-to-Speech
 
Unlocking the Potential of Speech Recognition Dataset: A Key to Advancing AI ...
Unlocking the Potential of Speech Recognition Dataset: A Key to Advancing AI ...Unlocking the Potential of Speech Recognition Dataset: A Key to Advancing AI ...
Unlocking the Potential of Speech Recognition Dataset: A Key to Advancing AI ...
 
Portofolio Muhammad Afrizal Septiansyah 2024
Portofolio Muhammad Afrizal Septiansyah 2024Portofolio Muhammad Afrizal Septiansyah 2024
Portofolio Muhammad Afrizal Septiansyah 2024
 
IRJET - Sign Language Converter
IRJET -  	  Sign Language ConverterIRJET -  	  Sign Language Converter
IRJET - Sign Language Converter
 
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHSPATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
 
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHSPATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
 
Text Mining for Lexicography
Text Mining for LexicographyText Mining for Lexicography
Text Mining for Lexicography
 
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdfleewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
 
Should we be afraid of Transformers?
Should we be afraid of Transformers?Should we be afraid of Transformers?
Should we be afraid of Transformers?
 

Recently uploaded

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 

Recently uploaded (20)

On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 

Webinar PENS October 2022

  • 1. Research and development of models for speech emotion recognition Bagus Tris Atmaja, PhD Lecturer, Engineering Physics, ITS Postdoctoral Researcher, AIRC, AIST Version: 10/13/22
  • 2. 2/22 Outline ● Introduction ● General approach for research ● Selected current research topics (problems & solutions): – General speech emotion recognition – Multimodal information fusion: video, audio, text – Multitask learning ● Future research & prospects ● Conclusions
  • 3. 3/22 Self Introduction ● 2005 – 2009: Undergraduate, Eng. Physics, ITS ● 2010 – 2012: Master, Eng. Physics, ITS ● 2011 – 2012: Research student, Kumamoto Univ. ● 2012 – 2014: Engineer, Shimizu Seisakusho, Mie ● 2014 – Now : Lecturer, ITS ● 2017 – 2018: Research Student, JAIST ● 2018 – 2021: PhD, JAIST ● 2021 – Now : Postdoctoral researcher, AIST
  • 4. 4/22 Research Theme: General Approach Problem-based and data-driven research, e.g., mental-state monitoring, satisfaction evaluation, abnormal sound detection
  • 5. 5/22 Selected Research Topic: Speech Emotion Recognition (SER) Dataset Acoustic embeddings correlate to emotion? Model suited for SER? Pre-processing/ post processing? Feature Selection? Problems in previous SER Research: - Most research were conducted for categorical emotion - For categorical SER, the overall performance is not satisfactory (<70% Accuracy) - Most SER research were conducted for English - Is there any other information can be obtained from speech in addition to SER? Categorical Emotion Loss function Linguistic Features? Typical SER workflow:
  • 6. 6/22 Why dimensional SER? ● Most previous studies were conducted for categorical emotion → Dimensional SER ● Why dimensional SER? Because the biological category like categorical emotion doesn’t have an essence due to high variability; continuous scores are more universal evaluations. ● For dimensional SER, the performance of valence is lower than others → Adding linguistic information (in addition to Acoustics)
  • 7. 7/22 Ideas for dimensional SER ● For dimensional SER, the performance of valence is lower than others → Adding linguistic information (in addition to Acoustics): – Evaluate different word embeddings – Adjusting loss function for each emotion dimension (valence, arousal, dominance) – Evaluate early and late fusions – Extracting emotional-correlated speech embedding Dimensional Emotion Regression Loss function Linguistic Features? Acoustic embeddings correlate to emotion?
  • 8. 8/22 How to improve low valence performance? B. T. Atmaja and M. Akagi, “Improving Valence Prediction in Dimensional Speech Emotion Recognition Using Linguistic Information,” in Oriental COCOSDA, 2020, pp. 166–171. [github https://github.com/bagustris/dser_with_text] Adding linguistic information since valence is similar to sentiment e.g.: “Your service is bad “ (negative sentiment, low valence) (emotion) (e.g., prosody) The goal is to get linguistic scores from transcribed text (e.g., GloVe) which correlates to valence (V), arousal (A), and dominance (above scores are from a dictionary).
  • 9. 9/22 How about the result? HSF+WE HSF+Word2Vec HSF+FastText HSF+GloVe HSF+BERT 0 20 40 60 80 100 120 140 Valence STL Valence MTL Averaged CCC MTL Relative improvement (%) Im provem ent (% )
  • 10. 10/22 Is there any way to optimize acoustic+linguistic fusion for SER? B. T. Atmaja and M. Akagi, “Dimensional speech emotion recognition from speech features and word embeddings by using multitask learning,” APSIPA Trans. Signal Inf. Process., vol. 9, no. May, p. e17, May 2020. [Github https://github.com/bagustris/dimensional-ser] α β α=0.7, β=0.2, CCC = 0.51 ccc MTL: ● No parameters ● 2 parameters ● 3 parameters Yes, one is by adjusting loss function for each emotion dimension.
  • 11. 11/22 Is concatenation the only method to fuse acoustic and linguistic information? pAA: python audio analysis (34 acoustic features) sil: silence B. T. Atmaja and M. Akagi, “Two-stage dimensional emotion recognition by fusing predictions of acoustic and text networks using SVM,” Speech Commun., vol. 126, pp. 9–21, Feb. 2021. [Github https://github.com/bagustris/two-stage-ser] No, there are early (feature) and late (decision) fusion that can be applied.
  • 12. 12/22 How to extract acoustic embedding correlate to emotional speech? Finetuned pre-trained model on affective speech dataset Affective dataset (MSP-Podcast) Last hidden states: 1024-dim Logits: 3-dim [1] J. Wagner et al., “Dawn of the transformer era in speech emotion recognition: closing the valence gap,” ArXiv: 2203.07378, Mar. 2022. Acoustic embedding: last hidden states + logits (VAD) [1] B. T. Atmaja and A. Sasou, “Leveraging Pre-Trained Acoustic Feature Extractor For Affective Vocal Bursts Tasks,” APSIPA-ASC 2022. [Github https://github.com/bagustris/A-VB2022]
  • 13. 13/22 Does the fusion also improves categorical SER? Does acoustic embedding also improves categorical SER? [1] B. T. Atmaja, K. Shirai, and M. Akagi, “Speech Emotion Recognition Using Speech Feature and Word Embedding,” in 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2019, pp. 519–523. [Github https://github.com/bagustris/Apsipa2019_SpeechText] [2] B. T. Atmaja and A. Sasou, “Effects of Data Augmentations on Speech Emotion Recognition,” Sensors, vol. 22, no. 16, p. 5941, Aug. 2022, doi: 10.3390/s22165941. [Github https://github.com/bagustris/ser_aug] ASR Yes, it does. Acoustic embedding trained for dimensional also improves categorical
  • 14. 14/22 Most SER research were conducted in English, How about Japanese? Text-Independent (TI) obtained lowest scores, probably, due to: (1) small number of test data (200 samples), (2) STI learns better than TI. On IEMOCAP dataset, split by speaker and script (SP&SC=SI&TI) showed the most challenging task (Pepino et al., 2020) On JTES dataset, we spot the similar phenomena, highlighting the dependency of SER to linguistic information We conducted SER research on Japanese and found a phenomena while evaluating it using different splits.
  • 15. 15/22 How to improve Japanese SER? Pre-trained model + data augmentation. [1] B. T. Atmaja and A. Sasou, “Effects of Data Augmentations on Speech Emotion Recognition,” Sensors, vol. 22, no. 16, p. 5941, Aug. 2022,
  • 16. 16/22 Is there any information can be obtained beside emotion from speech simultaneously? B. T. Atmaja, A. Sasou, and M. Akagi, “Speech Emotion and Naturalness Recognitions with Multitask and Single- task Learnings,” IEEE Access, pp. 1–1, 2022, doi: 10.1109/ACCESS.202 2.3189481. [Pre-trained model is available: https://github.com/ bagustris/sner] Yes, there is. naturalness of speech shows potential score in multitask learning with emotion recognition; it helps improve SER scores.
  • 17. 17/22 Is there any information can be obtained beside emotion from speech simultaneously? Emo Age Country Input Shared Layer Independent Layer Yes, there is. Age and Country also could be obtained from speech simultaneously.
  • 18. 18/22 Potency for Production ● Cloud computing ● Edge computing (on-device) SER Model on the cloud (e.g., Huggingface, ABCI, AWS) Public API ● Smaller (lite) models ● Private data ● Offline use
  • 19. 19/22 Some future works (Tawaran judul TA/Thesis) ● Build dataset and the pre-trained models for Bahasa Indonesia ● Evaluate effectiveness of single task SER vs. multitask general speech processing (SER + ASR + sentiment + language/dialect + etc..) → I believe MTL is superior. ● Evaluate the ethics (purpose of use), bias (gender, ethnics), and latency (real-time factor) of the model. ● On-device (edge computing) SER for data privacy, hardware optimization, etc. negative neutral positive neutral happy angry sad Sentiment Emotion Recognizing common sentiment and emotion (on-device?) dataset: MOSEI, JTES Concordance correlation coefficient (CCC) loss with machine learning (on-device?)
  • 20. 20/22 Future research: AI for Speech Information manifested in speech (Fujisaki, 2003) AI text dialect speaker age gender emotion naturalness disease Short-term goals: ● Pre-trained models for four common emotion categories: multilingual SER ● Pre-trained model for SER suited for Japanese language ● Insights on affective vocal bursts (cry, laughter, etc.) [knowledge]
  • 21. 21/22 Conclusions ● Most SER research have been conducted in English, there is a need to boost research on SER for non-English language: Bahasa? ● Multitask learning and multimodal information fusion shows more performance but there is trade off. ● It is not only emotion that can be extracted from speech, but also naturalness, age, and country. More knowledge will follow. ● Data and training paradigm (e.g. finetuning) are important to generate models since algorithms /methods are openly available. ● The need to protect user data while collecting information to increase well being without sacrificing model performance.