SlideShare a Scribd company logo
Research and development of models
for speech emotion recognition
Bagus Tris Atmaja, PhD
Lecturer, Engineering Physics, ITS
Postdoctoral Researcher, AIRC, AIST
Version: 10/13/22
2/22
Outline
●
Introduction
●
General approach for research
●
Selected current research topics (problems & solutions):
– General speech emotion recognition
– Multimodal information fusion: video, audio, text
– Multitask learning
●
Future research & prospects
●
Conclusions
3/22
Self Introduction
●
2005 – 2009: Undergraduate, Eng. Physics, ITS
●
2010 – 2012: Master, Eng. Physics, ITS
●
2011 – 2012: Research student, Kumamoto Univ.
●
2012 – 2014: Engineer, Shimizu Seisakusho, Mie
●
2014 – Now : Lecturer, ITS
●
2017 – 2018: Research Student, JAIST
●
2018 – 2021: PhD, JAIST
●
2021 – Now : Postdoctoral researcher, AIST
4/22
Research Theme: General Approach
Problem-based and data-driven research, e.g., mental-state
monitoring, satisfaction evaluation, abnormal sound detection
5/22
Selected Research Topic:
Speech Emotion Recognition (SER)
Dataset
Acoustic embeddings
correlate to emotion?
Model suited for
SER?
Pre-processing/
post processing?
Feature
Selection?
Problems in previous SER Research:
- Most research were conducted for categorical emotion
- For categorical SER, the overall performance is not satisfactory (<70% Accuracy)
- Most SER research were conducted for English
- Is there any other information can be obtained from speech in addition to SER?
Categorical Emotion
Loss
function
Linguistic
Features?
Typical SER workflow:
6/22
Why dimensional SER?
●
Most previous studies were conducted for
categorical emotion → Dimensional SER
●
Why dimensional SER? Because the biological
category like categorical emotion doesn’t have an
essence due to high variability; continuous scores
are more universal evaluations.
●
For dimensional SER, the performance of valence
is lower than others → Adding linguistic information
(in addition to Acoustics)
7/22
Ideas for dimensional SER
●
For dimensional SER, the performance of valence is lower than others →
Adding linguistic information (in addition to Acoustics):
– Evaluate different word embeddings
– Adjusting loss function for each emotion dimension (valence, arousal, dominance)
– Evaluate early and late fusions
– Extracting emotional-correlated speech embedding
Dimensional Emotion
Regression
Loss
function
Linguistic
Features?
Acoustic embeddings
correlate to emotion?
8/22
How to improve low valence performance?
B. T. Atmaja and M. Akagi, “Improving Valence Prediction in Dimensional Speech Emotion Recognition Using Linguistic
Information,” in Oriental COCOSDA, 2020, pp. 166–171. [github https://github.com/bagustris/dser_with_text]
Adding linguistic information since valence is similar to sentiment
e.g.: “Your service is bad “ (negative sentiment, low valence)
(emotion)
(e.g.,
prosody)
The goal is to get linguistic scores from transcribed text
(e.g., GloVe) which correlates to valence (V), arousal (A),
and dominance (above scores are from a dictionary).
9/22
How about the result?
HSF+WE HSF+Word2Vec HSF+FastText HSF+GloVe HSF+BERT
0
20
40
60
80
100
120
140
Valence STL Valence MTL Averaged CCC MTL
Relative
improvement
(%)
Im
provem
ent (%
)
10/22
Is there any way to optimize acoustic+linguistic
fusion for SER?
B. T. Atmaja and M. Akagi, “Dimensional speech emotion recognition from speech features and word embeddings by using multitask learning,”
APSIPA Trans. Signal Inf. Process., vol. 9, no. May, p. e17, May 2020. [Github https://github.com/bagustris/dimensional-ser]
α
β
α=0.7, β=0.2, CCC = 0.51
ccc MTL:
●
No parameters
●
2 parameters
●
3 parameters
Yes, one is by adjusting loss function for each emotion dimension.
11/22
Is concatenation the only method to fuse acoustic
and linguistic information?
pAA: python audio analysis (34 acoustic features)
sil: silence
B. T. Atmaja and M. Akagi, “Two-stage dimensional emotion recognition by fusing predictions of acoustic and text networks using SVM,”
Speech Commun., vol. 126, pp. 9–21, Feb. 2021. [Github https://github.com/bagustris/two-stage-ser]
No, there are early (feature) and late (decision) fusion that can be applied.
12/22
How to extract acoustic embedding correlate to
emotional speech?
Finetuned pre-trained model on affective speech
dataset
Affective dataset
(MSP-Podcast)
Last hidden states: 1024-dim
Logits: 3-dim
[1] J. Wagner et al., “Dawn of
the transformer era in speech
emotion recognition: closing
the valence gap,” ArXiv:
2203.07378, Mar. 2022.
Acoustic embedding: last hidden states + logits (VAD)
[1] B. T. Atmaja and A. Sasou, “Leveraging Pre-Trained Acoustic Feature Extractor For Affective Vocal Bursts Tasks,” APSIPA-ASC 2022.
[Github https://github.com/bagustris/A-VB2022]
13/22
Does the fusion also improves categorical SER?
Does acoustic embedding also improves categorical SER?
[1] B. T. Atmaja, K. Shirai, and M. Akagi, “Speech Emotion Recognition Using Speech Feature and Word Embedding,” in 2019 Asia-Pacific Signal and
Information Processing Association Annual Summit and Conference (APSIPA ASC), 2019, pp. 519–523.
[Github https://github.com/bagustris/Apsipa2019_SpeechText]
[2] B. T. Atmaja and A. Sasou, “Effects of Data Augmentations on Speech Emotion Recognition,” Sensors, vol. 22, no. 16, p. 5941, Aug. 2022, doi:
10.3390/s22165941. [Github https://github.com/bagustris/ser_aug]
ASR
Yes, it does. Acoustic embedding trained for dimensional also improves categorical
14/22
Most SER research were conducted in English,
How about Japanese?
Text-Independent (TI) obtained lowest scores, probably, due to: (1) small
number of test data (200 samples), (2) STI learns better than TI.
On IEMOCAP dataset, split by
speaker and script (SP&SC=SI&TI)
showed the most challenging task
(Pepino et al., 2020)
On JTES dataset, we spot
the similar phenomena,
highlighting the
dependency of SER to
linguistic information
We conducted SER research on
Japanese and found a phenomena while
evaluating it using different splits.
15/22
How to improve Japanese SER?
Pre-trained model + data augmentation.
[1] B. T. Atmaja and A. Sasou, “Effects of Data Augmentations on Speech Emotion Recognition,” Sensors, vol. 22, no. 16, p. 5941, Aug. 2022,
16/22
Is there any information can be obtained beside
emotion from speech simultaneously?
B. T. Atmaja, A.
Sasou, and M. Akagi,
“Speech Emotion and
Naturalness
Recognitions with
Multitask and Single-
task Learnings,” IEEE
Access, pp. 1–1,
2022, doi:
10.1109/ACCESS.202
2.3189481.
[Pre-trained model is
available:
https://github.com/
bagustris/sner]
Yes, there is. naturalness of speech shows potential score in multitask learning
with emotion recognition; it helps improve SER scores.
17/22
Is there any information can be obtained beside
emotion from speech simultaneously?
Emo
Age
Country
Input
Shared
Layer
Independent
Layer
Yes, there is. Age and Country also could be
obtained from speech simultaneously.
18/22
Potency for Production
●
Cloud computing
●
Edge computing (on-device)
SER Model on
the cloud
(e.g., Huggingface,
ABCI, AWS)
Public API
●
Smaller (lite) models
●
Private data
●
Offline use
19/22
Some future works
(Tawaran judul TA/Thesis)
●
Build dataset and the pre-trained models for Bahasa Indonesia
●
Evaluate effectiveness of single task SER vs. multitask general speech processing (SER + ASR
+ sentiment + language/dialect + etc..) → I believe MTL is superior.
●
Evaluate the ethics (purpose of use), bias (gender, ethnics), and latency (real-time factor) of
the model.
●
On-device (edge computing) SER for data privacy, hardware optimization, etc.
negative
neutral
positive
neutral
happy
angry
sad
Sentiment Emotion
Recognizing common sentiment and
emotion (on-device?) dataset: MOSEI, JTES
Concordance correlation coefficient (CCC)
loss with machine learning (on-device?)
20/22
Future research: AI for Speech
Information manifested in speech (Fujisaki, 2003)
AI
text
dialect
speaker
age
gender
emotion
naturalness
disease
Short-term goals:
●
Pre-trained models for four common emotion categories:
multilingual SER
●
Pre-trained model for SER suited for Japanese language
●
Insights on affective vocal bursts (cry, laughter, etc.)
[knowledge]
21/22
Conclusions
●
Most SER research have been conducted in English, there is a
need to boost research on SER for non-English language:
Bahasa?
●
Multitask learning and multimodal information fusion
shows more performance but there is trade off.
●
It is not only emotion that can be extracted from speech, but
also naturalness, age, and country. More knowledge will follow.
●
Data and training paradigm (e.g. finetuning) are important to
generate models since algorithms /methods are openly
available.
●
The need to protect user data while collecting information to
increase well being without sacrificing model performance.
22/22
Contact information
●
Email-only: bagus@ep.its.ac.id
●
Keep healthy, mentally and
physically!

More Related Content

Similar to Webinar PENS October 2022

Speech emotion recognition using 2D-convolutional neural network
Speech emotion recognition using 2D-convolutional neural  networkSpeech emotion recognition using 2D-convolutional neural  network
Speech emotion recognition using 2D-convolutional neural network
IJECEIAES
 
An Overview Of Natural Language Processing
An Overview Of Natural Language ProcessingAn Overview Of Natural Language Processing
An Overview Of Natural Language Processing
Scott Faria
 
Review On Speech Recognition using Deep Learning
Review On Speech Recognition using Deep LearningReview On Speech Recognition using Deep Learning
Review On Speech Recognition using Deep Learning
IRJET Journal
 
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
mathsjournal
 
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
mathsjournal
 
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
mathsjournal
 
The Evolution of Speech Recognition Datasets: Fueling the Future of AI
The Evolution of Speech Recognition Datasets: Fueling the Future of AIThe Evolution of Speech Recognition Datasets: Fueling the Future of AI
The Evolution of Speech Recognition Datasets: Fueling the Future of AI
GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
 
A prior case study of natural language processing on different domain
A prior case study of natural language processing  on different domain A prior case study of natural language processing  on different domain
A prior case study of natural language processing on different domain
IJECEIAES
 
NATURAL LANGUAGE PROCESSING
NATURAL LANGUAGE PROCESSINGNATURAL LANGUAGE PROCESSING
NATURAL LANGUAGE PROCESSING
IJCI JOURNAL
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...
karthik annam
 
Glis Localization Internationalization 05 20071030
Glis Localization Internationalization 05 20071030Glis Localization Internationalization 05 20071030
Glis Localization Internationalization 05 20071030
Jan Pawlowski
 
Unlocking the Power of AI Text-to-Speech
Unlocking the Power of AI Text-to-SpeechUnlocking the Power of AI Text-to-Speech
Unlocking the Power of AI Text-to-Speech
Nola58
 
Unlocking the Potential of Speech Recognition Dataset: A Key to Advancing AI ...
Unlocking the Potential of Speech Recognition Dataset: A Key to Advancing AI ...Unlocking the Potential of Speech Recognition Dataset: A Key to Advancing AI ...
Unlocking the Potential of Speech Recognition Dataset: A Key to Advancing AI ...
GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
 
Portofolio Muhammad Afrizal Septiansyah 2024
Portofolio Muhammad Afrizal Septiansyah 2024Portofolio Muhammad Afrizal Septiansyah 2024
Portofolio Muhammad Afrizal Septiansyah 2024
MuhammadAfrizalSepti
 
IRJET - Sign Language Converter
IRJET -  	  Sign Language ConverterIRJET -  	  Sign Language Converter
IRJET - Sign Language Converter
IRJET Journal
 
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHSPATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
kevig
 
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHSPATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
ijnlc
 
Text Mining for Lexicography
Text Mining for LexicographyText Mining for Lexicography
Text Mining for Lexicography
Leiden University
 
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdfleewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
robertsamuel23
 
Should we be afraid of Transformers?
Should we be afraid of Transformers?Should we be afraid of Transformers?
Should we be afraid of Transformers?
Dominik Seisser
 

Similar to Webinar PENS October 2022 (20)

Speech emotion recognition using 2D-convolutional neural network
Speech emotion recognition using 2D-convolutional neural  networkSpeech emotion recognition using 2D-convolutional neural  network
Speech emotion recognition using 2D-convolutional neural network
 
An Overview Of Natural Language Processing
An Overview Of Natural Language ProcessingAn Overview Of Natural Language Processing
An Overview Of Natural Language Processing
 
Review On Speech Recognition using Deep Learning
Review On Speech Recognition using Deep LearningReview On Speech Recognition using Deep Learning
Review On Speech Recognition using Deep Learning
 
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
 
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
 
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
 
The Evolution of Speech Recognition Datasets: Fueling the Future of AI
The Evolution of Speech Recognition Datasets: Fueling the Future of AIThe Evolution of Speech Recognition Datasets: Fueling the Future of AI
The Evolution of Speech Recognition Datasets: Fueling the Future of AI
 
A prior case study of natural language processing on different domain
A prior case study of natural language processing  on different domain A prior case study of natural language processing  on different domain
A prior case study of natural language processing on different domain
 
NATURAL LANGUAGE PROCESSING
NATURAL LANGUAGE PROCESSINGNATURAL LANGUAGE PROCESSING
NATURAL LANGUAGE PROCESSING
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...
 
Glis Localization Internationalization 05 20071030
Glis Localization Internationalization 05 20071030Glis Localization Internationalization 05 20071030
Glis Localization Internationalization 05 20071030
 
Unlocking the Power of AI Text-to-Speech
Unlocking the Power of AI Text-to-SpeechUnlocking the Power of AI Text-to-Speech
Unlocking the Power of AI Text-to-Speech
 
Unlocking the Potential of Speech Recognition Dataset: A Key to Advancing AI ...
Unlocking the Potential of Speech Recognition Dataset: A Key to Advancing AI ...Unlocking the Potential of Speech Recognition Dataset: A Key to Advancing AI ...
Unlocking the Potential of Speech Recognition Dataset: A Key to Advancing AI ...
 
Portofolio Muhammad Afrizal Septiansyah 2024
Portofolio Muhammad Afrizal Septiansyah 2024Portofolio Muhammad Afrizal Septiansyah 2024
Portofolio Muhammad Afrizal Septiansyah 2024
 
IRJET - Sign Language Converter
IRJET -  	  Sign Language ConverterIRJET -  	  Sign Language Converter
IRJET - Sign Language Converter
 
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHSPATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
 
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHSPATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
 
Text Mining for Lexicography
Text Mining for LexicographyText Mining for Lexicography
Text Mining for Lexicography
 
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdfleewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
 
Should we be afraid of Transformers?
Should we be afraid of Transformers?Should we be afraid of Transformers?
Should we be afraid of Transformers?
 

Recently uploaded

clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
Priyankaranawat4
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Excellence Foundation for South Sudan
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
mulvey2
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
adhitya5119
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
Priyankaranawat4
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
RitikBhardwaj56
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
amberjdewit93
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
Nguyen Thanh Tu Collection
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Dr. Vinod Kumar Kanvaria
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
heathfieldcps1
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
Nicholas Montgomery
 
Hindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdfHindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdf
Dr. Mulla Adam Ali
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
PECB
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
GeorgeMilliken2
 
Smart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICTSmart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICT
simonomuemu
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
Jean Carlos Nunes Paixão
 
Cognitive Development Adolescence Psychology
Cognitive Development Adolescence PsychologyCognitive Development Adolescence Psychology
Cognitive Development Adolescence Psychology
paigestewart1632
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
History of Stoke Newington
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
Israel Genealogy Research Association
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
AyyanKhan40
 

Recently uploaded (20)

clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
 
Hindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdfHindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdf
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
 
Smart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICTSmart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICT
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
 
Cognitive Development Adolescence Psychology
Cognitive Development Adolescence PsychologyCognitive Development Adolescence Psychology
Cognitive Development Adolescence Psychology
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
 

Webinar PENS October 2022

  • 1. Research and development of models for speech emotion recognition Bagus Tris Atmaja, PhD Lecturer, Engineering Physics, ITS Postdoctoral Researcher, AIRC, AIST Version: 10/13/22
  • 2. 2/22 Outline ● Introduction ● General approach for research ● Selected current research topics (problems & solutions): – General speech emotion recognition – Multimodal information fusion: video, audio, text – Multitask learning ● Future research & prospects ● Conclusions
  • 3. 3/22 Self Introduction ● 2005 – 2009: Undergraduate, Eng. Physics, ITS ● 2010 – 2012: Master, Eng. Physics, ITS ● 2011 – 2012: Research student, Kumamoto Univ. ● 2012 – 2014: Engineer, Shimizu Seisakusho, Mie ● 2014 – Now : Lecturer, ITS ● 2017 – 2018: Research Student, JAIST ● 2018 – 2021: PhD, JAIST ● 2021 – Now : Postdoctoral researcher, AIST
  • 4. 4/22 Research Theme: General Approach Problem-based and data-driven research, e.g., mental-state monitoring, satisfaction evaluation, abnormal sound detection
  • 5. 5/22 Selected Research Topic: Speech Emotion Recognition (SER) Dataset Acoustic embeddings correlate to emotion? Model suited for SER? Pre-processing/ post processing? Feature Selection? Problems in previous SER Research: - Most research were conducted for categorical emotion - For categorical SER, the overall performance is not satisfactory (<70% Accuracy) - Most SER research were conducted for English - Is there any other information can be obtained from speech in addition to SER? Categorical Emotion Loss function Linguistic Features? Typical SER workflow:
  • 6. 6/22 Why dimensional SER? ● Most previous studies were conducted for categorical emotion → Dimensional SER ● Why dimensional SER? Because the biological category like categorical emotion doesn’t have an essence due to high variability; continuous scores are more universal evaluations. ● For dimensional SER, the performance of valence is lower than others → Adding linguistic information (in addition to Acoustics)
  • 7. 7/22 Ideas for dimensional SER ● For dimensional SER, the performance of valence is lower than others → Adding linguistic information (in addition to Acoustics): – Evaluate different word embeddings – Adjusting loss function for each emotion dimension (valence, arousal, dominance) – Evaluate early and late fusions – Extracting emotional-correlated speech embedding Dimensional Emotion Regression Loss function Linguistic Features? Acoustic embeddings correlate to emotion?
  • 8. 8/22 How to improve low valence performance? B. T. Atmaja and M. Akagi, “Improving Valence Prediction in Dimensional Speech Emotion Recognition Using Linguistic Information,” in Oriental COCOSDA, 2020, pp. 166–171. [github https://github.com/bagustris/dser_with_text] Adding linguistic information since valence is similar to sentiment e.g.: “Your service is bad “ (negative sentiment, low valence) (emotion) (e.g., prosody) The goal is to get linguistic scores from transcribed text (e.g., GloVe) which correlates to valence (V), arousal (A), and dominance (above scores are from a dictionary).
  • 9. 9/22 How about the result? HSF+WE HSF+Word2Vec HSF+FastText HSF+GloVe HSF+BERT 0 20 40 60 80 100 120 140 Valence STL Valence MTL Averaged CCC MTL Relative improvement (%) Im provem ent (% )
  • 10. 10/22 Is there any way to optimize acoustic+linguistic fusion for SER? B. T. Atmaja and M. Akagi, “Dimensional speech emotion recognition from speech features and word embeddings by using multitask learning,” APSIPA Trans. Signal Inf. Process., vol. 9, no. May, p. e17, May 2020. [Github https://github.com/bagustris/dimensional-ser] α β α=0.7, β=0.2, CCC = 0.51 ccc MTL: ● No parameters ● 2 parameters ● 3 parameters Yes, one is by adjusting loss function for each emotion dimension.
  • 11. 11/22 Is concatenation the only method to fuse acoustic and linguistic information? pAA: python audio analysis (34 acoustic features) sil: silence B. T. Atmaja and M. Akagi, “Two-stage dimensional emotion recognition by fusing predictions of acoustic and text networks using SVM,” Speech Commun., vol. 126, pp. 9–21, Feb. 2021. [Github https://github.com/bagustris/two-stage-ser] No, there are early (feature) and late (decision) fusion that can be applied.
  • 12. 12/22 How to extract acoustic embedding correlate to emotional speech? Finetuned pre-trained model on affective speech dataset Affective dataset (MSP-Podcast) Last hidden states: 1024-dim Logits: 3-dim [1] J. Wagner et al., “Dawn of the transformer era in speech emotion recognition: closing the valence gap,” ArXiv: 2203.07378, Mar. 2022. Acoustic embedding: last hidden states + logits (VAD) [1] B. T. Atmaja and A. Sasou, “Leveraging Pre-Trained Acoustic Feature Extractor For Affective Vocal Bursts Tasks,” APSIPA-ASC 2022. [Github https://github.com/bagustris/A-VB2022]
  • 13. 13/22 Does the fusion also improves categorical SER? Does acoustic embedding also improves categorical SER? [1] B. T. Atmaja, K. Shirai, and M. Akagi, “Speech Emotion Recognition Using Speech Feature and Word Embedding,” in 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2019, pp. 519–523. [Github https://github.com/bagustris/Apsipa2019_SpeechText] [2] B. T. Atmaja and A. Sasou, “Effects of Data Augmentations on Speech Emotion Recognition,” Sensors, vol. 22, no. 16, p. 5941, Aug. 2022, doi: 10.3390/s22165941. [Github https://github.com/bagustris/ser_aug] ASR Yes, it does. Acoustic embedding trained for dimensional also improves categorical
  • 14. 14/22 Most SER research were conducted in English, How about Japanese? Text-Independent (TI) obtained lowest scores, probably, due to: (1) small number of test data (200 samples), (2) STI learns better than TI. On IEMOCAP dataset, split by speaker and script (SP&SC=SI&TI) showed the most challenging task (Pepino et al., 2020) On JTES dataset, we spot the similar phenomena, highlighting the dependency of SER to linguistic information We conducted SER research on Japanese and found a phenomena while evaluating it using different splits.
  • 15. 15/22 How to improve Japanese SER? Pre-trained model + data augmentation. [1] B. T. Atmaja and A. Sasou, “Effects of Data Augmentations on Speech Emotion Recognition,” Sensors, vol. 22, no. 16, p. 5941, Aug. 2022,
  • 16. 16/22 Is there any information can be obtained beside emotion from speech simultaneously? B. T. Atmaja, A. Sasou, and M. Akagi, “Speech Emotion and Naturalness Recognitions with Multitask and Single- task Learnings,” IEEE Access, pp. 1–1, 2022, doi: 10.1109/ACCESS.202 2.3189481. [Pre-trained model is available: https://github.com/ bagustris/sner] Yes, there is. naturalness of speech shows potential score in multitask learning with emotion recognition; it helps improve SER scores.
  • 17. 17/22 Is there any information can be obtained beside emotion from speech simultaneously? Emo Age Country Input Shared Layer Independent Layer Yes, there is. Age and Country also could be obtained from speech simultaneously.
  • 18. 18/22 Potency for Production ● Cloud computing ● Edge computing (on-device) SER Model on the cloud (e.g., Huggingface, ABCI, AWS) Public API ● Smaller (lite) models ● Private data ● Offline use
  • 19. 19/22 Some future works (Tawaran judul TA/Thesis) ● Build dataset and the pre-trained models for Bahasa Indonesia ● Evaluate effectiveness of single task SER vs. multitask general speech processing (SER + ASR + sentiment + language/dialect + etc..) → I believe MTL is superior. ● Evaluate the ethics (purpose of use), bias (gender, ethnics), and latency (real-time factor) of the model. ● On-device (edge computing) SER for data privacy, hardware optimization, etc. negative neutral positive neutral happy angry sad Sentiment Emotion Recognizing common sentiment and emotion (on-device?) dataset: MOSEI, JTES Concordance correlation coefficient (CCC) loss with machine learning (on-device?)
  • 20. 20/22 Future research: AI for Speech Information manifested in speech (Fujisaki, 2003) AI text dialect speaker age gender emotion naturalness disease Short-term goals: ● Pre-trained models for four common emotion categories: multilingual SER ● Pre-trained model for SER suited for Japanese language ● Insights on affective vocal bursts (cry, laughter, etc.) [knowledge]
  • 21. 21/22 Conclusions ● Most SER research have been conducted in English, there is a need to boost research on SER for non-English language: Bahasa? ● Multitask learning and multimodal information fusion shows more performance but there is trade off. ● It is not only emotion that can be extracted from speech, but also naturalness, age, and country. More knowledge will follow. ● Data and training paradigm (e.g. finetuning) are important to generate models since algorithms /methods are openly available. ● The need to protect user data while collecting information to increase well being without sacrificing model performance.