SlideShare a Scribd company logo
1 of 59
© Copyright National University of Singapore. All Rights Reserved.
© Copyright National University of Singapore. All Rights Reserved.
Emotion Modelling for Speech Generation
Kun Zhou (National University of Singapore)
Supervisors: Prof. Haizhou Li (Main Supervisor, NUS & CUHK)
Associate Prof. Thomas Yeo Boon Thye (NUS)
1
© Copyright National University of Singapore. All Rights Reserved.
Outline
• Background & Introduction
• Topic 1: Seq2Seq Emotion Modelling
• Topic 2: Emotion Intensity and its Control
• Topic 3: Mixed Emotion Modelling and Synthesis
• Conclusion & Future Work
2
© Copyright National University of Singapore. All Rights Reserved.
Background & Introduction
• Speech Generation:
- Text-to-Speech (TTS): Text  Speech
- Voice Conversion (VC): Speech  Speech
• Development:
Statistical Parametric-based  Neural Network-based
• Applications:
Conversational Agents (Apple Siri, Amazon Alexa, …)
• Research Limitations:
- Prosody Modelling: Lack of Expressivity;
- Data Dependency: Require Large & High-Quality Data;
- Personalization;
- Real-Time Generation;
Teach Machines to Speak
Picture Sources:
[1] “When Machines Speak”, Center for Science and Society, Columbia University, USA;
[2] “Talking with Machines”, News and Events, University of York, UK;
3
© Copyright National University of Singapore. All Rights Reserved.
Emotion Modelling for Speech Generation
• Why Speech Emotions Are Difficult to Model?
1) Human Emotions are subtle and difficult to represent;
- Categorical: Angry, Happy, Sad, …
- Continuous: Valence, Arousal, Dominance, …
2) Speech Emotions relate to various acoustic features;
- Speech Quality, Energy, Intonation, Rhythm, …
3) Emotion Perception is subjective;
• Our Research Focus:
1) How to better imitate human emotions?
“Generalizability”
2) How to make synthesized emotions more creative?
“Controllability”
4
© Copyright National University of Singapore. All Rights Reserved.
Towards Empathic AI
Empathic AI:
• Receiving: Receive Human/Environment Inputs;
• Appraisal: Appraise the Situation;
• Response: Generate Appropriate Responses;
Emotional Speech Generation:
• Generate Emotional Responses;
• Increase the Dialogue Richness;
• Enhance the Engagement in Human-
Machine Interaction;
Picture Sources: “Should Algorithm and Robots Mimic Empathy?”, The Medical Futurist;
5
© Copyright National University of Singapore. All Rights Reserved.
Publications
• Journal Publications (1st author):
[1] Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li, “Emotional Voice Conversion: Theory,
Databases and ESD”, Speech Communication, 2022;
[2] Kun Zhou, Berrak Sisman, Rajib Rana, Bjorn Schuller, Haizhou Li, “Emotion Intensity and its
Control for Emotional Voice Conversion”, IEEE Transactions on Affective Computing, 2023;
[3] Kun Zhou, Berrak Sisman, Rajib Rana, Bjorn Schuller, Haizhou Li, “Speech Synthesis with
Mixed Emotions”, IEEE Transactions on Affective Computing, 2022;
• Conference Publications (1st author):
[4] Kun Zhou, Berrak Sisman, Haizhou Li, “Transforming Spectrum and Prosody for Emotional
Voice Conversion with Non-Parallel Training Data”, Speaker Odyssey 2019;
[5] Kun Zhou, Berrak Sisman, Mingyang Zhang, Haizhou Li, “Converting Anyone’s Emotion:
Towards Speaker-Independent Emotional Voice Conversion”, INTERSPEECH 2020;
[6] Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li, “Seen and Unseen Emotional Style Transfer
with a New Emotional Speech Dataset”, ICASSP 2021;
[7] Kun Zhou, Berrak Sisman, Haizhou Li, “Limited Data Emotional Voice Conversion Leveraging
Text-to-Speech: Two-Stage Sequence-to-Sequence Training”, INTERSPEECH 2021;
[8] Kun Zhou, Berrak Sisman, Haizhou Li, “VAW-GAN for Disentanglement and Recomposition of
Emotional Elements in Speech”, IEEE SLT 2021; 6
© Copyright National University of Singapore. All Rights Reserved.
• Conference Publications (Co-authored):
[9] Zongyang Du, Kun Zhou, Berrak Sisman, Haizhou Li, “Spectrum and Prosody Conversion for
Cross-Lingual Voice Conversion with CycleGAN”, APSIPA 2020;
[10] Junchen Lu, Kun Zhou, Berrak Sisman, Haizhou Li, “VAW-GAN for Singing Voice Conversion
with Non-Parallel Data”, APSIPA 2020;
[11] Zongyang Du, Berrak Sisman, Kun Zhou, Haizhou Li, “Expressive Voice Conversion: A Joint
Framework for Speaker Identity and Emotional Style Transfer”, IEEE ASRU 2020;
[12] Zongyang Du, Berrak Sisman, Kun Zhou, Haizhou Li, “Disentanglement of Emotional Style
and Speaker Identity for Expressive Voice Conversion”, INTERSPEECH 2022;
Publications
7
© Copyright National University of Singapore. All Rights Reserved.
Emotion Modelling for Speech Generation
[1] Kun Zhou, Berrak Sisman, Haizhou Li, “Limited Emotional Voice Conversion
Leveraging Text-to-Speech: Two-Stage Sequence-to-Sequence Training”,
INTERSPEECH 2021;
[2] Kun Zhou, Berrak Sisman, Rajib Rana, Bjorn Schuller, Haizhou Li, “Emotion
Intensity and its Control for Emotional Voice Conversion”, IEEE Transactions on
Affective Computing, 2022;
[3] Kun Zhou, Berrak Sisman, Rajib Rana, Bjorn Schuller, Haizhou Li, “Speech
Synthesis with Mixed Emotions”, IEEE Transactions on Affective Computing, 2022;
Topic 1: Seq2Seq Emotion Modelling (Generalizability)
Topic 2: Emotion Intensity Modelling and its Control (Controllability)
Topic 3: Mixed Emotion Modelling and Synthesis (Controllability)
8
© Copyright National University of Singapore. All Rights Reserved.
Outline
• Background & Introduction
• Topic 1: Seq2Seq Emotion Modelling
• Topic 2: Emotion Intensity and its Control
• Topic 3: Mixed Emotion Modelling and Synthesis
• Conclusion & Future Work
9
© Copyright National University of Singapore. All Rights Reserved.
Seq2Seq Emotion Modelling*
• Why Seq2Seq Emotion Models?
Frame-based Models:
- Convert emotion on a frame-basis  Cannot Modify Duration;
- Separate model spectrum and prosody  Mismatch;
Seq2Seq-based Models:
- Joint model spectrum, prosody, and duration  No Mismatch;
- Sequence-level modelling  Focus on Emotion-Relevant Regions;
- Needs a large amount of emotional speech training data;
Our Seq2Seq-based Models:
Enable duration modelling
Only require limited emotion training data
* Kun Zhou, Berrak Sisman, Haizhou Li, “Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-Stage Sequence-to-
Sequence Training”, INTERSPEECH 2021.
10
© Copyright National University of Singapore. All Rights Reserved.
Seq2Seq Emotion Modelling
• Two-Stage Training for Seq2Seq Emotion Modelling
(1) Style Initialization (2) Emotion Training
• Stage I: Style Initialization
- Speaker Encoder learns Speaker Styles from a TTS corpus;
• Stage II: Emotion Training
- Emotion Encoder learns Emotional Styles from limited emotional speech data; 11
© Copyright National University of Singapore. All Rights Reserved.
Seq2Seq Emotion Modelling
• Two-Stage Training for Seq2Seq Emotion Modelling
(1) Style Encoder
from Stage I
(2) Emotion Encoder
from Stage II
- Emotion Encoder produce
effective emotion
embeddings;
- Only needs 50 mins
emotion training data!
When testing with emotional data:
12
© Copyright National University of Singapore. All Rights Reserved.
Seq2Seq Emotion Modelling
• Objective Evaluations
Seq2Seq Models outperformed the frame-based baselines.
DDUR -> Duration Conversion Performance
MCD -> Spectrum Conversion Performance
13
© Copyright National University of Singapore. All Rights Reserved.
Seq2Seq Emotion Modelling
• Subjective Evaluations
Seq2Seq Models outperformed the frame-based baselines.
MOS Test for Emotion Similarity
BWS Test for Speech Quality
14
© Copyright National University of Singapore. All Rights Reserved.
Outline
• Background & Introduction
• Topic 1: Seq2Seq Emotion Modelling
• Topic 2: Emotion Intensity and its Control
• Topic 3: Mixed Emotion Modelling and Synthesis
• Conclusion & Future Work
15
© Copyright National University of Singapore. All Rights Reserved.
Emotion Intensity and its Control*
• Emotion Intensity
Figure 1. Example of different intensity images of “fear”*
Picture Source:
Hoffmann, Holger, et al. "Expression intensity, gender and facial emotion recognition: Women recognize only subtle facial emotions better than men.”
Acta psychologica 135.3 (2010): 278-283.
*Kun Zhou, Berrak Sisman, Rajib Rana, Bjorn Schuller, Haizhou Li, “Emotion Intensity and its Control for Emotional Voice Conversion”,
IEEE Transactions on Affective Computing, 14 (1) 31-48, 2023.
- The level that an emotion can be perceived by a listener;
- Not just the loudness of a voice;
- Correlates to all the acoustic cues that contribute to an emotion;
16
© Copyright National University of Singapore. All Rights Reserved.
Emotion Intensity and its Control
• Research Challenges
- Complex with acoustic cues such as timbre, pitch and rhythm;
- Lack of explicit intensity labels;
- Subjective; (Hot or cold anger?)
• Previous Studies
- Use auxiliary features,
e.g. voiced/unvoiced/silence (VUS), attention weights, a saliency map;
- Manipulate internal emotion representations,
e.g. interpolation or scaling;
- Lack of interpretability;
- Limits the performance;
17
© Copyright National University of Singapore. All Rights Reserved.
• Emovox – Emotional Voice Conversion with Intensity Control
Emotion Intensity and its Control
Sequence-to-Sequence Conversion
Models:
• Linguistic Transplant;
• Emotion Transfer;
• Intensity Control;
How to design an intensity encoder that can:
(1) Accept manual intensity labels; (2) Produce interpretable intensity embeddings
 How to formulate emotion intensity?
18
© Copyright National University of Singapore. All Rights Reserved.
• Relative Attributes*
Emotion Intensity and its Control
He smiles more than the person on the right,
but less than the person on the left!
Our world is not binary:
- Before: Predicting the presence of an attribute;
E.g. Smiling or not smiling?  Regression Problem;
- Relative Attributes*: Predicting the strength of an attribute;
E.g. How much he/she is smiling?  Ranking Problem;
* Parikh, Devi, and Kristen Grauman. "Relative attributes." 2011 International Conference on Computer Vision. IEEE, 2011.
19
© Copyright National University of Singapore. All Rights Reserved.
• Formulation of Emotion Intensity
Emotion Intensity and its Control
Assumption:
(1) “Neutral” does not contain any
emotion variance;
 its intensity is always 0
(2) Emotion Intensity is the relative
difference with “Neutral”;
Given a training set T={xt}, N and E are
neutral and emotional dataset, we learn a
ranking function:
r(xt) = Wxt
It satisfies the following constraints:
Supervision:
O  Ordered Set, S  Similar Set.
O: E has higher intensity than N;
S: N-N pairs / E-E pairs
20
© Copyright National University of Singapore. All Rights Reserved.
• Formulation of Emotion Intensity
Emotion Intensity and its Control
(1) Traditional Classifier (2) Relative Ranking
Neutral
Angry
Least
Angry
Most Angry
A ranking model automatically predicts the intensity of the emotion with respect
to other speech samples. 21
© Copyright National University of Singapore. All Rights Reserved.
• Modelling Emotion Styles with its Intensity
Emotion Intensity and its Control
Emotion Style Reconstruction:
- Emotion Modelling;
- Intensity Modelling;
22
© Copyright National University of Singapore. All Rights Reserved.
• Run-time Emotion Intensity Control
Emotion Intensity and its Control
• Emotion Intensity Transfer:
• Emotion Intensity Control:
Emotion Intensity is predicted
from a reference audio;
Emotion Intensity is given by
humans;
23
© Copyright National University of Singapore. All Rights Reserved.
• Speech Samples
Emotion Intensity and its Control
Intensity = 0.1
(Most Weak)
Intensity = 0.3
(Less Weak)
Intensity = 0.6
(Less Strong)
Intensity = 0.9
(Most Strong)
Intensity = 0.1
(Most Weak)
Intensity = 0.3
(Less Weak)
Intensity = 0.6
(Less Strong)
Intensity = 0.9
(Most Strong)
Converting Neutral to Angry:
Converting Neutral to Sad:
Note: [1] All the speech samples that you listen are synthesized from source neutral speech;
[2] The total duration of training data is less than 1 hour.
24
© Copyright National University of Singapore. All Rights Reserved.
• Visual Comparisons (Duration)
Emotion Intensity and its Control
Sad (Weak), Intensity = 0.1
Sad (Medium), Intensity = 0.5
Sad (Strong), Intensity = 0.9
As intensity increases, speech becomes slower and more resonant. 25
© Copyright National University of Singapore. All Rights Reserved.
• Visual Comparisons (Pitch and Energy)
Emotion Intensity and its Control
Pitch
Energy
26
© Copyright National University of Singapore. All Rights Reserved.
• Intensity Control Evaluation
Emotion Intensity and its Control
Compared to other control methods, our proposed model with relative attributes
achieves best results in terms of intensity control. 27
© Copyright National University of Singapore. All Rights Reserved.
Emotion Intensity and its Control
Scan here for more speech samples:
Codes are publicly available:
https://github.com/KunZhou9646/Emovox
The Most Popular Article in IEEE Trans on Affective Computing!
28
© Copyright National University of Singapore. All Rights Reserved.
Outline
• Background & Introduction
• Topic 1: Seq2Seq Emotion Modelling
• Topic 2: Emotion Intensity and its Control
• Topic 3: Mixed Emotion Modelling and Synthesis
• Conclusion & Future Work
29
© Copyright National University of Singapore. All Rights Reserved.
Mixed Emotion Modelling & Synthesis*
• Mixed Emotions
Happy
Sad
Mixed Emotion
(Bittersweet)
* Kun Zhou, Berrak Sisman, Rajib Rana, Bjorn Schuller, Haizhou Li, “Speech Synthesis with Mixed Emotions”,
IEEE Transactions on Affective Computing, 2022.
30
© Copyright National University of Singapore. All Rights Reserved.
Mixed Emotion Modelling & Synthesis
• Mixed Emotions
Human can feel multiple emotions at the same time;
 Some Bittersweet moments:
- Remembering a lost love with warmth;
- First time when leaving home for college;
 Psychologists have been studying to understand the
measures and paradigms of mixed emotions;*
Is it possible to synthesize mixed emotional effects for speech synthesis?
* Larsen, Jeff T., and A. Peter McGraw. "The case for mixed emotions." Social and Personality Psychology Compass 8.6
(2014): 263-274.
31
© Copyright National University of Singapore. All Rights Reserved.
Mixed Emotion Modelling & Synthesis
• Theory of Emotion Wheel*
Human can experience around 34,000
different emotions;
• Plutchik proposed 8 primary
emotions;
• All other emotions can be regarded
as mixed or derivative states of
primary emotions;
For example:
Delight = Joy + Surprise;
Disappointment = Sad + Surprise;
*Robert Plutchik, “The Nature of Emotions”, American Scientist, 1984;
32
© Copyright National University of Singapore. All Rights Reserved.
Mixed Emotion Modelling & Synthesis
• Research Gaps
To learn emotion information:
- Associate with explicit labels (e.g., discrete/continuous labels);
- Imitate a reference audio (e.g., reference encoder, GST-Tacotron);
Only synthesize an averaged style belonging to a specific emotion type.
• Research Challenges
• How to characterize and quantify the mixture of speech emotions?
• How to evaluate the synthesized mixed results?
Our focus!
33
© Copyright National University of Singapore. All Rights Reserved.
Mixed Emotion Modelling & Synthesis
• The Ordinal Nature of Emotions*
Previous Method:
- Assigning an absolute score (Arousal, Valence, …);
- Defining a discrete emotion category (Happy, Sad, …);
- Imitate a reference audio;
Our Methods:
Characterize emotions through comparative assessments
(e.g., is sentence one happier than sentence two?)
Key idea  Learn to Rank
- Construct a ranking model using training data;
- Sort new objects according to their degree of relevance.
*Yannakakis, Georgios N., Roddy Cowie, and Carlos Busso. "The ordinal nature of emotions." 2017 Seventh International
Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 2017.
34
© Copyright National University of Singapore. All Rights Reserved.
Mixed Emotion Modelling & Synthesis
• Three Assumptions*:
- Manually Controlled Attribute
Vector;
- Proposed Relative Scheme;
- Emotional Text-to-Speech Model;
1. Mixed emotions are characterized by combinations, mixtures or compounds of
primary emotions;
2. All emotions are related to some extent;
3. Each emotion has stereotypical styles;
• Proposed Diagram:
*All assumptions are supported by related psychological studies. Please refer to our paper for more details.
Our Focus
35
© Copyright National University of Singapore. All Rights Reserved.
Mixed Emotion Modelling & Synthesis
• Design of a Novel Relative Scheme
- Train a relative ranking
function f(x) between each
emotion pairs;
- At run-time, the trained f(x)
automatically predicts an
emotion attribute;
- A smaller value of emotion
attribute represents a similar
emotional style;
36
© Copyright National University of Singapore. All Rights Reserved.
Mixed Emotion Modelling & Synthesis
• Training with Relative Scheme
During the training, the framework learns:
1. Characterize input emotion styles;
2. Quantifying the difference with other
emotion types;
 Emotion Attribute Vectors
 Emotion Embedding
37
© Copyright National University of Singapore. All Rights Reserved.
Mixed Emotion Modelling & Synthesis
• Controlling Emotion Mixture in Speech
Given Text Inputs:
• Text Encoder: Projects the linguistic
information to an internal representations;
• Emotion Encoder: Captures the emotion
style from a reference speech;
• Manually Controlled Attribute Vector:
Introduce & Control the characteristics of
other emotions types;
38
© Copyright National University of Singapore. All Rights Reserved.
Mixed Emotion Modelling & Synthesis
• Experiments with Three Case Studies
• Case Study 1: “Delight”, “Outrage”, “Disappointment”
• Case Study 2: Conflicting Emotions – “Happy” & “Sad”
• Case Study 3: An Emotion Transition System
39
© Copyright National University of Singapore. All Rights Reserved.
Mixed Emotion Modelling & Synthesis
• Case Study 1: “Delight”, “Outrage”, “Disappointment”
- We mix “Surprise” with “Happy”, “Angry” and “Sad”;
- We expect to synthesize mixed feelings closer to “Delight”, “Outrage”
and “Disappointment”;
40
© Copyright National University of Singapore. All Rights Reserved.
Mixed Emotion Modelling & Synthesis
• Case Study 1: “Delight”, “Outrage”, “Disappointment”
(1) Analysis with Speech Emotion Recognition (SER)
• Analyze the mixture of emotions
with the classification probabilities
derived before the last projection
layer of a pre-trained SER;
• The classification probabilities could
capture the emotion mixtures!
41
© Copyright National University of Singapore. All Rights Reserved.
Mixed Emotion Modelling & Synthesis
• Case Study 1: “Delight”, “Outrage”, “Disappointment”
(2) Acoustic Evaluation
Mel-Cepstral Distortion
(MCD) -> Spectrum
Similarity
Pearson Correlation
Coefficient (PCC) -> Pitch
Similarity
42
© Copyright National University of Singapore. All Rights Reserved.
Mixed Emotion Modelling & Synthesis
• Case Study 1: “Delight”, “Outrage”, “Disappointment”
(3) Listening Experiments
(A) Evaluation the perception of
“Angry”, “Happy” and “Sad”
(B) Evaluation the perception of
“Outrage”, “Delight” and “Disappointment”
43
© Copyright National University of Singapore. All Rights Reserved.
Mixed Emotion Modelling & Synthesis
• Case Study 1: “Delight”, “Outrage”, “Disappointment”
Speech Demos (Emotion Evaluation)
Only Surprise Only Angry Mixing Surprise with Angry (Outrage)
Only Surprise Only Happy Mixing Surprise with Happy (Delight)
Only Surprise Only Sad Mixing Surprise with Sad (Disappointment)
Note: [1] All the speech samples that you listen are synthesized from the texts;
[2] The total duration of training data is less than 1 hour.
44
© Copyright National University of Singapore. All Rights Reserved.
Mixed Emotion Modelling & Synthesis
• Case Study 1: “Delight”, “Outrage”, “Disappointment”
Speech Demos (Controllability Evaluation)
100% Surprise 100% Surprise
+ 30% Angry
100% Surprise
+ 60% Angry
100% Surprise
+ 90% Angry
Note: [1] All the speech samples that you listen are synthesized from the texts;
[2] The total duration of training data is less than 1 hour.
100% Surprise 100% Surprise
+ 30% Happy
100% Surprise
+ 60% Happy
100% Surprise
+ 90% Happy
100% Surprise 100% Surprise
+ 30% Sad
100% Surprise
+ 60% Sad
100% Surprise
+ 90% Sad
45
© Copyright National University of Singapore. All Rights Reserved.
Mixed Emotion Modelling & Synthesis
• Case Study 2: Conflicting Emotions – “Happy” & “Sad”
- “Happy” and “Sad” are considered as two conflicting emotions:
e.g., opposite valence (pleasant and unpleasant);
- “Bittersweet” describes a mixed feeling of both “Happy” and “Sad”;
- Professional actors are thought to be able to deliver such feelings through action &
speech;
- It is a challenge task to synthesize a “Bittersweet” feeling;
46
© Copyright National University of Singapore. All Rights Reserved.
Mixed Emotion Modelling & Synthesis
Speech Demos
Note: All the speech samples that you listen are synthesized from the texts
Happy Sad Happy + Sad
• Case Study 2: Conflicting Emotions – “Happy” & “Sad”
47
© Copyright National University of Singapore. All Rights Reserved.
Mixed Emotion Modelling & Synthesis
• Case Study 3: An Emotion Transition System
- Emotion transition aims to gradually transit the emotion from one to another;
- The key challenge is how to synthesize internal states between different emotion
types;
- We keep the sum of the percentages of emotions to be 100%, and adjust each
percentage manually (e.g., 80% Surprise with 20% Angry);
Internal Emotion States (Our Focus)
An Example of
Emotion Transition
System:
48
© Copyright National University of Singapore. All Rights Reserved.
Mixed Emotion Modelling & Synthesis
• Case Study 3: An Emotion Transition System
Speech Demos - An Emotion Triangle
Angry
Sad
Happy
Note: All the speech samples that you listen are synthesized from the texts
Emotion Triangle 49
© Copyright National University of Singapore. All Rights Reserved.
Mixed Emotion Modelling & Synthesis
• Case Study 3: An Emotion Transition System
Note: All the speech samples that you listen are synthesized from the texts
https://kunzhou9646.github.io/Emotion_Triangle/
50
© Copyright National University of Singapore. All Rights Reserved.
Mixed Emotion Modelling & Synthesis
Scan here for more speech samples:
Codes & Pre-trained models are publicly available:
https://github.com/KunZhou9646/Mixed_Emotions
Published Version:
51
© Copyright National University of Singapore. All Rights Reserved.
Outline
• Background & Introduction
• Topic 1: Seq2Seq Emotion Modelling
• Topic 2: Emotion Intensity and its Control
• Topic 3: Mixed Emotion Modelling and Synthesis
• Conclusion & Future Work
52
© Copyright National University of Singapore. All Rights Reserved.
Conclusion
• Emotional speech generation is a key technology to achieve empathic AI;
• We study seq2seq emotion modelling for emotional voice conversion that improves
the generalizability of emotion models;
• We study emotion intensity modelling and control for emotional voice conversion;
• We study mixed emotion modelling and generation, and present three case studies
to validate our idea;
53
© Copyright National University of Singapore. All Rights Reserved.
Future Work
• Inclusive Emotional Speech Generation
- Current Studies are “Exclusive”:
- Require high-quality recorded data;
- Acted emotional speech data may create stereotypes of emotions;
- Building an inclusive emotional speech generation framework:
- Associate emotional prosody with linguistic information;
- Reduce the impact of confounding factors:
Confounding Factors:
- Environmental Effects: Noises, Speech Overlappings, …
- Individual Effects: Accents, Age, Gender, …
54
© Copyright National University of Singapore. All Rights Reserved.
Future Work
• Affective Vocal Bursts Synthesis
Speech Emotions can be manifested in:
- Para-Linguistics: Intonation, Energy, Speaking Rate, …
- Non-Linguistics: Vocal Bursts -> Laugh, Sigh, Grunts, Cries, …
- Linguistics: Word Selections
Our Focus
Not Enough
Attention Yet
Expressive Vocalization (ExVo) Workshops*:
To Understand and Synthesize Expressive Non-Verbal Vocalizations
*Baird, Alice, et al. "The icml 2022 expressive vocalizations workshop and competition: Recognizing, generating,
and personalizing vocal bursts." Proceeding of ICML 2022.
55
© Copyright National University of Singapore. All Rights Reserved.
Future Work
• Ethical Study
(1) “Deep Fake” Issues:
- Spreading disinformation;
- Defaming a politician by manipulating their emotions;
(2) Privacy Issues:
- Emotions are the results of human experiences;
- Privacy concerns in data collections;
(3) The Biases of Individuals, Languages and Cultures:
- Underrepresented languages and cultures;
- Individual differences in emotional expression and perception;
(4) Do We Want Our Artificial Agent to be Emotional?
- Influence our emotions in the ways we would prefer or avoid;
- Give robots a certain personality that makes them indistinguishable from humans;
- The emotional overreliance on robots will increase the loneliness in real life;
A. Triantafyllopoulos et al., "An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era,"
in Proceedings of the IEEE.
56
© Copyright National University of Singapore. All Rights Reserved.
THANK YOU
• Prof. Haizhou Li (NUS & CUHK);
• Prof. Berrak Sisman (UT Dallas, USA), Prof. Rajib Rana (USQ, Australia),
Prof. Bjorn Schuller (Imperial College London, U.K.), Prof. Tanja Schultz (U of Bremen, Germany);
Family, Friends & Colleagues @ NUS, SUTD, UT Dallas, U of Bremen;
Picture Sources: Kun’s iPhone
2023 @ Dallas, USA
2023 @ Bremen, Germany
2022 @ Marina Bay, SG
2019 @ Oriental Hotel, SG
2023 @ Dallas, USA
2022 @ Bali, Indonesia
2020 @ Blue Horizon, SG
2023 @ Bremen, Germany
2022 @ Clark Quay, SG
57
© Copyright National University of Singapore. All Rights Reserved.
A little bit more: As a logo designer …
Picture Source:
[1] “Define AI at ICASSP 2022”, Travelogue;
[2] Kun’s iPhone;
58
© Copyright National University of Singapore. All Rights Reserved.
THANK YOU
59

More Related Content

What's hot

Advanced Voice Conversion
Advanced Voice ConversionAdvanced Voice Conversion
Advanced Voice ConversionNU_I_TODALAB
 
音声の声質を変換する技術とその応用
音声の声質を変換する技術とその応用音声の声質を変換する技術とその応用
音声の声質を変換する技術とその応用NU_I_TODALAB
 
多人数演奏楽譜から連弾譜への自動編曲
多人数演奏楽譜から連弾譜への自動編曲多人数演奏楽譜から連弾譜への自動編曲
多人数演奏楽譜から連弾譜への自動編曲kthrlab
 
喉頭摘出者のための歌唱支援を目指した電気音声変換法
喉頭摘出者のための歌唱支援を目指した電気音声変換法喉頭摘出者のための歌唱支援を目指した電気音声変換法
喉頭摘出者のための歌唱支援を目指した電気音声変換法NU_I_TODALAB
 
イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出
イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出
イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出Tomoki Hayashi
 
音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3
音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3
音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3Naoya Takahashi
 
日本音響学会2017秋 ”Moment-matching networkに基づく一期一会音声合成における発話間変動の評価”
日本音響学会2017秋 ”Moment-matching networkに基づく一期一会音声合成における発話間変動の評価”日本音響学会2017秋 ”Moment-matching networkに基づく一期一会音声合成における発話間変動の評価”
日本音響学会2017秋 ”Moment-matching networkに基づく一期一会音声合成における発話間変動の評価”Shinnosuke Takamichi
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022NU_I_TODALAB
 
深層生成モデルに基づく音声合成技術
深層生成モデルに基づく音声合成技術深層生成モデルに基づく音声合成技術
深層生成モデルに基づく音声合成技術NU_I_TODALAB
 
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)Yuki Saito
 
テキスト音声合成技術と多様性への挑戦 (名古屋大学 知能システム特論)
テキスト音声合成技術と多様性への挑戦 (名古屋大学 知能システム特論)テキスト音声合成技術と多様性への挑戦 (名古屋大学 知能システム特論)
テキスト音声合成技術と多様性への挑戦 (名古屋大学 知能システム特論)Shinnosuke Takamichi
 
日本語モーラの持続時間長: 単音節語提示による知覚実験(JSLS2015)
日本語モーラの持続時間長: 単音節語提示による知覚実験(JSLS2015)日本語モーラの持続時間長: 単音節語提示による知覚実験(JSLS2015)
日本語モーラの持続時間長: 単音節語提示による知覚実験(JSLS2015)Kosuke Sugai
 
独立性に基づくブラインド音源分離の発展と独立低ランク行列分析 History of independence-based blind source sep...
独立性に基づくブラインド音源分離の発展と独立低ランク行列分析 History of independence-based blind source sep...独立性に基づくブラインド音源分離の発展と独立低ランク行列分析 History of independence-based blind source sep...
独立性に基づくブラインド音源分離の発展と独立低ランク行列分析 History of independence-based blind source sep...Daichi Kitamura
 
ここまで来た&これから来る音声合成 (明治大学 先端メディアコロキウム)
ここまで来た&これから来る音声合成 (明治大学 先端メディアコロキウム)ここまで来た&これから来る音声合成 (明治大学 先端メディアコロキウム)
ここまで来た&これから来る音声合成 (明治大学 先端メディアコロキウム)Shinnosuke Takamichi
 
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクトCREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクトNU_I_TODALAB
 
信号の独立性に基づく多チャンネル音源分離
信号の独立性に基づく多チャンネル音源分離信号の独立性に基づく多チャンネル音源分離
信号の独立性に基づく多チャンネル音源分離NU_I_TODALAB
 
Latest Frame interpolation Algorithms
Latest Frame interpolation AlgorithmsLatest Frame interpolation Algorithms
Latest Frame interpolation AlgorithmsHyeongmin Lee
 
音響信号に対する異常音検知技術と応用
音響信号に対する異常音検知技術と応用音響信号に対する異常音検知技術と応用
音響信号に対する異常音検知技術と応用Yuma Koizumi
 

What's hot (20)

Advanced Voice Conversion
Advanced Voice ConversionAdvanced Voice Conversion
Advanced Voice Conversion
 
音声の声質を変換する技術とその応用
音声の声質を変換する技術とその応用音声の声質を変換する技術とその応用
音声の声質を変換する技術とその応用
 
多人数演奏楽譜から連弾譜への自動編曲
多人数演奏楽譜から連弾譜への自動編曲多人数演奏楽譜から連弾譜への自動編曲
多人数演奏楽譜から連弾譜への自動編曲
 
喉頭摘出者のための歌唱支援を目指した電気音声変換法
喉頭摘出者のための歌唱支援を目指した電気音声変換法喉頭摘出者のための歌唱支援を目指した電気音声変換法
喉頭摘出者のための歌唱支援を目指した電気音声変換法
 
イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出
イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出
イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出
 
音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3
音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3
音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3
 
日本音響学会2017秋 ”Moment-matching networkに基づく一期一会音声合成における発話間変動の評価”
日本音響学会2017秋 ”Moment-matching networkに基づく一期一会音声合成における発話間変動の評価”日本音響学会2017秋 ”Moment-matching networkに基づく一期一会音声合成における発話間変動の評価”
日本音響学会2017秋 ”Moment-matching networkに基づく一期一会音声合成における発話間変動の評価”
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
 
深層生成モデルに基づく音声合成技術
深層生成モデルに基づく音声合成技術深層生成モデルに基づく音声合成技術
深層生成モデルに基づく音声合成技術
 
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)
 
テキスト音声合成技術と多様性への挑戦 (名古屋大学 知能システム特論)
テキスト音声合成技術と多様性への挑戦 (名古屋大学 知能システム特論)テキスト音声合成技術と多様性への挑戦 (名古屋大学 知能システム特論)
テキスト音声合成技術と多様性への挑戦 (名古屋大学 知能システム特論)
 
日本語モーラの持続時間長: 単音節語提示による知覚実験(JSLS2015)
日本語モーラの持続時間長: 単音節語提示による知覚実験(JSLS2015)日本語モーラの持続時間長: 単音節語提示による知覚実験(JSLS2015)
日本語モーラの持続時間長: 単音節語提示による知覚実験(JSLS2015)
 
独立性に基づくブラインド音源分離の発展と独立低ランク行列分析 History of independence-based blind source sep...
独立性に基づくブラインド音源分離の発展と独立低ランク行列分析 History of independence-based blind source sep...独立性に基づくブラインド音源分離の発展と独立低ランク行列分析 History of independence-based blind source sep...
独立性に基づくブラインド音源分離の発展と独立低ランク行列分析 History of independence-based blind source sep...
 
ここまで来た&これから来る音声合成 (明治大学 先端メディアコロキウム)
ここまで来た&これから来る音声合成 (明治大学 先端メディアコロキウム)ここまで来た&これから来る音声合成 (明治大学 先端メディアコロキウム)
ここまで来た&これから来る音声合成 (明治大学 先端メディアコロキウム)
 
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクトCREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
 
信号の独立性に基づく多チャンネル音源分離
信号の独立性に基づく多チャンネル音源分離信号の独立性に基づく多チャンネル音源分離
信号の独立性に基づく多チャンネル音源分離
 
Latest Frame interpolation Algorithms
Latest Frame interpolation AlgorithmsLatest Frame interpolation Algorithms
Latest Frame interpolation Algorithms
 
AXIS Nova Carlos Gomes - CYRELA GOLDSZTEIN
AXIS Nova Carlos Gomes - CYRELA GOLDSZTEIN AXIS Nova Carlos Gomes - CYRELA GOLDSZTEIN
AXIS Nova Carlos Gomes - CYRELA GOLDSZTEIN
 
AR/SLAM for end-users
AR/SLAM for end-usersAR/SLAM for end-users
AR/SLAM for end-users
 
音響信号に対する異常音検知技術と応用
音響信号に対する異常音検知技術と応用音響信号に対する異常音検知技術と応用
音響信号に対する異常音検知技術と応用
 

Similar to PhD_Oral_Defense_Kun.ppt

Oral Qualification Examination_Kun_Zhou
Oral Qualification Examination_Kun_ZhouOral Qualification Examination_Kun_Zhou
Oral Qualification Examination_Kun_ZhouKunZhou18
 
Literature Review On: ”Speech Emotion Recognition Using Deep Neural Network”
Literature Review On: ”Speech Emotion Recognition Using Deep Neural Network”Literature Review On: ”Speech Emotion Recognition Using Deep Neural Network”
Literature Review On: ”Speech Emotion Recognition Using Deep Neural Network”IRJET Journal
 
IEEE ICASSP 2021
IEEE ICASSP 2021IEEE ICASSP 2021
IEEE ICASSP 2021KunZhou18
 
Emotion Detection Using Noninvasive Low-cost Sensors
Emotion Detection Using Noninvasive Low-cost SensorsEmotion Detection Using Noninvasive Low-cost Sensors
Emotion Detection Using Noninvasive Low-cost SensorsNicole Novielli
 
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...ijtsrd
 
Speech Emotion Recognition Using Neural Networks
Speech Emotion Recognition Using Neural NetworksSpeech Emotion Recognition Using Neural Networks
Speech Emotion Recognition Using Neural Networksijtsrd
 
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECHBASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECHIJCSEA Journal
 
NBLex: emotion prediction in Kannada-English code-switchtext using naïve baye...
NBLex: emotion prediction in Kannada-English code-switchtext using naïve baye...NBLex: emotion prediction in Kannada-English code-switchtext using naïve baye...
NBLex: emotion prediction in Kannada-English code-switchtext using naïve baye...IJECEIAES
 
NATURAL LANGUAGE PROCESSING.pptx
NATURAL LANGUAGE PROCESSING.pptxNATURAL LANGUAGE PROCESSING.pptx
NATURAL LANGUAGE PROCESSING.pptxsaivinay93
 
Speech emotion recognition using 2D-convolutional neural network
Speech emotion recognition using 2D-convolutional neural  networkSpeech emotion recognition using 2D-convolutional neural  network
Speech emotion recognition using 2D-convolutional neural networkIJECEIAES
 
A Review Paper on Speech Based Emotion Detection Using Deep Learning
A Review Paper on Speech Based Emotion Detection Using Deep LearningA Review Paper on Speech Based Emotion Detection Using Deep Learning
A Review Paper on Speech Based Emotion Detection Using Deep LearningIRJET Journal
 
76201926
7620192676201926
76201926IJRAT
 
A-STUDY-ON-SENTIMENT-POLARITY.pdf
A-STUDY-ON-SENTIMENT-POLARITY.pdfA-STUDY-ON-SENTIMENT-POLARITY.pdf
A-STUDY-ON-SENTIMENT-POLARITY.pdfSUDESHNASANI1
 
Emotion recognition using facial expressions and speech
Emotion recognition using facial expressions and speechEmotion recognition using facial expressions and speech
Emotion recognition using facial expressions and speechLakshmi Sarvani Videla
 
A hybrid strategy for emotion classification
A hybrid strategy for emotion classificationA hybrid strategy for emotion classification
A hybrid strategy for emotion classificationnooriasukmaningtyas
 
ANALYSING SPEECH EMOTION USING NEURAL NETWORK ALGORITHM
ANALYSING SPEECH EMOTION USING NEURAL NETWORK ALGORITHMANALYSING SPEECH EMOTION USING NEURAL NETWORK ALGORITHM
ANALYSING SPEECH EMOTION USING NEURAL NETWORK ALGORITHMIRJET Journal
 
Emotion recognition based on the energy distribution of plosive syllables
Emotion recognition based on the energy distribution of plosive  syllablesEmotion recognition based on the energy distribution of plosive  syllables
Emotion recognition based on the energy distribution of plosive syllablesIJECEIAES
 
saito22research_talk_at_NUS
saito22research_talk_at_NUSsaito22research_talk_at_NUS
saito22research_talk_at_NUSYuki Saito
 
Text to Emotion Extraction Using Supervised Machine Learning Techniques
Text to Emotion Extraction Using Supervised Machine Learning TechniquesText to Emotion Extraction Using Supervised Machine Learning Techniques
Text to Emotion Extraction Using Supervised Machine Learning TechniquesTELKOMNIKA JOURNAL
 

Similar to PhD_Oral_Defense_Kun.ppt (20)

Oral Qualification Examination_Kun_Zhou
Oral Qualification Examination_Kun_ZhouOral Qualification Examination_Kun_Zhou
Oral Qualification Examination_Kun_Zhou
 
Literature Review On: ”Speech Emotion Recognition Using Deep Neural Network”
Literature Review On: ”Speech Emotion Recognition Using Deep Neural Network”Literature Review On: ”Speech Emotion Recognition Using Deep Neural Network”
Literature Review On: ”Speech Emotion Recognition Using Deep Neural Network”
 
IEEE ICASSP 2021
IEEE ICASSP 2021IEEE ICASSP 2021
IEEE ICASSP 2021
 
Emotion Detection Using Noninvasive Low-cost Sensors
Emotion Detection Using Noninvasive Low-cost SensorsEmotion Detection Using Noninvasive Low-cost Sensors
Emotion Detection Using Noninvasive Low-cost Sensors
 
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
 
Speech Emotion Recognition Using Neural Networks
Speech Emotion Recognition Using Neural NetworksSpeech Emotion Recognition Using Neural Networks
Speech Emotion Recognition Using Neural Networks
 
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECHBASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
 
NBLex: emotion prediction in Kannada-English code-switchtext using naïve baye...
NBLex: emotion prediction in Kannada-English code-switchtext using naïve baye...NBLex: emotion prediction in Kannada-English code-switchtext using naïve baye...
NBLex: emotion prediction in Kannada-English code-switchtext using naïve baye...
 
NATURAL LANGUAGE PROCESSING.pptx
NATURAL LANGUAGE PROCESSING.pptxNATURAL LANGUAGE PROCESSING.pptx
NATURAL LANGUAGE PROCESSING.pptx
 
Speech emotion recognition using 2D-convolutional neural network
Speech emotion recognition using 2D-convolutional neural  networkSpeech emotion recognition using 2D-convolutional neural  network
Speech emotion recognition using 2D-convolutional neural network
 
A Review Paper on Speech Based Emotion Detection Using Deep Learning
A Review Paper on Speech Based Emotion Detection Using Deep LearningA Review Paper on Speech Based Emotion Detection Using Deep Learning
A Review Paper on Speech Based Emotion Detection Using Deep Learning
 
76201926
7620192676201926
76201926
 
A-STUDY-ON-SENTIMENT-POLARITY.pdf
A-STUDY-ON-SENTIMENT-POLARITY.pdfA-STUDY-ON-SENTIMENT-POLARITY.pdf
A-STUDY-ON-SENTIMENT-POLARITY.pdf
 
Emotion recognition using facial expressions and speech
Emotion recognition using facial expressions and speechEmotion recognition using facial expressions and speech
Emotion recognition using facial expressions and speech
 
A hybrid strategy for emotion classification
A hybrid strategy for emotion classificationA hybrid strategy for emotion classification
A hybrid strategy for emotion classification
 
ANALYSING SPEECH EMOTION USING NEURAL NETWORK ALGORITHM
ANALYSING SPEECH EMOTION USING NEURAL NETWORK ALGORITHMANALYSING SPEECH EMOTION USING NEURAL NETWORK ALGORITHM
ANALYSING SPEECH EMOTION USING NEURAL NETWORK ALGORITHM
 
Emotion recognition based on the energy distribution of plosive syllables
Emotion recognition based on the energy distribution of plosive  syllablesEmotion recognition based on the energy distribution of plosive  syllables
Emotion recognition based on the energy distribution of plosive syllables
 
saito22research_talk_at_NUS
saito22research_talk_at_NUSsaito22research_talk_at_NUS
saito22research_talk_at_NUS
 
N01741100102
N01741100102N01741100102
N01741100102
 
Text to Emotion Extraction Using Supervised Machine Learning Techniques
Text to Emotion Extraction Using Supervised Machine Learning TechniquesText to Emotion Extraction Using Supervised Machine Learning Techniques
Text to Emotion Extraction Using Supervised Machine Learning Techniques
 

Recently uploaded

Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...RKavithamani
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 

Recently uploaded (20)

Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 

PhD_Oral_Defense_Kun.ppt

  • 1. © Copyright National University of Singapore. All Rights Reserved. © Copyright National University of Singapore. All Rights Reserved. Emotion Modelling for Speech Generation Kun Zhou (National University of Singapore) Supervisors: Prof. Haizhou Li (Main Supervisor, NUS & CUHK) Associate Prof. Thomas Yeo Boon Thye (NUS) 1
  • 2. © Copyright National University of Singapore. All Rights Reserved. Outline • Background & Introduction • Topic 1: Seq2Seq Emotion Modelling • Topic 2: Emotion Intensity and its Control • Topic 3: Mixed Emotion Modelling and Synthesis • Conclusion & Future Work 2
  • 3. © Copyright National University of Singapore. All Rights Reserved. Background & Introduction • Speech Generation: - Text-to-Speech (TTS): Text  Speech - Voice Conversion (VC): Speech  Speech • Development: Statistical Parametric-based  Neural Network-based • Applications: Conversational Agents (Apple Siri, Amazon Alexa, …) • Research Limitations: - Prosody Modelling: Lack of Expressivity; - Data Dependency: Require Large & High-Quality Data; - Personalization; - Real-Time Generation; Teach Machines to Speak Picture Sources: [1] “When Machines Speak”, Center for Science and Society, Columbia University, USA; [2] “Talking with Machines”, News and Events, University of York, UK; 3
  • 4. © Copyright National University of Singapore. All Rights Reserved. Emotion Modelling for Speech Generation • Why Speech Emotions Are Difficult to Model? 1) Human Emotions are subtle and difficult to represent; - Categorical: Angry, Happy, Sad, … - Continuous: Valence, Arousal, Dominance, … 2) Speech Emotions relate to various acoustic features; - Speech Quality, Energy, Intonation, Rhythm, … 3) Emotion Perception is subjective; • Our Research Focus: 1) How to better imitate human emotions? “Generalizability” 2) How to make synthesized emotions more creative? “Controllability” 4
  • 5. © Copyright National University of Singapore. All Rights Reserved. Towards Empathic AI Empathic AI: • Receiving: Receive Human/Environment Inputs; • Appraisal: Appraise the Situation; • Response: Generate Appropriate Responses; Emotional Speech Generation: • Generate Emotional Responses; • Increase the Dialogue Richness; • Enhance the Engagement in Human- Machine Interaction; Picture Sources: “Should Algorithm and Robots Mimic Empathy?”, The Medical Futurist; 5
  • 6. © Copyright National University of Singapore. All Rights Reserved. Publications • Journal Publications (1st author): [1] Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li, “Emotional Voice Conversion: Theory, Databases and ESD”, Speech Communication, 2022; [2] Kun Zhou, Berrak Sisman, Rajib Rana, Bjorn Schuller, Haizhou Li, “Emotion Intensity and its Control for Emotional Voice Conversion”, IEEE Transactions on Affective Computing, 2023; [3] Kun Zhou, Berrak Sisman, Rajib Rana, Bjorn Schuller, Haizhou Li, “Speech Synthesis with Mixed Emotions”, IEEE Transactions on Affective Computing, 2022; • Conference Publications (1st author): [4] Kun Zhou, Berrak Sisman, Haizhou Li, “Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data”, Speaker Odyssey 2019; [5] Kun Zhou, Berrak Sisman, Mingyang Zhang, Haizhou Li, “Converting Anyone’s Emotion: Towards Speaker-Independent Emotional Voice Conversion”, INTERSPEECH 2020; [6] Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li, “Seen and Unseen Emotional Style Transfer with a New Emotional Speech Dataset”, ICASSP 2021; [7] Kun Zhou, Berrak Sisman, Haizhou Li, “Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-Stage Sequence-to-Sequence Training”, INTERSPEECH 2021; [8] Kun Zhou, Berrak Sisman, Haizhou Li, “VAW-GAN for Disentanglement and Recomposition of Emotional Elements in Speech”, IEEE SLT 2021; 6
  • 7. © Copyright National University of Singapore. All Rights Reserved. • Conference Publications (Co-authored): [9] Zongyang Du, Kun Zhou, Berrak Sisman, Haizhou Li, “Spectrum and Prosody Conversion for Cross-Lingual Voice Conversion with CycleGAN”, APSIPA 2020; [10] Junchen Lu, Kun Zhou, Berrak Sisman, Haizhou Li, “VAW-GAN for Singing Voice Conversion with Non-Parallel Data”, APSIPA 2020; [11] Zongyang Du, Berrak Sisman, Kun Zhou, Haizhou Li, “Expressive Voice Conversion: A Joint Framework for Speaker Identity and Emotional Style Transfer”, IEEE ASRU 2020; [12] Zongyang Du, Berrak Sisman, Kun Zhou, Haizhou Li, “Disentanglement of Emotional Style and Speaker Identity for Expressive Voice Conversion”, INTERSPEECH 2022; Publications 7
  • 8. © Copyright National University of Singapore. All Rights Reserved. Emotion Modelling for Speech Generation [1] Kun Zhou, Berrak Sisman, Haizhou Li, “Limited Emotional Voice Conversion Leveraging Text-to-Speech: Two-Stage Sequence-to-Sequence Training”, INTERSPEECH 2021; [2] Kun Zhou, Berrak Sisman, Rajib Rana, Bjorn Schuller, Haizhou Li, “Emotion Intensity and its Control for Emotional Voice Conversion”, IEEE Transactions on Affective Computing, 2022; [3] Kun Zhou, Berrak Sisman, Rajib Rana, Bjorn Schuller, Haizhou Li, “Speech Synthesis with Mixed Emotions”, IEEE Transactions on Affective Computing, 2022; Topic 1: Seq2Seq Emotion Modelling (Generalizability) Topic 2: Emotion Intensity Modelling and its Control (Controllability) Topic 3: Mixed Emotion Modelling and Synthesis (Controllability) 8
  • 9. © Copyright National University of Singapore. All Rights Reserved. Outline • Background & Introduction • Topic 1: Seq2Seq Emotion Modelling • Topic 2: Emotion Intensity and its Control • Topic 3: Mixed Emotion Modelling and Synthesis • Conclusion & Future Work 9
  • 10. © Copyright National University of Singapore. All Rights Reserved. Seq2Seq Emotion Modelling* • Why Seq2Seq Emotion Models? Frame-based Models: - Convert emotion on a frame-basis  Cannot Modify Duration; - Separate model spectrum and prosody  Mismatch; Seq2Seq-based Models: - Joint model spectrum, prosody, and duration  No Mismatch; - Sequence-level modelling  Focus on Emotion-Relevant Regions; - Needs a large amount of emotional speech training data; Our Seq2Seq-based Models: Enable duration modelling Only require limited emotion training data * Kun Zhou, Berrak Sisman, Haizhou Li, “Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-Stage Sequence-to- Sequence Training”, INTERSPEECH 2021. 10
  • 11. © Copyright National University of Singapore. All Rights Reserved. Seq2Seq Emotion Modelling • Two-Stage Training for Seq2Seq Emotion Modelling (1) Style Initialization (2) Emotion Training • Stage I: Style Initialization - Speaker Encoder learns Speaker Styles from a TTS corpus; • Stage II: Emotion Training - Emotion Encoder learns Emotional Styles from limited emotional speech data; 11
  • 12. © Copyright National University of Singapore. All Rights Reserved. Seq2Seq Emotion Modelling • Two-Stage Training for Seq2Seq Emotion Modelling (1) Style Encoder from Stage I (2) Emotion Encoder from Stage II - Emotion Encoder produce effective emotion embeddings; - Only needs 50 mins emotion training data! When testing with emotional data: 12
  • 13. © Copyright National University of Singapore. All Rights Reserved. Seq2Seq Emotion Modelling • Objective Evaluations Seq2Seq Models outperformed the frame-based baselines. DDUR -> Duration Conversion Performance MCD -> Spectrum Conversion Performance 13
  • 14. © Copyright National University of Singapore. All Rights Reserved. Seq2Seq Emotion Modelling • Subjective Evaluations Seq2Seq Models outperformed the frame-based baselines. MOS Test for Emotion Similarity BWS Test for Speech Quality 14
  • 15. © Copyright National University of Singapore. All Rights Reserved. Outline • Background & Introduction • Topic 1: Seq2Seq Emotion Modelling • Topic 2: Emotion Intensity and its Control • Topic 3: Mixed Emotion Modelling and Synthesis • Conclusion & Future Work 15
  • 16. © Copyright National University of Singapore. All Rights Reserved. Emotion Intensity and its Control* • Emotion Intensity Figure 1. Example of different intensity images of “fear”* Picture Source: Hoffmann, Holger, et al. "Expression intensity, gender and facial emotion recognition: Women recognize only subtle facial emotions better than men.” Acta psychologica 135.3 (2010): 278-283. *Kun Zhou, Berrak Sisman, Rajib Rana, Bjorn Schuller, Haizhou Li, “Emotion Intensity and its Control for Emotional Voice Conversion”, IEEE Transactions on Affective Computing, 14 (1) 31-48, 2023. - The level that an emotion can be perceived by a listener; - Not just the loudness of a voice; - Correlates to all the acoustic cues that contribute to an emotion; 16
  • 17. © Copyright National University of Singapore. All Rights Reserved. Emotion Intensity and its Control • Research Challenges - Complex with acoustic cues such as timbre, pitch and rhythm; - Lack of explicit intensity labels; - Subjective; (Hot or cold anger?) • Previous Studies - Use auxiliary features, e.g. voiced/unvoiced/silence (VUS), attention weights, a saliency map; - Manipulate internal emotion representations, e.g. interpolation or scaling; - Lack of interpretability; - Limits the performance; 17
  • 18. © Copyright National University of Singapore. All Rights Reserved. • Emovox – Emotional Voice Conversion with Intensity Control Emotion Intensity and its Control Sequence-to-Sequence Conversion Models: • Linguistic Transplant; • Emotion Transfer; • Intensity Control; How to design an intensity encoder that can: (1) Accept manual intensity labels; (2) Produce interpretable intensity embeddings  How to formulate emotion intensity? 18
  • 19. © Copyright National University of Singapore. All Rights Reserved. • Relative Attributes* Emotion Intensity and its Control He smiles more than the person on the right, but less than the person on the left! Our world is not binary: - Before: Predicting the presence of an attribute; E.g. Smiling or not smiling?  Regression Problem; - Relative Attributes*: Predicting the strength of an attribute; E.g. How much he/she is smiling?  Ranking Problem; * Parikh, Devi, and Kristen Grauman. "Relative attributes." 2011 International Conference on Computer Vision. IEEE, 2011. 19
  • 20. © Copyright National University of Singapore. All Rights Reserved. • Formulation of Emotion Intensity Emotion Intensity and its Control Assumption: (1) “Neutral” does not contain any emotion variance;  its intensity is always 0 (2) Emotion Intensity is the relative difference with “Neutral”; Given a training set T={xt}, N and E are neutral and emotional dataset, we learn a ranking function: r(xt) = Wxt It satisfies the following constraints: Supervision: O  Ordered Set, S  Similar Set. O: E has higher intensity than N; S: N-N pairs / E-E pairs 20
  • 21. © Copyright National University of Singapore. All Rights Reserved. • Formulation of Emotion Intensity Emotion Intensity and its Control (1) Traditional Classifier (2) Relative Ranking Neutral Angry Least Angry Most Angry A ranking model automatically predicts the intensity of the emotion with respect to other speech samples. 21
  • 22. © Copyright National University of Singapore. All Rights Reserved. • Modelling Emotion Styles with its Intensity Emotion Intensity and its Control Emotion Style Reconstruction: - Emotion Modelling; - Intensity Modelling; 22
  • 23. © Copyright National University of Singapore. All Rights Reserved. • Run-time Emotion Intensity Control Emotion Intensity and its Control • Emotion Intensity Transfer: • Emotion Intensity Control: Emotion Intensity is predicted from a reference audio; Emotion Intensity is given by humans; 23
  • 24. © Copyright National University of Singapore. All Rights Reserved. • Speech Samples Emotion Intensity and its Control Intensity = 0.1 (Most Weak) Intensity = 0.3 (Less Weak) Intensity = 0.6 (Less Strong) Intensity = 0.9 (Most Strong) Intensity = 0.1 (Most Weak) Intensity = 0.3 (Less Weak) Intensity = 0.6 (Less Strong) Intensity = 0.9 (Most Strong) Converting Neutral to Angry: Converting Neutral to Sad: Note: [1] All the speech samples that you listen are synthesized from source neutral speech; [2] The total duration of training data is less than 1 hour. 24
  • 25. © Copyright National University of Singapore. All Rights Reserved. • Visual Comparisons (Duration) Emotion Intensity and its Control Sad (Weak), Intensity = 0.1 Sad (Medium), Intensity = 0.5 Sad (Strong), Intensity = 0.9 As intensity increases, speech becomes slower and more resonant. 25
  • 26. © Copyright National University of Singapore. All Rights Reserved. • Visual Comparisons (Pitch and Energy) Emotion Intensity and its Control Pitch Energy 26
  • 27. © Copyright National University of Singapore. All Rights Reserved. • Intensity Control Evaluation Emotion Intensity and its Control Compared to other control methods, our proposed model with relative attributes achieves best results in terms of intensity control. 27
  • 28. © Copyright National University of Singapore. All Rights Reserved. Emotion Intensity and its Control Scan here for more speech samples: Codes are publicly available: https://github.com/KunZhou9646/Emovox The Most Popular Article in IEEE Trans on Affective Computing! 28
  • 29. © Copyright National University of Singapore. All Rights Reserved. Outline • Background & Introduction • Topic 1: Seq2Seq Emotion Modelling • Topic 2: Emotion Intensity and its Control • Topic 3: Mixed Emotion Modelling and Synthesis • Conclusion & Future Work 29
  • 30. © Copyright National University of Singapore. All Rights Reserved. Mixed Emotion Modelling & Synthesis* • Mixed Emotions Happy Sad Mixed Emotion (Bittersweet) * Kun Zhou, Berrak Sisman, Rajib Rana, Bjorn Schuller, Haizhou Li, “Speech Synthesis with Mixed Emotions”, IEEE Transactions on Affective Computing, 2022. 30
  • 31. © Copyright National University of Singapore. All Rights Reserved. Mixed Emotion Modelling & Synthesis • Mixed Emotions Human can feel multiple emotions at the same time;  Some Bittersweet moments: - Remembering a lost love with warmth; - First time when leaving home for college;  Psychologists have been studying to understand the measures and paradigms of mixed emotions;* Is it possible to synthesize mixed emotional effects for speech synthesis? * Larsen, Jeff T., and A. Peter McGraw. "The case for mixed emotions." Social and Personality Psychology Compass 8.6 (2014): 263-274. 31
  • 32. © Copyright National University of Singapore. All Rights Reserved. Mixed Emotion Modelling & Synthesis • Theory of Emotion Wheel* Human can experience around 34,000 different emotions; • Plutchik proposed 8 primary emotions; • All other emotions can be regarded as mixed or derivative states of primary emotions; For example: Delight = Joy + Surprise; Disappointment = Sad + Surprise; *Robert Plutchik, “The Nature of Emotions”, American Scientist, 1984; 32
  • 33. © Copyright National University of Singapore. All Rights Reserved. Mixed Emotion Modelling & Synthesis • Research Gaps To learn emotion information: - Associate with explicit labels (e.g., discrete/continuous labels); - Imitate a reference audio (e.g., reference encoder, GST-Tacotron); Only synthesize an averaged style belonging to a specific emotion type. • Research Challenges • How to characterize and quantify the mixture of speech emotions? • How to evaluate the synthesized mixed results? Our focus! 33
  • 34. © Copyright National University of Singapore. All Rights Reserved. Mixed Emotion Modelling & Synthesis • The Ordinal Nature of Emotions* Previous Method: - Assigning an absolute score (Arousal, Valence, …); - Defining a discrete emotion category (Happy, Sad, …); - Imitate a reference audio; Our Methods: Characterize emotions through comparative assessments (e.g., is sentence one happier than sentence two?) Key idea  Learn to Rank - Construct a ranking model using training data; - Sort new objects according to their degree of relevance. *Yannakakis, Georgios N., Roddy Cowie, and Carlos Busso. "The ordinal nature of emotions." 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 2017. 34
  • 35. © Copyright National University of Singapore. All Rights Reserved. Mixed Emotion Modelling & Synthesis • Three Assumptions*: - Manually Controlled Attribute Vector; - Proposed Relative Scheme; - Emotional Text-to-Speech Model; 1. Mixed emotions are characterized by combinations, mixtures or compounds of primary emotions; 2. All emotions are related to some extent; 3. Each emotion has stereotypical styles; • Proposed Diagram: *All assumptions are supported by related psychological studies. Please refer to our paper for more details. Our Focus 35
  • 36. © Copyright National University of Singapore. All Rights Reserved. Mixed Emotion Modelling & Synthesis • Design of a Novel Relative Scheme - Train a relative ranking function f(x) between each emotion pairs; - At run-time, the trained f(x) automatically predicts an emotion attribute; - A smaller value of emotion attribute represents a similar emotional style; 36
  • 37. © Copyright National University of Singapore. All Rights Reserved. Mixed Emotion Modelling & Synthesis • Training with Relative Scheme During the training, the framework learns: 1. Characterize input emotion styles; 2. Quantifying the difference with other emotion types;  Emotion Attribute Vectors  Emotion Embedding 37
  • 38. © Copyright National University of Singapore. All Rights Reserved. Mixed Emotion Modelling & Synthesis • Controlling Emotion Mixture in Speech Given Text Inputs: • Text Encoder: Projects the linguistic information to an internal representations; • Emotion Encoder: Captures the emotion style from a reference speech; • Manually Controlled Attribute Vector: Introduce & Control the characteristics of other emotions types; 38
  • 39. © Copyright National University of Singapore. All Rights Reserved. Mixed Emotion Modelling & Synthesis • Experiments with Three Case Studies • Case Study 1: “Delight”, “Outrage”, “Disappointment” • Case Study 2: Conflicting Emotions – “Happy” & “Sad” • Case Study 3: An Emotion Transition System 39
  • 40. © Copyright National University of Singapore. All Rights Reserved. Mixed Emotion Modelling & Synthesis • Case Study 1: “Delight”, “Outrage”, “Disappointment” - We mix “Surprise” with “Happy”, “Angry” and “Sad”; - We expect to synthesize mixed feelings closer to “Delight”, “Outrage” and “Disappointment”; 40
  • 41. © Copyright National University of Singapore. All Rights Reserved. Mixed Emotion Modelling & Synthesis • Case Study 1: “Delight”, “Outrage”, “Disappointment” (1) Analysis with Speech Emotion Recognition (SER) • Analyze the mixture of emotions with the classification probabilities derived before the last projection layer of a pre-trained SER; • The classification probabilities could capture the emotion mixtures! 41
  • 42. © Copyright National University of Singapore. All Rights Reserved. Mixed Emotion Modelling & Synthesis • Case Study 1: “Delight”, “Outrage”, “Disappointment” (2) Acoustic Evaluation Mel-Cepstral Distortion (MCD) -> Spectrum Similarity Pearson Correlation Coefficient (PCC) -> Pitch Similarity 42
  • 43. © Copyright National University of Singapore. All Rights Reserved. Mixed Emotion Modelling & Synthesis • Case Study 1: “Delight”, “Outrage”, “Disappointment” (3) Listening Experiments (A) Evaluation the perception of “Angry”, “Happy” and “Sad” (B) Evaluation the perception of “Outrage”, “Delight” and “Disappointment” 43
  • 44. © Copyright National University of Singapore. All Rights Reserved. Mixed Emotion Modelling & Synthesis • Case Study 1: “Delight”, “Outrage”, “Disappointment” Speech Demos (Emotion Evaluation) Only Surprise Only Angry Mixing Surprise with Angry (Outrage) Only Surprise Only Happy Mixing Surprise with Happy (Delight) Only Surprise Only Sad Mixing Surprise with Sad (Disappointment) Note: [1] All the speech samples that you listen are synthesized from the texts; [2] The total duration of training data is less than 1 hour. 44
  • 45. © Copyright National University of Singapore. All Rights Reserved. Mixed Emotion Modelling & Synthesis • Case Study 1: “Delight”, “Outrage”, “Disappointment” Speech Demos (Controllability Evaluation) 100% Surprise 100% Surprise + 30% Angry 100% Surprise + 60% Angry 100% Surprise + 90% Angry Note: [1] All the speech samples that you listen are synthesized from the texts; [2] The total duration of training data is less than 1 hour. 100% Surprise 100% Surprise + 30% Happy 100% Surprise + 60% Happy 100% Surprise + 90% Happy 100% Surprise 100% Surprise + 30% Sad 100% Surprise + 60% Sad 100% Surprise + 90% Sad 45
  • 46. © Copyright National University of Singapore. All Rights Reserved. Mixed Emotion Modelling & Synthesis • Case Study 2: Conflicting Emotions – “Happy” & “Sad” - “Happy” and “Sad” are considered as two conflicting emotions: e.g., opposite valence (pleasant and unpleasant); - “Bittersweet” describes a mixed feeling of both “Happy” and “Sad”; - Professional actors are thought to be able to deliver such feelings through action & speech; - It is a challenge task to synthesize a “Bittersweet” feeling; 46
  • 47. © Copyright National University of Singapore. All Rights Reserved. Mixed Emotion Modelling & Synthesis Speech Demos Note: All the speech samples that you listen are synthesized from the texts Happy Sad Happy + Sad • Case Study 2: Conflicting Emotions – “Happy” & “Sad” 47
  • 48. © Copyright National University of Singapore. All Rights Reserved. Mixed Emotion Modelling & Synthesis • Case Study 3: An Emotion Transition System - Emotion transition aims to gradually transit the emotion from one to another; - The key challenge is how to synthesize internal states between different emotion types; - We keep the sum of the percentages of emotions to be 100%, and adjust each percentage manually (e.g., 80% Surprise with 20% Angry); Internal Emotion States (Our Focus) An Example of Emotion Transition System: 48
  • 49. © Copyright National University of Singapore. All Rights Reserved. Mixed Emotion Modelling & Synthesis • Case Study 3: An Emotion Transition System Speech Demos - An Emotion Triangle Angry Sad Happy Note: All the speech samples that you listen are synthesized from the texts Emotion Triangle 49
  • 50. © Copyright National University of Singapore. All Rights Reserved. Mixed Emotion Modelling & Synthesis • Case Study 3: An Emotion Transition System Note: All the speech samples that you listen are synthesized from the texts https://kunzhou9646.github.io/Emotion_Triangle/ 50
  • 51. © Copyright National University of Singapore. All Rights Reserved. Mixed Emotion Modelling & Synthesis Scan here for more speech samples: Codes & Pre-trained models are publicly available: https://github.com/KunZhou9646/Mixed_Emotions Published Version: 51
  • 52. © Copyright National University of Singapore. All Rights Reserved. Outline • Background & Introduction • Topic 1: Seq2Seq Emotion Modelling • Topic 2: Emotion Intensity and its Control • Topic 3: Mixed Emotion Modelling and Synthesis • Conclusion & Future Work 52
  • 53. © Copyright National University of Singapore. All Rights Reserved. Conclusion • Emotional speech generation is a key technology to achieve empathic AI; • We study seq2seq emotion modelling for emotional voice conversion that improves the generalizability of emotion models; • We study emotion intensity modelling and control for emotional voice conversion; • We study mixed emotion modelling and generation, and present three case studies to validate our idea; 53
  • 54. © Copyright National University of Singapore. All Rights Reserved. Future Work • Inclusive Emotional Speech Generation - Current Studies are “Exclusive”: - Require high-quality recorded data; - Acted emotional speech data may create stereotypes of emotions; - Building an inclusive emotional speech generation framework: - Associate emotional prosody with linguistic information; - Reduce the impact of confounding factors: Confounding Factors: - Environmental Effects: Noises, Speech Overlappings, … - Individual Effects: Accents, Age, Gender, … 54
  • 55. © Copyright National University of Singapore. All Rights Reserved. Future Work • Affective Vocal Bursts Synthesis Speech Emotions can be manifested in: - Para-Linguistics: Intonation, Energy, Speaking Rate, … - Non-Linguistics: Vocal Bursts -> Laugh, Sigh, Grunts, Cries, … - Linguistics: Word Selections Our Focus Not Enough Attention Yet Expressive Vocalization (ExVo) Workshops*: To Understand and Synthesize Expressive Non-Verbal Vocalizations *Baird, Alice, et al. "The icml 2022 expressive vocalizations workshop and competition: Recognizing, generating, and personalizing vocal bursts." Proceeding of ICML 2022. 55
  • 56. © Copyright National University of Singapore. All Rights Reserved. Future Work • Ethical Study (1) “Deep Fake” Issues: - Spreading disinformation; - Defaming a politician by manipulating their emotions; (2) Privacy Issues: - Emotions are the results of human experiences; - Privacy concerns in data collections; (3) The Biases of Individuals, Languages and Cultures: - Underrepresented languages and cultures; - Individual differences in emotional expression and perception; (4) Do We Want Our Artificial Agent to be Emotional? - Influence our emotions in the ways we would prefer or avoid; - Give robots a certain personality that makes them indistinguishable from humans; - The emotional overreliance on robots will increase the loneliness in real life; A. Triantafyllopoulos et al., "An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era," in Proceedings of the IEEE. 56
  • 57. © Copyright National University of Singapore. All Rights Reserved. THANK YOU • Prof. Haizhou Li (NUS & CUHK); • Prof. Berrak Sisman (UT Dallas, USA), Prof. Rajib Rana (USQ, Australia), Prof. Bjorn Schuller (Imperial College London, U.K.), Prof. Tanja Schultz (U of Bremen, Germany); Family, Friends & Colleagues @ NUS, SUTD, UT Dallas, U of Bremen; Picture Sources: Kun’s iPhone 2023 @ Dallas, USA 2023 @ Bremen, Germany 2022 @ Marina Bay, SG 2019 @ Oriental Hotel, SG 2023 @ Dallas, USA 2022 @ Bali, Indonesia 2020 @ Blue Horizon, SG 2023 @ Bremen, Germany 2022 @ Clark Quay, SG 57
  • 58. © Copyright National University of Singapore. All Rights Reserved. A little bit more: As a logo designer … Picture Source: [1] “Define AI at ICASSP 2022”, Travelogue; [2] Kun’s iPhone; 58
  • 59. © Copyright National University of Singapore. All Rights Reserved. THANK YOU 59

Editor's Notes

  1. Good afternoon. My name is Kun Zhou, a 4-th year phd student from NUS. My PhD research is about Emotion Modelling for Speech Generation.
  2. This the outline of this presentation, which has three major topics: seq2seq emotion modelling, emotion intensity modelling and control, and mixed emotion modelling and synthesis.
  3. My PhD work is on speech generation. Speech generation is a task to teach the machines to speak. According to the input data, it can be divided into text-to-speech and voice conversion. Benefited from the deep learning models, we observe a shift from the previous statistical parametric to the neural network-based methods. Speech generation has may applications in industry. It helps us to build a conversional agents, like Apple Siri and Amazon Alexa. However, current studies face a few limitations: for example, the lack of the expressivity, require large high-quality training data, difficult to be personalized, and difficult to be used in real-time application.
  4. My PhD research is focused on the study of emotion modelling for speech generation. Emotions are difficult to model. Because human emotions are subtle and difficult to represent. We can use some categories to classify the emotions, or also can describe emotion in continuous space. And speech emotions, it correlates to multiple acoustic features, such as the voice quality, energy, intonation, and rhythm. Human emotion perception is subjective, brings another challenge in evaluation process. In my research, we are trying to answer two questions: 1/ how to better imitate human emotions, we try to improve the generalizability of emotion models, and 2/ how to make the synthesized emotions more creative? We want to make the emotion models more controllable, so we can create different emotional styles as we desire.
  5. The ultimate goal of my research is to build an AI with human empathy. An empathic AI first receives the inputs from humans or the environment, appraise the situation, and respond humans in an appropriate way. Emotional speech generation is a key technology to generate emotional response. It can enhance the engagement in human-machine interaction. In the following presentation, I will introduce our work on this topic.
  6. On this topic, I publish several papers in journals and conference, like IEEE transactions on Affective Computing, Speech Communication, INTERSPEECH, ICASSP, SLT and Speaker Odyssey.
  7. Also some co-authored papers in ASRU, INTERSPEECH and APSIPA.
  8. I chose three papers to formulate this presentation, the first one is about the generalizability, and the last two is about controllability of emotion models.
  9. In previous frame-based methods, we model speech emotions frame by frame. When we convert the emotions, the output frame length kept the same as the input, so the speech duration cannot be modified. But speaking rate is an important feature to express an emotion. Frame-based methods also separate model spectrum and prosody frame by frame, it will cause a mismatch during the inference. Compared with frame-based models, seq2seq models can be a better solution to model speech emotions. They joint model spectrum, prosody and duration, so there is no mismatch during the inference. With the attention mechanism, we could focus on the emotion-relevant regions. But seq2seq models usually need a large amount of training data. And it is difficult and expensive to collect emotional speech data. So we want to build a seq2seq models for emotion modelling, which can modify the speech duration, and only requires limited emotion training data.
  10. We propose a seq2seq emotion modelling framework together with a 2-stage training strategy. The first stage is style initialization, the style encoder learns speaker styles from a large neutral-speaking TTS corpus, the ASR encoder and the text encoder predict the linguistic embedding from the audio and text separately. We use a classifier with adversarial training to help with the disentanglement between the speaker style and linguistic content. In the Stage 2, we conduct emotion training with a limited amount of emotional speech data, where the style encoder becomes to the emotion encoder to learn emotional styles from the emotional speechd data.
  11. We test the performance of style encoder and emotion encoder with some emotion data. We plot the embedding into this figure. We can see that the emotion encoder learns a more effective emotion embedding, and can form separate clusters for each emotion.
  12. We calculate the difference of duration and the mel-cepstral distortion to evaluate the performance of duration and spectrum conversion. We can see that our seq2seq models outperformed the frame-based baselines. And can achieve effective duration conversion performance.
  13. We also conduct listening experiments to validate our idea.
  14. In 2nd topic, we study to model and control emotion intensity. Emotion intensity is the level that an emotion can be perceived by a listener. So the intensity of an emotion is not just the loudness of a voice, but correlates to all the acoustic cues that contribute to an emotion.
  15. To model the emotion intensity, there are a few challenges. One is the lack of intensity labels. And the emotion intensity is even more subjective. It also complex with multiple acoustic cues such as timbre, pitch and rhythm Previous study use some auxiliary features, for examples, a state of voiced/unvoiced/silence, attention weights or a saliency map from a pre-trained emotion recognizer. Another way is to manipulate the emotion representations, for example, through interpolation or scaling. But these methods lack the interpretability and their performance is not too good.
  16. Here we propose a novel framework, Emovox, for emotional voice conversion with effective intensity control. Emovox is a sequence-to-sequence conversion model, preserves the linguistic content of the source speech, and transfer the emotional style of a reference speech, and also allow humans to control the output intensity. The key challenge here is how to design an intensity encoder that can accept manual intensity labels, and produce interpretable intensity embeddings.
  17. We are inspired by the study of relative attributes in computer vision. Our world is not binary, and cannot be always modelled as a classification or regression problem. For example, we cannot simply tell if a person is smiling or not smiling. Because you know, it could be a half smile. Relative attributes is proposed to sovle this problem. It does not predict an attribute, instead, it predict the strength of the attribute compared with others. So in this way, it can formulate a ranking problem to sovle all these things.
  18. Inspired by the relative attributes, we propose a novel formulation of the emotion intensity. We assume the neutral speech does not contain any emotional variance, so its intensity should be always zero. In this way, the emotion intensity can be modeled as the relative difference between with neutral samples. Given a neutral set and an emotional set, we can learn a ranking function, which needs to satisfy the following constraints. We construct an ordered set and a similar set. For ordered set, the emotional speech samples always have higher intensity than neutral samples. Then we can solve a SVM problems.
  19. We explain our formuation further more in this figure. The left one is the traditional classifer, it aims to classify different emotions. But in our formulation, we do not seek to find a boundary. Instead, we learn a ranking, to rank all the samples given a critiria. In this way, for example, we can find the most angry one, and the least angry. When an unseen samples comes, the ranking model can automatically predict the emotion intensity with respect to other emotional speech.
  20. We further propose a sequence-to-sequence EVC framework. It enables the duration modelling with the attention mechanism, we also use the perceptual losses from a pre-trained SER to guide the emotion training. We design the intensity encoder with the relative ranking functions. We model the emotion style and intensity separately, and the decoder learns to reconstruct the emotion style from the combination of emotion and intensity embedding.
  21. At the run-time, the intensity can be predicted from a reference audio, or just given by humans. So we can achieve emotion intensity transfer and control at the same time.
  22. We visualize some acoustic features to validate our idea of intensity control. We first compare the speech duration with sad emotions. As we increase the intensity values, the speech becomes slower and the timber becomes more resonant.
  23. We then compare the pitch and energy. We observe a large dumpy of pitch and energy in angry and happy. But we don’t have such observations in sad. It shows that the intensity of sad emotion is more related to the speaking rate and the speech timbre.
  24. We ask listeners to listen to the speech samples with 3 different emotion intensities and choose the least and the most expressive one. We can see that our methods the relative attributes achieve the best results in terms of the intensity control.
  25. In the last topic, we study an interesting question: is it possible to synthesize a mixed emotion?
  26. In English, there is a word called “Bittersweet”. describes a mixed feeling of both happy and sad. Previous text-to-speech studies can synthesize human voice with different emotions types. For example, a happy voice or a sad voice. But can we synthesize a mixed feeling of both happy and sad in speech. This is the question that we would like to study in this research.
  27. Human can experience around 34000 different emotions. In the theory of emotion wheel, …
  28. there are two types of methods to learn emotion information from the speech. One is to associate with emotion labels, another is to imitate a reference audio. For examples, in GST-Tacotron, the reference encoder learns the reference style, and give a reference embedding to the decoder. But current studies are mostly generating an averaged style belonging to a specific emotion type. So they cannot synthesize mixed emotions. There are two challenge, one is …, another is … In this research, we are focusing on these two challenges and study a solution to mixed emotion modelling and synthesis.
  29. Another one the ordinal nature of emotions. Because it is difficult to precisely characterize emotions. In psychology, there are two methods, … But researchers found that for the human listener, it is more straightforward to use ordinal methods to model the human perception.
  30. Inspired by the psychology studies, we propose 3 assumptions… Based on these 3 assumptions, we propose a diagram, which has three parts, 1/ a manually controlled attribute vector that can be defined by human, and each number indicate the percentage of each emotion in the mixture. The proposed relative scheme accepts the attribute vector, and generate a relative embedding. Given the relative embedding, reference speech and text input, the emotional text-to-speech can generate speech with a mixture of emotions. Our major contribution is focused on design a relative scheme.
  31. Our method is straightforward. We take emotion style as an attribute of the emotional speech. We first pair each emotion, and train a relative ranking function between each emotion pairs. At the run-time, the trained ranking functions can automatically predicts an emotion attribute. A smaller value of emotion attribute represents a similar style. And all the attributes can form an emotion attribute vector here.
  32. We incorporate the relative scheme into a sequence-to-sequence text-to-speech framework. During the training, the framework learns two things at the same time: 1/ characterize the input emotion styles, which results in an emotion embedding. 2/ quantifying the differences with other emotion types, which results in emotion attribute vectors. The decoder learns to reconstruct the input style from a combination of emotion embedding and emotion attribute vectors.
  33. At run-time, given text as the inputs, the text encoder projects the linguistic information to an internal representations, the emotion encoder captures the emotion style from a reference speech. And the manually controlled attribute vector further introduces the characteristics of other emotion types.
  34. We present three case studies to validate our idea.
  35. In the first case study, we would like to mix surprise with happy, angry, and sad, and we expect to synthesize mixed feelings closer to delight, outrage, and disappointment.
  36. We first perform the analysis with a pre-trained speech emotion recognition model. We analyze the mixture of emotions with the classification probabilities derived before the last projection layer. From these figures, we can see if we increase the percentage of other emotion types in the mixture, the SER can capture such variance.
  37. We further calculate mel-cepstral distortion to measure the spectrum similarity, and pearson correlation coefficients to measure the pitch similarity. From these two metrics, we also can observe that the synthesized mixed emotions becomes similar with the reference adding emotions.
  38. We conduct one listening test to evaluate the perception of angry, happy, and sad. And anther one to evaluate the perception of outrage, delight and disappointment. We can see that human listeners can distinguish the emotions in the mixture. And the results of mixed emotions for example, outrage, delight and disappointment are better than the single one .
  39. We can feel the mixing emotions keep the pattern of surprise while introducing the features of other emotion types.
  40. We also can flexible control the percentage of each emotion in the mixture.
  41. We next present our second case study. In this case study, we are trying to mix happy and sad, and synthesize a bittersweet feeling. This is a very challenging task, because happy and sad are two conflicting emotions with opposite valence. Professional actors can deliver such feelings of bittersweet through actions and speech. But it is difficult for us to synthesize and also difficult for human to evaluate. But with our proposed methods, we believe it is possible to synthesize a bittersweet feeling.
  42. The last case study is an emotion transition system. Emotion transition aims to gradually transit the emotion from one to another. The key challenge is how synthesize the internal states between different emotion types. To achieve this, we keep the sum of the percentatges of emotions to be 100% and ajust each percantly manually. For expames, 80% surprise with 20% angry.
  43. Here we show an emotion triangle, which can flexible transit the emotion states between happy, sad, and angry.
  44. Here we show an emotion triangle, which can flexible transit the emotion states between happy, sad, and angry.
  45. In the future work, there are several topics we would like to study. The first one is inclusive emotional speech generation. Current studies always require high-quality recorded speech data. We ask actors to perform the emotions, which also may create stereotypes of emotions. So we think these studies are exclusive. To build an inclusive emotional speech generation framework, we would like to use the data from the real life, we need to study how to associate the emotional prosody with the linguistic information. And how to reduce the impact of the confounding factors.
  46. The second research direction is the affective vocal bursts synthesis. In our research, we only focus on the modelling of para-linguistic features, such as the intonation, energy, speaking rate. But in real life, human convey their emotions also through the non-linguistic features, we call them as the vocal bursts, such as laugh, sigh, grunts, and cries. But the synthesis of vocal bursts hasn’t got enough attention from the community. Recently, there is a workshop called expressive vocalization workshop held in ICML is proposed to understand and synthesis the expressive non-verbal vocalizations. This research topic is getting more and more attention from the community.
  47. The last one is the ethical study. It is evident that just because artificial intelligence can do something, it does not mean they should do it. While emotional speech generation has a huge potential to improve human-computer interaction, there are still several societal challenges that need to be carefully studied for our community. The first one is deep fake issues. With emotional speech generation techniques, we are able to manipulate the emotions of an individual, to change their mood, feelings or the opinion. This can increase the risk of spreading disinformation or defaming a politician. The second one is privacy issues. Emotions are the results of human experiences. As mentioned previously, our next step is to build an emotional speech generation framework with naturalistic emotions, which concerns with the privacy issues in the process of data collection. Another issue is the biases of individuals, language and cultures. In our research, we focus on the English, and didn’t pay attention to other languages. This will introduce the biases into our methods and results. The final question is if we want our artificial agent to be emotional or not. Equipping artificial agents with emotions will give them capabilities to influence our emotions in the ways we would prefer or avoid. Furthermore, this technology will give the robot a certain personality that makes them indistinguishable with humans. The emotional overreliance on robots will affect human’s social skills and make them feel more lonely in real life.
  48. I would like to take this opportunity to express my gratitude to my supervisor, Professor Haizhou Li. Professor Li host my Master and PhD study in HLT lab. I spent incrediable five years and I really grew up a lot with his guidance. His hardworking spirite and scientific attitues encourage me a lot. He teach me how to be an independent researcher and offer me the opportunity to visit US and Germany. I cannot imagine a better supervisor than him. I would also like to thank Prof Berrak Sisman from UTD, Prof Rajib Rana from USQ, Prof Bjorn Schuller from ICL and Prof. Tanja Schultz from U of Bremen for their help during my PhD. I also want to thank my friends and colleagues from NUS, SUTD, UT Dallas and U of Bremen. At last, I want to thank my dear parents for their constant love and support.
  49. Doing a PhD is also a period to discover myself. Beyond the research, as a logo designer, I designed logos for different laboratories in Singapore and Germany. also for international conferences held in Singapore. For example, ASRU 2019 and ICASSP 2022. It gave me a sense of fullfillment and made my phd journey much more meaningful.
  50. It is the end of this presentation. Thank you very much!