PhD_Oral_Defense_Kun.ppt

© Copyright National University of Singapore. All Rights Reserved.
Emotion Modelling for Speech Generation
Kun Zhou (National University of Singapore)
Supervisors: Prof. Haizhou Li (Main Supervisor, NUS & CUHK)
Associate Prof. Thomas Yeo Boon Thye (NUS)
1

Outline
• Background & Introduction
• Topic 1: Seq2Seq Emotion Modelling
• Topic 2: Emotion Intensity and its Control
• Topic 3: Mixed Emotion Modelling and Synthesis
• Conclusion & Future Work
2

Background & Introduction
• Speech Generation:
- Text-to-Speech (TTS): Text  Speech
- Voice Conversion (VC): Speech  Speech
• Development:
Statistical Parametric-based  Neural Network-based
• Applications:
Conversational Agents (Apple Siri, Amazon Alexa, …)
• Research Limitations:
- Prosody Modelling: Lack of Expressivity;
- Data Dependency: Require Large & High-Quality Data;
- Personalization;
- Real-Time Generation;
Teach Machines to Speak
Picture Sources:
[1] “When Machines Speak”, Center for Science and Society, Columbia University, USA;
[2] “Talking with Machines”, News and Events, University of York, UK;
3

• Why Speech Emotions Are Difficult to Model?
1) Human Emotions are subtle and difficult to represent;
- Categorical: Angry, Happy, Sad, …
- Continuous: Valence, Arousal, Dominance, …
2) Speech Emotions relate to various acoustic features;
- Speech Quality, Energy, Intonation, Rhythm, …
3) Emotion Perception is subjective;
• Our Research Focus:
1) How to better imitate human emotions?
“Generalizability”
2) How to make synthesized emotions more creative?
“Controllability”
4

Towards Empathic AI
Empathic AI:
• Receiving: Receive Human/Environment Inputs;
• Appraisal: Appraise the Situation;
• Response: Generate Appropriate Responses;
Emotional Speech Generation:
• Generate Emotional Responses;
• Increase the Dialogue Richness;
• Enhance the Engagement in Human-
Machine Interaction;
Picture Sources: “Should Algorithm and Robots Mimic Empathy?”, The Medical Futurist;
5

Publications
• Journal Publications (1st author):
[1] Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li, “Emotional Voice Conversion: Theory,
Databases and ESD”, Speech Communication, 2022;
[2] Kun Zhou, Berrak Sisman, Rajib Rana, Bjorn Schuller, Haizhou Li, “Emotion Intensity and its
Control for Emotional Voice Conversion”, IEEE Transactions on Affective Computing, 2023;
[3] Kun Zhou, Berrak Sisman, Rajib Rana, Bjorn Schuller, Haizhou Li, “Speech Synthesis with
Mixed Emotions”, IEEE Transactions on Affective Computing, 2022;
• Conference Publications (1st author):
[4] Kun Zhou, Berrak Sisman, Haizhou Li, “Transforming Spectrum and Prosody for Emotional
Voice Conversion with Non-Parallel Training Data”, Speaker Odyssey 2019;
[5] Kun Zhou, Berrak Sisman, Mingyang Zhang, Haizhou Li, “Converting Anyone’s Emotion:
Towards Speaker-Independent Emotional Voice Conversion”, INTERSPEECH 2020;
[6] Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li, “Seen and Unseen Emotional Style Transfer
with a New Emotional Speech Dataset”, ICASSP 2021;
[7] Kun Zhou, Berrak Sisman, Haizhou Li, “Limited Data Emotional Voice Conversion Leveraging
Text-to-Speech: Two-Stage Sequence-to-Sequence Training”, INTERSPEECH 2021;
[8] Kun Zhou, Berrak Sisman, Haizhou Li, “VAW-GAN for Disentanglement and Recomposition of
Emotional Elements in Speech”, IEEE SLT 2021; 6

• Conference Publications (Co-authored):
[9] Zongyang Du, Kun Zhou, Berrak Sisman, Haizhou Li, “Spectrum and Prosody Conversion for
Cross-Lingual Voice Conversion with CycleGAN”, APSIPA 2020;
[10] Junchen Lu, Kun Zhou, Berrak Sisman, Haizhou Li, “VAW-GAN for Singing Voice Conversion
with Non-Parallel Data”, APSIPA 2020;
[11] Zongyang Du, Berrak Sisman, Kun Zhou, Haizhou Li, “Expressive Voice Conversion: A Joint
Framework for Speaker Identity and Emotional Style Transfer”, IEEE ASRU 2020;
[12] Zongyang Du, Berrak Sisman, Kun Zhou, Haizhou Li, “Disentanglement of Emotional Style
and Speaker Identity for Expressive Voice Conversion”, INTERSPEECH 2022;
Publications
7

[1] Kun Zhou, Berrak Sisman, Haizhou Li, “Limited Emotional Voice Conversion
Leveraging Text-to-Speech: Two-Stage Sequence-to-Sequence Training”,
INTERSPEECH 2021;
[2] Kun Zhou, Berrak Sisman, Rajib Rana, Bjorn Schuller, Haizhou Li, “Emotion
Intensity and its Control for Emotional Voice Conversion”, IEEE Transactions on
Affective Computing, 2022;
[3] Kun Zhou, Berrak Sisman, Rajib Rana, Bjorn Schuller, Haizhou Li, “Speech
Synthesis with Mixed Emotions”, IEEE Transactions on Affective Computing, 2022;
Topic 1: Seq2Seq Emotion Modelling (Generalizability)
Topic 2: Emotion Intensity Modelling and its Control (Controllability)
Topic 3: Mixed Emotion Modelling and Synthesis (Controllability)
8

Outline
9

Seq2Seq Emotion Modelling*
• Why Seq2Seq Emotion Models?
Frame-based Models:
- Convert emotion on a frame-basis  Cannot Modify Duration;
- Separate model spectrum and prosody  Mismatch;
Seq2Seq-based Models:
- Joint model spectrum, prosody, and duration  No Mismatch;
- Sequence-level modelling  Focus on Emotion-Relevant Regions;
- Needs a large amount of emotional speech training data;
Our Seq2Seq-based Models:
Enable duration modelling
Only require limited emotion training data
* Kun Zhou, Berrak Sisman, Haizhou Li, “Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-Stage Sequence-to-
Sequence Training”, INTERSPEECH 2021.
10

Seq2Seq Emotion Modelling
• Two-Stage Training for Seq2Seq Emotion Modelling
(1) Style Initialization (2) Emotion Training
• Stage I: Style Initialization
- Speaker Encoder learns Speaker Styles from a TTS corpus;
• Stage II: Emotion Training
- Emotion Encoder learns Emotional Styles from limited emotional speech data; 11

• Two-Stage Training for Seq2Seq Emotion Modelling
(1) Style Encoder
from Stage I
(2) Emotion Encoder
from Stage II
- Emotion Encoder produce
effective emotion
embeddings;
- Only needs 50 mins
emotion training data!
When testing with emotional data:
12

• Objective Evaluations
Seq2Seq Models outperformed the frame-based baselines.
DDUR -> Duration Conversion Performance
MCD -> Spectrum Conversion Performance
13

• Subjective Evaluations
Seq2Seq Models outperformed the frame-based baselines.
MOS Test for Emotion Similarity
BWS Test for Speech Quality
14

Outline
15

Emotion Intensity and its Control*
• Emotion Intensity
Figure 1. Example of different intensity images of “fear”*
Picture Source:
Hoffmann, Holger, et al. "Expression intensity, gender and facial emotion recognition: Women recognize only subtle facial emotions better than men.”
Acta psychologica 135.3 (2010): 278-283.
*Kun Zhou, Berrak Sisman, Rajib Rana, Bjorn Schuller, Haizhou Li, “Emotion Intensity and its Control for Emotional Voice Conversion”,
IEEE Transactions on Affective Computing, 14 (1) 31-48, 2023.
- The level that an emotion can be perceived by a listener;
- Not just the loudness of a voice;
- Correlates to all the acoustic cues that contribute to an emotion;
16

Emotion Intensity and its Control
• Research Challenges
- Complex with acoustic cues such as timbre, pitch and rhythm;
- Lack of explicit intensity labels;
- Subjective; (Hot or cold anger?)
• Previous Studies
- Use auxiliary features,
e.g. voiced/unvoiced/silence (VUS), attention weights, a saliency map;
- Manipulate internal emotion representations,
e.g. interpolation or scaling;
- Lack of interpretability;
- Limits the performance;
17

• Emovox – Emotional Voice Conversion with Intensity Control
Sequence-to-Sequence Conversion
Models:
• Linguistic Transplant;
• Emotion Transfer;
• Intensity Control;
How to design an intensity encoder that can:
(1) Accept manual intensity labels; (2) Produce interpretable intensity embeddings
 How to formulate emotion intensity?
18

• Relative Attributes*
He smiles more than the person on the right,
but less than the person on the left!
Our world is not binary:
- Before: Predicting the presence of an attribute;
E.g. Smiling or not smiling?  Regression Problem;
- Relative Attributes*: Predicting the strength of an attribute;
E.g. How much he/she is smiling?  Ranking Problem;
* Parikh, Devi, and Kristen Grauman. "Relative attributes." 2011 International Conference on Computer Vision. IEEE, 2011.
19

• Formulation of Emotion Intensity
Assumption:
(1) “Neutral” does not contain any
emotion variance;
 its intensity is always 0
(2) Emotion Intensity is the relative
difference with “Neutral”;
Given a training set T={xt}, N and E are
neutral and emotional dataset, we learn a
ranking function:
r(xt) = Wxt
It satisfies the following constraints:
Supervision:
O  Ordered Set, S  Similar Set.
O: E has higher intensity than N;
S: N-N pairs / E-E pairs
20

• Formulation of Emotion Intensity
(1) Traditional Classifier (2) Relative Ranking
Neutral
Angry
Least
Angry
Most Angry
A ranking model automatically predicts the intensity of the emotion with respect
to other speech samples. 21

• Modelling Emotion Styles with its Intensity
Emotion Style Reconstruction:
- Emotion Modelling;
- Intensity Modelling;
22

• Run-time Emotion Intensity Control
• Emotion Intensity Transfer:
• Emotion Intensity Control:
Emotion Intensity is predicted
from a reference audio;
Emotion Intensity is given by
humans;
23

• Speech Samples
Intensity = 0.1
(Most Weak)
Intensity = 0.3
(Less Weak)
Intensity = 0.6
(Less Strong)
Intensity = 0.9
(Most Strong)
Intensity = 0.1
(Most Weak)
Intensity = 0.3
(Less Weak)
Intensity = 0.6
(Less Strong)
Intensity = 0.9
(Most Strong)
Converting Neutral to Angry:
Converting Neutral to Sad:
Note: [1] All the speech samples that you listen are synthesized from source neutral speech;
[2] The total duration of training data is less than 1 hour.
24

• Visual Comparisons (Duration)
Sad (Weak), Intensity = 0.1
Sad (Medium), Intensity = 0.5
Sad (Strong), Intensity = 0.9
As intensity increases, speech becomes slower and more resonant. 25

• Visual Comparisons (Pitch and Energy)
Pitch
Energy
26

• Intensity Control Evaluation
Compared to other control methods, our proposed model with relative attributes
achieves best results in terms of intensity control. 27

Scan here for more speech samples:
Codes are publicly available:
https://github.com/KunZhou9646/Emovox
The Most Popular Article in IEEE Trans on Affective Computing!
28

Outline
29

Mixed Emotion Modelling & Synthesis*
• Mixed Emotions
Happy
Sad
Mixed Emotion
(Bittersweet)
* Kun Zhou, Berrak Sisman, Rajib Rana, Bjorn Schuller, Haizhou Li, “Speech Synthesis with Mixed Emotions”,
IEEE Transactions on Affective Computing, 2022.
30

Mixed Emotion Modelling & Synthesis
• Mixed Emotions
Human can feel multiple emotions at the same time;
 Some Bittersweet moments:
- Remembering a lost love with warmth;
- First time when leaving home for college;
 Psychologists have been studying to understand the
measures and paradigms of mixed emotions;*
Is it possible to synthesize mixed emotional effects for speech synthesis?
* Larsen, Jeff T., and A. Peter McGraw. "The case for mixed emotions." Social and Personality Psychology Compass 8.6
(2014): 263-274.
31

• Theory of Emotion Wheel*
Human can experience around 34,000
different emotions;
• Plutchik proposed 8 primary
emotions;
• All other emotions can be regarded
as mixed or derivative states of
primary emotions;
For example:
Delight = Joy + Surprise;
Disappointment = Sad + Surprise;
*Robert Plutchik, “The Nature of Emotions”, American Scientist, 1984;
32

• Research Gaps
To learn emotion information:
- Associate with explicit labels (e.g., discrete/continuous labels);
- Imitate a reference audio (e.g., reference encoder, GST-Tacotron);
Only synthesize an averaged style belonging to a specific emotion type.
• Research Challenges
• How to characterize and quantify the mixture of speech emotions?
• How to evaluate the synthesized mixed results?
Our focus!
33

• The Ordinal Nature of Emotions*
Previous Method:
- Assigning an absolute score (Arousal, Valence, …);
- Defining a discrete emotion category (Happy, Sad, …);
- Imitate a reference audio;
Our Methods:
Characterize emotions through comparative assessments
(e.g., is sentence one happier than sentence two?)
Key idea  Learn to Rank
- Construct a ranking model using training data;
- Sort new objects according to their degree of relevance.
*Yannakakis, Georgios N., Roddy Cowie, and Carlos Busso. "The ordinal nature of emotions." 2017 Seventh International
Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 2017.
34

• Three Assumptions*:
- Manually Controlled Attribute
Vector;
- Proposed Relative Scheme;
- Emotional Text-to-Speech Model;
1. Mixed emotions are characterized by combinations, mixtures or compounds of
primary emotions;
2. All emotions are related to some extent;
3. Each emotion has stereotypical styles;
• Proposed Diagram:
*All assumptions are supported by related psychological studies. Please refer to our paper for more details.
Our Focus
35

• Design of a Novel Relative Scheme
- Train a relative ranking
function f(x) between each
emotion pairs;
- At run-time, the trained f(x)
automatically predicts an
emotion attribute;
- A smaller value of emotion
attribute represents a similar
emotional style;
36

• Training with Relative Scheme
During the training, the framework learns:
1. Characterize input emotion styles;
2. Quantifying the difference with other
emotion types;
 Emotion Attribute Vectors
 Emotion Embedding
37

• Controlling Emotion Mixture in Speech
Given Text Inputs:
• Text Encoder: Projects the linguistic
information to an internal representations;
• Emotion Encoder: Captures the emotion
style from a reference speech;
• Manually Controlled Attribute Vector:
Introduce & Control the characteristics of
other emotions types;
38

• Experiments with Three Case Studies
• Case Study 1: “Delight”, “Outrage”, “Disappointment”
• Case Study 2: Conflicting Emotions – “Happy” & “Sad”
• Case Study 3: An Emotion Transition System
39

- We mix “Surprise” with “Happy”, “Angry” and “Sad”;
- We expect to synthesize mixed feelings closer to “Delight”, “Outrage”
and “Disappointment”;
40

(1) Analysis with Speech Emotion Recognition (SER)
• Analyze the mixture of emotions
with the classification probabilities
derived before the last projection
layer of a pre-trained SER;
• The classification probabilities could
capture the emotion mixtures!
41

(2) Acoustic Evaluation
Mel-Cepstral Distortion
(MCD) -> Spectrum
Similarity
Pearson Correlation
Coefficient (PCC) -> Pitch
Similarity
42

(3) Listening Experiments
(A) Evaluation the perception of
“Angry”, “Happy” and “Sad”
(B) Evaluation the perception of
“Outrage”, “Delight” and “Disappointment”
43

Speech Demos (Emotion Evaluation)
Only Surprise Only Angry Mixing Surprise with Angry (Outrage)
Only Surprise Only Happy Mixing Surprise with Happy (Delight)
Only Surprise Only Sad Mixing Surprise with Sad (Disappointment)
Note: [1] All the speech samples that you listen are synthesized from the texts;
44

Speech Demos (Controllability Evaluation)
100% Surprise 100% Surprise
+ 30% Angry
100% Surprise
+ 60% Angry
100% Surprise
+ 90% Angry
Note: [1] All the speech samples that you listen are synthesized from the texts;
+ 30% Happy
100% Surprise
+ 60% Happy
100% Surprise
+ 90% Happy
+ 30% Sad
100% Surprise
+ 60% Sad
100% Surprise
+ 90% Sad
45

- “Happy” and “Sad” are considered as two conflicting emotions:
e.g., opposite valence (pleasant and unpleasant);
- “Bittersweet” describes a mixed feeling of both “Happy” and “Sad”;
- Professional actors are thought to be able to deliver such feelings through action &
speech;
- It is a challenge task to synthesize a “Bittersweet” feeling;
46

Speech Demos
Note: All the speech samples that you listen are synthesized from the texts
Happy Sad Happy + Sad
47

- Emotion transition aims to gradually transit the emotion from one to another;
- The key challenge is how to synthesize internal states between different emotion
types;
- We keep the sum of the percentages of emotions to be 100%, and adjust each
percentage manually (e.g., 80% Surprise with 20% Angry);
Internal Emotion States (Our Focus)
An Example of
Emotion Transition
System:
48

Speech Demos - An Emotion Triangle
Angry
Sad
Happy
Emotion Triangle 49

https://kunzhou9646.github.io/Emotion_Triangle/
50

Scan here for more speech samples:
Codes & Pre-trained models are publicly available:
https://github.com/KunZhou9646/Mixed_Emotions
Published Version:
51

Outline
52

Conclusion
• Emotional speech generation is a key technology to achieve empathic AI;
• We study seq2seq emotion modelling for emotional voice conversion that improves
the generalizability of emotion models;
• We study emotion intensity modelling and control for emotional voice conversion;
• We study mixed emotion modelling and generation, and present three case studies
to validate our idea;
53

Future Work
• Inclusive Emotional Speech Generation
- Current Studies are “Exclusive”:
- Require high-quality recorded data;
- Acted emotional speech data may create stereotypes of emotions;
- Building an inclusive emotional speech generation framework:
- Associate emotional prosody with linguistic information;
- Reduce the impact of confounding factors:
Confounding Factors:
- Environmental Effects: Noises, Speech Overlappings, …
- Individual Effects: Accents, Age, Gender, …
54

Future Work
• Affective Vocal Bursts Synthesis
Speech Emotions can be manifested in:
- Para-Linguistics: Intonation, Energy, Speaking Rate, …
- Non-Linguistics: Vocal Bursts -> Laugh, Sigh, Grunts, Cries, …
- Linguistics: Word Selections
Our Focus
Not Enough
Attention Yet
Expressive Vocalization (ExVo) Workshops*:
To Understand and Synthesize Expressive Non-Verbal Vocalizations
*Baird, Alice, et al. "The icml 2022 expressive vocalizations workshop and competition: Recognizing, generating,
and personalizing vocal bursts." Proceeding of ICML 2022.
55

Future Work
• Ethical Study
(1) “Deep Fake” Issues:
- Spreading disinformation;
- Defaming a politician by manipulating their emotions;
(2) Privacy Issues:
- Emotions are the results of human experiences;
- Privacy concerns in data collections;
(3) The Biases of Individuals, Languages and Cultures:
- Underrepresented languages and cultures;
- Individual differences in emotional expression and perception;
(4) Do We Want Our Artificial Agent to be Emotional?
- Influence our emotions in the ways we would prefer or avoid;
- Give robots a certain personality that makes them indistinguishable from humans;
- The emotional overreliance on robots will increase the loneliness in real life;
A. Triantafyllopoulos et al., "An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era,"
in Proceedings of the IEEE.
56

THANK YOU
• Prof. Haizhou Li (NUS & CUHK);
• Prof. Berrak Sisman (UT Dallas, USA), Prof. Rajib Rana (USQ, Australia),
Prof. Bjorn Schuller (Imperial College London, U.K.), Prof. Tanja Schultz (U of Bremen, Germany);
Family, Friends & Colleagues @ NUS, SUTD, UT Dallas, U of Bremen;
Picture Sources: Kun’s iPhone
2023 @ Dallas, USA
2023 @ Bremen, Germany
2022 @ Marina Bay, SG
2019 @ Oriental Hotel, SG
2023 @ Dallas, USA
2022 @ Bali, Indonesia
2020 @ Blue Horizon, SG
2023 @ Bremen, Germany
2022 @ Clark Quay, SG
57

A little bit more: As a logo designer …
Picture Source:
[1] “Define AI at ICASSP 2022”, Travelogue;
[2] Kun’s iPhone;
58

THANK YOU
59

PhD_Oral_Defense_Kun.ppt

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PhD_Oral_Defense_Kun.ppt

Similar to PhD_Oral_Defense_Kun.ppt (20)

Recently uploaded

Recently uploaded (20)

PhD_Oral_Defense_Kun.ppt

Editor's Notes