SlideShare a Scribd company logo
© Copyright National University of Singapore. All Rights Reserved.
© Copyright National University of Singapore. All Rights Reserved.
Emotional Voice Conversion with
Non-parallel data
PhD candidate: Kun Zhou
Supervisor: Prof. Li Haizhou
Dept. of Electrical and Computer Engineering, National University of Singapore
© Copyright National University of Singapore. All Rights Reserved.
Content
• Introduction
• Related Work
• My PhD Research
• Conclusion
01
© Copyright National University of Singapore. All Rights Reserved.
Introduction
• Emotional voice conversion (EVC)
o Convert the emotional state from one to another (e.g. from happy to sad)
o Preserving linguistic content, speaker identity …
02
Figure 1: At run-time, an EVC framework converts the emotional state from one to another.
• Emotion in Speech
o Speech conveys information through:
- Linguistic aspect (what we speak)
- Para-linguistic aspect (how we speak): e.g. emotional state
© Copyright National University of Singapore. All Rights Reserved.
• Applications
Introduction
02
- Conversational Agents
- Social Robots
© Copyright National University of Singapore. All Rights Reserved.
• Current Challenges in Emotional Voice Conversion:
02 Introduction
o Non-parallel training;
o Limited data training;
o Emotional prosody modelling;
o Lack of controllability;
o Lack of generalizability;
© Copyright National University of Singapore. All Rights Reserved.
Publications During PhD
• Accepted
• Submitted
02
[1] Kun Zhou, Berrak Sisman, Haizhou Li, “Transforming spectrum and prosody for emotional voice
conversion with non-parallel data”, Speaker Odyssey 2020;
[2] Kun Zhou, Berrak Sisman, Mingyang Zhang, Haizhou Li, “Converting anyone’s emotion: towards
speaker-independent emotional voice conversion”, Interspeech 2020;
[3] Kun Zhou, Berrak Sisman, Haizhou Li, “Vaw-gan for disentanglement and recomposition of emotional
elements in speech”, IEEE Spoken Language Technology Workshop (SLT) 2021;
[4] Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li, “ Seen and unseen emotional style transfer with a
new emotional speech dataset”, IEEE International Conference on Acoustic, Speech and Signal
Processing (ICASSP), 2021
[5] Kun Zhou, Berrak Sisman, Haizhou Li, “Limited data emotional voice conversion leveraging text-to-
speech: two-stage sequence-to-sequence training”, Interspeech 2021;
[1] Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li, “Emotional voice conversion: theory, databases and
ESD”, submitted to Speech Communication;
© Copyright National University of Singapore. All Rights Reserved.
Related Work
03
• Training Stage
“Source Analysis  Mapping  Target Analysis”
[1] Masanori Morise, Fumiya Yokomori, and Kenji Ozawa, “World: a vocoder-based high-quality speech synthesis system for real-time applications,”
IEICE TRANSACTIONS on Information and Systems, vol. 99, no. 7, pp. 1877–1884, 2016.
© Copyright National University of Singapore. All Rights Reserved.
03 Related Work
According to the training data, EVC can be divided into two types:
- EVC with parallel data: Source: Target:
Source and target speech share the same linguistic content;
Expensive and difficult to collect!
- EVC with non-parallel data: Source: Target: -- our focus!
Linguistic content is different;
Easy for real-life applications!
© Copyright National University of Singapore. All Rights Reserved.
Related Work
03
• Conversion Stage
“Analysis  Mapping  Synthesis”
Reference audio Griffin-Lim WaveRNN Parallel WaveGAN
Example: Waveform Generation with different vocoders:
o We care about emotional expression;
o Emotional expression  Feature Mapping (Our focus!);
o Speech quality  Waveform Generation (Not our focus!);
o To get a better speech quality  Train a better vocoder
© Copyright National University of Singapore. All Rights Reserved.
03 Related Work
• Speaker voice conversion vs. Emotional voice conversion
o Speaker voice conversion:
- convert speaker identity;
- mainly focus on spectrum conversion;
- consider prosody as speaker-independent;
o Emotional voice conversion:
- convert emotional style;
- focus on both spectrum and prosody conversion;
- prosody (intonation, speech rate, energy, …) plays an
important role!
© Copyright National University of Singapore. All Rights Reserved.
My PhD Research
• CycleGAN-based EVC;
• Speaker-independent EVC;
• EVC for seen and unseen emotions;
• Limited data EVC;
We always focus on non-parallel & limited data solutions!
04
Improve generalizability
Prosody modelling
Duration modelling
© Copyright National University of Singapore. All Rights Reserved.
CycleGAN-based EVC [2]
• Motivation
o F0 is an essential part of the intonation;
- Supra-segmental and hierarchical nature;
- Linear transformation is insufficient; F0 is difficult to model!
• Contribution
o A parallel-data-free emotional voice conversion framework;
o Convert spectral and prosodic features with CycleGAN;
o Modelling F0 over multiple time scales with continuous wavelet
transform (CWT);
o Investigate different training strategies: joint vs. separate training;
02
[2] Kun Zhou, Berrak Sisman, Haizhou Li, “Transforming spectrum and prosody for emotional voice conversion with non-parallel
training data”, Speaker Odyssey 2020
© Copyright National University of Singapore. All Rights Reserved.
CycleGAN-based EVC [2]
02
- Lower scales capture the short-term variations, such as syllables and
phonemes;
- Higher scales capture the long-term variations, such as phrases and
utterances;
© Copyright National University of Singapore. All Rights Reserved.
Cycle-GAN based EVC [2]
02
o Training Stage:
- Spectral CycleGAN: learn the feature mapping of spectral features;
- Prosody CycleGAN: learn the feature mapping of CWT-based F0 features;
o Conversion Stage:
Spectral & Prosody CycleGAN convert the input features from source to target
emotion type;
[2] Kun Zhou, Berrak Sisman, Haizhou Li, “Transforming spectrum and prosody for emotional voice conversion with non-parallel
training data”, Speaker Odyssey 2020
© Copyright National University of Singapore. All Rights Reserved.
02 Cycle-GAN based EVC [2]
o From 1st XAB preference test:
CWT analysis of F0 improves the emotion similarity to the target emotion type;
o From 2nd XAB preference test:
Separate training of spectral and prosodic features outperforms the joint training;
[2] Kun Zhou, Berrak Sisman, Haizhou Li, “Transforming spectrum and prosody for emotional voice conversion with non-parallel
training data”, Speaker Odyssey 2020
Source (Neutral) Converted Angry Converted Sad Converted Surprise
- Experiments
© Copyright National University of Singapore. All Rights Reserved.
Speaker-independent EVC [3]
03
[3] Kun Zhou, Berrak Sisman, Mingyang Zhang, Haizhou Li, “Converting Anyone’s Emotion: Towards Speaker-independent Emotional
Voice Conversion”, Interspeech 2020
• Motivation
• Contribution
o Emotional expression is believed to share some common cues across
individuals;
For example:
Happy tends to have a higher mean and std of F0 than Sad
o Previous EVC studies assume that emotion is speaker-dependent;
o Study emotion through speaker-independent perspective;
o Study prosody modelling with CWT and F0 conditioning for emotion-
independent encoder training;
(The First Study!)
© Copyright National University of Singapore. All Rights Reserved.
Speaker-independent EVC [3]
03
• Related Work: VAW-GAN [4]
- Conditional VAE + Discriminator
• Disentangle Emotional Elements from Speech:
- Spectral features (SP):
Speaker + Phonetic + Prosodic info
- Provide emotion ID and F0 to decoder;
- The encoder learns to discard emotion
related information;
[4] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voice conversion from unaligned corpora using variational autoencoding wasserstein
generative adversarial networks,” Interspeech, 2017.
© Copyright National University of Singapore. All Rights Reserved.
Speaker-independent EVC [3]
03
• Proposed Framework
o Training Stage:
o Conversion Stage:
VAW-GAN for Prosody:
- Encoder learns emotion-independent representations from CWT-F0;
- Generator learns to reconstruct prosody features with one-hot emotion ID;
- Discriminator learns to judge whether the reconstructed features real or not;
© Copyright National University of Singapore. All Rights Reserved.
Speaker-independent EVC [3]
03
• Proposed Framework
o Training Stage:
o Conversion Stage:
VAW-GAN for Spectrum:
- Encoder learns emotion-independent representations from spectrum;
- Generator learns to reconstruct prosody features with one-hot emotion ID
and F0;
- Discriminator learns to judge whether the reconstructed features real or not;
© Copyright National University of Singapore. All Rights Reserved.
03 Speaker-independent EVC [3]
- XAB preference test for emotion
similarity;
- XAB preference test for speaker
similarity;
- Both validates the effectiveness of our proposed framework in both speaker-
dependent and speaker-independent settings!
- CWT analysis for prosody
modelling;
- F0 conditioning for encoder
training;
- Performance with seen and
unseen speakers;
We would like to show the
effectiveness of :
© Copyright National University of Singapore. All Rights Reserved.
04 EVC for seen and unseen emotions [4]
• Motivation
o Current EVC frameworks represent each emotion with one-hot emotion
label
- learn to remember a fixed set of emotions;
- insufficient to describe emotional styles
(emotional styles present subtle difference even with the same emotion
category)
• Contribution
o A one-to-many emotional style transfer framework;
o Use pre-trained SER to describe emotional styles;
[4] Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li, “Seen and unseen emotional style transfer for voice conversion with a new emotional
speech dataset”, IEEE ICASSP 2021
Non-parallel training;
Seen and unseen emotional style transfer (The First Study!)
© Copyright National University of Singapore. All Rights Reserved.
• Proposed Framework
• Stage I: Emotion Descriptor Training
• Stage II: Encoder-Decoder Training with VAW-GAN
• Stage III: Run-time Conversion
04 EVC for seen and unseen emotions
© Copyright National University of Singapore. All Rights Reserved.
04 EVC for seen and unseen emotions
- AB preference test for speech
quality;
- XAB preference test for emotion
similarity;
- Validate the effectiveness of proposed framework for both seen and
unseen emotion conversion.
© Copyright National University of Singapore. All Rights Reserved.
Limited data EVC [5]
05
[5] Kun Zhou, Berrak Sisman, Haizhou Li, “Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-
to-Sequence Training”, Interspeech 2021
• Motivation
o Speech duration conversion has been a missing point in frame-based
models;
o Sequence-to-sequence (seq2seq) methods predict the duration with
attention mechanism;
o A seq2seq framework usually needs a large amount of training data!
• Contribution
o A seq2seq EVC framework leveraging TTS:
Emotional voice conversion & Emotional text-to-speech;
Require a limited amount of emotional speech data;
Joint model spectrum, prosody and duration;
Non-parallel training;
Many-to-many conversion;
(Lack of large amount of emotional speech data!)
© Copyright National University of Singapore. All Rights Reserved.
• Proposed Framework
• Two-stage Training:
- Stage I: Style Initialization:
Style Encoder learns speaker style from a large TTS corpus;
- Stage II: Emotion Training:
Style Encoder acts as emotion encoder to learn emotional style;
Limited data EVC
05
© Copyright National University of Singapore. All Rights Reserved.
Limited data EVC
05
Figure 3: Visualization of emotion embedding derived from (a) style encoder and (b) emotion encoder.
- Emotion embedding derived from emotion encoder can form separate groups;
- A significant separation between angry, happy, surprise and neutral, sad;
© Copyright National University of Singapore. All Rights Reserved.
• Speech Samples*
Limited data EVC
05
Source CycleGAN[2] StarGAN[3] Our Propose Target
Neutral-to-Angry
Neutral-to-Happy
Neutral-to-Sad
Neutral-to-Surprise
* For more speech samples: https://kunzhou9646.github.io/IS21/
[2] K. Zhou, B. Sisman, and H. Li, “Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data,” in Proc.
Odyssey 2020 The Speaker and Language Recognition Workshop, 2020, pp. 230–237
[3] G. Rizos, A. Baird, M. Elliott, and B. Schuller, “Stargan for emotional speech conversion: Validated by data augmentation of end-to-end
emotion recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020,
pp. 3502–3506
© Copyright National University of Singapore. All Rights Reserved.
Limited data EVC
05
• Speech Samples*
(Emotional Text-to-Speech)
Input text:
“Clear than clear water”
Angry
Happy
Surprise
Sad
Clear than clear water.
Clear than clear water…
Clear than clear water?
Clear than clear wate!
© Copyright National University of Singapore. All Rights Reserved.
05 Limited data EVC
Proposed framework significantly outperforms the baselines in
emotion similarity evaluation
© Copyright National University of Singapore. All Rights Reserved.
06 Conclusion
o Emotional voice conversion: theory and challenges;
o Our work:
- Cycle-GAN based EVC [2]; [Speaker Odyssey 2020]
- Speaker-independent EVC [3]; [INTERSPEECH 2020]
- EVC for seen and unseen emotions [4]; [ICASSP 2021]
- Limited data EVC [5]; [INTERSPEECH 2021]
All codes are publicly available!
o Future studies:
- Emotional voice conversion with emotion strength control;
- Emotion interpolations for emotional voice conversion;
- Cross-lingual representations for emotional voice conversion;
[2] Kun Zhou, Berrak Sisman, Haizhou Li, “Transforming spectrum and prosody for emotional voice conversion with non-parallel training data”, Speaker Odyssey 2020
[3] Kun Zhou, Berrak Sisman, Mingyang Zhang, Haizhou Li, “Converting Anyone’s Emotion: Towards Speaker-independent Emotional Voice Conversion”, Interspeech 2020
[4] Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li, “Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset”, IEEE ICASSP 2021
[5] Kun Zhou, Berrak Sisman, Haizhou Li, “Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training”, Interspeech 2021
© Copyright National University of Singapore. All Rights Reserved.
THANK YOU

More Related Content

What's hot

Iranian new wave
Iranian new waveIranian new wave
Iranian new wave
Asrafun Naher
 
Media key terms revision slides shots angles movement composition
Media key terms revision slides shots angles movement compositionMedia key terms revision slides shots angles movement composition
Media key terms revision slides shots angles movement composition
MissConnell
 
Film editing silent early years (2)
Film editing silent early years (2)Film editing silent early years (2)
Film editing silent early years (2)
Lordnikon109
 
Intertextuality within media
Intertextuality within mediaIntertextuality within media
Intertextuality within media
lewisclanfield
 
The Generic conventions of Sci-fi films
The Generic conventions of Sci-fi filmsThe Generic conventions of Sci-fi films
The Generic conventions of Sci-fi films
SGurung-MediaStudies
 
Revision 1 media language
Revision 1 media languageRevision 1 media language
Revision 1 media language
howardeffinghammedia
 
Epic hero powerpoint
Epic hero powerpointEpic hero powerpoint
Epic hero powerpoint
Kristin Voolstra
 
Creation of Man by Prometheus
Creation of Man by PrometheusCreation of Man by Prometheus
Creation of Man by Prometheus
Paula Marie Llido
 
Elements of drama
Elements of dramaElements of drama
Elements of drama
Jane Salutan
 
Science fiction conventions
Science fiction conventionsScience fiction conventions
Science fiction conventions
Nikky Bain
 
Film Aesthetics I
Film Aesthetics IFilm Aesthetics I
Film sound
Film soundFilm sound
Film sound
Matthew Hartman
 
Iconography of Crime Thriller
Iconography of Crime ThrillerIconography of Crime Thriller
Iconography of Crime Thriller
ljacksonmedia
 
Myth, Literature and the African World
Myth, Literature and the African WorldMyth, Literature and the African World
Myth, Literature and the African World
Reem 'Minnie'
 
Ex machina slide share
Ex machina slide share Ex machina slide share
Ex machina slide share
Jake Foulkes
 
A2 Media The Hunger Games Genre Narrative and Representation
A2 Media The Hunger Games Genre Narrative and RepresentationA2 Media The Hunger Games Genre Narrative and Representation
A2 Media The Hunger Games Genre Narrative and Representation
Elle Sullivan
 
Codes and conventions of thriller films
Codes and conventions of thriller filmsCodes and conventions of thriller films
Codes and conventions of thriller films
Lauryn Robertson
 
Symbolism in Films PowerPoint
Symbolism in Films PowerPointSymbolism in Films PowerPoint
Symbolism in Films PowerPoint
u1024811
 
Conventions of the Fantasy genre
Conventions of the Fantasy genreConventions of the Fantasy genre
Conventions of the Fantasy genre
Monique Jackson
 
The Horror Genre - Media Studies
The Horror Genre - Media StudiesThe Horror Genre - Media Studies
The Horror Genre - Media Studies
Rachel Wood
 

What's hot (20)

Iranian new wave
Iranian new waveIranian new wave
Iranian new wave
 
Media key terms revision slides shots angles movement composition
Media key terms revision slides shots angles movement compositionMedia key terms revision slides shots angles movement composition
Media key terms revision slides shots angles movement composition
 
Film editing silent early years (2)
Film editing silent early years (2)Film editing silent early years (2)
Film editing silent early years (2)
 
Intertextuality within media
Intertextuality within mediaIntertextuality within media
Intertextuality within media
 
The Generic conventions of Sci-fi films
The Generic conventions of Sci-fi filmsThe Generic conventions of Sci-fi films
The Generic conventions of Sci-fi films
 
Revision 1 media language
Revision 1 media languageRevision 1 media language
Revision 1 media language
 
Epic hero powerpoint
Epic hero powerpointEpic hero powerpoint
Epic hero powerpoint
 
Creation of Man by Prometheus
Creation of Man by PrometheusCreation of Man by Prometheus
Creation of Man by Prometheus
 
Elements of drama
Elements of dramaElements of drama
Elements of drama
 
Science fiction conventions
Science fiction conventionsScience fiction conventions
Science fiction conventions
 
Film Aesthetics I
Film Aesthetics IFilm Aesthetics I
Film Aesthetics I
 
Film sound
Film soundFilm sound
Film sound
 
Iconography of Crime Thriller
Iconography of Crime ThrillerIconography of Crime Thriller
Iconography of Crime Thriller
 
Myth, Literature and the African World
Myth, Literature and the African WorldMyth, Literature and the African World
Myth, Literature and the African World
 
Ex machina slide share
Ex machina slide share Ex machina slide share
Ex machina slide share
 
A2 Media The Hunger Games Genre Narrative and Representation
A2 Media The Hunger Games Genre Narrative and RepresentationA2 Media The Hunger Games Genre Narrative and Representation
A2 Media The Hunger Games Genre Narrative and Representation
 
Codes and conventions of thriller films
Codes and conventions of thriller filmsCodes and conventions of thriller films
Codes and conventions of thriller films
 
Symbolism in Films PowerPoint
Symbolism in Films PowerPointSymbolism in Films PowerPoint
Symbolism in Films PowerPoint
 
Conventions of the Fantasy genre
Conventions of the Fantasy genreConventions of the Fantasy genre
Conventions of the Fantasy genre
 
The Horror Genre - Media Studies
The Horror Genre - Media StudiesThe Horror Genre - Media Studies
The Horror Genre - Media Studies
 

Similar to Oral Qualification Examination_Kun_Zhou

PhD_Oral_Defense_Kun.ppt
PhD_Oral_Defense_Kun.pptPhD_Oral_Defense_Kun.ppt
PhD_Oral_Defense_Kun.ppt
KunZhou18
 
VAW-GAN for disentanglement and recomposition of emotional elements in speech
VAW-GAN for disentanglement and recomposition of emotional elements in speechVAW-GAN for disentanglement and recomposition of emotional elements in speech
VAW-GAN for disentanglement and recomposition of emotional elements in speech
KunZhou18
 
Arcomem training opinions_advanced
Arcomem training opinions_advancedArcomem training opinions_advanced
Arcomem training opinions_advanced
arcomem
 
IEEE ICASSP 2021
IEEE ICASSP 2021IEEE ICASSP 2021
IEEE ICASSP 2021
KunZhou18
 
Emotion recognition using facial expressions and speech
Emotion recognition using facial expressions and speechEmotion recognition using facial expressions and speech
Emotion recognition using facial expressions and speech
Lakshmi Sarvani Videla
 
presentation
presentationpresentation
presentation
Anthony Lowhur
 
Jalt 2012...spreading it...vocab sig presentation event...flyer & program final
Jalt 2012...spreading it...vocab sig presentation event...flyer & program finalJalt 2012...spreading it...vocab sig presentation event...flyer & program final
Jalt 2012...spreading it...vocab sig presentation event...flyer & program final
Andy Boon
 
Cv huaiping
Cv huaipingCv huaiping
Cv huaiping
Ming Huaiping
 
Augmenting Speech-Language Rehabilitation with Brain Computer Interfaces: An ...
Augmenting Speech-Language Rehabilitation with Brain Computer Interfaces: An ...Augmenting Speech-Language Rehabilitation with Brain Computer Interfaces: An ...
Augmenting Speech-Language Rehabilitation with Brain Computer Interfaces: An ...
HCI Lab
 
Introduction to text to speech
Introduction to text to speechIntroduction to text to speech
Introduction to text to speech
Bilgin Aksoy
 
Omsa
OmsaOmsa
Roee Aharoni - 2017 - Morphological Inflection Generation with Hard Monotonic...
Roee Aharoni - 2017 - Morphological Inflection Generation with Hard Monotonic...Roee Aharoni - 2017 - Morphological Inflection Generation with Hard Monotonic...
Roee Aharoni - 2017 - Morphological Inflection Generation with Hard Monotonic...
Association for Computational Linguistics
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
Roelof Pieters
 
Efficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingEfficient named entity annotation through pre-empting
Efficient named entity annotation through pre-empting
Leon Derczynski
 
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECHBASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
IJCSEA Journal
 
Conversational transfer learning for emotion recognition
Conversational transfer learning for emotion recognitionConversational transfer learning for emotion recognition
Conversational transfer learning for emotion recognition
Takato Hayashi
 
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
QuantInsti
 
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emot...
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emot...Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emot...
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emot...
Esra Açar
 
columbia-gwu
columbia-gwucolumbia-gwu
columbia-gwu
Tianrui Peng
 
Parafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdfParafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdf
Universidad Nacional de San Martin
 

Similar to Oral Qualification Examination_Kun_Zhou (20)

PhD_Oral_Defense_Kun.ppt
PhD_Oral_Defense_Kun.pptPhD_Oral_Defense_Kun.ppt
PhD_Oral_Defense_Kun.ppt
 
VAW-GAN for disentanglement and recomposition of emotional elements in speech
VAW-GAN for disentanglement and recomposition of emotional elements in speechVAW-GAN for disentanglement and recomposition of emotional elements in speech
VAW-GAN for disentanglement and recomposition of emotional elements in speech
 
Arcomem training opinions_advanced
Arcomem training opinions_advancedArcomem training opinions_advanced
Arcomem training opinions_advanced
 
IEEE ICASSP 2021
IEEE ICASSP 2021IEEE ICASSP 2021
IEEE ICASSP 2021
 
Emotion recognition using facial expressions and speech
Emotion recognition using facial expressions and speechEmotion recognition using facial expressions and speech
Emotion recognition using facial expressions and speech
 
presentation
presentationpresentation
presentation
 
Jalt 2012...spreading it...vocab sig presentation event...flyer & program final
Jalt 2012...spreading it...vocab sig presentation event...flyer & program finalJalt 2012...spreading it...vocab sig presentation event...flyer & program final
Jalt 2012...spreading it...vocab sig presentation event...flyer & program final
 
Cv huaiping
Cv huaipingCv huaiping
Cv huaiping
 
Augmenting Speech-Language Rehabilitation with Brain Computer Interfaces: An ...
Augmenting Speech-Language Rehabilitation with Brain Computer Interfaces: An ...Augmenting Speech-Language Rehabilitation with Brain Computer Interfaces: An ...
Augmenting Speech-Language Rehabilitation with Brain Computer Interfaces: An ...
 
Introduction to text to speech
Introduction to text to speechIntroduction to text to speech
Introduction to text to speech
 
Omsa
OmsaOmsa
Omsa
 
Roee Aharoni - 2017 - Morphological Inflection Generation with Hard Monotonic...
Roee Aharoni - 2017 - Morphological Inflection Generation with Hard Monotonic...Roee Aharoni - 2017 - Morphological Inflection Generation with Hard Monotonic...
Roee Aharoni - 2017 - Morphological Inflection Generation with Hard Monotonic...
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Efficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingEfficient named entity annotation through pre-empting
Efficient named entity annotation through pre-empting
 
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECHBASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
 
Conversational transfer learning for emotion recognition
Conversational transfer learning for emotion recognitionConversational transfer learning for emotion recognition
Conversational transfer learning for emotion recognition
 
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
 
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emot...
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emot...Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emot...
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emot...
 
columbia-gwu
columbia-gwucolumbia-gwu
columbia-gwu
 
Parafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdfParafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdf
 

Recently uploaded

Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
PsychoTech Services
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
RAHUL
 
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptxPrésentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
siemaillard
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
Nguyen Thanh Tu Collection
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
TechSoup
 
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem studentsRHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
Himanshu Rai
 
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptxBIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
RidwanHassanYusuf
 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Henry Hollis
 
Electric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger HuntElectric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger Hunt
RamseyBerglund
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
Nicholas Montgomery
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
History of Stoke Newington
 
BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
Katrina Pritchard
 
Pharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brubPharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brub
danielkiash986
 
Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47
MysoreMuleSoftMeetup
 
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptxBeyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
EduSkills OECD
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
PECB
 
A Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two HeartsA Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two Hearts
Steve Thomason
 
Nutrition Inc FY 2024, 4 - Hour Training
Nutrition Inc FY 2024, 4 - Hour TrainingNutrition Inc FY 2024, 4 - Hour Training
Nutrition Inc FY 2024, 4 - Hour Training
melliereed
 
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptxRESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
zuzanka
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
Jyoti Chand
 

Recently uploaded (20)

Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
 
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptxPrésentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
 
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem studentsRHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
 
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptxBIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
 
Electric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger HuntElectric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger Hunt
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
 
BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
 
Pharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brubPharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brub
 
Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47
 
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptxBeyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
 
A Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two HeartsA Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two Hearts
 
Nutrition Inc FY 2024, 4 - Hour Training
Nutrition Inc FY 2024, 4 - Hour TrainingNutrition Inc FY 2024, 4 - Hour Training
Nutrition Inc FY 2024, 4 - Hour Training
 
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptxRESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
 

Oral Qualification Examination_Kun_Zhou

  • 1. © Copyright National University of Singapore. All Rights Reserved. © Copyright National University of Singapore. All Rights Reserved. Emotional Voice Conversion with Non-parallel data PhD candidate: Kun Zhou Supervisor: Prof. Li Haizhou Dept. of Electrical and Computer Engineering, National University of Singapore
  • 2. © Copyright National University of Singapore. All Rights Reserved. Content • Introduction • Related Work • My PhD Research • Conclusion 01
  • 3. © Copyright National University of Singapore. All Rights Reserved. Introduction • Emotional voice conversion (EVC) o Convert the emotional state from one to another (e.g. from happy to sad) o Preserving linguistic content, speaker identity … 02 Figure 1: At run-time, an EVC framework converts the emotional state from one to another. • Emotion in Speech o Speech conveys information through: - Linguistic aspect (what we speak) - Para-linguistic aspect (how we speak): e.g. emotional state
  • 4. © Copyright National University of Singapore. All Rights Reserved. • Applications Introduction 02 - Conversational Agents - Social Robots
  • 5. © Copyright National University of Singapore. All Rights Reserved. • Current Challenges in Emotional Voice Conversion: 02 Introduction o Non-parallel training; o Limited data training; o Emotional prosody modelling; o Lack of controllability; o Lack of generalizability;
  • 6. © Copyright National University of Singapore. All Rights Reserved. Publications During PhD • Accepted • Submitted 02 [1] Kun Zhou, Berrak Sisman, Haizhou Li, “Transforming spectrum and prosody for emotional voice conversion with non-parallel data”, Speaker Odyssey 2020; [2] Kun Zhou, Berrak Sisman, Mingyang Zhang, Haizhou Li, “Converting anyone’s emotion: towards speaker-independent emotional voice conversion”, Interspeech 2020; [3] Kun Zhou, Berrak Sisman, Haizhou Li, “Vaw-gan for disentanglement and recomposition of emotional elements in speech”, IEEE Spoken Language Technology Workshop (SLT) 2021; [4] Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li, “ Seen and unseen emotional style transfer with a new emotional speech dataset”, IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), 2021 [5] Kun Zhou, Berrak Sisman, Haizhou Li, “Limited data emotional voice conversion leveraging text-to- speech: two-stage sequence-to-sequence training”, Interspeech 2021; [1] Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li, “Emotional voice conversion: theory, databases and ESD”, submitted to Speech Communication;
  • 7. © Copyright National University of Singapore. All Rights Reserved. Related Work 03 • Training Stage “Source Analysis  Mapping  Target Analysis” [1] Masanori Morise, Fumiya Yokomori, and Kenji Ozawa, “World: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE TRANSACTIONS on Information and Systems, vol. 99, no. 7, pp. 1877–1884, 2016.
  • 8. © Copyright National University of Singapore. All Rights Reserved. 03 Related Work According to the training data, EVC can be divided into two types: - EVC with parallel data: Source: Target: Source and target speech share the same linguistic content; Expensive and difficult to collect! - EVC with non-parallel data: Source: Target: -- our focus! Linguistic content is different; Easy for real-life applications!
  • 9. © Copyright National University of Singapore. All Rights Reserved. Related Work 03 • Conversion Stage “Analysis  Mapping  Synthesis” Reference audio Griffin-Lim WaveRNN Parallel WaveGAN Example: Waveform Generation with different vocoders: o We care about emotional expression; o Emotional expression  Feature Mapping (Our focus!); o Speech quality  Waveform Generation (Not our focus!); o To get a better speech quality  Train a better vocoder
  • 10. © Copyright National University of Singapore. All Rights Reserved. 03 Related Work • Speaker voice conversion vs. Emotional voice conversion o Speaker voice conversion: - convert speaker identity; - mainly focus on spectrum conversion; - consider prosody as speaker-independent; o Emotional voice conversion: - convert emotional style; - focus on both spectrum and prosody conversion; - prosody (intonation, speech rate, energy, …) plays an important role!
  • 11. © Copyright National University of Singapore. All Rights Reserved. My PhD Research • CycleGAN-based EVC; • Speaker-independent EVC; • EVC for seen and unseen emotions; • Limited data EVC; We always focus on non-parallel & limited data solutions! 04 Improve generalizability Prosody modelling Duration modelling
  • 12. © Copyright National University of Singapore. All Rights Reserved. CycleGAN-based EVC [2] • Motivation o F0 is an essential part of the intonation; - Supra-segmental and hierarchical nature; - Linear transformation is insufficient; F0 is difficult to model! • Contribution o A parallel-data-free emotional voice conversion framework; o Convert spectral and prosodic features with CycleGAN; o Modelling F0 over multiple time scales with continuous wavelet transform (CWT); o Investigate different training strategies: joint vs. separate training; 02 [2] Kun Zhou, Berrak Sisman, Haizhou Li, “Transforming spectrum and prosody for emotional voice conversion with non-parallel training data”, Speaker Odyssey 2020
  • 13. © Copyright National University of Singapore. All Rights Reserved. CycleGAN-based EVC [2] 02 - Lower scales capture the short-term variations, such as syllables and phonemes; - Higher scales capture the long-term variations, such as phrases and utterances;
  • 14. © Copyright National University of Singapore. All Rights Reserved. Cycle-GAN based EVC [2] 02 o Training Stage: - Spectral CycleGAN: learn the feature mapping of spectral features; - Prosody CycleGAN: learn the feature mapping of CWT-based F0 features; o Conversion Stage: Spectral & Prosody CycleGAN convert the input features from source to target emotion type; [2] Kun Zhou, Berrak Sisman, Haizhou Li, “Transforming spectrum and prosody for emotional voice conversion with non-parallel training data”, Speaker Odyssey 2020
  • 15. © Copyright National University of Singapore. All Rights Reserved. 02 Cycle-GAN based EVC [2] o From 1st XAB preference test: CWT analysis of F0 improves the emotion similarity to the target emotion type; o From 2nd XAB preference test: Separate training of spectral and prosodic features outperforms the joint training; [2] Kun Zhou, Berrak Sisman, Haizhou Li, “Transforming spectrum and prosody for emotional voice conversion with non-parallel training data”, Speaker Odyssey 2020 Source (Neutral) Converted Angry Converted Sad Converted Surprise - Experiments
  • 16. © Copyright National University of Singapore. All Rights Reserved. Speaker-independent EVC [3] 03 [3] Kun Zhou, Berrak Sisman, Mingyang Zhang, Haizhou Li, “Converting Anyone’s Emotion: Towards Speaker-independent Emotional Voice Conversion”, Interspeech 2020 • Motivation • Contribution o Emotional expression is believed to share some common cues across individuals; For example: Happy tends to have a higher mean and std of F0 than Sad o Previous EVC studies assume that emotion is speaker-dependent; o Study emotion through speaker-independent perspective; o Study prosody modelling with CWT and F0 conditioning for emotion- independent encoder training; (The First Study!)
  • 17. © Copyright National University of Singapore. All Rights Reserved. Speaker-independent EVC [3] 03 • Related Work: VAW-GAN [4] - Conditional VAE + Discriminator • Disentangle Emotional Elements from Speech: - Spectral features (SP): Speaker + Phonetic + Prosodic info - Provide emotion ID and F0 to decoder; - The encoder learns to discard emotion related information; [4] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks,” Interspeech, 2017.
  • 18. © Copyright National University of Singapore. All Rights Reserved. Speaker-independent EVC [3] 03 • Proposed Framework o Training Stage: o Conversion Stage: VAW-GAN for Prosody: - Encoder learns emotion-independent representations from CWT-F0; - Generator learns to reconstruct prosody features with one-hot emotion ID; - Discriminator learns to judge whether the reconstructed features real or not;
  • 19. © Copyright National University of Singapore. All Rights Reserved. Speaker-independent EVC [3] 03 • Proposed Framework o Training Stage: o Conversion Stage: VAW-GAN for Spectrum: - Encoder learns emotion-independent representations from spectrum; - Generator learns to reconstruct prosody features with one-hot emotion ID and F0; - Discriminator learns to judge whether the reconstructed features real or not;
  • 20. © Copyright National University of Singapore. All Rights Reserved. 03 Speaker-independent EVC [3] - XAB preference test for emotion similarity; - XAB preference test for speaker similarity; - Both validates the effectiveness of our proposed framework in both speaker- dependent and speaker-independent settings! - CWT analysis for prosody modelling; - F0 conditioning for encoder training; - Performance with seen and unseen speakers; We would like to show the effectiveness of :
  • 21. © Copyright National University of Singapore. All Rights Reserved. 04 EVC for seen and unseen emotions [4] • Motivation o Current EVC frameworks represent each emotion with one-hot emotion label - learn to remember a fixed set of emotions; - insufficient to describe emotional styles (emotional styles present subtle difference even with the same emotion category) • Contribution o A one-to-many emotional style transfer framework; o Use pre-trained SER to describe emotional styles; [4] Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li, “Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset”, IEEE ICASSP 2021 Non-parallel training; Seen and unseen emotional style transfer (The First Study!)
  • 22. © Copyright National University of Singapore. All Rights Reserved. • Proposed Framework • Stage I: Emotion Descriptor Training • Stage II: Encoder-Decoder Training with VAW-GAN • Stage III: Run-time Conversion 04 EVC for seen and unseen emotions
  • 23. © Copyright National University of Singapore. All Rights Reserved. 04 EVC for seen and unseen emotions - AB preference test for speech quality; - XAB preference test for emotion similarity; - Validate the effectiveness of proposed framework for both seen and unseen emotion conversion.
  • 24. © Copyright National University of Singapore. All Rights Reserved. Limited data EVC [5] 05 [5] Kun Zhou, Berrak Sisman, Haizhou Li, “Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence- to-Sequence Training”, Interspeech 2021 • Motivation o Speech duration conversion has been a missing point in frame-based models; o Sequence-to-sequence (seq2seq) methods predict the duration with attention mechanism; o A seq2seq framework usually needs a large amount of training data! • Contribution o A seq2seq EVC framework leveraging TTS: Emotional voice conversion & Emotional text-to-speech; Require a limited amount of emotional speech data; Joint model spectrum, prosody and duration; Non-parallel training; Many-to-many conversion; (Lack of large amount of emotional speech data!)
  • 25. © Copyright National University of Singapore. All Rights Reserved. • Proposed Framework • Two-stage Training: - Stage I: Style Initialization: Style Encoder learns speaker style from a large TTS corpus; - Stage II: Emotion Training: Style Encoder acts as emotion encoder to learn emotional style; Limited data EVC 05
  • 26. © Copyright National University of Singapore. All Rights Reserved. Limited data EVC 05 Figure 3: Visualization of emotion embedding derived from (a) style encoder and (b) emotion encoder. - Emotion embedding derived from emotion encoder can form separate groups; - A significant separation between angry, happy, surprise and neutral, sad;
  • 27. © Copyright National University of Singapore. All Rights Reserved. • Speech Samples* Limited data EVC 05 Source CycleGAN[2] StarGAN[3] Our Propose Target Neutral-to-Angry Neutral-to-Happy Neutral-to-Sad Neutral-to-Surprise * For more speech samples: https://kunzhou9646.github.io/IS21/ [2] K. Zhou, B. Sisman, and H. Li, “Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data,” in Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, 2020, pp. 230–237 [3] G. Rizos, A. Baird, M. Elliott, and B. Schuller, “Stargan for emotional speech conversion: Validated by data augmentation of end-to-end emotion recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 3502–3506
  • 28. © Copyright National University of Singapore. All Rights Reserved. Limited data EVC 05 • Speech Samples* (Emotional Text-to-Speech) Input text: “Clear than clear water” Angry Happy Surprise Sad Clear than clear water. Clear than clear water… Clear than clear water? Clear than clear wate!
  • 29. © Copyright National University of Singapore. All Rights Reserved. 05 Limited data EVC Proposed framework significantly outperforms the baselines in emotion similarity evaluation
  • 30. © Copyright National University of Singapore. All Rights Reserved. 06 Conclusion o Emotional voice conversion: theory and challenges; o Our work: - Cycle-GAN based EVC [2]; [Speaker Odyssey 2020] - Speaker-independent EVC [3]; [INTERSPEECH 2020] - EVC for seen and unseen emotions [4]; [ICASSP 2021] - Limited data EVC [5]; [INTERSPEECH 2021] All codes are publicly available! o Future studies: - Emotional voice conversion with emotion strength control; - Emotion interpolations for emotional voice conversion; - Cross-lingual representations for emotional voice conversion; [2] Kun Zhou, Berrak Sisman, Haizhou Li, “Transforming spectrum and prosody for emotional voice conversion with non-parallel training data”, Speaker Odyssey 2020 [3] Kun Zhou, Berrak Sisman, Mingyang Zhang, Haizhou Li, “Converting Anyone’s Emotion: Towards Speaker-independent Emotional Voice Conversion”, Interspeech 2020 [4] Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li, “Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset”, IEEE ICASSP 2021 [5] Kun Zhou, Berrak Sisman, Haizhou Li, “Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training”, Interspeech 2021
  • 31. © Copyright National University of Singapore. All Rights Reserved. THANK YOU

Editor's Notes

  1. Good morning! I am PhD candidate from ECE department supervised by Prof. Li Haizhou. I am going to present my work on emotional voice conversion with non-parallel data during my first two years.
  2. First I will give an introduction to emotional voice conversion and the related work. Next I will talk about my PhD research on this topic during these two years.
  3. As we known, speech conveys information through linguistic, which refers to what we speak, and para-linguistic, which refers to how we speak. Emotional state is a para-linguistic attribute that can even reshape the meaning and understanding of the utterance. Emotional voice conversion is a technique which aims to change the emotional state of an utterance. In the meantime, we would like to preserve the linguistic content and the speaker identity. At run-time, for example, if give the system a happy utterance [Play demo], we can get a sad utterance with the same speaking content and the same speaker [Play demo].
  4. We can use this technology for many applications, such as social robots and conversational agents. It makes the synthesized voice more emotional and closer to the human voice.
  5. In emotional voice conversion, there are still many challenges. For example, how to train an emotional voice conversion framework only with nonparallel and limited data is a challenging topic. Besides, emotional prosody is also difficult to model. And it is still difficult for us to control the output emotion strength. The current frameworks are also lack of generalizability, which means if we test the framework with unseen emotion or unseen speakers, the performance may not be so good.
  6. Here is the list of the publications during the first two years of my PhD study. They are all about emotional voice conversion. I have five conference papers published, and one journal paper submitted.
  7. Next I will introduce the related work on this topic. Previous studies on emotional voice conversion follow the pattern “analysis-mapping”, as shown in this figure. During the training, imagine we have a source speech which is neutral [Demo], and a target speech which is angry [Demo]. we first extract the speech features from the source and target utterance. We mostly use the spectral features and fundamental frequency to study spectrum and prosody. Then we train a conversion model to learn a feature mapping between the source and target features. Therefore, how to learn a feature mapping function has been the main focus of emotional voice conversion.
  8. According to the training data, EVC can be divided into two types: If the source and target speech are paired, which means if only the emotion is different, we call it as parallel data. But in real life, such data is expensive and difficult to collect. Therefore, our research has been focusing on how to find a feature mapping function with non-parallel data, which means we have a source speech which is neutral [Demo], but we don’t have a target speech with the same content. Instead, we have a target speech with different content [Demo]. And we try to only learn the difference of emotional style between source and target. Compared with parallel data, emotional voice conversion with non-parallel data is much more challenging but more suitable for real-life applications.
  9. When it comes to the conversion, we have a feature mapping function that we learnt from the last training stage. We first extract the speech features from the source speech, and give these features to the feature mapping function as the input. Then we can get the converted features through the feature mapping. Next we need to reconstruct the speech waveform from the converted features. We call this step as “waveform generation”. The model we used for this step we call it as vocoder. The vocoder quality determines the quality of the output speech. But speech quality is not our focus, we more care about the emotional expression in the output speech. We want the output emotion is more intelligible and more similar with the target emotion. Therefore, our main focus is to train a good feature mapping. And the emotional expression of the speech mostly depends on the feature mapping function. If we want to get a better speech quality, all we need to do is just to train a better vocoder. Here are some speech samples generated by different vocoders. 1/ the first one is Reference audio [Demo]; 2/ the second one is synthesized by Griffin-Lim [Demo]; 3/ This one is WaveRNN [Demo]; 4/ This one is Parallel WaveGAN [Demo]; From these demos, the vocoder quality determines the speech quality. But my research is focus on the emotional expression rather than the speech quality.
  10. In speech community, speaker voice conversion is a popular and well-studied research topic. Speaker voice conversion aims to convert the speaker identity and protect other prosodic attributes. Compared with speaker voice conversion, emotional voice conversion can be a more challenging task. On the one hand, emotion is much more subjective and difficult to describe. And emotional style also complex with multiple signal attributes. Prosody, such as intonation, speaking rate, and speech energy, all play an important role in emotional expression and perception.
  11. During the first two years, I propose and develop these four different frameworks on this topic. They all aim to tackle the current challenges we talked about in the previous slide. For these four frameworks, we always focus on finding nonparallel and limited data solutions for emotional voice conversion.
  12. The first work I talk about is cycle-gan based evc, which is published in Speaker odyssey 2020. In this work, we aim to focus on non-parallel training and F0 modelling. F0, which is fundamental frequency, is an essential part of the intonation, varying from syllables to utterances which makes it difficult to model. Therefore we propose to use continuous wavelet transform to model F0 over multiple time scales. And convert it together with spectral features with CycleGAN. Moreover, we also investigate different training strategies to get a better performance.
  13. We use continuous wavelet transform to study F0 modelling in different time scales. As shown in this figure, we decompose the F0 into ten scales, and we assume the lower scales can capture the short-term variations, such as syllables and phonemes, and higher scales can capture the long-term variations, such as phrases and utterances.
  14. During training, we have two pipelines: the first is spectral cyclegan, which is to learn the feature mapping of spectral features, and another one is prosody cyclegan, to learn the mapping of CWT- F0 features given by continuous wavelet transform. During conversion, these two models convert the spectral and F0 features from source to target emotion types.
  15. We further conduct two listening tests to assess the final performance. The first figure shows , with continuous wavelet transform, the emotion similarity is improved, From the 2nd figure, we shows that the separate training of spectral and prosodic features outperforms the joint training. Here you can listen to some samples, if we have a source speech which is neutral [demo], we can convert it to angry [demo], sad [demo], and surprise [demo].
  16. The second work I am going to talk about is speaker-independent EVC, which we publish on Interspeech 2020. As we known, emotional expression shares some common cues across individuals. For example, no matter who speaks with happy, the fundamental frequency usually have a higher mean and standard variance than sad. So in this work, we study the universal pattern of emotional expression across different speakers, thus we call it as speaker-independent emotional voice conversion. In the technical part, we propose a VAW-GAN-based architecture to enable non-parallel training, and study prosody modelling with continuous wavelet transform and F0 conditioning for encoder training.
  17. Our proposed framework is based on VAW-GAN. It is a conditional variational auto-encoder followed by a discriminator. Compared with VAE, VAW-GAN can generate more realistic features. We further use VAW-GAN to study how to disentangle the emotional elements from speech. As the model input, spectral features contains speaker information which related to the speaker identity, phonetic information which is linguistic, and the prosodic information. We propose to provide emotion ID and F0 to the decoder, and the encoder can learn to discard emotion-related information and the latent code can be emotion-independent.
  18. Our framework also consists of two pipelines, one for prosody and one for spectrum. For prosody pipeline, the encoder learns emotion-independent representation from CWT-F0 features and the generator learns to reconstruct prosody features with one-hot emotion label. The discriminator learns to judge whether the reconstructed features real or not.
  19. The spectrum pipeline is similar with the prosody one. The only difference is we condition the generator not only with the emotion ID, but also the F0 values.
  20. In experiments, we would like to validate our idea of 1/ CWT analysis on prosody modelling; 2/ F0 conditioning for encoder training; and 3/ performance with seen and unseen speakers. These two XAB preference test show the effectiveness of our proposed framework.
  21. Next, I am going to talk about our work which published in this year ICASSP, EVC for seen and unseen emotions. Current emotional voice conversion frameworks represent each emotion with one-hot emotion label, but such representation only learn to remember a fixed set of emotions and may not be sufficient to describe different emotional styles. As we known, the emotional styles also can present subtle difference even with the same emotion category. In this work, we propose a one-to-many emotional style transfer framework. We use a pre-trained Speech Emotion Recognition model to describe different emotional styles. Our framework can work with non-parallel data and transfer the emotional style for both seen and unseen emotions.
  22. There are three stages in our proposed framework. In the 1st stage, we train a SER model and use it to get deep emotional features for each utterances; Then we train the VAW-GAN framework with deep emotional features; the decoder learns to reconstruct the spectral features from the latent representation from the encoder, F0 and deep emotional features; In the last stage run-time conversion, if we give the framework the deep emotional features of either seen emotion or unseen emotion, the framework can reconstruct the speech features with the reference emotional style.
  23. We further conduct two preference tests for speech quality and emotion similarity. Both of them validate the effectiveness of our proposed framework.
  24. The last work I am going to talk about is our recent work published in this year Interspeech, limited data EVC. Previous work I talk about all convert the feature frame-by-frame, which means the speech duration always kept to be the same; but speech duration is an important factor in speech rhythm and it has been a missing point in these frame-based models. To convert the speech duration, one solution is to train a sequence-to-sequence framework, which can predict the speech duration with the attention mechanism. But to train a seq2seq framework, we need a large amount of training data, nearly tens of hours to achieve a good prediction. If the training data is not sufficient, the framework may not learn a good alignment and the final performance will be poor. But for emotional speech data, there is no such large-scale datasets. In this work, we are trying to build a seq2seq evc framework only requires a limited amount of emotional speech data, and can do both emotional voice conversion and emotional text-to-speech. Besides, it can jointly model spectrum, prosody, and duration; it can work with non-parallel training data and can do many-to-many conversion.
  25. We propose a two-stage training for limited data evc: Style initialization and emotion training. During style initialization, we leverage available large TTS corpus which is all neutral data. The style encoder learns speaker. During the emotion training, we retrain the whole framework with limited amount of emotional speech data. The style encoder becomes an emotion encoder to learn the emotional style.
  26. To validate our idea of two-stage training, we visualize the emotion embeddings derived from the style encoder and emotion encoder. From this figure, we observe that with emotion retraining, the emotion encoder can generate meaningful emotion representations and each emotion form separate clusters and there is a significant separation between different emotion types.
  27. We choose two state-of-the-art emotional voice conversion frameworks CycleGAN and StarGAN for the baselines. We conduct emotion conversion from neutral to angry, happy, sad and surprise. Here are some speech samples. If we give a source neutral speech [demo], we can convert it to angry, it is cyclegan sounds like [demo], it is stargan[demo], it is our proposed [demo], it is the target[demo]. For the neutral to happy, it is the source[demo]…. From these speech samples, we can clearly feel that our proposed framework has a much better performance than the baselines, especially for the neutral-to-surprise.
  28. Our framework also can do emotional text-to-speech. If we give an input text “clear than clear water” [demo], we can synthesis angry one[demo], sad[demo], surprise[demo], and happy[demo].
  29. We further conduct listening experiments to evaluate the emotional expression. From this figure, we can observe that our framework significantly outperform the baselines in emotion similarity evaluation.
  30. As a conclusion, in this presentation, we first introduce the emotional voice conversion and its theory and challenges. With these challenges, we introduce our work, which is CycleGAN based EVC, Speaker-independent EVC, EVC for seen and unseen emotions, and limited data EVC. For all the work, we provide the demo website and the codes are all available at github. As for future studies, we would like to these following topics, 1/ Emotional voice conversion with emotion strength control, which aims to control the output emotion strength, emotional interpolation for emotional voice conversion, which aims to convert the emotion in a continuous scale. Cross-lingual representations for emotional voice conversion, which aims to study a cross-lingual emotion representation across different languages.
  31. It is the ending of this presentation. Thank you for listening!