SlideShare a Scribd company logo
NUS Presentation Title 2001
Seen and Unseen emotional style transfer for voice
conversion with a new emotional speech dataset
Kun Zhou1, Berrak Sisman2, Rui Liu2, Haizhou Li1
1National University of Singapore
2Singapore University of Technology and Design
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
1. Introduction
1.1 Emotional voice conversion
Emotional voice conversion aims to transfer the emotional style of an utterance
from one to another, while preserving the linguistic content and speaker identity.
Applications: emotional text-to-speech, conversational agents
Compared with speaker voice conversion:
1) Speaker identity is mainly determined by voice quality;
-- speaker voice conversion mainly focuses on spectrum conversion
2) Emotion is the interplay of multiple acoustic attributes concerning both spectrum
and prosody.
-- emotional voice conversion focuses on both spectrum and prosody conversion
NUS Presentation Title 2001
1. Introduction
1.2 Current Challenges
1) Non-parallel training
Parallel data is difficult and expensive to collect in real life;
2) Emotional prosody modelling
Emotional prosody is hierarchical in nature, and perceived at phoneme,
word, and sentence level;
3) Unseen emotion conversion
Existing studies use discrete emotion categories, such as one-hot label to
remember a fixed set of emotional styles.
Our focus
Our focus
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
2. Related Work
2.1 Emotion representation
1) Categorical representation
Example: Ekman’s six basic emotions theory[1]:
i.e., anger, disgust, fear, happiness, sadness and surprise
2) Dimensional representation
Example: Russell’s circumplex model[2]:
i.e., arousal, valence, and dominance
[1] Paul Ekman, “An argument for basic emotions,” Cognition & emotion, 1992
[2] James A Russell, “A circumplex model of affect.,” Journal of personality and social psychology, vol. 39, no. 6, pp. 1161, 1980.
NUS Presentation Title 2001
2. Related Work
2.2 Emotional descriptor
1) One-hot emotion labels
2) Expert-crafted features such as
openSMILE [3]
3) Deep emotional features:
[3] Florian Eyben, Martin Wöllmer, and Björn Schuller, Opensmile: the munich versatile and fast open-source audio feature extractor, Proceedings of the
18th acm international conference on multimedia, 2010, pp. 1459–1462
-- data-driven, less dependent on human
knowledge, more suitable for emotional
style transfer…
NUS Presentation Title 2001
2. Related Work
2.3 Variational Auto-encoding Wasserstein Generative Adversarial
Network (VAW-GAN)[4]
Fig 2. An illustration of VAW-GAN framework.
- Based on variational auto-encoder
(VAE)
- Consisting of three components:
Encoder: to learn a latent code from the
input;
Generator: to reconstruct the input from the
latent code and the inference code;
Discriminator: to judge the reconstructed
input whether real or not;
- First introduced in speaker voice
conversion in 2017
[4] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and HsinMin Wang, “Voice conversion from unaligned
corpora using variational autoencoding wasserstein generative adversarial networks,” in Proc. Interspeech, 2017.
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
3. Contributions
1. One-to-many emotional style transfer framework
2. Does not require parallel training data;
3. Use speech emotion recognition model to describe the emotional style;
4. Publicly available multi-lingual and multi-speaker emotional speech
corpus (ESD), that can be used for various speech synthesis tasks;
To our best knowledge, this is the first reported study on emotional style transfer
for unseen emotion.
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
4. Proposed Framework
1) Stage I: Emotion Descriptor Training
2) Stage II: Encoder-Decoder Training with VAW-GAN
3) Stage III: Run-time Conversion
Fig. 3. The training phase of the proposed DeepEST framework. Blue boxes represent the networks
that involved in the training, and the red boxes represent the networks that are already trained.
NUS Presentation Title 2001
4. Proposed Framework
Fig. 3. The training phase of the proposed DeepEST framework. Blue boxes represent the networks
that involved in the training, and the red boxes represent the networks that are already trained.
1) Emotion Descriptor Training
- A pre-trained SER model is used to extract deep emotional features from the
input utterance.
- The utterance-level deep emotional features are thought to describe different
emotions in a continuous space.
NUS Presentation Title 2001
4. Proposed Framework
Fig. 3. The training phase of the proposed DeepEST framework. Blue boxes represent the networks
that involved in the training, and the red boxes represent the networks that are already trained.
2) Encoder-Decoder Training with VAW-GAN
- The encoder learns to generate emotion-independent latent representation z
from the input features
- The decoder learns to reconstruct the input features with latent representation,
F0 contour, and deep emotional features;
- The discriminator learns to determine whether the reconstructed features real
or not;
NUS Presentation Title 2001
4. Proposed Framework
Fig. 3. The training phase of the proposed DeepEST framework. Blue boxes represent the networks
that involved in the training, and the red boxes represent the networks that are already trained.
3) Run-time Conversion
- The pre-trained SER generates the deep emotional features from the reference
set;
- We then concatenate the deep emotional features with the converted F0 and
emotion-independent z as the input to the decoder;
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
5. Experiments
5.1 Emotional Speech Dataset -- ESD
1) The dataset consists of 350 parallel utterances with an average duration
of 2.9 seconds spoken by 10 native English and 10 native Mandarin
speakers;
2) For each language, the dataset consists of 5 male and 5 female speakers
in five emotions:
a) happy, b) sad, c) neutral, d) angry, and e) surprise
3) ESD dataset is publicly available at:
https://github.com/ HLTSingapore/Emotional-Speech-Data
To our best knowledge, this is the first parallel emotional voice
conversion dataset in a multi-lingual and multi-speaker setup.
NUS Presentation Title 2001
5. Experiments
5.2 Experimental setup
1) 300-30-20 utterances (training-reference-test set);
2) One universal model that takes neutral, happy and sad utterances as input;
3) SER is trained on a subset of IEMOCAP[5], with four emotion types (happy, angry,
sad and neutral)
4) We conduct emotion conversion from neutral to seen emotional states (happy, sad)
to unseen emotional states (angry).
5) Baseline: VAW-GAN-EVC [6]: only capable of one-to-one conversion and generate
seen emotions;
Proposed: DeepEST: one-to-many conversion, seen and unseen emotion generation
[5] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang,
Sungbok Lee, and Shrikanth S Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language
resources and evaluation, vol. 42, no. 4, pp. 335, 2008.
[6] Kun Zhou, Berrak Sisman, Mingyang Zhang, and Haizhou Li, “Converting Anyone’s Emotion: Towards Speaker-
Independent Emotional Voice Conversion,” in Proc. Interspeech 2020, 2020, pp. 3416–3420.
NUS Presentation Title 2001
5. Experiments
5.3 Objective Evaluation
We calculate MCD [dB] to measure the spectral distortion
Proposed DeepEST outperforms the baseline framework for seen emotion
conversion and achieves comparable results for unseen emotion conversion.
NUS Presentation Title 2001
5. Experiments
5.4 Subjective Evaluation
1) MOS listening experiments
2) AB preference test
3) XAB preference test
To evaluate speech quality:
To evaluate emotion similarity:
Proposed DeepEST achieves competitive
results for both seen and unseen emotions
NUS Presentation Title 2001
5. Experiments
Codes & Speech Samples
https://kunzhou9646.github.io/controllable-evc/
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
6. Conclusion
1. We propose a one-to-many emotional style transfer framework based on
VAW-GAN without the need for parallel data;
2. We propose to leverage deep emotional features from speech emotion
recognition to describe the emotional prosody in a continuous space;
3. We condition the decoder with controllable attributes such as deep emotional
features and F0 values;
4. We achieve competitive results for both seen and unseen emotions over the
baselines;
5. We introduce and release a new emotional speech dataset, ESD, that can be
used in speech synthesis and voice conversion.
NUS Presentation Title 2001
Thank you for listening!

More Related Content

What's hot

Twitter sentiment analysis
Twitter sentiment analysisTwitter sentiment analysis
Twitter sentiment analysis
Sunil Kandari
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
Hemantha Kulathilake
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
Khang Pham
 
Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminar
Diptimaya Sarangi
 
Resnet
ResnetResnet
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial Networks
Mark Chang
 
Long Short Term Memory
Long Short Term MemoryLong Short Term Memory
Long Short Term Memory
Yan Xu
 
DIY Driver Analysis Webinar slides
DIY Driver Analysis Webinar slidesDIY Driver Analysis Webinar slides
DIY Driver Analysis Webinar slides
Displayr
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
Bhaskar Mitra
 
Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentation
himanshubhatti
 
Natural Language Processing: Parsing
Natural Language Processing: ParsingNatural Language Processing: Parsing
Natural Language Processing: Parsing
Rushdi Shams
 
Text classification & sentiment analysis
Text classification & sentiment analysisText classification & sentiment analysis
Text classification & sentiment analysis
M. Atif Qureshi
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
Christian Perone
 
Speaker recognition using MFCC
Speaker recognition using MFCCSpeaker recognition using MFCC
Speaker recognition using MFCC
Hira Shaukat
 
Next word Prediction
Next word PredictionNext word Prediction
Next word Prediction
KHUSHISOLANKI3
 
Feed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descentFeed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descent
Muhammad Rasel
 
Speech recognition
Speech recognitionSpeech recognition
Speech recognition
Charu Joshi
 
Sentiment analysis-by-nltk
Sentiment analysis-by-nltkSentiment analysis-by-nltk
Sentiment analysis-by-nltk
Wei-Ting Kuo
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Yuriy Guts
 
Genetic representation
Genetic representationGenetic representation
Genetic representation
DEEPIKA T
 

What's hot (20)

Twitter sentiment analysis
Twitter sentiment analysisTwitter sentiment analysis
Twitter sentiment analysis
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
 
Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminar
 
Resnet
ResnetResnet
Resnet
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial Networks
 
Long Short Term Memory
Long Short Term MemoryLong Short Term Memory
Long Short Term Memory
 
DIY Driver Analysis Webinar slides
DIY Driver Analysis Webinar slidesDIY Driver Analysis Webinar slides
DIY Driver Analysis Webinar slides
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentation
 
Natural Language Processing: Parsing
Natural Language Processing: ParsingNatural Language Processing: Parsing
Natural Language Processing: Parsing
 
Text classification & sentiment analysis
Text classification & sentiment analysisText classification & sentiment analysis
Text classification & sentiment analysis
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
Speaker recognition using MFCC
Speaker recognition using MFCCSpeaker recognition using MFCC
Speaker recognition using MFCC
 
Next word Prediction
Next word PredictionNext word Prediction
Next word Prediction
 
Feed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descentFeed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descent
 
Speech recognition
Speech recognitionSpeech recognition
Speech recognition
 
Sentiment analysis-by-nltk
Sentiment analysis-by-nltkSentiment analysis-by-nltk
Sentiment analysis-by-nltk
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Genetic representation
Genetic representationGenetic representation
Genetic representation
 

Similar to IEEE ICASSP 2021

VAW-GAN for disentanglement and recomposition of emotional elements in speech
VAW-GAN for disentanglement and recomposition of emotional elements in speechVAW-GAN for disentanglement and recomposition of emotional elements in speech
VAW-GAN for disentanglement and recomposition of emotional elements in speech
KunZhou18
 
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
ijtsrd
 
Speech Emotion Recognition Using Neural Networks
Speech Emotion Recognition Using Neural NetworksSpeech Emotion Recognition Using Neural Networks
Speech Emotion Recognition Using Neural Networks
ijtsrd
 
PhD_Oral_Defense_Kun.ppt
PhD_Oral_Defense_Kun.pptPhD_Oral_Defense_Kun.ppt
PhD_Oral_Defense_Kun.ppt
KunZhou18
 
A comparative analysis of classifiers in emotion recognition thru acoustic fea...
A comparative analysis of classifiers in emotion recognition thru acoustic fea...A comparative analysis of classifiers in emotion recognition thru acoustic fea...
A comparative analysis of classifiers in emotion recognition thru acoustic fea...
Pravena Duplex
 
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECHBASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
IJCSEA Journal
 
saito22research_talk_at_NUS
saito22research_talk_at_NUSsaito22research_talk_at_NUS
saito22research_talk_at_NUS
Yuki Saito
 
Emotion recognition based on the energy distribution of plosive syllables
Emotion recognition based on the energy distribution of plosive  syllablesEmotion recognition based on the energy distribution of plosive  syllables
Emotion recognition based on the energy distribution of plosive syllables
IJECEIAES
 
A Review Paper on Speech Based Emotion Detection Using Deep Learning
A Review Paper on Speech Based Emotion Detection Using Deep LearningA Review Paper on Speech Based Emotion Detection Using Deep Learning
A Review Paper on Speech Based Emotion Detection Using Deep Learning
IRJET Journal
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
ParrotAI
 
Emotion Recognition Based on Speech Signals by Combining Empirical Mode Decom...
Emotion Recognition Based on Speech Signals by Combining Empirical Mode Decom...Emotion Recognition Based on Speech Signals by Combining Empirical Mode Decom...
Emotion Recognition Based on Speech Signals by Combining Empirical Mode Decom...
BIJIAM Journal
 
A critical insight into multi-languages speech emotion databases
A critical insight into multi-languages speech emotion databasesA critical insight into multi-languages speech emotion databases
A critical insight into multi-languages speech emotion databases
journalBEEI
 
histogram-based-emotion
histogram-based-emotionhistogram-based-emotion
histogram-based-emotion
Muhammet Pakyurek
 
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKSENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
ijnlc
 
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural NetworkSentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
kevig
 
Predicting the Level of Emotion by Means of Indonesian Speech Signal
Predicting the Level of Emotion by Means of Indonesian Speech SignalPredicting the Level of Emotion by Means of Indonesian Speech Signal
Predicting the Level of Emotion by Means of Indonesian Speech Signal
TELKOMNIKA JOURNAL
 
NEURAL DISCOURSE MODELLING OF CONVERSATIONS
NEURAL DISCOURSE MODELLING OF CONVERSATIONSNEURAL DISCOURSE MODELLING OF CONVERSATIONS
NEURAL DISCOURSE MODELLING OF CONVERSATIONS
kevig
 
Day 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.pptDay 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.ppt
FatmaSetyaningsih2
 
Day 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.pptDay 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.ppt
MeghanaDBengalur
 
Upgrading the Performance of Speech Emotion Recognition at the Segmental Level
Upgrading the Performance of Speech Emotion Recognition at the Segmental Level Upgrading the Performance of Speech Emotion Recognition at the Segmental Level
Upgrading the Performance of Speech Emotion Recognition at the Segmental Level
IOSR Journals
 

Similar to IEEE ICASSP 2021 (20)

VAW-GAN for disentanglement and recomposition of emotional elements in speech
VAW-GAN for disentanglement and recomposition of emotional elements in speechVAW-GAN for disentanglement and recomposition of emotional elements in speech
VAW-GAN for disentanglement and recomposition of emotional elements in speech
 
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
 
Speech Emotion Recognition Using Neural Networks
Speech Emotion Recognition Using Neural NetworksSpeech Emotion Recognition Using Neural Networks
Speech Emotion Recognition Using Neural Networks
 
PhD_Oral_Defense_Kun.ppt
PhD_Oral_Defense_Kun.pptPhD_Oral_Defense_Kun.ppt
PhD_Oral_Defense_Kun.ppt
 
A comparative analysis of classifiers in emotion recognition thru acoustic fea...
A comparative analysis of classifiers in emotion recognition thru acoustic fea...A comparative analysis of classifiers in emotion recognition thru acoustic fea...
A comparative analysis of classifiers in emotion recognition thru acoustic fea...
 
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECHBASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
 
saito22research_talk_at_NUS
saito22research_talk_at_NUSsaito22research_talk_at_NUS
saito22research_talk_at_NUS
 
Emotion recognition based on the energy distribution of plosive syllables
Emotion recognition based on the energy distribution of plosive  syllablesEmotion recognition based on the energy distribution of plosive  syllables
Emotion recognition based on the energy distribution of plosive syllables
 
A Review Paper on Speech Based Emotion Detection Using Deep Learning
A Review Paper on Speech Based Emotion Detection Using Deep LearningA Review Paper on Speech Based Emotion Detection Using Deep Learning
A Review Paper on Speech Based Emotion Detection Using Deep Learning
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Emotion Recognition Based on Speech Signals by Combining Empirical Mode Decom...
Emotion Recognition Based on Speech Signals by Combining Empirical Mode Decom...Emotion Recognition Based on Speech Signals by Combining Empirical Mode Decom...
Emotion Recognition Based on Speech Signals by Combining Empirical Mode Decom...
 
A critical insight into multi-languages speech emotion databases
A critical insight into multi-languages speech emotion databasesA critical insight into multi-languages speech emotion databases
A critical insight into multi-languages speech emotion databases
 
histogram-based-emotion
histogram-based-emotionhistogram-based-emotion
histogram-based-emotion
 
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKSENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
 
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural NetworkSentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
 
Predicting the Level of Emotion by Means of Indonesian Speech Signal
Predicting the Level of Emotion by Means of Indonesian Speech SignalPredicting the Level of Emotion by Means of Indonesian Speech Signal
Predicting the Level of Emotion by Means of Indonesian Speech Signal
 
NEURAL DISCOURSE MODELLING OF CONVERSATIONS
NEURAL DISCOURSE MODELLING OF CONVERSATIONSNEURAL DISCOURSE MODELLING OF CONVERSATIONS
NEURAL DISCOURSE MODELLING OF CONVERSATIONS
 
Day 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.pptDay 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.ppt
 
Day 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.pptDay 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.ppt
 
Upgrading the Performance of Speech Emotion Recognition at the Segmental Level
Upgrading the Performance of Speech Emotion Recognition at the Segmental Level Upgrading the Performance of Speech Emotion Recognition at the Segmental Level
Upgrading the Performance of Speech Emotion Recognition at the Segmental Level
 

Recently uploaded

Engineering Standards Wiring methods.pdf
Engineering Standards Wiring methods.pdfEngineering Standards Wiring methods.pdf
Engineering Standards Wiring methods.pdf
edwin408357
 
VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...
VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...
VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...
PIMR BHOPAL
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
Welding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdfWelding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdf
AjmalKhan50578
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
PKavitha10
 
morris_worm_intro_and_source_code_analysis_.pdf
morris_worm_intro_and_source_code_analysis_.pdfmorris_worm_intro_and_source_code_analysis_.pdf
morris_worm_intro_and_source_code_analysis_.pdf
ycwu0509
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
VANDANAMOHANGOUDA
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
Paris Salesforce Developer Group
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
PreethaV16
 
Gas agency management system project report.pdf
Gas agency management system project report.pdfGas agency management system project report.pdf
Gas agency management system project report.pdf
Kamal Acharya
 
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
PriyankaKilaniya
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
Divyanshu
 
Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...
bijceesjournal
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
ElakkiaU
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
Digital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptxDigital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptx
aryanpankaj78
 
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
upoux
 

Recently uploaded (20)

Engineering Standards Wiring methods.pdf
Engineering Standards Wiring methods.pdfEngineering Standards Wiring methods.pdf
Engineering Standards Wiring methods.pdf
 
VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...
VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...
VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
Welding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdfWelding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdf
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
 
morris_worm_intro_and_source_code_analysis_.pdf
morris_worm_intro_and_source_code_analysis_.pdfmorris_worm_intro_and_source_code_analysis_.pdf
morris_worm_intro_and_source_code_analysis_.pdf
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
 
Gas agency management system project report.pdf
Gas agency management system project report.pdfGas agency management system project report.pdf
Gas agency management system project report.pdf
 
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
 
Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
Digital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptxDigital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptx
 
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
 

IEEE ICASSP 2021

  • 1. NUS Presentation Title 2001 Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset Kun Zhou1, Berrak Sisman2, Rui Liu2, Haizhou Li1 1National University of Singapore 2Singapore University of Technology and Design
  • 2. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 3. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 4. NUS Presentation Title 2001 1. Introduction 1.1 Emotional voice conversion Emotional voice conversion aims to transfer the emotional style of an utterance from one to another, while preserving the linguistic content and speaker identity. Applications: emotional text-to-speech, conversational agents Compared with speaker voice conversion: 1) Speaker identity is mainly determined by voice quality; -- speaker voice conversion mainly focuses on spectrum conversion 2) Emotion is the interplay of multiple acoustic attributes concerning both spectrum and prosody. -- emotional voice conversion focuses on both spectrum and prosody conversion
  • 5. NUS Presentation Title 2001 1. Introduction 1.2 Current Challenges 1) Non-parallel training Parallel data is difficult and expensive to collect in real life; 2) Emotional prosody modelling Emotional prosody is hierarchical in nature, and perceived at phoneme, word, and sentence level; 3) Unseen emotion conversion Existing studies use discrete emotion categories, such as one-hot label to remember a fixed set of emotional styles. Our focus Our focus
  • 6. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 7. NUS Presentation Title 2001 2. Related Work 2.1 Emotion representation 1) Categorical representation Example: Ekman’s six basic emotions theory[1]: i.e., anger, disgust, fear, happiness, sadness and surprise 2) Dimensional representation Example: Russell’s circumplex model[2]: i.e., arousal, valence, and dominance [1] Paul Ekman, “An argument for basic emotions,” Cognition & emotion, 1992 [2] James A Russell, “A circumplex model of affect.,” Journal of personality and social psychology, vol. 39, no. 6, pp. 1161, 1980.
  • 8. NUS Presentation Title 2001 2. Related Work 2.2 Emotional descriptor 1) One-hot emotion labels 2) Expert-crafted features such as openSMILE [3] 3) Deep emotional features: [3] Florian Eyben, Martin Wöllmer, and Björn Schuller, Opensmile: the munich versatile and fast open-source audio feature extractor, Proceedings of the 18th acm international conference on multimedia, 2010, pp. 1459–1462 -- data-driven, less dependent on human knowledge, more suitable for emotional style transfer…
  • 9. NUS Presentation Title 2001 2. Related Work 2.3 Variational Auto-encoding Wasserstein Generative Adversarial Network (VAW-GAN)[4] Fig 2. An illustration of VAW-GAN framework. - Based on variational auto-encoder (VAE) - Consisting of three components: Encoder: to learn a latent code from the input; Generator: to reconstruct the input from the latent code and the inference code; Discriminator: to judge the reconstructed input whether real or not; - First introduced in speaker voice conversion in 2017 [4] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and HsinMin Wang, “Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks,” in Proc. Interspeech, 2017.
  • 10. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 11. NUS Presentation Title 2001 3. Contributions 1. One-to-many emotional style transfer framework 2. Does not require parallel training data; 3. Use speech emotion recognition model to describe the emotional style; 4. Publicly available multi-lingual and multi-speaker emotional speech corpus (ESD), that can be used for various speech synthesis tasks; To our best knowledge, this is the first reported study on emotional style transfer for unseen emotion.
  • 12. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 13. NUS Presentation Title 2001 4. Proposed Framework 1) Stage I: Emotion Descriptor Training 2) Stage II: Encoder-Decoder Training with VAW-GAN 3) Stage III: Run-time Conversion Fig. 3. The training phase of the proposed DeepEST framework. Blue boxes represent the networks that involved in the training, and the red boxes represent the networks that are already trained.
  • 14. NUS Presentation Title 2001 4. Proposed Framework Fig. 3. The training phase of the proposed DeepEST framework. Blue boxes represent the networks that involved in the training, and the red boxes represent the networks that are already trained. 1) Emotion Descriptor Training - A pre-trained SER model is used to extract deep emotional features from the input utterance. - The utterance-level deep emotional features are thought to describe different emotions in a continuous space.
  • 15. NUS Presentation Title 2001 4. Proposed Framework Fig. 3. The training phase of the proposed DeepEST framework. Blue boxes represent the networks that involved in the training, and the red boxes represent the networks that are already trained. 2) Encoder-Decoder Training with VAW-GAN - The encoder learns to generate emotion-independent latent representation z from the input features - The decoder learns to reconstruct the input features with latent representation, F0 contour, and deep emotional features; - The discriminator learns to determine whether the reconstructed features real or not;
  • 16. NUS Presentation Title 2001 4. Proposed Framework Fig. 3. The training phase of the proposed DeepEST framework. Blue boxes represent the networks that involved in the training, and the red boxes represent the networks that are already trained. 3) Run-time Conversion - The pre-trained SER generates the deep emotional features from the reference set; - We then concatenate the deep emotional features with the converted F0 and emotion-independent z as the input to the decoder;
  • 17. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 18. NUS Presentation Title 2001 5. Experiments 5.1 Emotional Speech Dataset -- ESD 1) The dataset consists of 350 parallel utterances with an average duration of 2.9 seconds spoken by 10 native English and 10 native Mandarin speakers; 2) For each language, the dataset consists of 5 male and 5 female speakers in five emotions: a) happy, b) sad, c) neutral, d) angry, and e) surprise 3) ESD dataset is publicly available at: https://github.com/ HLTSingapore/Emotional-Speech-Data To our best knowledge, this is the first parallel emotional voice conversion dataset in a multi-lingual and multi-speaker setup.
  • 19. NUS Presentation Title 2001 5. Experiments 5.2 Experimental setup 1) 300-30-20 utterances (training-reference-test set); 2) One universal model that takes neutral, happy and sad utterances as input; 3) SER is trained on a subset of IEMOCAP[5], with four emotion types (happy, angry, sad and neutral) 4) We conduct emotion conversion from neutral to seen emotional states (happy, sad) to unseen emotional states (angry). 5) Baseline: VAW-GAN-EVC [6]: only capable of one-to-one conversion and generate seen emotions; Proposed: DeepEST: one-to-many conversion, seen and unseen emotion generation [5] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, pp. 335, 2008. [6] Kun Zhou, Berrak Sisman, Mingyang Zhang, and Haizhou Li, “Converting Anyone’s Emotion: Towards Speaker- Independent Emotional Voice Conversion,” in Proc. Interspeech 2020, 2020, pp. 3416–3420.
  • 20. NUS Presentation Title 2001 5. Experiments 5.3 Objective Evaluation We calculate MCD [dB] to measure the spectral distortion Proposed DeepEST outperforms the baseline framework for seen emotion conversion and achieves comparable results for unseen emotion conversion.
  • 21. NUS Presentation Title 2001 5. Experiments 5.4 Subjective Evaluation 1) MOS listening experiments 2) AB preference test 3) XAB preference test To evaluate speech quality: To evaluate emotion similarity: Proposed DeepEST achieves competitive results for both seen and unseen emotions
  • 22. NUS Presentation Title 2001 5. Experiments Codes & Speech Samples https://kunzhou9646.github.io/controllable-evc/
  • 23. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 24. NUS Presentation Title 2001 6. Conclusion 1. We propose a one-to-many emotional style transfer framework based on VAW-GAN without the need for parallel data; 2. We propose to leverage deep emotional features from speech emotion recognition to describe the emotional prosody in a continuous space; 3. We condition the decoder with controllable attributes such as deep emotional features and F0 values; 4. We achieve competitive results for both seen and unseen emotions over the baselines; 5. We introduce and release a new emotional speech dataset, ESD, that can be used in speech synthesis and voice conversion.
  • 25. NUS Presentation Title 2001 Thank you for listening!