SlideShare a Scribd company logo
1 of 25
NUS Presentation Title 2001
Seen and Unseen emotional style transfer for voice
conversion with a new emotional speech dataset
Kun Zhou1, Berrak Sisman2, Rui Liu2, Haizhou Li1
1National University of Singapore
2Singapore University of Technology and Design
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
1. Introduction
1.1 Emotional voice conversion
Emotional voice conversion aims to transfer the emotional style of an utterance
from one to another, while preserving the linguistic content and speaker identity.
Applications: emotional text-to-speech, conversational agents
Compared with speaker voice conversion:
1) Speaker identity is mainly determined by voice quality;
-- speaker voice conversion mainly focuses on spectrum conversion
2) Emotion is the interplay of multiple acoustic attributes concerning both spectrum
and prosody.
-- emotional voice conversion focuses on both spectrum and prosody conversion
NUS Presentation Title 2001
1. Introduction
1.2 Current Challenges
1) Non-parallel training
Parallel data is difficult and expensive to collect in real life;
2) Emotional prosody modelling
Emotional prosody is hierarchical in nature, and perceived at phoneme,
word, and sentence level;
3) Unseen emotion conversion
Existing studies use discrete emotion categories, such as one-hot label to
remember a fixed set of emotional styles.
Our focus
Our focus
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
2. Related Work
2.1 Emotion representation
1) Categorical representation
Example: Ekman’s six basic emotions theory[1]:
i.e., anger, disgust, fear, happiness, sadness and surprise
2) Dimensional representation
Example: Russell’s circumplex model[2]:
i.e., arousal, valence, and dominance
[1] Paul Ekman, “An argument for basic emotions,” Cognition & emotion, 1992
[2] James A Russell, “A circumplex model of affect.,” Journal of personality and social psychology, vol. 39, no. 6, pp. 1161, 1980.
NUS Presentation Title 2001
2. Related Work
2.2 Emotional descriptor
1) One-hot emotion labels
2) Expert-crafted features such as
openSMILE [3]
3) Deep emotional features:
[3] Florian Eyben, Martin Wöllmer, and Björn Schuller, Opensmile: the munich versatile and fast open-source audio feature extractor, Proceedings of the
18th acm international conference on multimedia, 2010, pp. 1459–1462
-- data-driven, less dependent on human
knowledge, more suitable for emotional
style transfer…
NUS Presentation Title 2001
2. Related Work
2.3 Variational Auto-encoding Wasserstein Generative Adversarial
Network (VAW-GAN)[4]
Fig 2. An illustration of VAW-GAN framework.
- Based on variational auto-encoder
(VAE)
- Consisting of three components:
Encoder: to learn a latent code from the
input;
Generator: to reconstruct the input from the
latent code and the inference code;
Discriminator: to judge the reconstructed
input whether real or not;
- First introduced in speaker voice
conversion in 2017
[4] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and HsinMin Wang, “Voice conversion from unaligned
corpora using variational autoencoding wasserstein generative adversarial networks,” in Proc. Interspeech, 2017.
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
3. Contributions
1. One-to-many emotional style transfer framework
2. Does not require parallel training data;
3. Use speech emotion recognition model to describe the emotional style;
4. Publicly available multi-lingual and multi-speaker emotional speech
corpus (ESD), that can be used for various speech synthesis tasks;
To our best knowledge, this is the first reported study on emotional style transfer
for unseen emotion.
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
4. Proposed Framework
1) Stage I: Emotion Descriptor Training
2) Stage II: Encoder-Decoder Training with VAW-GAN
3) Stage III: Run-time Conversion
Fig. 3. The training phase of the proposed DeepEST framework. Blue boxes represent the networks
that involved in the training, and the red boxes represent the networks that are already trained.
NUS Presentation Title 2001
4. Proposed Framework
Fig. 3. The training phase of the proposed DeepEST framework. Blue boxes represent the networks
that involved in the training, and the red boxes represent the networks that are already trained.
1) Emotion Descriptor Training
- A pre-trained SER model is used to extract deep emotional features from the
input utterance.
- The utterance-level deep emotional features are thought to describe different
emotions in a continuous space.
NUS Presentation Title 2001
4. Proposed Framework
Fig. 3. The training phase of the proposed DeepEST framework. Blue boxes represent the networks
that involved in the training, and the red boxes represent the networks that are already trained.
2) Encoder-Decoder Training with VAW-GAN
- The encoder learns to generate emotion-independent latent representation z
from the input features
- The decoder learns to reconstruct the input features with latent representation,
F0 contour, and deep emotional features;
- The discriminator learns to determine whether the reconstructed features real
or not;
NUS Presentation Title 2001
4. Proposed Framework
Fig. 3. The training phase of the proposed DeepEST framework. Blue boxes represent the networks
that involved in the training, and the red boxes represent the networks that are already trained.
3) Run-time Conversion
- The pre-trained SER generates the deep emotional features from the reference
set;
- We then concatenate the deep emotional features with the converted F0 and
emotion-independent z as the input to the decoder;
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
5. Experiments
5.1 Emotional Speech Dataset -- ESD
1) The dataset consists of 350 parallel utterances with an average duration
of 2.9 seconds spoken by 10 native English and 10 native Mandarin
speakers;
2) For each language, the dataset consists of 5 male and 5 female speakers
in five emotions:
a) happy, b) sad, c) neutral, d) angry, and e) surprise
3) ESD dataset is publicly available at:
https://github.com/ HLTSingapore/Emotional-Speech-Data
To our best knowledge, this is the first parallel emotional voice
conversion dataset in a multi-lingual and multi-speaker setup.
NUS Presentation Title 2001
5. Experiments
5.2 Experimental setup
1) 300-30-20 utterances (training-reference-test set);
2) One universal model that takes neutral, happy and sad utterances as input;
3) SER is trained on a subset of IEMOCAP[5], with four emotion types (happy, angry,
sad and neutral)
4) We conduct emotion conversion from neutral to seen emotional states (happy, sad)
to unseen emotional states (angry).
5) Baseline: VAW-GAN-EVC [6]: only capable of one-to-one conversion and generate
seen emotions;
Proposed: DeepEST: one-to-many conversion, seen and unseen emotion generation
[5] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang,
Sungbok Lee, and Shrikanth S Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language
resources and evaluation, vol. 42, no. 4, pp. 335, 2008.
[6] Kun Zhou, Berrak Sisman, Mingyang Zhang, and Haizhou Li, “Converting Anyone’s Emotion: Towards Speaker-
Independent Emotional Voice Conversion,” in Proc. Interspeech 2020, 2020, pp. 3416–3420.
NUS Presentation Title 2001
5. Experiments
5.3 Objective Evaluation
We calculate MCD [dB] to measure the spectral distortion
Proposed DeepEST outperforms the baseline framework for seen emotion
conversion and achieves comparable results for unseen emotion conversion.
NUS Presentation Title 2001
5. Experiments
5.4 Subjective Evaluation
1) MOS listening experiments
2) AB preference test
3) XAB preference test
To evaluate speech quality:
To evaluate emotion similarity:
Proposed DeepEST achieves competitive
results for both seen and unseen emotions
NUS Presentation Title 2001
5. Experiments
Codes & Speech Samples
https://kunzhou9646.github.io/controllable-evc/
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
6. Conclusion
1. We propose a one-to-many emotional style transfer framework based on
VAW-GAN without the need for parallel data;
2. We propose to leverage deep emotional features from speech emotion
recognition to describe the emotional prosody in a continuous space;
3. We condition the decoder with controllable attributes such as deep emotional
features and F0 values;
4. We achieve competitive results for both seen and unseen emotions over the
baselines;
5. We introduce and release a new emotional speech dataset, ESD, that can be
used in speech synthesis and voice conversion.
NUS Presentation Title 2001
Thank you for listening!

More Related Content

What's hot

Fundamental, An Introduction to Neural Networks
Fundamental, An Introduction to Neural NetworksFundamental, An Introduction to Neural Networks
Fundamental, An Introduction to Neural NetworksNelson Piedra
 
Modeling of EEG (Brain waves)
Modeling of EEG (Brain waves) Modeling of EEG (Brain waves)
Modeling of EEG (Brain waves) Kenyu Uehara
 
Implementation of reed solomon codes basics
Implementation of reed solomon codes basicsImplementation of reed solomon codes basics
Implementation of reed solomon codes basicsRam Singh Yadav
 
Encoder for (7,3) cyclic code using matlab
Encoder for (7,3) cyclic code using matlabEncoder for (7,3) cyclic code using matlab
Encoder for (7,3) cyclic code using matlabSneheshDutta
 
Adaptive Resonance Theory
Adaptive Resonance TheoryAdaptive Resonance Theory
Adaptive Resonance Theorysurat murthy
 
Introduction to Artificial Neural Networks (ANNs) - Step-by-Step Training & T...
Introduction to Artificial Neural Networks (ANNs) - Step-by-Step Training & T...Introduction to Artificial Neural Networks (ANNs) - Step-by-Step Training & T...
Introduction to Artificial Neural Networks (ANNs) - Step-by-Step Training & T...Ahmed Gad
 
Feature Extraction Techniques and Classification Algorithms for EEG Signals t...
Feature Extraction Techniques and Classification Algorithms for EEG Signals t...Feature Extraction Techniques and Classification Algorithms for EEG Signals t...
Feature Extraction Techniques and Classification Algorithms for EEG Signals t...Editor IJCATR
 
Prediction Model for Emotion Recognition Using EEG
Prediction Model for Emotion Recognition Using EEGPrediction Model for Emotion Recognition Using EEG
Prediction Model for Emotion Recognition Using EEGIRJET Journal
 
Machine Learning : Latent variable models for discrete data (Topic model ...)
Machine Learning : Latent variable models for discrete data (Topic model ...)Machine Learning : Latent variable models for discrete data (Topic model ...)
Machine Learning : Latent variable models for discrete data (Topic model ...)Yukara Ikemiya
 
An Efficient Explorative Sampling Considering the Generative Boundaries of De...
An Efficient Explorative Sampling Considering the Generative Boundaries of De...An Efficient Explorative Sampling Considering the Generative Boundaries of De...
An Efficient Explorative Sampling Considering the Generative Boundaries of De...GiyoungJeon
 
Time Series Forecasting Using Recurrent Neural Network and Vector Autoregress...
Time Series Forecasting Using Recurrent Neural Network and Vector Autoregress...Time Series Forecasting Using Recurrent Neural Network and Vector Autoregress...
Time Series Forecasting Using Recurrent Neural Network and Vector Autoregress...Databricks
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnnKuppusamy P
 
Convolutional Neural Network (CNN) presentation from theory to code in Theano
Convolutional Neural Network (CNN) presentation from theory to code in TheanoConvolutional Neural Network (CNN) presentation from theory to code in Theano
Convolutional Neural Network (CNN) presentation from theory to code in TheanoSeongwon Hwang
 
Image inpainting
Image inpaintingImage inpainting
Image inpaintingKhoaBiNgcng
 
Final thesis presentation on bci
Final thesis presentation on bciFinal thesis presentation on bci
Final thesis presentation on bciRedwan Islam
 
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...Simplilearn
 
Simple Introduction to AutoEncoder
Simple Introduction to AutoEncoderSimple Introduction to AutoEncoder
Simple Introduction to AutoEncoderJun Lang
 

What's hot (20)

Fundamental, An Introduction to Neural Networks
Fundamental, An Introduction to Neural NetworksFundamental, An Introduction to Neural Networks
Fundamental, An Introduction to Neural Networks
 
Modeling of EEG (Brain waves)
Modeling of EEG (Brain waves) Modeling of EEG (Brain waves)
Modeling of EEG (Brain waves)
 
Implementation of reed solomon codes basics
Implementation of reed solomon codes basicsImplementation of reed solomon codes basics
Implementation of reed solomon codes basics
 
Encoder for (7,3) cyclic code using matlab
Encoder for (7,3) cyclic code using matlabEncoder for (7,3) cyclic code using matlab
Encoder for (7,3) cyclic code using matlab
 
Adaptive Resonance Theory
Adaptive Resonance TheoryAdaptive Resonance Theory
Adaptive Resonance Theory
 
Introduction to Artificial Neural Networks (ANNs) - Step-by-Step Training & T...
Introduction to Artificial Neural Networks (ANNs) - Step-by-Step Training & T...Introduction to Artificial Neural Networks (ANNs) - Step-by-Step Training & T...
Introduction to Artificial Neural Networks (ANNs) - Step-by-Step Training & T...
 
Svm
SvmSvm
Svm
 
Feature Extraction Techniques and Classification Algorithms for EEG Signals t...
Feature Extraction Techniques and Classification Algorithms for EEG Signals t...Feature Extraction Techniques and Classification Algorithms for EEG Signals t...
Feature Extraction Techniques and Classification Algorithms for EEG Signals t...
 
Conv-TasNet.pdf
Conv-TasNet.pdfConv-TasNet.pdf
Conv-TasNet.pdf
 
Prediction Model for Emotion Recognition Using EEG
Prediction Model for Emotion Recognition Using EEGPrediction Model for Emotion Recognition Using EEG
Prediction Model for Emotion Recognition Using EEG
 
Machine Learning : Latent variable models for discrete data (Topic model ...)
Machine Learning : Latent variable models for discrete data (Topic model ...)Machine Learning : Latent variable models for discrete data (Topic model ...)
Machine Learning : Latent variable models for discrete data (Topic model ...)
 
An Efficient Explorative Sampling Considering the Generative Boundaries of De...
An Efficient Explorative Sampling Considering the Generative Boundaries of De...An Efficient Explorative Sampling Considering the Generative Boundaries of De...
An Efficient Explorative Sampling Considering the Generative Boundaries of De...
 
Time Series Forecasting Using Recurrent Neural Network and Vector Autoregress...
Time Series Forecasting Using Recurrent Neural Network and Vector Autoregress...Time Series Forecasting Using Recurrent Neural Network and Vector Autoregress...
Time Series Forecasting Using Recurrent Neural Network and Vector Autoregress...
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
 
neural networks
 neural networks neural networks
neural networks
 
Convolutional Neural Network (CNN) presentation from theory to code in Theano
Convolutional Neural Network (CNN) presentation from theory to code in TheanoConvolutional Neural Network (CNN) presentation from theory to code in Theano
Convolutional Neural Network (CNN) presentation from theory to code in Theano
 
Image inpainting
Image inpaintingImage inpainting
Image inpainting
 
Final thesis presentation on bci
Final thesis presentation on bciFinal thesis presentation on bci
Final thesis presentation on bci
 
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
 
Simple Introduction to AutoEncoder
Simple Introduction to AutoEncoderSimple Introduction to AutoEncoder
Simple Introduction to AutoEncoder
 

Similar to Emotional Voice Conversion with DeepEST Framework

VAW-GAN for disentanglement and recomposition of emotional elements in speech
VAW-GAN for disentanglement and recomposition of emotional elements in speechVAW-GAN for disentanglement and recomposition of emotional elements in speech
VAW-GAN for disentanglement and recomposition of emotional elements in speechKunZhou18
 
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...ijtsrd
 
Speech Emotion Recognition Using Neural Networks
Speech Emotion Recognition Using Neural NetworksSpeech Emotion Recognition Using Neural Networks
Speech Emotion Recognition Using Neural Networksijtsrd
 
PhD_Oral_Defense_Kun.ppt
PhD_Oral_Defense_Kun.pptPhD_Oral_Defense_Kun.ppt
PhD_Oral_Defense_Kun.pptKunZhou18
 
A comparative analysis of classifiers in emotion recognition thru acoustic fea...
A comparative analysis of classifiers in emotion recognition thru acoustic fea...A comparative analysis of classifiers in emotion recognition thru acoustic fea...
A comparative analysis of classifiers in emotion recognition thru acoustic fea...Pravena Duplex
 
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECHBASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECHIJCSEA Journal
 
saito22research_talk_at_NUS
saito22research_talk_at_NUSsaito22research_talk_at_NUS
saito22research_talk_at_NUSYuki Saito
 
Emotion recognition based on the energy distribution of plosive syllables
Emotion recognition based on the energy distribution of plosive  syllablesEmotion recognition based on the energy distribution of plosive  syllables
Emotion recognition based on the energy distribution of plosive syllablesIJECEIAES
 
A Review Paper on Speech Based Emotion Detection Using Deep Learning
A Review Paper on Speech Based Emotion Detection Using Deep LearningA Review Paper on Speech Based Emotion Detection Using Deep Learning
A Review Paper on Speech Based Emotion Detection Using Deep LearningIRJET Journal
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingParrotAI
 
A critical insight into multi-languages speech emotion databases
A critical insight into multi-languages speech emotion databasesA critical insight into multi-languages speech emotion databases
A critical insight into multi-languages speech emotion databasesjournalBEEI
 
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKSENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKijnlc
 
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural NetworkSentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Networkkevig
 
Predicting the Level of Emotion by Means of Indonesian Speech Signal
Predicting the Level of Emotion by Means of Indonesian Speech SignalPredicting the Level of Emotion by Means of Indonesian Speech Signal
Predicting the Level of Emotion by Means of Indonesian Speech SignalTELKOMNIKA JOURNAL
 
NEURAL DISCOURSE MODELLING OF CONVERSATIONS
NEURAL DISCOURSE MODELLING OF CONVERSATIONSNEURAL DISCOURSE MODELLING OF CONVERSATIONS
NEURAL DISCOURSE MODELLING OF CONVERSATIONSkevig
 
Day 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.pptDay 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.pptMeghanaDBengalur
 
Upgrading the Performance of Speech Emotion Recognition at the Segmental Level
Upgrading the Performance of Speech Emotion Recognition at the Segmental Level Upgrading the Performance of Speech Emotion Recognition at the Segmental Level
Upgrading the Performance of Speech Emotion Recognition at the Segmental Level IOSR Journals
 
Transformation of feelings using pitch parameter for Marathi speech
Transformation of feelings using pitch parameter for Marathi speechTransformation of feelings using pitch parameter for Marathi speech
Transformation of feelings using pitch parameter for Marathi speechIJERA Editor
 

Similar to Emotional Voice Conversion with DeepEST Framework (20)

VAW-GAN for disentanglement and recomposition of emotional elements in speech
VAW-GAN for disentanglement and recomposition of emotional elements in speechVAW-GAN for disentanglement and recomposition of emotional elements in speech
VAW-GAN for disentanglement and recomposition of emotional elements in speech
 
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
 
Speech Emotion Recognition Using Neural Networks
Speech Emotion Recognition Using Neural NetworksSpeech Emotion Recognition Using Neural Networks
Speech Emotion Recognition Using Neural Networks
 
PhD_Oral_Defense_Kun.ppt
PhD_Oral_Defense_Kun.pptPhD_Oral_Defense_Kun.ppt
PhD_Oral_Defense_Kun.ppt
 
A comparative analysis of classifiers in emotion recognition thru acoustic fea...
A comparative analysis of classifiers in emotion recognition thru acoustic fea...A comparative analysis of classifiers in emotion recognition thru acoustic fea...
A comparative analysis of classifiers in emotion recognition thru acoustic fea...
 
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECHBASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
 
saito22research_talk_at_NUS
saito22research_talk_at_NUSsaito22research_talk_at_NUS
saito22research_talk_at_NUS
 
Emotion recognition based on the energy distribution of plosive syllables
Emotion recognition based on the energy distribution of plosive  syllablesEmotion recognition based on the energy distribution of plosive  syllables
Emotion recognition based on the energy distribution of plosive syllables
 
A Review Paper on Speech Based Emotion Detection Using Deep Learning
A Review Paper on Speech Based Emotion Detection Using Deep LearningA Review Paper on Speech Based Emotion Detection Using Deep Learning
A Review Paper on Speech Based Emotion Detection Using Deep Learning
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
A critical insight into multi-languages speech emotion databases
A critical insight into multi-languages speech emotion databasesA critical insight into multi-languages speech emotion databases
A critical insight into multi-languages speech emotion databases
 
histogram-based-emotion
histogram-based-emotionhistogram-based-emotion
histogram-based-emotion
 
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKSENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
 
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural NetworkSentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
 
Predicting the Level of Emotion by Means of Indonesian Speech Signal
Predicting the Level of Emotion by Means of Indonesian Speech SignalPredicting the Level of Emotion by Means of Indonesian Speech Signal
Predicting the Level of Emotion by Means of Indonesian Speech Signal
 
NEURAL DISCOURSE MODELLING OF CONVERSATIONS
NEURAL DISCOURSE MODELLING OF CONVERSATIONSNEURAL DISCOURSE MODELLING OF CONVERSATIONS
NEURAL DISCOURSE MODELLING OF CONVERSATIONS
 
Day 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.pptDay 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.ppt
 
Day 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.pptDay 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.ppt
 
Upgrading the Performance of Speech Emotion Recognition at the Segmental Level
Upgrading the Performance of Speech Emotion Recognition at the Segmental Level Upgrading the Performance of Speech Emotion Recognition at the Segmental Level
Upgrading the Performance of Speech Emotion Recognition at the Segmental Level
 
Transformation of feelings using pitch parameter for Marathi speech
Transformation of feelings using pitch parameter for Marathi speechTransformation of feelings using pitch parameter for Marathi speech
Transformation of feelings using pitch parameter for Marathi speech
 

Recently uploaded

Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Piping Basic stress analysis by engineering
Piping Basic stress analysis by engineeringPiping Basic stress analysis by engineering
Piping Basic stress analysis by engineeringJuanCarlosMorales19600
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxsomshekarkn64
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 

Recently uploaded (20)

Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Piping Basic stress analysis by engineering
Piping Basic stress analysis by engineeringPiping Basic stress analysis by engineering
Piping Basic stress analysis by engineering
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptx
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 

Emotional Voice Conversion with DeepEST Framework

  • 1. NUS Presentation Title 2001 Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset Kun Zhou1, Berrak Sisman2, Rui Liu2, Haizhou Li1 1National University of Singapore 2Singapore University of Technology and Design
  • 2. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 3. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 4. NUS Presentation Title 2001 1. Introduction 1.1 Emotional voice conversion Emotional voice conversion aims to transfer the emotional style of an utterance from one to another, while preserving the linguistic content and speaker identity. Applications: emotional text-to-speech, conversational agents Compared with speaker voice conversion: 1) Speaker identity is mainly determined by voice quality; -- speaker voice conversion mainly focuses on spectrum conversion 2) Emotion is the interplay of multiple acoustic attributes concerning both spectrum and prosody. -- emotional voice conversion focuses on both spectrum and prosody conversion
  • 5. NUS Presentation Title 2001 1. Introduction 1.2 Current Challenges 1) Non-parallel training Parallel data is difficult and expensive to collect in real life; 2) Emotional prosody modelling Emotional prosody is hierarchical in nature, and perceived at phoneme, word, and sentence level; 3) Unseen emotion conversion Existing studies use discrete emotion categories, such as one-hot label to remember a fixed set of emotional styles. Our focus Our focus
  • 6. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 7. NUS Presentation Title 2001 2. Related Work 2.1 Emotion representation 1) Categorical representation Example: Ekman’s six basic emotions theory[1]: i.e., anger, disgust, fear, happiness, sadness and surprise 2) Dimensional representation Example: Russell’s circumplex model[2]: i.e., arousal, valence, and dominance [1] Paul Ekman, “An argument for basic emotions,” Cognition & emotion, 1992 [2] James A Russell, “A circumplex model of affect.,” Journal of personality and social psychology, vol. 39, no. 6, pp. 1161, 1980.
  • 8. NUS Presentation Title 2001 2. Related Work 2.2 Emotional descriptor 1) One-hot emotion labels 2) Expert-crafted features such as openSMILE [3] 3) Deep emotional features: [3] Florian Eyben, Martin Wöllmer, and Björn Schuller, Opensmile: the munich versatile and fast open-source audio feature extractor, Proceedings of the 18th acm international conference on multimedia, 2010, pp. 1459–1462 -- data-driven, less dependent on human knowledge, more suitable for emotional style transfer…
  • 9. NUS Presentation Title 2001 2. Related Work 2.3 Variational Auto-encoding Wasserstein Generative Adversarial Network (VAW-GAN)[4] Fig 2. An illustration of VAW-GAN framework. - Based on variational auto-encoder (VAE) - Consisting of three components: Encoder: to learn a latent code from the input; Generator: to reconstruct the input from the latent code and the inference code; Discriminator: to judge the reconstructed input whether real or not; - First introduced in speaker voice conversion in 2017 [4] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and HsinMin Wang, “Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks,” in Proc. Interspeech, 2017.
  • 10. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 11. NUS Presentation Title 2001 3. Contributions 1. One-to-many emotional style transfer framework 2. Does not require parallel training data; 3. Use speech emotion recognition model to describe the emotional style; 4. Publicly available multi-lingual and multi-speaker emotional speech corpus (ESD), that can be used for various speech synthesis tasks; To our best knowledge, this is the first reported study on emotional style transfer for unseen emotion.
  • 12. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 13. NUS Presentation Title 2001 4. Proposed Framework 1) Stage I: Emotion Descriptor Training 2) Stage II: Encoder-Decoder Training with VAW-GAN 3) Stage III: Run-time Conversion Fig. 3. The training phase of the proposed DeepEST framework. Blue boxes represent the networks that involved in the training, and the red boxes represent the networks that are already trained.
  • 14. NUS Presentation Title 2001 4. Proposed Framework Fig. 3. The training phase of the proposed DeepEST framework. Blue boxes represent the networks that involved in the training, and the red boxes represent the networks that are already trained. 1) Emotion Descriptor Training - A pre-trained SER model is used to extract deep emotional features from the input utterance. - The utterance-level deep emotional features are thought to describe different emotions in a continuous space.
  • 15. NUS Presentation Title 2001 4. Proposed Framework Fig. 3. The training phase of the proposed DeepEST framework. Blue boxes represent the networks that involved in the training, and the red boxes represent the networks that are already trained. 2) Encoder-Decoder Training with VAW-GAN - The encoder learns to generate emotion-independent latent representation z from the input features - The decoder learns to reconstruct the input features with latent representation, F0 contour, and deep emotional features; - The discriminator learns to determine whether the reconstructed features real or not;
  • 16. NUS Presentation Title 2001 4. Proposed Framework Fig. 3. The training phase of the proposed DeepEST framework. Blue boxes represent the networks that involved in the training, and the red boxes represent the networks that are already trained. 3) Run-time Conversion - The pre-trained SER generates the deep emotional features from the reference set; - We then concatenate the deep emotional features with the converted F0 and emotion-independent z as the input to the decoder;
  • 17. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 18. NUS Presentation Title 2001 5. Experiments 5.1 Emotional Speech Dataset -- ESD 1) The dataset consists of 350 parallel utterances with an average duration of 2.9 seconds spoken by 10 native English and 10 native Mandarin speakers; 2) For each language, the dataset consists of 5 male and 5 female speakers in five emotions: a) happy, b) sad, c) neutral, d) angry, and e) surprise 3) ESD dataset is publicly available at: https://github.com/ HLTSingapore/Emotional-Speech-Data To our best knowledge, this is the first parallel emotional voice conversion dataset in a multi-lingual and multi-speaker setup.
  • 19. NUS Presentation Title 2001 5. Experiments 5.2 Experimental setup 1) 300-30-20 utterances (training-reference-test set); 2) One universal model that takes neutral, happy and sad utterances as input; 3) SER is trained on a subset of IEMOCAP[5], with four emotion types (happy, angry, sad and neutral) 4) We conduct emotion conversion from neutral to seen emotional states (happy, sad) to unseen emotional states (angry). 5) Baseline: VAW-GAN-EVC [6]: only capable of one-to-one conversion and generate seen emotions; Proposed: DeepEST: one-to-many conversion, seen and unseen emotion generation [5] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, pp. 335, 2008. [6] Kun Zhou, Berrak Sisman, Mingyang Zhang, and Haizhou Li, “Converting Anyone’s Emotion: Towards Speaker- Independent Emotional Voice Conversion,” in Proc. Interspeech 2020, 2020, pp. 3416–3420.
  • 20. NUS Presentation Title 2001 5. Experiments 5.3 Objective Evaluation We calculate MCD [dB] to measure the spectral distortion Proposed DeepEST outperforms the baseline framework for seen emotion conversion and achieves comparable results for unseen emotion conversion.
  • 21. NUS Presentation Title 2001 5. Experiments 5.4 Subjective Evaluation 1) MOS listening experiments 2) AB preference test 3) XAB preference test To evaluate speech quality: To evaluate emotion similarity: Proposed DeepEST achieves competitive results for both seen and unseen emotions
  • 22. NUS Presentation Title 2001 5. Experiments Codes & Speech Samples https://kunzhou9646.github.io/controllable-evc/
  • 23. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 24. NUS Presentation Title 2001 6. Conclusion 1. We propose a one-to-many emotional style transfer framework based on VAW-GAN without the need for parallel data; 2. We propose to leverage deep emotional features from speech emotion recognition to describe the emotional prosody in a continuous space; 3. We condition the decoder with controllable attributes such as deep emotional features and F0 values; 4. We achieve competitive results for both seen and unseen emotions over the baselines; 5. We introduce and release a new emotional speech dataset, ESD, that can be used in speech synthesis and voice conversion.
  • 25. NUS Presentation Title 2001 Thank you for listening!