Good afternoon. My name is Kun Zhou, a 4-th year phd student from NUS. My PhD research is about Emotion Modelling for Speech Generation.
This the outline of this presentation, which has three major topics: seq2seq emotion modelling, emotion intensity modelling and control, and mixed emotion modelling and synthesis.
My PhD work is on speech generation.
Speech generation is a task to teach the machines to speak.
According to the input data, it can be divided into text-to-speech and voice conversion.
Benefited from the deep learning models, we observe a shift from the previous statistical parametric to the neural network-based methods.
Speech generation has may applications in industry. It helps us to build a conversional agents, like Apple Siri and Amazon Alexa.
However, current studies face a few limitations: for example, the lack of the expressivity, require large high-quality training data, difficult to be personalized, and difficult to be used in real-time application.
My PhD research is focused on the study of emotion modelling for speech generation.
Emotions are difficult to model. Because human emotions are subtle and difficult to represent. We can use some categories to classify the emotions, or also can describe emotion in continuous space. And speech emotions, it correlates to multiple acoustic features, such as the voice quality, energy, intonation, and rhythm.
Human emotion perception is subjective, brings another challenge in evaluation process.
In my research, we are trying to answer two questions: 1/ how to better imitate human emotions, we try to improve the generalizability of emotion models, and 2/ how to make the synthesized emotions more creative? We want to make the emotion models more controllable, so we can create different emotional styles as we desire.
The ultimate goal of my research is to build an AI with human empathy. An empathic AI first receives the inputs from humans or the environment, appraise the situation, and respond humans in an appropriate way.
Emotional speech generation is a key technology to generate emotional response. It can enhance the engagement in human-machine interaction. In the following presentation, I will introduce our work on this topic.
On this topic, I publish several papers in journals and conference, like IEEE transactions on Affective Computing, Speech Communication, INTERSPEECH, ICASSP, SLT and Speaker Odyssey.
Also some co-authored papers in ASRU, INTERSPEECH and APSIPA.
I chose three papers to formulate this presentation, the first one is about the generalizability, and the last two is about controllability of emotion models.
In previous frame-based methods, we model speech emotions frame by frame. When we convert the emotions, the output frame length kept the same as the input, so the speech duration cannot be modified. But speaking rate is an important feature to express an emotion.
Frame-based methods also separate model spectrum and prosody frame by frame, it will cause a mismatch during the inference.
Compared with frame-based models, seq2seq models can be a better solution to model speech emotions.
They joint model spectrum, prosody and duration, so there is no mismatch during the inference. With the attention mechanism, we could focus on the emotion-relevant regions.
But seq2seq models usually need a large amount of training data. And it is difficult and expensive to collect emotional speech data.
So we want to build a seq2seq models for emotion modelling, which can modify the speech duration, and only requires limited emotion training data.
We propose a seq2seq emotion modelling framework together with a 2-stage training strategy.
The first stage is style initialization, the style encoder learns speaker styles from a large neutral-speaking TTS corpus, the ASR encoder and the text encoder predict the linguistic embedding from the audio and text separately.
We use a classifier with adversarial training to help with the disentanglement between the speaker style and linguistic content.
In the Stage 2, we conduct emotion training with a limited amount of emotional speech data, where the style encoder becomes to the emotion encoder to learn emotional styles from the emotional speechd data.
We test the performance of style encoder and emotion encoder with some emotion data. We plot the embedding into this figure. We can see that the emotion encoder learns a more effective emotion embedding, and can form separate clusters for each emotion.
We calculate the difference of duration and the mel-cepstral distortion to evaluate the performance of duration and spectrum conversion. We can see that our seq2seq models outperformed the frame-based baselines. And can achieve effective duration conversion performance.
We also conduct listening experiments to validate our idea.
In 2nd topic, we study to model and control emotion intensity. Emotion intensity is the level that an emotion can be perceived by a listener. So the intensity of an emotion is not just the loudness of a voice, but correlates to all the acoustic cues that contribute to an emotion.
To model the emotion intensity, there are a few challenges. One is the lack of intensity labels. And the emotion intensity is even more subjective. It also complex with multiple acoustic cues such as timbre, pitch and rhythm
Previous study use some auxiliary features, for examples, a state of voiced/unvoiced/silence, attention weights or a saliency map from a pre-trained emotion recognizer. Another way is to manipulate the emotion representations, for example, through interpolation or scaling. But these methods lack the interpretability and their performance is not too good.
Here we propose a novel framework, Emovox, for emotional voice conversion with effective intensity control.
Emovox is a sequence-to-sequence conversion model, preserves the linguistic content of the source speech, and transfer the emotional style of a reference speech, and also allow humans to control the output intensity.
The key challenge here is how to design an intensity encoder that can accept manual intensity labels, and produce interpretable intensity embeddings.
We are inspired by the study of relative attributes in computer vision.
Our world is not binary, and cannot be always modelled as a classification or regression problem.
For example, we cannot simply tell if a person is smiling or not smiling. Because you know, it could be a half smile.
Relative attributes is proposed to sovle this problem. It does not predict an attribute, instead, it predict the strength of the attribute compared with others. So in this way, it can formulate a ranking problem to sovle all these things.
Inspired by the relative attributes, we propose a novel formulation of the emotion intensity. We assume the neutral speech does not contain any emotional variance, so its intensity should be always zero. In this way, the emotion intensity can be modeled as the relative difference between with neutral samples. Given a neutral set and an emotional set, we can learn a ranking function, which needs to satisfy the following constraints. We construct an ordered set and a similar set. For ordered set, the emotional speech samples always have higher intensity than neutral samples. Then we can solve a SVM problems.
We explain our formuation further more in this figure. The left one is the traditional classifer, it aims to classify different emotions. But in our formulation, we do not seek to find a boundary. Instead, we learn a ranking, to rank all the samples given a critiria. In this way, for example, we can find the most angry one, and the least angry. When an unseen samples comes, the ranking model can automatically predict the emotion intensity with respect to other emotional speech.
We further propose a sequence-to-sequence EVC framework. It enables the duration modelling with the attention mechanism, we also use the perceptual losses from a pre-trained SER to guide the emotion training. We design the intensity encoder with the relative ranking functions. We model the emotion style and intensity separately, and the decoder learns to reconstruct the emotion style from the combination of emotion and intensity embedding.
At the run-time, the intensity can be predicted from a reference audio, or just given by humans. So we can achieve emotion intensity transfer and control at the same time.
We visualize some acoustic features to validate our idea of intensity control. We first compare the speech duration with sad emotions. As we increase the intensity values, the speech becomes slower and the timber becomes more resonant.
We then compare the pitch and energy. We observe a large dumpy of pitch and energy in angry and happy. But we don’t have such observations in sad. It shows that the intensity of sad emotion is more related to the speaking rate and the speech timbre.
We ask listeners to listen to the speech samples with 3 different emotion intensities and choose the least and the most expressive one. We can see that our methods the relative attributes achieve the best results in terms of the intensity control.
In the last topic, we study an interesting question: is it possible to synthesize a mixed emotion?
In English, there is a word called “Bittersweet”. describes a mixed feeling of both happy and sad. Previous text-to-speech studies can synthesize human voice with different emotions types. For example, a happy voice or a sad voice. But can we synthesize a mixed feeling of both happy and sad in speech. This is the question that we would like to study in this research.
Human can experience around 34000 different emotions. In the theory of emotion wheel, …
there are two types of methods to learn emotion information from the speech. One is to associate with emotion labels, another is to imitate a reference audio. For examples, in GST-Tacotron, the reference encoder learns the reference style, and give a reference embedding to the decoder. But current studies are mostly generating an averaged style belonging to a specific emotion type. So they cannot synthesize mixed emotions. There are two challenge, one is …, another is … In this research, we are focusing on these two challenges and study a solution to mixed emotion modelling and synthesis.
Another one the ordinal nature of emotions. Because it is difficult to precisely characterize emotions. In psychology, there are two methods, … But researchers found that for the human listener, it is more straightforward to use ordinal methods to model the human perception.
Inspired by the psychology studies, we propose 3 assumptions…
Based on these 3 assumptions, we propose a diagram, which has three parts, 1/ a manually controlled attribute vector that can be defined by human, and each number indicate the percentage of each emotion in the mixture. The proposed relative scheme accepts the attribute vector, and generate a relative embedding. Given the relative embedding, reference speech and text input, the emotional text-to-speech can generate speech with a mixture of emotions.
Our major contribution is focused on design a relative scheme.
Our method is straightforward. We take emotion style as an attribute of the emotional speech. We first pair each emotion, and train a relative ranking function between each emotion pairs. At the run-time, the trained ranking functions can automatically predicts an emotion attribute. A smaller value of emotion attribute represents a similar style. And all the attributes can form an emotion attribute vector here.
We incorporate the relative scheme into a sequence-to-sequence text-to-speech framework. During the training, the framework learns two things at the same time: 1/ characterize the input emotion styles, which results in an emotion embedding. 2/ quantifying the differences with other emotion types, which results in emotion attribute vectors. The decoder learns to reconstruct the input style from a combination of emotion embedding and emotion attribute vectors.
At run-time, given text as the inputs, the text encoder projects the linguistic information to an internal representations, the emotion encoder captures the emotion style from a reference speech. And the manually controlled attribute vector further introduces the characteristics of other emotion types.
We present three case studies to validate our idea.
In the first case study, we would like to mix surprise with happy, angry, and sad, and we expect to synthesize mixed feelings closer to delight, outrage, and disappointment.
We first perform the analysis with a pre-trained speech emotion recognition model. We analyze the mixture of emotions with the classification probabilities derived before the last projection layer. From these figures, we can see if we increase the percentage of other emotion types in the mixture, the SER can capture such variance.
We further calculate mel-cepstral distortion to measure the spectrum similarity, and pearson correlation coefficients to measure the pitch similarity. From these two metrics, we also can observe that the synthesized mixed emotions becomes similar with the reference adding emotions.
We conduct one listening test to evaluate the perception of angry, happy, and sad. And anther one to evaluate the perception of outrage, delight and disappointment. We can see that human listeners can distinguish the emotions in the mixture. And the results of mixed emotions for example, outrage, delight and disappointment are better than the single one .
We can feel the mixing emotions keep the pattern of surprise while introducing the features of other emotion types.
We also can flexible control the percentage of each emotion in the mixture.
We next present our second case study. In this case study, we are trying to mix happy and sad, and synthesize a bittersweet feeling. This is a very challenging task, because happy and sad are two conflicting emotions with opposite valence. Professional actors can deliver such feelings of bittersweet through actions and speech. But it is difficult for us to synthesize and also difficult for human to evaluate. But with our proposed methods, we believe it is possible to synthesize a bittersweet feeling.
The last case study is an emotion transition system. Emotion transition aims to gradually transit the emotion from one to another. The key challenge is how synthesize the internal states between different emotion types. To achieve this, we keep the sum of the percentatges of emotions to be 100% and ajust each percantly manually. For expames, 80% surprise with 20% angry.
Here we show an emotion triangle, which can flexible transit the emotion states between happy, sad, and angry.
Here we show an emotion triangle, which can flexible transit the emotion states between happy, sad, and angry.
In the future work, there are several topics we would like to study. The first one is inclusive emotional speech generation. Current studies always require high-quality recorded speech data. We ask actors to perform the emotions, which also may create stereotypes of emotions. So we think these studies are exclusive. To build an inclusive emotional speech generation framework, we would like to use the data from the real life, we need to study how to associate the emotional prosody with the linguistic information. And how to reduce the impact of the confounding factors.
The second research direction is the affective vocal bursts synthesis. In our research, we only focus on the modelling of para-linguistic features, such as the intonation, energy, speaking rate. But in real life, human convey their emotions also through the non-linguistic features, we call them as the vocal bursts, such as laugh, sigh, grunts, and cries. But the synthesis of vocal bursts hasn’t got enough attention from the community. Recently, there is a workshop called expressive vocalization workshop held in ICML is proposed to understand and synthesis the expressive non-verbal vocalizations. This research topic is getting more and more attention from the community.
The last one is the ethical study. It is evident that just because artificial intelligence can do something, it does not mean they should do it. While emotional speech generation has a huge potential to improve human-computer interaction, there are still several societal challenges that need to be carefully studied for our community.
The first one is deep fake issues. With emotional speech generation techniques, we are able to manipulate the emotions of an individual, to change their mood, feelings or the opinion. This can increase the risk of spreading disinformation or defaming a politician. The second one is privacy issues. Emotions are the results of human experiences. As mentioned previously, our next step is to build an emotional speech generation framework with naturalistic emotions, which concerns with the privacy issues in the process of data collection. Another issue is the biases of individuals, language and cultures. In our research, we focus on the English, and didn’t pay attention to other languages. This will introduce the biases into our methods and results. The final question is if we want our artificial agent to be emotional or not. Equipping artificial agents with emotions will give them capabilities to influence our emotions in the ways we would prefer or avoid. Furthermore, this technology will give the robot a certain personality that makes them indistinguishable with humans. The emotional overreliance on robots will affect human’s social skills and make them feel more lonely in real life.
I would like to take this opportunity to express my gratitude to my supervisor, Professor Haizhou Li. Professor Li host my Master and PhD study in HLT lab. I spent incrediable five years and I really grew up a lot with his guidance. His hardworking spirite and scientific attitues encourage me a lot. He teach me how to be an independent researcher and offer me the opportunity to visit US and Germany. I cannot imagine a better supervisor than him.
I would also like to thank Prof Berrak Sisman from UTD, Prof Rajib Rana from USQ, Prof Bjorn Schuller from ICL and Prof. Tanja Schultz from U of Bremen for their help during my PhD. I also want to thank my friends and colleagues from NUS, SUTD, UT Dallas and U of Bremen.
At last, I want to thank my dear parents for their constant love and support.
Doing a PhD is also a period to discover myself. Beyond the research, as a logo designer, I designed logos for different laboratories in Singapore and Germany. also for international conferences held in Singapore. For example, ASRU 2019 and ICASSP 2022. It gave me a sense of fullfillment and made my phd journey much more meaningful.
It is the end of this presentation. Thank you very much!