1. Advancements in Voice Cloning:
A Comprehensive Overview
Exploring Cutting-Edge Research in Voice Synthesis
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Tribhuvan University
Institute of Science and Technology
School of Mathematical Sciences
by
Aatiz Ghimiré
MDS 555 - Natural Language Processing
2. Introduction
● Voice cloning is to generate natural speech for a variety of speakers in a
data efficient manner.
● Dubbing and Localization, Character Voices, Voice Assistance for People
with Disabilities, Personalized Virtual Assistants and Podcasting and Content
Creation.
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
3. Objectives
● Here we talk about, the early voice clone papers that has effective results.
● There are 3 paper selected from the below, but actual two were analyzed.
○ “Neural Voice Cloning with a Few Samples”, Sercan O. Arik, Jitong Chen, Kainan Peng,
Wei Ping, Yanqi Zhou [ NeurIPS, 2018], Baidu
○ “Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech
Synthesis”, , Ye Jia, Yu Zhang, Ron J. Weiss [NeurIPS, 2018], Google
○ “Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and
Cross-Language Voice Cloning”, Yu Zhang, Ron J. Weiss, Google
● Similarity, based on use of Tacotron 2 architecture.
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
4. Terminologies
● Neural speech synthesis:
● Few-shot generative modeling:
● Speaker-dependent speech processing:
● Speaker adaptation : fine-tuning a multi-speaker generative model.
● Speaker encoding generate a fixed-dimensional embedding vector
● A sequence-to-sequence synthesis network based on Tacotron 2 that generates
a mel spectrogram from text, conditioned on the speaker embedding
● WaveNet-based vocoder network that converts the mel spectrogram into time
domain waveform samples
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
5. Terminologies
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Fig: Mel spectrogram from different text
Fig: Word embedding
Fig: Speaker embedding
6. Paper 1: Neural Voice Cloning with a Few Samples
● Multi-Speaker Generative Modeling to Voice Cloning
● Multi-speaker generative model, f(ti,j , si ; W, esi ), which takes a text ti,j and a
speaker identity si . The trainable parameters in the model is parameterized by W,
and esi.
● For voice cloning, we extract the speaker characteristics for an unseen speaker sk
from a set of cloning audios Ask, and generate an audio given any text for that
speaker.
● Speaker adaptation: fine-tune a trained multi-speaker model, . Fine-tuning can be
applied to either the speaker embedding or the whole model.
● multi-speaker generative model is based on the convolutional sequence-to-
sequence architecture
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
7. Paper 1: Neural Voice Cloning with a Few Samples
● Speaker encoding is based on training a separate model to directly infer a new
speaker embedding, which will be applied to a multi-speaker generative model.
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
8. Paper 2 :Transfer Learning from Speaker Verification to Multispeaker TTS Synthesis
● Multi-speaker speech synthesis model
● Three independently trained neural networks,
○ A recurrent speaker encoder,which computes a fixed dimensional vector from a speech signal.
○ A sequence-to-sequence synthesizer,which predicts a mel spectrogram from a sequence
phoneme inputs, conditioned on the speaker embedding vector.
○ A autoregressive WaveNet vocoder, which converts the spectrogram into time domain
waveforms.
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Fig: Model overview. Each of the three components are trained independently.
9. Datasets
● LibriSpeech: poor audio quality of the audio sample(2484 speakers, totalling
820 hours) (Paper 1- trained)
● VCTK : Cloned from this dataset, 108 native speakers of English with various
accents. (44 hours of clean speech) (Paper 1- Cloned & Paper 2- train)
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
10. Performance: Paper 1
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Table: Naturalness, 5-scale mean opinion score (MOS)
Table: summary the approaches and lists the requirements for training, data, cloning time and memory footprint.
11. Performance: Paper 1
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Table: Similarity score evaluations, 4-scale similarity score
12. Performance: Paper 2
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Table: Naturalness, 5-scale mean opinion score (MOS)
Table: Similarity score evaluations, 4-scale similarity score
13. Result, Challenges and Limitation
- The proposed techniques can potentially be improved with better multi-
speaker models in the future. (Paper 1)
- The proposed model does not attain human-level naturalness, despite the
use of a WaveNet vocoder (along with its very high inference cost), in
contrast to the single speaker results. (Paper 2)
- Use of datasets with lower data quality. An additional limitation lies in the
model’s inability to transfer accents.(Paper 2)
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
15. Ethical Considerations
● Potential for misuse of this technology
● For example impersonating someone’s voice without their consent
● DeepFake
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
17. Acknowledgments
“Neural Voice Cloning with a Few Samples”, Sercan O. Arik, Jitong Chen, Kainan
Peng, Wei Ping, Yanqi Zhou [ NeurIPS, 2018], Baidu
“Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech
Synthesis”, , Ye Jia, Yu Zhang, Ron J. Weiss [NeurIPS, 2018], Google
“Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis
and Cross-Language Voice Cloning”, Yu Zhang, Ron J. Weiss, Google
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences