Voice Cloning

Advancements in Voice Cloning:
A Comprehensive Overview
Exploring Cutting-Edge Research in Voice Synthesis
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Tribhuvan University
Institute of Science and Technology
School of Mathematical Sciences
by
Aatiz Ghimiré
MDS 555 - Natural Language Processing

Introduction
● Voice cloning is to generate natural speech for a variety of speakers in a
data efficient manner.
● Dubbing and Localization, Character Voices, Voice Assistance for People
with Disabilities, Personalized Virtual Assistants and Podcasting and Content
Creation.

Objectives
● Here we talk about, the early voice clone papers that has effective results.
● There are 3 paper selected from the below, but actual two were analyzed.
○ “Neural Voice Cloning with a Few Samples”, Sercan O. Arik, Jitong Chen, Kainan Peng,
Wei Ping, Yanqi Zhou [ NeurIPS, 2018], Baidu
○ “Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech
Synthesis”, , Ye Jia, Yu Zhang, Ron J. Weiss [NeurIPS, 2018], Google
○ “Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and
Cross-Language Voice Cloning”, Yu Zhang, Ron J. Weiss, Google
● Similarity, based on use of Tacotron 2 architecture.

Terminologies
● Neural speech synthesis:
● Few-shot generative modeling:
● Speaker-dependent speech processing:
● Speaker adaptation : fine-tuning a multi-speaker generative model.
● Speaker encoding generate a fixed-dimensional embedding vector
● A sequence-to-sequence synthesis network based on Tacotron 2 that generates
a mel spectrogram from text, conditioned on the speaker embedding
● WaveNet-based vocoder network that converts the mel spectrogram into time
domain waveform samples

Terminologies
Fig: Mel spectrogram from different text
Fig: Word embedding
Fig: Speaker embedding

Paper 1: Neural Voice Cloning with a Few Samples
● Multi-Speaker Generative Modeling to Voice Cloning
● Multi-speaker generative model, f(ti,j , si ; W, esi ), which takes a text ti,j and a
speaker identity si . The trainable parameters in the model is parameterized by W,
and esi.
● For voice cloning, we extract the speaker characteristics for an unseen speaker sk
from a set of cloning audios Ask, and generate an audio given any text for that
speaker.
● Speaker adaptation: fine-tune a trained multi-speaker model, . Fine-tuning can be
applied to either the speaker embedding or the whole model.
● multi-speaker generative model is based on the convolutional sequence-to-
sequence architecture

Paper 1: Neural Voice Cloning with a Few Samples
● Speaker encoding is based on training a separate model to directly infer a new
speaker embedding, which will be applied to a multi-speaker generative model.

Paper 2 :Transfer Learning from Speaker Verification to Multispeaker TTS Synthesis
● Multi-speaker speech synthesis model
● Three independently trained neural networks,
○ A recurrent speaker encoder,which computes a fixed dimensional vector from a speech signal.
○ A sequence-to-sequence synthesizer,which predicts a mel spectrogram from a sequence
phoneme inputs, conditioned on the speaker embedding vector.
○ A autoregressive WaveNet vocoder, which converts the spectrogram into time domain
waveforms.
Fig: Model overview. Each of the three components are trained independently.

Datasets
● LibriSpeech: poor audio quality of the audio sample(2484 speakers, totalling
820 hours) (Paper 1- trained)
● VCTK : Cloned from this dataset, 108 native speakers of English with various
accents. (44 hours of clean speech) (Paper 1- Cloned & Paper 2- train)

Performance: Paper 1
Table: Naturalness, 5-scale mean opinion score (MOS)
Table: summary the approaches and lists the requirements for training, data, cloning time and memory footprint.

Table: Similarity score evaluations, 4-scale similarity score

Table: Naturalness, 5-scale mean opinion score (MOS)
Table: Similarity score evaluations, 4-scale similarity score

Result, Challenges and Limitation
- The proposed techniques can potentially be improved with better multi-
speaker models in the future. (Paper 1)
- The proposed model does not attain human-level naturalness, despite the
use of a WaveNet vocoder (along with its very high inference cost), in
contrast to the single speaker results. (Paper 2)
- Use of datasets with lower data quality. An additional limitation lies in the
model’s inability to transfer accents.(Paper 2)

Demo
Input Audio
Cloned Audio

Ethical Considerations
● Potential for misuse of this technology
● For example impersonating someone’s voice without their consent
● DeepFake

Q&A
Q&A

Acknowledgments
“Neural Voice Cloning with a Few Samples”, Sercan O. Arik, Jitong Chen, Kainan
Peng, Wei Ping, Yanqi Zhou [ NeurIPS, 2018], Baidu
“Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech
Synthesis”, , Ye Jia, Yu Zhang, Ron J. Weiss [NeurIPS, 2018], Google
“Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis
and Cross-Language Voice Cloning”, Yu Zhang, Ron J. Weiss, Google

Voice Cloning

Recommended

Recommended

More Related Content

Similar to Voice Cloning

Similar to Voice Cloning (20)

Recently uploaded

Recently uploaded (20)

Voice Cloning