1. REAL TIME VOICE CLONING
PRESENTED BY
N.GAYATHRI PRIYA (19HQ1A0528)
L.RENUKA (19HQ1A0523)
K.SAI VARSHA (19HQ1A0519)
K.UMA MAHESWAR (19HQ1A0520)
K.BHANU TEJA (19HQ1A0508)
GUIDED BY
Y.GAYATRI
2. CONTENTS
• Abstract
• Introduction
• Existing System
• Proposed System
• System Architecture
• Best ofVoice Cloning
• System Requirements
• Conclusion
3. ABSTRACT
• Deep learning models are becoming predominant in many fields of
machine learning. Text-to-Speech (TTS), the process of synthesizing
artificial speech from text, is no exception. To this end, a deep neural
network is usually trained using a corpus of several hours of recorded
speech from a single speaker. Trying to produce the voice of a speaker
other than the one learned is expensive and requires large effort since it
is necessary to record a new dataset and retrain the model. This is the
main reason why the TTS models are usually single speaker. The
proposed approach has the goal to overcome these limitations trying to
obtain a system which is able to model a multi-speaker acoustic space.
This allows the generation of speech audio similar to the voice of
different target speakers, even if they were not observed during the
training phase.
4. INTRODUCTION
• Text-to-Speech (TTS) synthesis, the process of generating
natural speech from text, remains a challenging task despite
decades of investigation. Nowadays there are several TTS
systems able to get impressive results in terms of synthesis
of natural voices very close to human ones.
Unfortunately, many of these systems learn to synthesize
text only with a single voice. The goal of this work is to
build a TTS system which can generate in a data efficient
manner natural speech for a wide variety of speakers, not
necessarily seen during the training phase.
• The activity that allows the creation of this type of models is
called Voice Cloning
5. EXISTING SYSTEM
• As the V2C is a new task, here we briefly review several closely
related works in the fields of Text to Speech, Voice Cloning, and
Prosody Transfer. Many text-to-speech (TTS) synthesis methods
have been proposed to generate natural speech from text.
• Propose a new framework Tacotron, which integrates all the
necessary stages in text-to-speech synthesis and enables that the
speech synthesis model can be optimized in an end-to-end
manner.
• They propose a more efficient transformer (i.e., Fast Speech) by
using non auto-regressive generation method. Based on Fast
Speech, they further design an improved FastSpeech2, which
seeks to control the generated speech via the adjustment of pitch
and energy. However, the TTS task mainly focuses on how to
convert natural language text to speech in a correct pronounce.
6. PROPOSED SYSTEM
• The Synthesizer used is the Google tacotron 2 model which is used without
Wavenet. Tacotron is a repeated sequence to sequence system predicting a
text-based mel spectrogram.
• To build the encoder output frames, these frames are passed through a
bidirectional LSTM.
• This is where SV2TTS adds change to the architecture: the embedding of a
speaker is concatenated with each frame that the Tacotron encoder creates.
8. MAIN COMPONENTS
The proposed system consists of three components :
1. Speaker Encoder : Speaker encoder takes a voice note as input and then
analyzes the wave length and frequency of the referenced voice note.
2. Synthesizer : Synthesizer takes the text as input and then synthesizes the text
with the frequency of the referenced voice note.
3. Neural Vocoder : Finally the neural vocoder takes the output of the synthesizer
and then generates the speech waveform.
• Meanwhile, the synthesized voice note will be in a loop until it becomes clear
and undisturbed noise free voice note and then it proceeds to neural vocoder
then it generates speech waveform.
9. BEST OF VOICE CLONING
Best of voice cloning includes following three key-criteria’s :
1. Output quality : Real Time Voice Cloning provides the best output
i.e., noise free cristal clear speech from text.
2. Intuitive interface : Its easy to use the voice cloning application.
3. Voice protections : Real Time Voice Cloning application provides an
interface with many user privacy features.
10. SYSTEM REQUIREMENTS
Software Requirements :
✓ Windows 10 or Ubuntu 20.04+ operating
system
Hardware Requirements :
✓ 5GB+ Disk space
✓ NVIDIA GPU with at least 4GB of memory
& driver version 456.38+ (optional)
11. CONCLUSION
• In this work, our goal was to build a Voice
Cloning system which could generate natural
speech for a variety of target speakers in a data
efficient manner. Our system combines an
independently trained speaker encoder network
with a sequence-to-sequence with attention
architecture and a neural vocoder model.
• Using a transfer learning technique from a
speaker-discriminative encoder model based on
utterance embeddings rather than speaker
embeddings, the synthesizer and the vocoder are
able to generate good quality speech also for
speakers not observed before.