SlideShare a Scribd company logo
Advancements in Voice Cloning:
A Comprehensive Overview
Exploring Cutting-Edge Research in Voice Synthesis
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Tribhuvan University
Institute of Science and Technology
School of Mathematical Sciences
by
Aatiz Ghimiré
MDS 555 - Natural Language Processing
Introduction
● Voice cloning is to generate natural speech for a variety of speakers in a
data efficient manner.
● Dubbing and Localization, Character Voices, Voice Assistance for People
with Disabilities, Personalized Virtual Assistants and Podcasting and Content
Creation.
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Objectives
● Here we talk about, the early voice clone papers that has effective results.
● There are 3 paper selected from the below, but actual two were analyzed.
○ “Neural Voice Cloning with a Few Samples”, Sercan O. Arik, Jitong Chen, Kainan Peng,
Wei Ping, Yanqi Zhou [ NeurIPS, 2018], Baidu
○ “Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech
Synthesis”, , Ye Jia, Yu Zhang, Ron J. Weiss [NeurIPS, 2018], Google
○ “Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and
Cross-Language Voice Cloning”, Yu Zhang, Ron J. Weiss, Google
● Similarity, based on use of Tacotron 2 architecture.
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Terminologies
● Neural speech synthesis:
● Few-shot generative modeling:
● Speaker-dependent speech processing:
● Speaker adaptation : fine-tuning a multi-speaker generative model.
● Speaker encoding generate a fixed-dimensional embedding vector
● A sequence-to-sequence synthesis network based on Tacotron 2 that generates
a mel spectrogram from text, conditioned on the speaker embedding
● WaveNet-based vocoder network that converts the mel spectrogram into time
domain waveform samples
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Terminologies
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Fig: Mel spectrogram from different text
Fig: Word embedding
Fig: Speaker embedding
Paper 1: Neural Voice Cloning with a Few Samples
● Multi-Speaker Generative Modeling to Voice Cloning
● Multi-speaker generative model, f(ti,j , si ; W, esi ), which takes a text ti,j and a
speaker identity si . The trainable parameters in the model is parameterized by W,
and esi.
● For voice cloning, we extract the speaker characteristics for an unseen speaker sk
from a set of cloning audios Ask, and generate an audio given any text for that
speaker.
● Speaker adaptation: fine-tune a trained multi-speaker model, . Fine-tuning can be
applied to either the speaker embedding or the whole model.
● multi-speaker generative model is based on the convolutional sequence-to-
sequence architecture
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Paper 1: Neural Voice Cloning with a Few Samples
● Speaker encoding is based on training a separate model to directly infer a new
speaker embedding, which will be applied to a multi-speaker generative model.
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Paper 2 :Transfer Learning from Speaker Verification to Multispeaker TTS Synthesis
● Multi-speaker speech synthesis model
● Three independently trained neural networks,
○ A recurrent speaker encoder,which computes a fixed dimensional vector from a speech signal.
○ A sequence-to-sequence synthesizer,which predicts a mel spectrogram from a sequence
phoneme inputs, conditioned on the speaker embedding vector.
○ A autoregressive WaveNet vocoder, which converts the spectrogram into time domain
waveforms.
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Fig: Model overview. Each of the three components are trained independently.
Datasets
● LibriSpeech: poor audio quality of the audio sample(2484 speakers, totalling
820 hours) (Paper 1- trained)
● VCTK : Cloned from this dataset, 108 native speakers of English with various
accents. (44 hours of clean speech) (Paper 1- Cloned & Paper 2- train)
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Performance: Paper 1
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Table: Naturalness, 5-scale mean opinion score (MOS)
Table: summary the approaches and lists the requirements for training, data, cloning time and memory footprint.
Performance: Paper 1
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Table: Similarity score evaluations, 4-scale similarity score
Performance: Paper 2
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Table: Naturalness, 5-scale mean opinion score (MOS)
Table: Similarity score evaluations, 4-scale similarity score
Result, Challenges and Limitation
- The proposed techniques can potentially be improved with better multi-
speaker models in the future. (Paper 1)
- The proposed model does not attain human-level naturalness, despite the
use of a WaveNet vocoder (along with its very high inference cost), in
contrast to the single speaker results. (Paper 2)
- Use of datasets with lower data quality. An additional limitation lies in the
model’s inability to transfer accents.(Paper 2)
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Demo
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Input Audio
Cloned Audio
Ethical Considerations
● Potential for misuse of this technology
● For example impersonating someone’s voice without their consent
● DeepFake
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Q&A
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Q&A
Acknowledgments
“Neural Voice Cloning with a Few Samples”, Sercan O. Arik, Jitong Chen, Kainan
Peng, Wei Ping, Yanqi Zhou [ NeurIPS, 2018], Baidu
“Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech
Synthesis”, , Ye Jia, Yu Zhang, Ron J. Weiss [NeurIPS, 2018], Google
“Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis
and Cross-Language Voice Cloning”, Yu Zhang, Ron J. Weiss, Google
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences

More Related Content

What's hot

2.5 capacity calculations of fdma, tdma and cdma
2.5   capacity calculations of fdma, tdma and cdma2.5   capacity calculations of fdma, tdma and cdma
2.5 capacity calculations of fdma, tdma and cdma
JAIGANESH SEKAR
 
Smart antenna
Smart antennaSmart antenna
Smart antenna
Rajib Kumar Das
 
Introduction to multiple signal classifier (music)
Introduction to multiple signal classifier (music)Introduction to multiple signal classifier (music)
Introduction to multiple signal classifier (music)
Milkessa Negeri
 
Combating fading channels (1) (3)
Combating fading channels (1) (3)Combating fading channels (1) (3)
Combating fading channels (1) (3)
liril sharma
 
Multrate dsp
Multrate dspMultrate dsp
Long Term Evolution (LTE) -
Long Term Evolution (LTE) -Long Term Evolution (LTE) -
Long Term Evolution (LTE) -
Tinniam V Ganesh (TV)
 
SPEECH BASED EMOTION RECOGNITION USING VOICE
SPEECH BASED  EMOTION RECOGNITION USING VOICESPEECH BASED  EMOTION RECOGNITION USING VOICE
SPEECH BASED EMOTION RECOGNITION USING VOICE
VamshidharSingh
 
Genetic Algorithms
Genetic AlgorithmsGenetic Algorithms
Genetic Algorithms
Alaa Khamis, PhD, SMIEEE
 
rake reciever ppt
rake reciever pptrake reciever ppt
rake reciever ppt
Divya Shukla
 
Women Safety Night Patrolling Robot Using IOT
Women Safety Night Patrolling Robot Using IOTWomen Safety Night Patrolling Robot Using IOT
Women Safety Night Patrolling Robot Using IOT
Dr. Amarjeet Singh
 
smart antennas ppt
smart antennas pptsmart antennas ppt
smart antennas ppt
santhu652
 
Underwater acoustic sensor network
Underwater acoustic sensor networkUnderwater acoustic sensor network
Underwater acoustic sensor network
Mphasis
 
Optical Network Survivability
Optical Network SurvivabilityOptical Network Survivability
Optical Network Survivability
Becky Jia
 
Mimo
MimoMimo
Mimo
Virak Sou
 
AODV routing protocol
AODV routing protocolAODV routing protocol
AODV routing protocol
Varsha Anandani
 
Channel equalization
Channel equalizationChannel equalization
Channel equalization
Munnangi Anirudh
 
Speech emotion recognition from audio converted
Speech emotion recognition from audio convertedSpeech emotion recognition from audio converted
Speech emotion recognition from audio converted
dataalcott
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
fathitarek
 
Echo Cancellation Paper
Echo Cancellation Paper Echo Cancellation Paper
WSN-IEEE 802.15.4 -MAC Protocol
WSN-IEEE 802.15.4 -MAC ProtocolWSN-IEEE 802.15.4 -MAC Protocol
WSN-IEEE 802.15.4 -MAC Protocol
ArunChokkalingam
 

What's hot (20)

2.5 capacity calculations of fdma, tdma and cdma
2.5   capacity calculations of fdma, tdma and cdma2.5   capacity calculations of fdma, tdma and cdma
2.5 capacity calculations of fdma, tdma and cdma
 
Smart antenna
Smart antennaSmart antenna
Smart antenna
 
Introduction to multiple signal classifier (music)
Introduction to multiple signal classifier (music)Introduction to multiple signal classifier (music)
Introduction to multiple signal classifier (music)
 
Combating fading channels (1) (3)
Combating fading channels (1) (3)Combating fading channels (1) (3)
Combating fading channels (1) (3)
 
Multrate dsp
Multrate dspMultrate dsp
Multrate dsp
 
Long Term Evolution (LTE) -
Long Term Evolution (LTE) -Long Term Evolution (LTE) -
Long Term Evolution (LTE) -
 
SPEECH BASED EMOTION RECOGNITION USING VOICE
SPEECH BASED  EMOTION RECOGNITION USING VOICESPEECH BASED  EMOTION RECOGNITION USING VOICE
SPEECH BASED EMOTION RECOGNITION USING VOICE
 
Genetic Algorithms
Genetic AlgorithmsGenetic Algorithms
Genetic Algorithms
 
rake reciever ppt
rake reciever pptrake reciever ppt
rake reciever ppt
 
Women Safety Night Patrolling Robot Using IOT
Women Safety Night Patrolling Robot Using IOTWomen Safety Night Patrolling Robot Using IOT
Women Safety Night Patrolling Robot Using IOT
 
smart antennas ppt
smart antennas pptsmart antennas ppt
smart antennas ppt
 
Underwater acoustic sensor network
Underwater acoustic sensor networkUnderwater acoustic sensor network
Underwater acoustic sensor network
 
Optical Network Survivability
Optical Network SurvivabilityOptical Network Survivability
Optical Network Survivability
 
Mimo
MimoMimo
Mimo
 
AODV routing protocol
AODV routing protocolAODV routing protocol
AODV routing protocol
 
Channel equalization
Channel equalizationChannel equalization
Channel equalization
 
Speech emotion recognition from audio converted
Speech emotion recognition from audio convertedSpeech emotion recognition from audio converted
Speech emotion recognition from audio converted
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Echo Cancellation Paper
Echo Cancellation Paper Echo Cancellation Paper
Echo Cancellation Paper
 
WSN-IEEE 802.15.4 -MAC Protocol
WSN-IEEE 802.15.4 -MAC ProtocolWSN-IEEE 802.15.4 -MAC Protocol
WSN-IEEE 802.15.4 -MAC Protocol
 

Similar to Voice Cloning

MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
IRJET Journal
 
saito22research_talk_at_NUS
saito22research_talk_at_NUSsaito22research_talk_at_NUS
saito22research_talk_at_NUS
Yuki Saito
 
WaveNet.pdf
WaveNet.pdfWaveNet.pdf
WaveNet.pdf
ssuser849b73
 
Deep Learning for Automatic Speaker Recognition
Deep Learning for Automatic Speaker RecognitionDeep Learning for Automatic Speaker Recognition
Deep Learning for Automatic Speaker Recognition
Sai Kiran Kadam
 
Sentiment analysis by deep learning approaches
Sentiment analysis by deep learning approachesSentiment analysis by deep learning approaches
Sentiment analysis by deep learning approaches
TELKOMNIKA JOURNAL
 
Research_Wu.pptx
Research_Wu.pptxResearch_Wu.pptx
Research_Wu.pptx
Rakesh Pogula
 
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
IRJET Journal
 
Speech emotion recognition with light gradient boosting decision trees machine
Speech emotion recognition with light gradient boosting decision trees machineSpeech emotion recognition with light gradient boosting decision trees machine
Speech emotion recognition with light gradient boosting decision trees machine
IJECEIAES
 
IRJET - Audio Emotion Analysis
IRJET - Audio Emotion AnalysisIRJET - Audio Emotion Analysis
IRJET - Audio Emotion Analysis
IRJET Journal
 
Wavelet Based Feature Extraction for the Indonesian CV Syllables Sound
Wavelet Based Feature Extraction for the Indonesian CV Syllables SoundWavelet Based Feature Extraction for the Indonesian CV Syllables Sound
Wavelet Based Feature Extraction for the Indonesian CV Syllables Sound
TELKOMNIKA JOURNAL
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
NU_I_TODALAB
 
Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)
Toru Fujino
 
T26123129
T26123129T26123129
T26123129
IJERA Editor
 
Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...
eSAT Publishing House
 
Audio Features Based Steganography Detection in WAV File
Audio Features Based Steganography Detection in WAV FileAudio Features Based Steganography Detection in WAV File
Audio Features Based Steganography Detection in WAV File
ijtsrd
 
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
IJERA Editor
 
Text prediction based on Recurrent Neural Network Language Model
Text prediction based on Recurrent Neural Network Language ModelText prediction based on Recurrent Neural Network Language Model
Text prediction based on Recurrent Neural Network Language Model
ANIRUDHMALODE2
 
Modeling Text Independent Speaker Identification with Vector Quantization
Modeling Text Independent Speaker Identification with Vector QuantizationModeling Text Independent Speaker Identification with Vector Quantization
Modeling Text Independent Speaker Identification with Vector Quantization
TELKOMNIKA JOURNAL
 
19 ijcse-01227
19 ijcse-0122719 ijcse-01227
19 ijcse-01227
Shivlal Mewada
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...
karthik annam
 

Similar to Voice Cloning (20)

MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
 
saito22research_talk_at_NUS
saito22research_talk_at_NUSsaito22research_talk_at_NUS
saito22research_talk_at_NUS
 
WaveNet.pdf
WaveNet.pdfWaveNet.pdf
WaveNet.pdf
 
Deep Learning for Automatic Speaker Recognition
Deep Learning for Automatic Speaker RecognitionDeep Learning for Automatic Speaker Recognition
Deep Learning for Automatic Speaker Recognition
 
Sentiment analysis by deep learning approaches
Sentiment analysis by deep learning approachesSentiment analysis by deep learning approaches
Sentiment analysis by deep learning approaches
 
Research_Wu.pptx
Research_Wu.pptxResearch_Wu.pptx
Research_Wu.pptx
 
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
 
Speech emotion recognition with light gradient boosting decision trees machine
Speech emotion recognition with light gradient boosting decision trees machineSpeech emotion recognition with light gradient boosting decision trees machine
Speech emotion recognition with light gradient boosting decision trees machine
 
IRJET - Audio Emotion Analysis
IRJET - Audio Emotion AnalysisIRJET - Audio Emotion Analysis
IRJET - Audio Emotion Analysis
 
Wavelet Based Feature Extraction for the Indonesian CV Syllables Sound
Wavelet Based Feature Extraction for the Indonesian CV Syllables SoundWavelet Based Feature Extraction for the Indonesian CV Syllables Sound
Wavelet Based Feature Extraction for the Indonesian CV Syllables Sound
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
 
Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)
 
T26123129
T26123129T26123129
T26123129
 
Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...
 
Audio Features Based Steganography Detection in WAV File
Audio Features Based Steganography Detection in WAV FileAudio Features Based Steganography Detection in WAV File
Audio Features Based Steganography Detection in WAV File
 
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
 
Text prediction based on Recurrent Neural Network Language Model
Text prediction based on Recurrent Neural Network Language ModelText prediction based on Recurrent Neural Network Language Model
Text prediction based on Recurrent Neural Network Language Model
 
Modeling Text Independent Speaker Identification with Vector Quantization
Modeling Text Independent Speaker Identification with Vector QuantizationModeling Text Independent Speaker Identification with Vector Quantization
Modeling Text Independent Speaker Identification with Vector Quantization
 
19 ijcse-01227
19 ijcse-0122719 ijcse-01227
19 ijcse-01227
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...
 

Recently uploaded

HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 

Recently uploaded (20)

HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 

Voice Cloning

  • 1. Advancements in Voice Cloning: A Comprehensive Overview Exploring Cutting-Edge Research in Voice Synthesis Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences Tribhuvan University Institute of Science and Technology School of Mathematical Sciences by Aatiz Ghimiré MDS 555 - Natural Language Processing
  • 2. Introduction ● Voice cloning is to generate natural speech for a variety of speakers in a data efficient manner. ● Dubbing and Localization, Character Voices, Voice Assistance for People with Disabilities, Personalized Virtual Assistants and Podcasting and Content Creation. Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
  • 3. Objectives ● Here we talk about, the early voice clone papers that has effective results. ● There are 3 paper selected from the below, but actual two were analyzed. ○ “Neural Voice Cloning with a Few Samples”, Sercan O. Arik, Jitong Chen, Kainan Peng, Wei Ping, Yanqi Zhou [ NeurIPS, 2018], Baidu ○ “Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis”, , Ye Jia, Yu Zhang, Ron J. Weiss [NeurIPS, 2018], Google ○ “Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning”, Yu Zhang, Ron J. Weiss, Google ● Similarity, based on use of Tacotron 2 architecture. Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
  • 4. Terminologies ● Neural speech synthesis: ● Few-shot generative modeling: ● Speaker-dependent speech processing: ● Speaker adaptation : fine-tuning a multi-speaker generative model. ● Speaker encoding generate a fixed-dimensional embedding vector ● A sequence-to-sequence synthesis network based on Tacotron 2 that generates a mel spectrogram from text, conditioned on the speaker embedding ● WaveNet-based vocoder network that converts the mel spectrogram into time domain waveform samples Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
  • 5. Terminologies Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences Fig: Mel spectrogram from different text Fig: Word embedding Fig: Speaker embedding
  • 6. Paper 1: Neural Voice Cloning with a Few Samples ● Multi-Speaker Generative Modeling to Voice Cloning ● Multi-speaker generative model, f(ti,j , si ; W, esi ), which takes a text ti,j and a speaker identity si . The trainable parameters in the model is parameterized by W, and esi. ● For voice cloning, we extract the speaker characteristics for an unseen speaker sk from a set of cloning audios Ask, and generate an audio given any text for that speaker. ● Speaker adaptation: fine-tune a trained multi-speaker model, . Fine-tuning can be applied to either the speaker embedding or the whole model. ● multi-speaker generative model is based on the convolutional sequence-to- sequence architecture Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
  • 7. Paper 1: Neural Voice Cloning with a Few Samples ● Speaker encoding is based on training a separate model to directly infer a new speaker embedding, which will be applied to a multi-speaker generative model. Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
  • 8. Paper 2 :Transfer Learning from Speaker Verification to Multispeaker TTS Synthesis ● Multi-speaker speech synthesis model ● Three independently trained neural networks, ○ A recurrent speaker encoder,which computes a fixed dimensional vector from a speech signal. ○ A sequence-to-sequence synthesizer,which predicts a mel spectrogram from a sequence phoneme inputs, conditioned on the speaker embedding vector. ○ A autoregressive WaveNet vocoder, which converts the spectrogram into time domain waveforms. Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences Fig: Model overview. Each of the three components are trained independently.
  • 9. Datasets ● LibriSpeech: poor audio quality of the audio sample(2484 speakers, totalling 820 hours) (Paper 1- trained) ● VCTK : Cloned from this dataset, 108 native speakers of English with various accents. (44 hours of clean speech) (Paper 1- Cloned & Paper 2- train) Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
  • 10. Performance: Paper 1 Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences Table: Naturalness, 5-scale mean opinion score (MOS) Table: summary the approaches and lists the requirements for training, data, cloning time and memory footprint.
  • 11. Performance: Paper 1 Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences Table: Similarity score evaluations, 4-scale similarity score
  • 12. Performance: Paper 2 Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences Table: Naturalness, 5-scale mean opinion score (MOS) Table: Similarity score evaluations, 4-scale similarity score
  • 13. Result, Challenges and Limitation - The proposed techniques can potentially be improved with better multi- speaker models in the future. (Paper 1) - The proposed model does not attain human-level naturalness, despite the use of a WaveNet vocoder (along with its very high inference cost), in contrast to the single speaker results. (Paper 2) - Use of datasets with lower data quality. An additional limitation lies in the model’s inability to transfer accents.(Paper 2) Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
  • 14. Demo Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences Input Audio Cloned Audio
  • 15. Ethical Considerations ● Potential for misuse of this technology ● For example impersonating someone’s voice without their consent ● DeepFake Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
  • 16. Q&A Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences Q&A
  • 17. Acknowledgments “Neural Voice Cloning with a Few Samples”, Sercan O. Arik, Jitong Chen, Kainan Peng, Wei Ping, Yanqi Zhou [ NeurIPS, 2018], Baidu “Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis”, , Ye Jia, Yu Zhang, Ron J. Weiss [NeurIPS, 2018], Google “Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning”, Yu Zhang, Ron J. Weiss, Google Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
  • 18. Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences