SlideShare a Scribd company logo
1 of 18
Advancements in Voice Cloning:
A Comprehensive Overview
Exploring Cutting-Edge Research in Voice Synthesis
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Tribhuvan University
Institute of Science and Technology
School of Mathematical Sciences
by
Aatiz Ghimiré
MDS 555 - Natural Language Processing
Introduction
● Voice cloning is to generate natural speech for a variety of speakers in a
data efficient manner.
● Dubbing and Localization, Character Voices, Voice Assistance for People
with Disabilities, Personalized Virtual Assistants and Podcasting and Content
Creation.
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Objectives
● Here we talk about, the early voice clone papers that has effective results.
● There are 3 paper selected from the below, but actual two were analyzed.
○ “Neural Voice Cloning with a Few Samples”, Sercan O. Arik, Jitong Chen, Kainan Peng,
Wei Ping, Yanqi Zhou [ NeurIPS, 2018], Baidu
○ “Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech
Synthesis”, , Ye Jia, Yu Zhang, Ron J. Weiss [NeurIPS, 2018], Google
○ “Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and
Cross-Language Voice Cloning”, Yu Zhang, Ron J. Weiss, Google
● Similarity, based on use of Tacotron 2 architecture.
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Terminologies
● Neural speech synthesis:
● Few-shot generative modeling:
● Speaker-dependent speech processing:
● Speaker adaptation : fine-tuning a multi-speaker generative model.
● Speaker encoding generate a fixed-dimensional embedding vector
● A sequence-to-sequence synthesis network based on Tacotron 2 that generates
a mel spectrogram from text, conditioned on the speaker embedding
● WaveNet-based vocoder network that converts the mel spectrogram into time
domain waveform samples
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Terminologies
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Fig: Mel spectrogram from different text
Fig: Word embedding
Fig: Speaker embedding
Paper 1: Neural Voice Cloning with a Few Samples
● Multi-Speaker Generative Modeling to Voice Cloning
● Multi-speaker generative model, f(ti,j , si ; W, esi ), which takes a text ti,j and a
speaker identity si . The trainable parameters in the model is parameterized by W,
and esi.
● For voice cloning, we extract the speaker characteristics for an unseen speaker sk
from a set of cloning audios Ask, and generate an audio given any text for that
speaker.
● Speaker adaptation: fine-tune a trained multi-speaker model, . Fine-tuning can be
applied to either the speaker embedding or the whole model.
● multi-speaker generative model is based on the convolutional sequence-to-
sequence architecture
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Paper 1: Neural Voice Cloning with a Few Samples
● Speaker encoding is based on training a separate model to directly infer a new
speaker embedding, which will be applied to a multi-speaker generative model.
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Paper 2 :Transfer Learning from Speaker Verification to Multispeaker TTS Synthesis
● Multi-speaker speech synthesis model
● Three independently trained neural networks,
○ A recurrent speaker encoder,which computes a fixed dimensional vector from a speech signal.
○ A sequence-to-sequence synthesizer,which predicts a mel spectrogram from a sequence
phoneme inputs, conditioned on the speaker embedding vector.
○ A autoregressive WaveNet vocoder, which converts the spectrogram into time domain
waveforms.
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Fig: Model overview. Each of the three components are trained independently.
Datasets
● LibriSpeech: poor audio quality of the audio sample(2484 speakers, totalling
820 hours) (Paper 1- trained)
● VCTK : Cloned from this dataset, 108 native speakers of English with various
accents. (44 hours of clean speech) (Paper 1- Cloned & Paper 2- train)
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Performance: Paper 1
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Table: Naturalness, 5-scale mean opinion score (MOS)
Table: summary the approaches and lists the requirements for training, data, cloning time and memory footprint.
Performance: Paper 1
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Table: Similarity score evaluations, 4-scale similarity score
Performance: Paper 2
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Table: Naturalness, 5-scale mean opinion score (MOS)
Table: Similarity score evaluations, 4-scale similarity score
Result, Challenges and Limitation
- The proposed techniques can potentially be improved with better multi-
speaker models in the future. (Paper 1)
- The proposed model does not attain human-level naturalness, despite the
use of a WaveNet vocoder (along with its very high inference cost), in
contrast to the single speaker results. (Paper 2)
- Use of datasets with lower data quality. An additional limitation lies in the
model’s inability to transfer accents.(Paper 2)
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Demo
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Input Audio
Cloned Audio
Ethical Considerations
● Potential for misuse of this technology
● For example impersonating someone’s voice without their consent
● DeepFake
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Q&A
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Q&A
Acknowledgments
“Neural Voice Cloning with a Few Samples”, Sercan O. Arik, Jitong Chen, Kainan
Peng, Wei Ping, Yanqi Zhou [ NeurIPS, 2018], Baidu
“Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech
Synthesis”, , Ye Jia, Yu Zhang, Ron J. Weiss [NeurIPS, 2018], Google
“Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis
and Cross-Language Voice Cloning”, Yu Zhang, Ron J. Weiss, Google
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences

More Related Content

Similar to Voice Cloning

Sentiment analysis by deep learning approaches
Sentiment analysis by deep learning approachesSentiment analysis by deep learning approaches
Sentiment analysis by deep learning approachesTELKOMNIKA JOURNAL
 
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...IRJET Journal
 
Speech emotion recognition with light gradient boosting decision trees machine
Speech emotion recognition with light gradient boosting decision trees machineSpeech emotion recognition with light gradient boosting decision trees machine
Speech emotion recognition with light gradient boosting decision trees machineIJECEIAES
 
IRJET - Audio Emotion Analysis
IRJET - Audio Emotion AnalysisIRJET - Audio Emotion Analysis
IRJET - Audio Emotion AnalysisIRJET Journal
 
Wavelet Based Feature Extraction for the Indonesian CV Syllables Sound
Wavelet Based Feature Extraction for the Indonesian CV Syllables SoundWavelet Based Feature Extraction for the Indonesian CV Syllables Sound
Wavelet Based Feature Extraction for the Indonesian CV Syllables SoundTELKOMNIKA JOURNAL
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...NU_I_TODALAB
 
Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Toru Fujino
 
Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...eSAT Publishing House
 
Audio Features Based Steganography Detection in WAV File
Audio Features Based Steganography Detection in WAV FileAudio Features Based Steganography Detection in WAV File
Audio Features Based Steganography Detection in WAV Fileijtsrd
 
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...IJERA Editor
 
Text prediction based on Recurrent Neural Network Language Model
Text prediction based on Recurrent Neural Network Language ModelText prediction based on Recurrent Neural Network Language Model
Text prediction based on Recurrent Neural Network Language ModelANIRUDHMALODE2
 
Modeling Text Independent Speaker Identification with Vector Quantization
Modeling Text Independent Speaker Identification with Vector QuantizationModeling Text Independent Speaker Identification with Vector Quantization
Modeling Text Independent Speaker Identification with Vector QuantizationTELKOMNIKA JOURNAL
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...karthik annam
 
IRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural NetworkIRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural NetworkIRJET Journal
 
Course report-islam-taharimul (1)
Course report-islam-taharimul (1)Course report-islam-taharimul (1)
Course report-islam-taharimul (1)TANVIRAHMED611926
 
A NOVEL METHOD FOR OBTAINING A BETTER QUALITY SPEECH SIGNAL FOR COCHLEAR IMPL...
A NOVEL METHOD FOR OBTAINING A BETTER QUALITY SPEECH SIGNAL FOR COCHLEAR IMPL...A NOVEL METHOD FOR OBTAINING A BETTER QUALITY SPEECH SIGNAL FOR COCHLEAR IMPL...
A NOVEL METHOD FOR OBTAINING A BETTER QUALITY SPEECH SIGNAL FOR COCHLEAR IMPL...acijjournal
 
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...mathsjournal
 

Similar to Voice Cloning (20)

Sentiment analysis by deep learning approaches
Sentiment analysis by deep learning approachesSentiment analysis by deep learning approaches
Sentiment analysis by deep learning approaches
 
Research_Wu.pptx
Research_Wu.pptxResearch_Wu.pptx
Research_Wu.pptx
 
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
 
Speech emotion recognition with light gradient boosting decision trees machine
Speech emotion recognition with light gradient boosting decision trees machineSpeech emotion recognition with light gradient boosting decision trees machine
Speech emotion recognition with light gradient boosting decision trees machine
 
IRJET - Audio Emotion Analysis
IRJET - Audio Emotion AnalysisIRJET - Audio Emotion Analysis
IRJET - Audio Emotion Analysis
 
Wavelet Based Feature Extraction for the Indonesian CV Syllables Sound
Wavelet Based Feature Extraction for the Indonesian CV Syllables SoundWavelet Based Feature Extraction for the Indonesian CV Syllables Sound
Wavelet Based Feature Extraction for the Indonesian CV Syllables Sound
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
 
Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)
 
T26123129
T26123129T26123129
T26123129
 
Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...
 
Audio Features Based Steganography Detection in WAV File
Audio Features Based Steganography Detection in WAV FileAudio Features Based Steganography Detection in WAV File
Audio Features Based Steganography Detection in WAV File
 
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
 
Text prediction based on Recurrent Neural Network Language Model
Text prediction based on Recurrent Neural Network Language ModelText prediction based on Recurrent Neural Network Language Model
Text prediction based on Recurrent Neural Network Language Model
 
Modeling Text Independent Speaker Identification with Vector Quantization
Modeling Text Independent Speaker Identification with Vector QuantizationModeling Text Independent Speaker Identification with Vector Quantization
Modeling Text Independent Speaker Identification with Vector Quantization
 
19 ijcse-01227
19 ijcse-0122719 ijcse-01227
19 ijcse-01227
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...
 
IRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural NetworkIRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural Network
 
Course report-islam-taharimul (1)
Course report-islam-taharimul (1)Course report-islam-taharimul (1)
Course report-islam-taharimul (1)
 
A NOVEL METHOD FOR OBTAINING A BETTER QUALITY SPEECH SIGNAL FOR COCHLEAR IMPL...
A NOVEL METHOD FOR OBTAINING A BETTER QUALITY SPEECH SIGNAL FOR COCHLEAR IMPL...A NOVEL METHOD FOR OBTAINING A BETTER QUALITY SPEECH SIGNAL FOR COCHLEAR IMPL...
A NOVEL METHOD FOR OBTAINING A BETTER QUALITY SPEECH SIGNAL FOR COCHLEAR IMPL...
 
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...
 

Recently uploaded

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Voice Cloning

  • 1. Advancements in Voice Cloning: A Comprehensive Overview Exploring Cutting-Edge Research in Voice Synthesis Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences Tribhuvan University Institute of Science and Technology School of Mathematical Sciences by Aatiz Ghimiré MDS 555 - Natural Language Processing
  • 2. Introduction ● Voice cloning is to generate natural speech for a variety of speakers in a data efficient manner. ● Dubbing and Localization, Character Voices, Voice Assistance for People with Disabilities, Personalized Virtual Assistants and Podcasting and Content Creation. Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
  • 3. Objectives ● Here we talk about, the early voice clone papers that has effective results. ● There are 3 paper selected from the below, but actual two were analyzed. ○ “Neural Voice Cloning with a Few Samples”, Sercan O. Arik, Jitong Chen, Kainan Peng, Wei Ping, Yanqi Zhou [ NeurIPS, 2018], Baidu ○ “Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis”, , Ye Jia, Yu Zhang, Ron J. Weiss [NeurIPS, 2018], Google ○ “Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning”, Yu Zhang, Ron J. Weiss, Google ● Similarity, based on use of Tacotron 2 architecture. Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
  • 4. Terminologies ● Neural speech synthesis: ● Few-shot generative modeling: ● Speaker-dependent speech processing: ● Speaker adaptation : fine-tuning a multi-speaker generative model. ● Speaker encoding generate a fixed-dimensional embedding vector ● A sequence-to-sequence synthesis network based on Tacotron 2 that generates a mel spectrogram from text, conditioned on the speaker embedding ● WaveNet-based vocoder network that converts the mel spectrogram into time domain waveform samples Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
  • 5. Terminologies Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences Fig: Mel spectrogram from different text Fig: Word embedding Fig: Speaker embedding
  • 6. Paper 1: Neural Voice Cloning with a Few Samples ● Multi-Speaker Generative Modeling to Voice Cloning ● Multi-speaker generative model, f(ti,j , si ; W, esi ), which takes a text ti,j and a speaker identity si . The trainable parameters in the model is parameterized by W, and esi. ● For voice cloning, we extract the speaker characteristics for an unseen speaker sk from a set of cloning audios Ask, and generate an audio given any text for that speaker. ● Speaker adaptation: fine-tune a trained multi-speaker model, . Fine-tuning can be applied to either the speaker embedding or the whole model. ● multi-speaker generative model is based on the convolutional sequence-to- sequence architecture Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
  • 7. Paper 1: Neural Voice Cloning with a Few Samples ● Speaker encoding is based on training a separate model to directly infer a new speaker embedding, which will be applied to a multi-speaker generative model. Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
  • 8. Paper 2 :Transfer Learning from Speaker Verification to Multispeaker TTS Synthesis ● Multi-speaker speech synthesis model ● Three independently trained neural networks, ○ A recurrent speaker encoder,which computes a fixed dimensional vector from a speech signal. ○ A sequence-to-sequence synthesizer,which predicts a mel spectrogram from a sequence phoneme inputs, conditioned on the speaker embedding vector. ○ A autoregressive WaveNet vocoder, which converts the spectrogram into time domain waveforms. Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences Fig: Model overview. Each of the three components are trained independently.
  • 9. Datasets ● LibriSpeech: poor audio quality of the audio sample(2484 speakers, totalling 820 hours) (Paper 1- trained) ● VCTK : Cloned from this dataset, 108 native speakers of English with various accents. (44 hours of clean speech) (Paper 1- Cloned & Paper 2- train) Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
  • 10. Performance: Paper 1 Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences Table: Naturalness, 5-scale mean opinion score (MOS) Table: summary the approaches and lists the requirements for training, data, cloning time and memory footprint.
  • 11. Performance: Paper 1 Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences Table: Similarity score evaluations, 4-scale similarity score
  • 12. Performance: Paper 2 Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences Table: Naturalness, 5-scale mean opinion score (MOS) Table: Similarity score evaluations, 4-scale similarity score
  • 13. Result, Challenges and Limitation - The proposed techniques can potentially be improved with better multi- speaker models in the future. (Paper 1) - The proposed model does not attain human-level naturalness, despite the use of a WaveNet vocoder (along with its very high inference cost), in contrast to the single speaker results. (Paper 2) - Use of datasets with lower data quality. An additional limitation lies in the model’s inability to transfer accents.(Paper 2) Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
  • 14. Demo Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences Input Audio Cloned Audio
  • 15. Ethical Considerations ● Potential for misuse of this technology ● For example impersonating someone’s voice without their consent ● DeepFake Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
  • 16. Q&A Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences Q&A
  • 17. Acknowledgments “Neural Voice Cloning with a Few Samples”, Sercan O. Arik, Jitong Chen, Kainan Peng, Wei Ping, Yanqi Zhou [ NeurIPS, 2018], Baidu “Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis”, , Ye Jia, Yu Zhang, Ron J. Weiss [NeurIPS, 2018], Google “Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning”, Yu Zhang, Ron J. Weiss, Google Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences
  • 18. Tribhuvan University, Institute of Science and Technology, School of Mathematical Sciences