SlideShare a Scribd company logo
1 of 21
Voice Conversion with Neural-
based Speech Generation Model
Yi-Chiao Wu
Toda Lab, Nagoya University
About me
https://bigpon.github.io/
• Education
• Nagoya University, informatics (current)
• Topics: voice conversion, speech synthesis
• National Chiao Tung University, communication
engineering (MS, BS)
• Topic: speaker verification
• Work experience
• Academia Sinica, research assistant [2015–2017]
• Topic: voice conversion, speech enhancement
• Asus, software R&D [2013-2015]
• Topic: speaker verification
• Realtek, system designer [2012-2013]
1
Voice conversion
• Changing the speaker identity of speech
• Keeping the linguistic content consistent
2
Speech
analysis
Speaker
A
Speaker
B
Speech
synthesis
Feature
conversion
Source Target
Source to target
Conventional speech synthesis
• Vocoder: voice coder
• Encoder (analyzer)
• decomposing speech into acoustic features
• Decoder (synthesizer)
• synthesizing speech from acoustic features
3
Speech
analysis
Speech
synthesis
Spectral
features
Prosodic
features
Speech Speech
Vocoder
Encoder Decoder
Neural-based speech generation
• Neural-based decoder (synthesizer)
• Input: acoustic features
• Output: speech samples
4
Neural-based
speech
generation
Spectral
features
Prosodic
features
Speech
Research overview
1. Build a voice conversion (VC) with neural-based
generation model baseline system (done)
2. Improve the robustness (done)
3. Improve the flexibility (done)
4. Build a real-time system (on-going)
5
VC with neural-generation model
baseline system
• Source to target features conversion
• Speech generation from converted features
• Voice conversion challenge 2018 system
• Y.-C. Wu et al. “The NU non-parallel voice conversion system for
the voice conversion challenge 2018,” Proc. Odyssey, 2018.
6
Feature
conversion
Generation
model
Source
features
Converted
target
features
Converted
target
Collapsed speech
• Collapsed speech  Refined speech
• Waveform shape constraint
• Y. -C. Wu et al. “Collapsed speech generation and suppression for
WaveNet vocoder ,” Proc. Interspeech, 2018.
• Y. -C. Wu et al. “Non-parallel voice conversion system with
WaveNet vocoder and collapsed speech suppression,” Proc. IEEE
Access, 2020.
7
Acoustic Mismatch
• Mismatch between training and testing stage
• Speech quality degradation
• Noisy generated speech (Collapsed speech)
8
Generation
model
Target
features
Speech
Generation
model
Converted
target
features
Speech
Training
Testing
Mismatch
Temporal Mismatch
• TTS postfilter
• Source: artificial speech; target: natural speech
• Source and target have the same data length, but the
temporal structures are different
• Y.-C. Wu et al. “A cyclical post-filtering approach to mismatch
refinement of neural vocoder for text-to-speech systems,” Submitted
to Interspeech, 2020.
9
Generation
model
Source
features
Speech
Training
Temporal mismatch
WaveNet [A. Oord+, 2016]
• Audio signals have a very long term dependency
• Basic RNN cannot model long-term correlations
• Stacked CNN layers
• Input: a segment of previous samples (receptive field)
• Output: the conditional probability
of the current sample
10
 
1
| ,...,
n n r n
P y y y
 
y0 y1 y2 y3
p(y4)
Receptive field
Dilated CNN [F. Yu+, 2016]
• Convolution with skip holes
• Efficiently extend the receptive field
• Downsampling-like structure makes the network
capture information on different levels
11
Receptive field
Dilation
size = 4
Dilation
size = 2
Dilation
size = 1
Is WaveNet vocoder suitable for
speech generation?
• Speech signal is a quasi-periodic signal
• Periodic part: long-term correlation
• Non-periodic part: short-term correlation
• WaveNet
• Fixed network architecture
• Without prior knowledge of speech signal
• Problems of WaveNet as a vocoder
• Inefficient speech signal modeling
• Limited pitch controllability
12
Quasi-Periodic WaveNet
• Pitch-dependent dilated convolution
• Dynamically change the dilation size
• Model the periodic part with prior F0 knowledge (long-
term correlations)
• Cascaded network
• Fixed modules model the non-periodic part with the
nearest samples (short-term correlations)
• Adaptive modules for periodic part
• Y.-C. Wu et al. “Quasi-Periodic WaveNet vocoder: a pitch dependent
dilated convolution model for parametric speech generation ,” Proc.
Interspeech, 2019.
13
Pitch-dependent dilated
convolution
• Pitch-dependent dilated factor: Et = Fs/(F0,t × a)
14
Fixed dilated convolution
Receptive field
Effective receptive field
Output
Dilation = 2*ET
Hidden layer
Dilation = 1*ET
Input
Output
Dilation = 2
Hidden layer
Dilation = 1
Input
Pitch dependent
dilated factors
ET
Pitch dependent dilated convolution
Receptive field
Effective receptive field
Et=2
Et-1=3
...
Effective receptive filed
• Fixed number of samples in a receptive filed
• Fixed number of samples in one cycle
• The same number of past cycles in a effective
receptive field for arbitrary F0
15
Cascaded networks
• Fixed modules for short-term correlations
• Fixed dilated convolution (DCNN)
• Adaptive modules for long-term correlations
• Pitch-dependent dilated convolution (PDCNN)
16
Skip connection
Fixed
block
Fixed
block
Adaptive
block
Adaptive
block
...
Fixed / Adaptive
2×1 dilated
Gated
Next residual block
Relu 1×1
Softmax
Relu 1×1
Previous residual block
Causal
Output
Auxiliary
features
Skip
connection
Input
Fixed/Adaptive residual block
...
Upsample
Acoustic
features
1×1
1×1
1×1
Auxiliary
features
Parallel WaveGAN [R. Yamamoto+, 2020]
• GAN structure for speech waveform generation
• WaveNet-like generator
• Non-autoregressive and non-causal
• Fast (RTF: 0.02 with one Titan V)
• Compact (3% of WaveNet)
17
Generator (G)
Gaussian
noise
Acoustic
features
Discriminator
(D)
Generated
speech
Natural
speech
Multi-resolution
STFT loss
LD
Ladv
Lsp
LG λadv
Training
Synthesis
Quasi-Periodic Parallel WaveGAN
• Pitch-dependent
dilated convolution
• Cascaded network
18
Skip connection
Adaptive
block
Adaptive
block
Fixed
block
Fixed
block
...
Adaptive / Fixed
3×1 dilated
Gated
Next residual block
ReLU 1×1 ReLU 1×1
Previous residual block
1×1
Generated
speech
Auxiliary
features
Skip
connection
Gaussian
noise
Adaptive / Fixed
residual block
...
Upsample
Acoustic
features
1×1
1×1
1×1
Auxiliary
features
Macroblock 0
Macroblock 1
• Y.-C. Wu et al. “Quasi-Periodic Parallel WaveGAN vocoder: a non-
autoregressive pitch-dependent dilated convolution model for parametric
speech generation,” Submitted to Interspeech, 2020.
Other works
• Voice conversion (VC)
• Exemplar VC w/ LLE [Y.-C. Wu+, 2016]
• Variational AutoEncoder (VAE) [C.-C. Hsu+, 2016]
• VAE w/ WGAN [C.-C. Hsu+, 2016]
• CycleVC [P. L. Tobing+, 2019]
• Seq2Seq Transformer VC [W.-C. Huang+, 2020]
• Speech enhancement (SE)
• Exemplar VC w/ LLE for SE [Y.-C. Wu+, 2017]
19
Thank you for your attention !
https://bigpon.github.io/QuasiPeriodicParallelWaveGAN_demo/
https://bigpon.github.io/QuasiPeriodicWaveNet_demo/
https://bigpon.github.io/LpcConstrainedWaveNet_demo/
20

More Related Content

Similar to Research_Wu.pptx

[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION
[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION
[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITIONDeep Learning JP
 
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...NUGU developers
 
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech RecognitionTeaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech RecognitionZachary S. Brown
 
Speaker recognition using MFCC
Speaker recognition using MFCCSpeaker recognition using MFCC
Speaker recognition using MFCCHira Shaukat
 
Automated Podcasting System for Universities
Automated Podcasting System for UniversitiesAutomated Podcasting System for Universities
Automated Podcasting System for UniversitiesEducational Technology
 
COMP 4010 Lecture5 VR Audio and Tracking
COMP 4010 Lecture5 VR Audio and TrackingCOMP 4010 Lecture5 VR Audio and Tracking
COMP 4010 Lecture5 VR Audio and TrackingMark Billinghurst
 
Utilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech TranslationUtilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech TranslationChen Xu
 
Mediaeval 2013 Spoken Web Search results slides
Mediaeval 2013 Spoken Web Search results slidesMediaeval 2013 Spoken Web Search results slides
Mediaeval 2013 Spoken Web Search results slidesXavier Anguera
 
Deep Learning for Automatic Speaker Recognition
Deep Learning for Automatic Speaker RecognitionDeep Learning for Automatic Speaker Recognition
Deep Learning for Automatic Speaker RecognitionSai Kiran Kadam
 
Experimental analysis of optimal window length for independent low-rank matri...
Experimental analysis of optimal window length for independent low-rank matri...Experimental analysis of optimal window length for independent low-rank matri...
Experimental analysis of optimal window length for independent low-rank matri...Daichi Kitamura
 
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...Sebastian Ruder
 
Deep Learning - Speaker Verification, Sound Event Detection
Deep Learning - Speaker Verification, Sound Event DetectionDeep Learning - Speaker Verification, Sound Event Detection
Deep Learning - Speaker Verification, Sound Event DetectionSai Kiran Kadam
 
Curriculum Development of an Audio Processing Laboratory Course
Curriculum Development of an Audio Processing Laboratory CourseCurriculum Development of an Audio Processing Laboratory Course
Curriculum Development of an Audio Processing Laboratory Coursesipij
 
Automatic Key Term Extraction from Spoken Course Lectures
Automatic Key Term Extraction from Spoken Course LecturesAutomatic Key Term Extraction from Spoken Course Lectures
Automatic Key Term Extraction from Spoken Course LecturesYun-Nung (Vivian) Chen
 
Mimo tutorial by-fuyun_ling
Mimo tutorial by-fuyun_lingMimo tutorial by-fuyun_ling
Mimo tutorial by-fuyun_lingFuyun Ling
 
Environmentally robust ASR front end for DNN-based acoustic models
Environmentally robust ASR front end for DNN-based acoustic modelsEnvironmentally robust ASR front end for DNN-based acoustic models
Environmentally robust ASR front end for DNN-based acoustic modelsTakuya Yoshioka
 
Listening at the Cocktail Party with Deep Neural Networks and TensorFlow
Listening at the Cocktail Party with Deep Neural Networks and TensorFlowListening at the Cocktail Party with Deep Neural Networks and TensorFlow
Listening at the Cocktail Party with Deep Neural Networks and TensorFlowDatabricks
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...NU_I_TODALAB
 
final ppt BATCH 3.pptx
final ppt BATCH 3.pptxfinal ppt BATCH 3.pptx
final ppt BATCH 3.pptxMounika715343
 

Similar to Research_Wu.pptx (20)

[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION
[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION
[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION
 
WaveNet.pdf
WaveNet.pdfWaveNet.pdf
WaveNet.pdf
 
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
 
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech RecognitionTeaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
 
Speaker recognition using MFCC
Speaker recognition using MFCCSpeaker recognition using MFCC
Speaker recognition using MFCC
 
Automated Podcasting System for Universities
Automated Podcasting System for UniversitiesAutomated Podcasting System for Universities
Automated Podcasting System for Universities
 
COMP 4010 Lecture5 VR Audio and Tracking
COMP 4010 Lecture5 VR Audio and TrackingCOMP 4010 Lecture5 VR Audio and Tracking
COMP 4010 Lecture5 VR Audio and Tracking
 
Utilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech TranslationUtilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech Translation
 
Mediaeval 2013 Spoken Web Search results slides
Mediaeval 2013 Spoken Web Search results slidesMediaeval 2013 Spoken Web Search results slides
Mediaeval 2013 Spoken Web Search results slides
 
Deep Learning for Automatic Speaker Recognition
Deep Learning for Automatic Speaker RecognitionDeep Learning for Automatic Speaker Recognition
Deep Learning for Automatic Speaker Recognition
 
Experimental analysis of optimal window length for independent low-rank matri...
Experimental analysis of optimal window length for independent low-rank matri...Experimental analysis of optimal window length for independent low-rank matri...
Experimental analysis of optimal window length for independent low-rank matri...
 
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
 
Deep Learning - Speaker Verification, Sound Event Detection
Deep Learning - Speaker Verification, Sound Event DetectionDeep Learning - Speaker Verification, Sound Event Detection
Deep Learning - Speaker Verification, Sound Event Detection
 
Curriculum Development of an Audio Processing Laboratory Course
Curriculum Development of an Audio Processing Laboratory CourseCurriculum Development of an Audio Processing Laboratory Course
Curriculum Development of an Audio Processing Laboratory Course
 
Automatic Key Term Extraction from Spoken Course Lectures
Automatic Key Term Extraction from Spoken Course LecturesAutomatic Key Term Extraction from Spoken Course Lectures
Automatic Key Term Extraction from Spoken Course Lectures
 
Mimo tutorial by-fuyun_ling
Mimo tutorial by-fuyun_lingMimo tutorial by-fuyun_ling
Mimo tutorial by-fuyun_ling
 
Environmentally robust ASR front end for DNN-based acoustic models
Environmentally robust ASR front end for DNN-based acoustic modelsEnvironmentally robust ASR front end for DNN-based acoustic models
Environmentally robust ASR front end for DNN-based acoustic models
 
Listening at the Cocktail Party with Deep Neural Networks and TensorFlow
Listening at the Cocktail Party with Deep Neural Networks and TensorFlowListening at the Cocktail Party with Deep Neural Networks and TensorFlow
Listening at the Cocktail Party with Deep Neural Networks and TensorFlow
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
 
final ppt BATCH 3.pptx
final ppt BATCH 3.pptxfinal ppt BATCH 3.pptx
final ppt BATCH 3.pptx
 

More from Rakesh Pogula

More from Rakesh Pogula (6)

BWE2.ppt
BWE2.pptBWE2.ppt
BWE2.ppt
 
BWE1.ppt
BWE1.pptBWE1.ppt
BWE1.ppt
 
chapter4.ppt
chapter4.pptchapter4.ppt
chapter4.ppt
 
lec2 - Modern Processors - SIMD.pptx
lec2 - Modern Processors - SIMD.pptxlec2 - Modern Processors - SIMD.pptx
lec2 - Modern Processors - SIMD.pptx
 
Tomas_IWAENC_keynote10.ppt
Tomas_IWAENC_keynote10.pptTomas_IWAENC_keynote10.ppt
Tomas_IWAENC_keynote10.ppt
 
Combinational Filters.pptx
Combinational Filters.pptxCombinational Filters.pptx
Combinational Filters.pptx
 

Recently uploaded

APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAbhinavSharma374939
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 

Recently uploaded (20)

APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog Converter
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 

Research_Wu.pptx

  • 1. Voice Conversion with Neural- based Speech Generation Model Yi-Chiao Wu Toda Lab, Nagoya University
  • 2. About me https://bigpon.github.io/ • Education • Nagoya University, informatics (current) • Topics: voice conversion, speech synthesis • National Chiao Tung University, communication engineering (MS, BS) • Topic: speaker verification • Work experience • Academia Sinica, research assistant [2015–2017] • Topic: voice conversion, speech enhancement • Asus, software R&D [2013-2015] • Topic: speaker verification • Realtek, system designer [2012-2013] 1
  • 3. Voice conversion • Changing the speaker identity of speech • Keeping the linguistic content consistent 2 Speech analysis Speaker A Speaker B Speech synthesis Feature conversion Source Target Source to target
  • 4. Conventional speech synthesis • Vocoder: voice coder • Encoder (analyzer) • decomposing speech into acoustic features • Decoder (synthesizer) • synthesizing speech from acoustic features 3 Speech analysis Speech synthesis Spectral features Prosodic features Speech Speech Vocoder Encoder Decoder
  • 5. Neural-based speech generation • Neural-based decoder (synthesizer) • Input: acoustic features • Output: speech samples 4 Neural-based speech generation Spectral features Prosodic features Speech
  • 6. Research overview 1. Build a voice conversion (VC) with neural-based generation model baseline system (done) 2. Improve the robustness (done) 3. Improve the flexibility (done) 4. Build a real-time system (on-going) 5
  • 7. VC with neural-generation model baseline system • Source to target features conversion • Speech generation from converted features • Voice conversion challenge 2018 system • Y.-C. Wu et al. “The NU non-parallel voice conversion system for the voice conversion challenge 2018,” Proc. Odyssey, 2018. 6 Feature conversion Generation model Source features Converted target features Converted target
  • 8. Collapsed speech • Collapsed speech  Refined speech • Waveform shape constraint • Y. -C. Wu et al. “Collapsed speech generation and suppression for WaveNet vocoder ,” Proc. Interspeech, 2018. • Y. -C. Wu et al. “Non-parallel voice conversion system with WaveNet vocoder and collapsed speech suppression,” Proc. IEEE Access, 2020. 7
  • 9. Acoustic Mismatch • Mismatch between training and testing stage • Speech quality degradation • Noisy generated speech (Collapsed speech) 8 Generation model Target features Speech Generation model Converted target features Speech Training Testing Mismatch
  • 10. Temporal Mismatch • TTS postfilter • Source: artificial speech; target: natural speech • Source and target have the same data length, but the temporal structures are different • Y.-C. Wu et al. “A cyclical post-filtering approach to mismatch refinement of neural vocoder for text-to-speech systems,” Submitted to Interspeech, 2020. 9 Generation model Source features Speech Training Temporal mismatch
  • 11. WaveNet [A. Oord+, 2016] • Audio signals have a very long term dependency • Basic RNN cannot model long-term correlations • Stacked CNN layers • Input: a segment of previous samples (receptive field) • Output: the conditional probability of the current sample 10   1 | ,..., n n r n P y y y   y0 y1 y2 y3 p(y4) Receptive field
  • 12. Dilated CNN [F. Yu+, 2016] • Convolution with skip holes • Efficiently extend the receptive field • Downsampling-like structure makes the network capture information on different levels 11 Receptive field Dilation size = 4 Dilation size = 2 Dilation size = 1
  • 13. Is WaveNet vocoder suitable for speech generation? • Speech signal is a quasi-periodic signal • Periodic part: long-term correlation • Non-periodic part: short-term correlation • WaveNet • Fixed network architecture • Without prior knowledge of speech signal • Problems of WaveNet as a vocoder • Inefficient speech signal modeling • Limited pitch controllability 12
  • 14. Quasi-Periodic WaveNet • Pitch-dependent dilated convolution • Dynamically change the dilation size • Model the periodic part with prior F0 knowledge (long- term correlations) • Cascaded network • Fixed modules model the non-periodic part with the nearest samples (short-term correlations) • Adaptive modules for periodic part • Y.-C. Wu et al. “Quasi-Periodic WaveNet vocoder: a pitch dependent dilated convolution model for parametric speech generation ,” Proc. Interspeech, 2019. 13
  • 15. Pitch-dependent dilated convolution • Pitch-dependent dilated factor: Et = Fs/(F0,t × a) 14 Fixed dilated convolution Receptive field Effective receptive field Output Dilation = 2*ET Hidden layer Dilation = 1*ET Input Output Dilation = 2 Hidden layer Dilation = 1 Input Pitch dependent dilated factors ET Pitch dependent dilated convolution Receptive field Effective receptive field Et=2 Et-1=3 ...
  • 16. Effective receptive filed • Fixed number of samples in a receptive filed • Fixed number of samples in one cycle • The same number of past cycles in a effective receptive field for arbitrary F0 15
  • 17. Cascaded networks • Fixed modules for short-term correlations • Fixed dilated convolution (DCNN) • Adaptive modules for long-term correlations • Pitch-dependent dilated convolution (PDCNN) 16 Skip connection Fixed block Fixed block Adaptive block Adaptive block ... Fixed / Adaptive 2×1 dilated Gated Next residual block Relu 1×1 Softmax Relu 1×1 Previous residual block Causal Output Auxiliary features Skip connection Input Fixed/Adaptive residual block ... Upsample Acoustic features 1×1 1×1 1×1 Auxiliary features
  • 18. Parallel WaveGAN [R. Yamamoto+, 2020] • GAN structure for speech waveform generation • WaveNet-like generator • Non-autoregressive and non-causal • Fast (RTF: 0.02 with one Titan V) • Compact (3% of WaveNet) 17 Generator (G) Gaussian noise Acoustic features Discriminator (D) Generated speech Natural speech Multi-resolution STFT loss LD Ladv Lsp LG λadv Training Synthesis
  • 19. Quasi-Periodic Parallel WaveGAN • Pitch-dependent dilated convolution • Cascaded network 18 Skip connection Adaptive block Adaptive block Fixed block Fixed block ... Adaptive / Fixed 3×1 dilated Gated Next residual block ReLU 1×1 ReLU 1×1 Previous residual block 1×1 Generated speech Auxiliary features Skip connection Gaussian noise Adaptive / Fixed residual block ... Upsample Acoustic features 1×1 1×1 1×1 Auxiliary features Macroblock 0 Macroblock 1 • Y.-C. Wu et al. “Quasi-Periodic Parallel WaveGAN vocoder: a non- autoregressive pitch-dependent dilated convolution model for parametric speech generation,” Submitted to Interspeech, 2020.
  • 20. Other works • Voice conversion (VC) • Exemplar VC w/ LLE [Y.-C. Wu+, 2016] • Variational AutoEncoder (VAE) [C.-C. Hsu+, 2016] • VAE w/ WGAN [C.-C. Hsu+, 2016] • CycleVC [P. L. Tobing+, 2019] • Seq2Seq Transformer VC [W.-C. Huang+, 2020] • Speech enhancement (SE) • Exemplar VC w/ LLE for SE [Y.-C. Wu+, 2017] 19
  • 21. Thank you for your attention ! https://bigpon.github.io/QuasiPeriodicParallelWaveGAN_demo/ https://bigpon.github.io/QuasiPeriodicWaveNet_demo/ https://bigpon.github.io/LpcConstrainedWaveNet_demo/ 20