SlideShare a Scribd company logo
1 of 19
Download to read offline
©Yuki Saito, 13/11/2018
GENERATIVE APPROACH USING THE NOISE
GENERATION MODELS FOR DNN-BASED SPEECH
SYNTHESIS TRAINED FROM NOISY SPEECH
Masakazu Une1, Yuki Saito2, Shinnosuke Takamichi2,
Daichi Kitamura3, Ryoichi Miyazaki1, and Hiroshi Saruwatari2
1NIT, Tokuyama College, Japan, 2The Univ. of Tokyo, Japan,
3NIT, Kagawa College, Japan
APSIPA-ASC 2018 TU-P1-5.1
/181
Text-To-Speech (TTS) synthesis
using Deep Neural Networks (DNNs)
 Text-To-Speech (TTS) synthesis
 TTS using Deep Neural Networks (DNNs) [Zen et al., 2013]
Text Speech
Linguistic
features
Speech
params.
Text
analysis
Speech
synthesis
Text-To-Speech (TTS)
DNN-based
acoustic models
To realize high-quality TTS,
studio-quality clean speech data is required for training the DNNs.
/18
 Goal: realizing high-quality TTS using NOISY speech data
 Common approach: noise reduction before training DNNs
– Error caused by the noise reduction is propagated to TTS training...
 Proposed: training DNNs considering noise additive process
– GAN*-based noise generation models are introduced to TTS training.
 Results: improving synthetic speech quality
2
Outline of this talk
*Generative Adversarial Network [Goodfellow et al., 2014]
Noise
reduction
Noisy
(observed)
Clean
(estimated)
TTS
Noise
addition
Noisy
(observed)
Clean
(unobserved)
TTS
Noise generation
models
/18
Noise reduction using Spectral Subtraction (SS)*
 Amplitude spectra after noise reduction 𝒚s
(SS)
is calculated as:
– 𝑦s
SS
𝑡, 𝑓 =
𝑦ns
2
𝑡, 𝑓 − 𝛽 𝑦n
2
𝑓 𝑦ns
2 𝑡, 𝑓 − 𝛽 𝑦n
2 𝑓 > 0
0 otherwise
 The estimated average power of noise 𝒚n
2 is defined as:
– 𝑦n
2
𝑓 =
1
𝑇n
𝑡=1
𝑇n
𝑦n
2
𝑡, 𝑓 (𝑇n: total frame length of the noise)
 Limitations
– Approximating the noise distribution with its expectation value 𝒚n
2
– Causing trade-off between noise reduction & speech distortion due to
setting the hyper-parameter 𝛽 (noise suppression ratio)
3*[Boll, 1979]
/18
Training TTS from noisy speech using SS
4
Mean squared error
𝐿MSE 𝒚s
SS
, 𝒚s
SS
TTS
Linguistic
features
Predicted clean
amplitude
spectra
Estimated clean
amplitude
spectra
Noisy
amplitude
spectra
𝒚s
(SS)
𝒚s
(SS)
𝒚ns
Noise
reduction
using SS
→ Minimize𝐿MSE 𝒚s
SS
, 𝒚s
SS
=
1
𝑇
𝒚s
SS
− 𝒚s
SS
⊤
𝒚s
SS
− 𝒚s
SS
𝑇: total frame length of the features
/18
 1. Speech distortion caused by error of SS
 2. Propagation of the distortion by using 𝒚s
SS
as a target vector
Issues in training TTS using SS
5
𝐿MSE 𝒚s
SS
, 𝒚s
SS
𝒚s
(SS)
𝒚s
(SS)
𝒚ns
Noise
reduction
using SS
TTS
These issues significantly degrade synthetic speech quality...
/186
Proposed algorithm:
Training TTS using
noise generation models
based on GANs
/187
Overview of the proposed algorithm
𝐿MSE 𝒚ns, 𝒚ns
TTS
Linguistic
features
Estimated
noisy Noisy
Predicted
clean
𝒚s 𝒚ns 𝒚ns
Noise
addition
Pre-trained
noise generation
models
𝐺n ⋅
Prior
noise
𝒚n
Generated
noise
𝒏
We want 𝐺n ⋅ to model the distribution of the observed noise.
/188
Pre-training of noise generation models based on
GANs
Noise generation
models
𝐺n ⋅
Prior
noise
𝒚n
Generated
noise
𝒏
Discriminative
models
𝐷 ⋅
𝒚ns
Noisy
𝑉 𝐺n, 𝐷
or
𝑉 𝐺n, 𝐷 = min
𝐺n
max
𝐷
𝐸 log 𝐷 𝒚n + 𝐸 log 1 − 𝐷 𝒚n
1: observed
0: generated
Extraction of
non-speech
period
𝒚n
Observed
noise
/189
Pre-training of noise generation models based on
GANs
Noise generation
models
𝐺n ⋅
Prior
noise
𝒚n
Generated
noise
𝒏
Discriminative
models
𝐷 ⋅
𝒚ns
Noisy
𝑉 𝐺n, 𝐷
or
𝑉 𝐺n, 𝐷 = min
𝐺n
max
𝐷
𝐸 log 𝐷 𝒚n + 𝐸 log 1 − 𝐷 𝒚n
1: observed
0: generated
Extraction of
non-speech
period
𝒚n
Observed
noise
/1810
Pre-training of noise generation models based on
GANs
Noise generation
models
𝐺n ⋅
Prior
noise
𝒚n
Generated
noise
𝒏
Discriminative
models
𝐷 ⋅
𝒚ns
Noisy
𝑉 𝐺n, 𝐷
𝑉 𝐺n, 𝐷 = min
𝐺n
max
𝐷
𝐸 log 𝐷 𝒚n + 𝐸 log 1 − 𝐷 𝒚n
1: observed
*Jensen—Shannon
This minimizes the approx. JS* divergence betw. distributions of 𝒚n & 𝒚n.
Extraction of
non-speech
period
𝒚n
Observed
noise
/1811
Comparison of observed/generated noise
(generating Gaussian noise from uniform noise)
Frequency Amplitude
Freq.[kHz]Freq.[kHz]
Time [s]
Observed
Generated
Spectrogram Histogram
Our noise generation models effectively reproduce
characteristics of the observed noise!
/18
 Modeling distribution of stationary noise by using GANs
– Musical noise [Miyazaki et al., 2012] (unpleasant sound) can be reduced.
– By using recurrent networks, distribution of non-stationary noise can be
also modeled by our algorithm.
 Extending the proposed algorithm
– Distribution of context-dependent noise (e.g., pop-noise) can be
captured by using conditional GANs [Mirza et al., 2015].
– By using WaveNet [Oord et al., 2016], noise distribution can be modeled
in the waveform domain.
 Adapting TTS or noise generation models
– Pre-recorded clean speech data can be used to build initial models
used in our algorithm.
12
Discussion of proposed algorithm
/1813
Experimental evaluations
/18
Experimental conditions
14
Dataset
Japanese female speaker
(subset of JSUT corpus [Sonobe et al., 2017])
Train / evaluate data 3,000 / 53 sentences (16 kHz sampling)
Linguistic feats.
442-dimensional vector
(phoneme, accent type, F0, UV, duration, etc...)
Speech params. 257-dimensional log amplitude spectrum
Waveform synthesis Griffin & Lim’s method [Griffin et al., 1986]
Prior / observed noise Uniform / Gaussian (artificially added)
DNN architectures Feed-Forward (details are written in our manuscript)
Noise suppression ratio
of SS 𝛽
0.5, 1.0, 2.0, and 5.0
(larger value means stronger noise reduction)
Input SNR 0, 5, and 10 [dB]
Evaluation method
Preference AB test in terms of speech quality
(25 participants / evaluation)
/18
Results of subjective evaluation of speech quality
(input SNR = 0 [dB])
15In all cases, the 𝑝-values between the methods were smaller than 10−6
.
0.368 0.632
SS+MSE
(β = 0.5)
SS+MSE
(β = 1.0)
SS+MSE
(β = 2.0)
SS+MSE
(β = 5.0)
Proposed
0.312 0.688
0.312 0.688
0.00 0.25 0.50 0.75 1.00
Preference score
0.253 0.747
MSE+SS
(𝛽 = 0.5)
MSE+SS
(𝛽 = 1.0)
MSE+SS
(𝛽 = 2.0)
MSE+SS
(𝛽 = 5.0)
Proposed
Preference score
0.00 0.25 0.50 0.75 1.00
Our algorithm significantly improves speech quality
compared with TTS using SS!
/18
Results of subjective evaluation of speech quality
(input SNR = 5 [dB])
16In all cases, the 𝑝-values between the methods were smaller than 10−6
.
0.368 0.632
SS+MSE
(β = 0.5)
SS+MSE
(β = 1.0)
SS+MSE
(β = 2.0)
SS+MSE
(β = 5.0)
Proposed
0.312 0.688
0.312 0.688
0.00 0.25 0.50 0.75 1.00
Preference score
0.253 0.747
0.292 0.708
0.320 0.680
0.323 0.677
0.00 0.25 0.50 0.75 1.00
Preference score
0.216 0.784
SS+MSE
(β = 0.5)
SS+MSE
(β = 1.0)
SS+MSE
(β = 2.0)
SS+MSE
(β = 5.0)
Proposed
MSE+SS
(𝛽 = 0.5)
MSE+SS
(𝛽 = 1.0)
MSE+SS
(𝛽 = 2.0)
MSE+SS
(𝛽 = 5.0)
Proposed
Preference score
0.00 0.25 0.50 0.75 1.00
Our algorithm significantly improves speech quality
compared with TTS using SS!
/18
Results of subjective evaluation of speech quality
(input SNR = 10 [dB])
17In all cases, the 𝑝-values between the methods were smaller than 10−6
.
0.368 0.632
SS+MSE
(β = 0.5)
SS+MSE
(β = 1.0)
SS+MSE
(β = 2.0)
SS+MSE
(β = 5.0)
Proposed
0.312 0.688
0.312 0.688
0.00 0.25 0.50 0.75 1.00
Preference score
0.253 0.747
0.268 0.732
0.292 0.707
0.256 0.744
0.00 0.25 0.50 0.75 1.00
Preference score
0.288 0.712
SS+MSE
(β = 0.5)
SS+MSE
(β = 1.0)
SS+MSE
(β = 2.0)
SS+MSE
(β = 5.0)
Proposed
MSE+SS
(𝛽 = 0.5)
MSE+SS
(𝛽 = 1.0)
MSE+SS
(𝛽 = 2.0)
MSE+SS
(𝛽 = 5.0)
Proposed
Preference score
0.00 0.25 0.50 0.75 1.00
Our algorithm significantly improves speech quality
compared with TTS using SS!
/18
Conclusion
 Purpose
– Training high-quality TTS using noisy speech data
 Proposed
– Training algorithm considering noise additive process
• Our noise generation models can learn distribution of
observed noise through the GAN-based training.
 Results
– Improving synthetic speech quality compared with TTS using SS
 Future work
– Modeling non-stationary noise by the proposed algorithm
• Using richer DNN architectures (e.g., long-short term memory)
– Comparing our algorithm with state-of-the-art noise suppression
18
Thank you for your attention!

More Related Content

What's hot

Semantic Parsing in Bayesian Anti Spam
Semantic Parsing in Bayesian Anti SpamSemantic Parsing in Bayesian Anti Spam
Semantic Parsing in Bayesian Anti Spam
Tao He
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
NAVER Engineering
 
Speech measurement using laser doppler vibrometer
Speech measurement using laser doppler vibrometerSpeech measurement using laser doppler vibrometer
Speech measurement using laser doppler vibrometer
I'am Ajas
 

What's hot (20)

Meta back translation
Meta back translationMeta back translation
Meta back translation
 
Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...
Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...
Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...
 
INFORMATIZED CAPTION ENHANCEMENT BASED ON IBM WATSON API AND SPEAKER PRONUNCI...
INFORMATIZED CAPTION ENHANCEMENT BASED ON IBM WATSON API AND SPEAKER PRONUNCI...INFORMATIZED CAPTION ENHANCEMENT BASED ON IBM WATSON API AND SPEAKER PRONUNCI...
INFORMATIZED CAPTION ENHANCEMENT BASED ON IBM WATSON API AND SPEAKER PRONUNCI...
 
3D Audio playback for single channel audio using visual cues
3D Audio playback for single channel audio using visual cues3D Audio playback for single channel audio using visual cues
3D Audio playback for single channel audio using visual cues
 
A computer vision approach to speech enhancement
A computer vision approach to speech enhancementA computer vision approach to speech enhancement
A computer vision approach to speech enhancement
 
Semantic Parsing in Bayesian Anti Spam
Semantic Parsing in Bayesian Anti SpamSemantic Parsing in Bayesian Anti Spam
Semantic Parsing in Bayesian Anti Spam
 
F010334548
F010334548F010334548
F010334548
 
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
 
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
 
Speech measurement using laser doppler vibrometer
Speech measurement using laser doppler vibrometerSpeech measurement using laser doppler vibrometer
Speech measurement using laser doppler vibrometer
 
A novel speech enhancement technique
A novel speech enhancement techniqueA novel speech enhancement technique
A novel speech enhancement technique
 
Speech driven gesture generation with Autoencoders - Project
Speech driven gesture generation with Autoencoders - ProjectSpeech driven gesture generation with Autoencoders - Project
Speech driven gesture generation with Autoencoders - Project
 
Sound Source Localization with microphone arrays
Sound Source Localization with microphone arraysSound Source Localization with microphone arrays
Sound Source Localization with microphone arrays
 
A literature review on improving speech intelligibility in noisy environment
A literature review on improving speech intelligibility in noisy environmentA literature review on improving speech intelligibility in noisy environment
A literature review on improving speech intelligibility in noisy environment
 
Dsp2015for ss
Dsp2015for ssDsp2015for ss
Dsp2015for ss
 
Phonetic distance based accent
Phonetic distance based accentPhonetic distance based accent
Phonetic distance based accent
 
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...
 
Audio Signal Processing
Audio Signal Processing Audio Signal Processing
Audio Signal Processing
 
Thesis
ThesisThesis
Thesis
 

Similar to Une18apsipa

A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
sipij
 

Similar to Une18apsipa (20)

International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
nakai22apsipa_presentation.pdf
nakai22apsipa_presentation.pdfnakai22apsipa_presentation.pdf
nakai22apsipa_presentation.pdf
 
The Short-Time Silence of Speech Signal as Signal-To-Noise Ratio Estimator
The Short-Time Silence of Speech Signal as Signal-To-Noise Ratio EstimatorThe Short-Time Silence of Speech Signal as Signal-To-Noise Ratio Estimator
The Short-Time Silence of Speech Signal as Signal-To-Noise Ratio Estimator
 
MediaEval 2015 - The SPL-IT-UC Query by Example Search on Speech system for M...
MediaEval 2015 - The SPL-IT-UC Query by Example Search on Speech system for M...MediaEval 2015 - The SPL-IT-UC Query by Example Search on Speech system for M...
MediaEval 2015 - The SPL-IT-UC Query by Example Search on Speech system for M...
 
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
 
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
 
Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing
Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing
Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing
 
Subjective comparison of_speech_enhancement_algori (1)
Subjective comparison of_speech_enhancement_algori (1)Subjective comparison of_speech_enhancement_algori (1)
Subjective comparison of_speech_enhancement_algori (1)
 
Speaker Segmentation (2006)
Speaker Segmentation (2006)Speaker Segmentation (2006)
Speaker Segmentation (2006)
 
An efficient peak valley detection based vad algorithm for robust detection o...
An efficient peak valley detection based vad algorithm for robust detection o...An efficient peak valley detection based vad algorithm for robust detection o...
An efficient peak valley detection based vad algorithm for robust detection o...
 
AN EFFICIENT PEAK VALLEY DETECTION BASED VAD ALGORITHM FOR ROBUST DETECTION O...
AN EFFICIENT PEAK VALLEY DETECTION BASED VAD ALGORITHM FOR ROBUST DETECTION O...AN EFFICIENT PEAK VALLEY DETECTION BASED VAD ALGORITHM FOR ROBUST DETECTION O...
AN EFFICIENT PEAK VALLEY DETECTION BASED VAD ALGORITHM FOR ROBUST DETECTION O...
 
An efficient peak valley detection based vad algorithm for robust detection o...
An efficient peak valley detection based vad algorithm for robust detection o...An efficient peak valley detection based vad algorithm for robust detection o...
An efficient peak valley detection based vad algorithm for robust detection o...
 
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
 
Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction
Speech Enhancement Using Spectral Flatness Measure Based Spectral SubtractionSpeech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction
Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction
 
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral SubtractionA New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
 
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
 
20575-38936-1-PB.pdf
20575-38936-1-PB.pdf20575-38936-1-PB.pdf
20575-38936-1-PB.pdf
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...
 
saito22research_talk_at_NUS
saito22research_talk_at_NUSsaito22research_talk_at_NUS
saito22research_talk_at_NUS
 
Audio Noise Removal – The State of the Art
Audio Noise Removal – The State of the ArtAudio Noise Removal – The State of the Art
Audio Noise Removal – The State of the Art
 

More from Yuki Saito

More from Yuki Saito (20)

hirai23slp03.pdf
hirai23slp03.pdfhirai23slp03.pdf
hirai23slp03.pdf
 
Interspeech2022 参加報告
Interspeech2022 参加報告Interspeech2022 参加報告
Interspeech2022 参加報告
 
fujii22apsipa_asc
fujii22apsipa_ascfujii22apsipa_asc
fujii22apsipa_asc
 
Neural text-to-speech and voice conversion
Neural text-to-speech and voice conversionNeural text-to-speech and voice conversion
Neural text-to-speech and voice conversion
 
Nishimura22slp03 presentation
Nishimura22slp03 presentationNishimura22slp03 presentation
Nishimura22slp03 presentation
 
Nakai22sp03 presentation
Nakai22sp03 presentationNakai22sp03 presentation
Nakai22sp03 presentation
 
Saito21asj Autumn Meeting
Saito21asj Autumn MeetingSaito21asj Autumn Meeting
Saito21asj Autumn Meeting
 
Interspeech2020 reading
Interspeech2020 readingInterspeech2020 reading
Interspeech2020 reading
 
Saito20asj_autumn
Saito20asj_autumnSaito20asj_autumn
Saito20asj_autumn
 
ICASSP読み会2020
ICASSP読み会2020ICASSP読み会2020
ICASSP読み会2020
 
Saito20asj s slide_published
Saito20asj s slide_publishedSaito20asj s slide_published
Saito20asj s slide_published
 
Saito19asjAutumn_DeNA
Saito19asjAutumn_DeNASaito19asjAutumn_DeNA
Saito19asjAutumn_DeNA
 
Deep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generationDeep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generation
 
Saito19asj_s
Saito19asj_sSaito19asj_s
Saito19asj_s
 
Saito18sp03
Saito18sp03Saito18sp03
Saito18sp03
 
Saito18asj_s
Saito18asj_sSaito18asj_s
Saito18asj_s
 
Saito17asjA
Saito17asjASaito17asjA
Saito17asjA
 
釧路高専情報工学科向け進学説明会
釧路高専情報工学科向け進学説明会釧路高専情報工学科向け進学説明会
釧路高専情報工学科向け進学説明会
 
miyoshi17sp07
miyoshi17sp07miyoshi17sp07
miyoshi17sp07
 
miyoshi2017asj
miyoshi2017asjmiyoshi2017asj
miyoshi2017asj
 

Recently uploaded

Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 

Recently uploaded (20)

Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 

Une18apsipa

  • 1. ©Yuki Saito, 13/11/2018 GENERATIVE APPROACH USING THE NOISE GENERATION MODELS FOR DNN-BASED SPEECH SYNTHESIS TRAINED FROM NOISY SPEECH Masakazu Une1, Yuki Saito2, Shinnosuke Takamichi2, Daichi Kitamura3, Ryoichi Miyazaki1, and Hiroshi Saruwatari2 1NIT, Tokuyama College, Japan, 2The Univ. of Tokyo, Japan, 3NIT, Kagawa College, Japan APSIPA-ASC 2018 TU-P1-5.1
  • 2. /181 Text-To-Speech (TTS) synthesis using Deep Neural Networks (DNNs)  Text-To-Speech (TTS) synthesis  TTS using Deep Neural Networks (DNNs) [Zen et al., 2013] Text Speech Linguistic features Speech params. Text analysis Speech synthesis Text-To-Speech (TTS) DNN-based acoustic models To realize high-quality TTS, studio-quality clean speech data is required for training the DNNs.
  • 3. /18  Goal: realizing high-quality TTS using NOISY speech data  Common approach: noise reduction before training DNNs – Error caused by the noise reduction is propagated to TTS training...  Proposed: training DNNs considering noise additive process – GAN*-based noise generation models are introduced to TTS training.  Results: improving synthetic speech quality 2 Outline of this talk *Generative Adversarial Network [Goodfellow et al., 2014] Noise reduction Noisy (observed) Clean (estimated) TTS Noise addition Noisy (observed) Clean (unobserved) TTS Noise generation models
  • 4. /18 Noise reduction using Spectral Subtraction (SS)*  Amplitude spectra after noise reduction 𝒚s (SS) is calculated as: – 𝑦s SS 𝑡, 𝑓 = 𝑦ns 2 𝑡, 𝑓 − 𝛽 𝑦n 2 𝑓 𝑦ns 2 𝑡, 𝑓 − 𝛽 𝑦n 2 𝑓 > 0 0 otherwise  The estimated average power of noise 𝒚n 2 is defined as: – 𝑦n 2 𝑓 = 1 𝑇n 𝑡=1 𝑇n 𝑦n 2 𝑡, 𝑓 (𝑇n: total frame length of the noise)  Limitations – Approximating the noise distribution with its expectation value 𝒚n 2 – Causing trade-off between noise reduction & speech distortion due to setting the hyper-parameter 𝛽 (noise suppression ratio) 3*[Boll, 1979]
  • 5. /18 Training TTS from noisy speech using SS 4 Mean squared error 𝐿MSE 𝒚s SS , 𝒚s SS TTS Linguistic features Predicted clean amplitude spectra Estimated clean amplitude spectra Noisy amplitude spectra 𝒚s (SS) 𝒚s (SS) 𝒚ns Noise reduction using SS → Minimize𝐿MSE 𝒚s SS , 𝒚s SS = 1 𝑇 𝒚s SS − 𝒚s SS ⊤ 𝒚s SS − 𝒚s SS 𝑇: total frame length of the features
  • 6. /18  1. Speech distortion caused by error of SS  2. Propagation of the distortion by using 𝒚s SS as a target vector Issues in training TTS using SS 5 𝐿MSE 𝒚s SS , 𝒚s SS 𝒚s (SS) 𝒚s (SS) 𝒚ns Noise reduction using SS TTS These issues significantly degrade synthetic speech quality...
  • 7. /186 Proposed algorithm: Training TTS using noise generation models based on GANs
  • 8. /187 Overview of the proposed algorithm 𝐿MSE 𝒚ns, 𝒚ns TTS Linguistic features Estimated noisy Noisy Predicted clean 𝒚s 𝒚ns 𝒚ns Noise addition Pre-trained noise generation models 𝐺n ⋅ Prior noise 𝒚n Generated noise 𝒏 We want 𝐺n ⋅ to model the distribution of the observed noise.
  • 9. /188 Pre-training of noise generation models based on GANs Noise generation models 𝐺n ⋅ Prior noise 𝒚n Generated noise 𝒏 Discriminative models 𝐷 ⋅ 𝒚ns Noisy 𝑉 𝐺n, 𝐷 or 𝑉 𝐺n, 𝐷 = min 𝐺n max 𝐷 𝐸 log 𝐷 𝒚n + 𝐸 log 1 − 𝐷 𝒚n 1: observed 0: generated Extraction of non-speech period 𝒚n Observed noise
  • 10. /189 Pre-training of noise generation models based on GANs Noise generation models 𝐺n ⋅ Prior noise 𝒚n Generated noise 𝒏 Discriminative models 𝐷 ⋅ 𝒚ns Noisy 𝑉 𝐺n, 𝐷 or 𝑉 𝐺n, 𝐷 = min 𝐺n max 𝐷 𝐸 log 𝐷 𝒚n + 𝐸 log 1 − 𝐷 𝒚n 1: observed 0: generated Extraction of non-speech period 𝒚n Observed noise
  • 11. /1810 Pre-training of noise generation models based on GANs Noise generation models 𝐺n ⋅ Prior noise 𝒚n Generated noise 𝒏 Discriminative models 𝐷 ⋅ 𝒚ns Noisy 𝑉 𝐺n, 𝐷 𝑉 𝐺n, 𝐷 = min 𝐺n max 𝐷 𝐸 log 𝐷 𝒚n + 𝐸 log 1 − 𝐷 𝒚n 1: observed *Jensen—Shannon This minimizes the approx. JS* divergence betw. distributions of 𝒚n & 𝒚n. Extraction of non-speech period 𝒚n Observed noise
  • 12. /1811 Comparison of observed/generated noise (generating Gaussian noise from uniform noise) Frequency Amplitude Freq.[kHz]Freq.[kHz] Time [s] Observed Generated Spectrogram Histogram Our noise generation models effectively reproduce characteristics of the observed noise!
  • 13. /18  Modeling distribution of stationary noise by using GANs – Musical noise [Miyazaki et al., 2012] (unpleasant sound) can be reduced. – By using recurrent networks, distribution of non-stationary noise can be also modeled by our algorithm.  Extending the proposed algorithm – Distribution of context-dependent noise (e.g., pop-noise) can be captured by using conditional GANs [Mirza et al., 2015]. – By using WaveNet [Oord et al., 2016], noise distribution can be modeled in the waveform domain.  Adapting TTS or noise generation models – Pre-recorded clean speech data can be used to build initial models used in our algorithm. 12 Discussion of proposed algorithm
  • 15. /18 Experimental conditions 14 Dataset Japanese female speaker (subset of JSUT corpus [Sonobe et al., 2017]) Train / evaluate data 3,000 / 53 sentences (16 kHz sampling) Linguistic feats. 442-dimensional vector (phoneme, accent type, F0, UV, duration, etc...) Speech params. 257-dimensional log amplitude spectrum Waveform synthesis Griffin & Lim’s method [Griffin et al., 1986] Prior / observed noise Uniform / Gaussian (artificially added) DNN architectures Feed-Forward (details are written in our manuscript) Noise suppression ratio of SS 𝛽 0.5, 1.0, 2.0, and 5.0 (larger value means stronger noise reduction) Input SNR 0, 5, and 10 [dB] Evaluation method Preference AB test in terms of speech quality (25 participants / evaluation)
  • 16. /18 Results of subjective evaluation of speech quality (input SNR = 0 [dB]) 15In all cases, the 𝑝-values between the methods were smaller than 10−6 . 0.368 0.632 SS+MSE (β = 0.5) SS+MSE (β = 1.0) SS+MSE (β = 2.0) SS+MSE (β = 5.0) Proposed 0.312 0.688 0.312 0.688 0.00 0.25 0.50 0.75 1.00 Preference score 0.253 0.747 MSE+SS (𝛽 = 0.5) MSE+SS (𝛽 = 1.0) MSE+SS (𝛽 = 2.0) MSE+SS (𝛽 = 5.0) Proposed Preference score 0.00 0.25 0.50 0.75 1.00 Our algorithm significantly improves speech quality compared with TTS using SS!
  • 17. /18 Results of subjective evaluation of speech quality (input SNR = 5 [dB]) 16In all cases, the 𝑝-values between the methods were smaller than 10−6 . 0.368 0.632 SS+MSE (β = 0.5) SS+MSE (β = 1.0) SS+MSE (β = 2.0) SS+MSE (β = 5.0) Proposed 0.312 0.688 0.312 0.688 0.00 0.25 0.50 0.75 1.00 Preference score 0.253 0.747 0.292 0.708 0.320 0.680 0.323 0.677 0.00 0.25 0.50 0.75 1.00 Preference score 0.216 0.784 SS+MSE (β = 0.5) SS+MSE (β = 1.0) SS+MSE (β = 2.0) SS+MSE (β = 5.0) Proposed MSE+SS (𝛽 = 0.5) MSE+SS (𝛽 = 1.0) MSE+SS (𝛽 = 2.0) MSE+SS (𝛽 = 5.0) Proposed Preference score 0.00 0.25 0.50 0.75 1.00 Our algorithm significantly improves speech quality compared with TTS using SS!
  • 18. /18 Results of subjective evaluation of speech quality (input SNR = 10 [dB]) 17In all cases, the 𝑝-values between the methods were smaller than 10−6 . 0.368 0.632 SS+MSE (β = 0.5) SS+MSE (β = 1.0) SS+MSE (β = 2.0) SS+MSE (β = 5.0) Proposed 0.312 0.688 0.312 0.688 0.00 0.25 0.50 0.75 1.00 Preference score 0.253 0.747 0.268 0.732 0.292 0.707 0.256 0.744 0.00 0.25 0.50 0.75 1.00 Preference score 0.288 0.712 SS+MSE (β = 0.5) SS+MSE (β = 1.0) SS+MSE (β = 2.0) SS+MSE (β = 5.0) Proposed MSE+SS (𝛽 = 0.5) MSE+SS (𝛽 = 1.0) MSE+SS (𝛽 = 2.0) MSE+SS (𝛽 = 5.0) Proposed Preference score 0.00 0.25 0.50 0.75 1.00 Our algorithm significantly improves speech quality compared with TTS using SS!
  • 19. /18 Conclusion  Purpose – Training high-quality TTS using noisy speech data  Proposed – Training algorithm considering noise additive process • Our noise generation models can learn distribution of observed noise through the GAN-based training.  Results – Improving synthetic speech quality compared with TTS using SS  Future work – Modeling non-stationary noise by the proposed algorithm • Using richer DNN architectures (e.g., long-short term memory) – Comparing our algorithm with state-of-the-art noise suppression 18 Thank you for your attention!