SlideShare a Scribd company logo

Une18apsipa

Y
Yuki Saito

APSIPA-ASC 2018 @ Hawaii

1 of 19
Download to read offline
©Yuki Saito, 13/11/2018
GENERATIVE APPROACH USING THE NOISE
GENERATION MODELS FOR DNN-BASED SPEECH
SYNTHESIS TRAINED FROM NOISY SPEECH
Masakazu Une1, Yuki Saito2, Shinnosuke Takamichi2,
Daichi Kitamura3, Ryoichi Miyazaki1, and Hiroshi Saruwatari2
1NIT, Tokuyama College, Japan, 2The Univ. of Tokyo, Japan,
3NIT, Kagawa College, Japan
APSIPA-ASC 2018 TU-P1-5.1
/181
Text-To-Speech (TTS) synthesis
using Deep Neural Networks (DNNs)
 Text-To-Speech (TTS) synthesis
 TTS using Deep Neural Networks (DNNs) [Zen et al., 2013]
Text Speech
Linguistic
features
Speech
params.
Text
analysis
Speech
synthesis
Text-To-Speech (TTS)
DNN-based
acoustic models
To realize high-quality TTS,
studio-quality clean speech data is required for training the DNNs.
/18
 Goal: realizing high-quality TTS using NOISY speech data
 Common approach: noise reduction before training DNNs
– Error caused by the noise reduction is propagated to TTS training...
 Proposed: training DNNs considering noise additive process
– GAN*-based noise generation models are introduced to TTS training.
 Results: improving synthetic speech quality
2
Outline of this talk
*Generative Adversarial Network [Goodfellow et al., 2014]
Noise
reduction
Noisy
(observed)
Clean
(estimated)
TTS
Noise
addition
Noisy
(observed)
Clean
(unobserved)
TTS
Noise generation
models
/18
Noise reduction using Spectral Subtraction (SS)*
 Amplitude spectra after noise reduction 𝒚s
(SS)
is calculated as:
– 𝑦s
SS
𝑡, 𝑓 =
𝑦ns
2
𝑡, 𝑓 − 𝛽 𝑦n
2
𝑓 𝑦ns
2 𝑡, 𝑓 − 𝛽 𝑦n
2 𝑓 > 0
0 otherwise
 The estimated average power of noise 𝒚n
2 is defined as:
– 𝑦n
2
𝑓 =
1
𝑇n
𝑡=1
𝑇n
𝑦n
2
𝑡, 𝑓 (𝑇n: total frame length of the noise)
 Limitations
– Approximating the noise distribution with its expectation value 𝒚n
2
– Causing trade-off between noise reduction & speech distortion due to
setting the hyper-parameter 𝛽 (noise suppression ratio)
3*[Boll, 1979]
/18
Training TTS from noisy speech using SS
4
Mean squared error
𝐿MSE 𝒚s
SS
, 𝒚s
SS
TTS
Linguistic
features
Predicted clean
amplitude
spectra
Estimated clean
amplitude
spectra
Noisy
amplitude
spectra
𝒚s
(SS)
𝒚s
(SS)
𝒚ns
Noise
reduction
using SS
→ Minimize𝐿MSE 𝒚s
SS
, 𝒚s
SS
=
1
𝑇
𝒚s
SS
− 𝒚s
SS
⊤
𝒚s
SS
− 𝒚s
SS
𝑇: total frame length of the features
/18
 1. Speech distortion caused by error of SS
 2. Propagation of the distortion by using 𝒚s
SS
as a target vector
Issues in training TTS using SS
5
𝐿MSE 𝒚s
SS
, 𝒚s
SS
𝒚s
(SS)
𝒚s
(SS)
𝒚ns
Noise
reduction
using SS
TTS
These issues significantly degrade synthetic speech quality...

Recommended

GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)Yuki Saito
 
Saito2017icassp
Saito2017icasspSaito2017icassp
Saito2017icasspYuki Saito
 
Digital modeling of speech signal
Digital modeling of speech signalDigital modeling of speech signal
Digital modeling of speech signalVinodhini
 
APSIPA2017: Trajectory smoothing for vocoder-free speech synthesis
APSIPA2017: Trajectory smoothing for vocoder-free speech synthesisAPSIPA2017: Trajectory smoothing for vocoder-free speech synthesis
APSIPA2017: Trajectory smoothing for vocoder-free speech synthesisShinnosuke Takamichi
 

More Related Content

What's hot

Meta back translation
Meta back translationMeta back translation
Meta back translationHyunKyu Jeon
 
Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...
Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...
Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...ijsrd.com
 
INFORMATIZED CAPTION ENHANCEMENT BASED ON IBM WATSON API AND SPEAKER PRONUNCI...
INFORMATIZED CAPTION ENHANCEMENT BASED ON IBM WATSON API AND SPEAKER PRONUNCI...INFORMATIZED CAPTION ENHANCEMENT BASED ON IBM WATSON API AND SPEAKER PRONUNCI...
INFORMATIZED CAPTION ENHANCEMENT BASED ON IBM WATSON API AND SPEAKER PRONUNCI...cscpconf
 
3D Audio playback for single channel audio using visual cues
3D Audio playback for single channel audio using visual cues3D Audio playback for single channel audio using visual cues
3D Audio playback for single channel audio using visual cuesRamin Anushiravani
 
A computer vision approach to speech enhancement
A computer vision approach to speech enhancementA computer vision approach to speech enhancement
A computer vision approach to speech enhancementRamin Anushiravani
 
Semantic Parsing in Bayesian Anti Spam
Semantic Parsing in Bayesian Anti SpamSemantic Parsing in Bayesian Anti Spam
Semantic Parsing in Bayesian Anti SpamTao He
 
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...IRJET Journal
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementNAVER Engineering
 
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...IJERA Editor
 
Speech measurement using laser doppler vibrometer
Speech measurement using laser doppler vibrometerSpeech measurement using laser doppler vibrometer
Speech measurement using laser doppler vibrometerI'am Ajas
 
A novel speech enhancement technique
A novel speech enhancement techniqueA novel speech enhancement technique
A novel speech enhancement techniqueeSAT Publishing House
 
Sound Source Localization with microphone arrays
Sound Source Localization with microphone arraysSound Source Localization with microphone arrays
Sound Source Localization with microphone arraysRamin Anushiravani
 
A literature review on improving speech intelligibility in noisy environment
A literature review on improving speech intelligibility in noisy environmentA literature review on improving speech intelligibility in noisy environment
A literature review on improving speech intelligibility in noisy environmentOHSU | Oregon Health & Science University
 
Phonetic distance based accent
Phonetic distance based accentPhonetic distance based accent
Phonetic distance based accentsipij
 
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...Toshiaki Nakazawa
 
Audio Signal Processing
Audio Signal Processing Audio Signal Processing
Audio Signal Processing Ahmed A. Arefin
 

What's hot (20)

Meta back translation
Meta back translationMeta back translation
Meta back translation
 
Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...
Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...
Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...
 
INFORMATIZED CAPTION ENHANCEMENT BASED ON IBM WATSON API AND SPEAKER PRONUNCI...
INFORMATIZED CAPTION ENHANCEMENT BASED ON IBM WATSON API AND SPEAKER PRONUNCI...INFORMATIZED CAPTION ENHANCEMENT BASED ON IBM WATSON API AND SPEAKER PRONUNCI...
INFORMATIZED CAPTION ENHANCEMENT BASED ON IBM WATSON API AND SPEAKER PRONUNCI...
 
3D Audio playback for single channel audio using visual cues
3D Audio playback for single channel audio using visual cues3D Audio playback for single channel audio using visual cues
3D Audio playback for single channel audio using visual cues
 
A computer vision approach to speech enhancement
A computer vision approach to speech enhancementA computer vision approach to speech enhancement
A computer vision approach to speech enhancement
 
Semantic Parsing in Bayesian Anti Spam
Semantic Parsing in Bayesian Anti SpamSemantic Parsing in Bayesian Anti Spam
Semantic Parsing in Bayesian Anti Spam
 
F010334548
F010334548F010334548
F010334548
 
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
 
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
 
Speech measurement using laser doppler vibrometer
Speech measurement using laser doppler vibrometerSpeech measurement using laser doppler vibrometer
Speech measurement using laser doppler vibrometer
 
A novel speech enhancement technique
A novel speech enhancement techniqueA novel speech enhancement technique
A novel speech enhancement technique
 
Speech driven gesture generation with Autoencoders - Project
Speech driven gesture generation with Autoencoders - ProjectSpeech driven gesture generation with Autoencoders - Project
Speech driven gesture generation with Autoencoders - Project
 
Sound Source Localization with microphone arrays
Sound Source Localization with microphone arraysSound Source Localization with microphone arrays
Sound Source Localization with microphone arrays
 
A literature review on improving speech intelligibility in noisy environment
A literature review on improving speech intelligibility in noisy environmentA literature review on improving speech intelligibility in noisy environment
A literature review on improving speech intelligibility in noisy environment
 
Dsp2015for ss
Dsp2015for ssDsp2015for ss
Dsp2015for ss
 
Phonetic distance based accent
Phonetic distance based accentPhonetic distance based accent
Phonetic distance based accent
 
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...
 
Audio Signal Processing
Audio Signal Processing Audio Signal Processing
Audio Signal Processing
 
Thesis
ThesisThesis
Thesis
 

Similar to Une18apsipa

International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
nakai22apsipa_presentation.pdf
nakai22apsipa_presentation.pdfnakai22apsipa_presentation.pdf
nakai22apsipa_presentation.pdfYuki Saito
 
The Short-Time Silence of Speech Signal as Signal-To-Noise Ratio Estimator
The Short-Time Silence of Speech Signal as Signal-To-Noise Ratio EstimatorThe Short-Time Silence of Speech Signal as Signal-To-Noise Ratio Estimator
The Short-Time Silence of Speech Signal as Signal-To-Noise Ratio EstimatorIJERA Editor
 
MediaEval 2015 - The SPL-IT-UC Query by Example Search on Speech system for M...
MediaEval 2015 - The SPL-IT-UC Query by Example Search on Speech system for M...MediaEval 2015 - The SPL-IT-UC Query by Example Search on Speech system for M...
MediaEval 2015 - The SPL-IT-UC Query by Example Search on Speech system for M...multimediaeval
 
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...sipij
 
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...ssuser849b73
 
Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing
Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing
Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing IJECEIAES
 
Subjective comparison of_speech_enhancement_algori (1)
Subjective comparison of_speech_enhancement_algori (1)Subjective comparison of_speech_enhancement_algori (1)
Subjective comparison of_speech_enhancement_algori (1)Priyanka Reddy
 
An efficient peak valley detection based vad algorithm for robust detection o...
An efficient peak valley detection based vad algorithm for robust detection o...An efficient peak valley detection based vad algorithm for robust detection o...
An efficient peak valley detection based vad algorithm for robust detection o...csandit
 
AN EFFICIENT PEAK VALLEY DETECTION BASED VAD ALGORITHM FOR ROBUST DETECTION O...
AN EFFICIENT PEAK VALLEY DETECTION BASED VAD ALGORITHM FOR ROBUST DETECTION O...AN EFFICIENT PEAK VALLEY DETECTION BASED VAD ALGORITHM FOR ROBUST DETECTION O...
AN EFFICIENT PEAK VALLEY DETECTION BASED VAD ALGORITHM FOR ROBUST DETECTION O...cscpconf
 
An efficient peak valley detection based vad algorithm for robust detection o...
An efficient peak valley detection based vad algorithm for robust detection o...An efficient peak valley detection based vad algorithm for robust detection o...
An efficient peak valley detection based vad algorithm for robust detection o...csandit
 
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...NU_I_TODALAB
 
Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction
Speech Enhancement Using Spectral Flatness Measure Based Spectral SubtractionSpeech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction
Speech Enhancement Using Spectral Flatness Measure Based Spectral SubtractionIOSRJVSP
 
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral SubtractionA New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral SubtractionCSCJournals
 
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...IRJET Journal
 
20575-38936-1-PB.pdf
20575-38936-1-PB.pdf20575-38936-1-PB.pdf
20575-38936-1-PB.pdfIjictTeam
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...karthik annam
 
saito22research_talk_at_NUS
saito22research_talk_at_NUSsaito22research_talk_at_NUS
saito22research_talk_at_NUSYuki Saito
 
Audio Noise Removal – The State of the Art
Audio Noise Removal – The State of the ArtAudio Noise Removal – The State of the Art
Audio Noise Removal – The State of the Artijceronline
 

Similar to Une18apsipa (20)

International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
nakai22apsipa_presentation.pdf
nakai22apsipa_presentation.pdfnakai22apsipa_presentation.pdf
nakai22apsipa_presentation.pdf
 
The Short-Time Silence of Speech Signal as Signal-To-Noise Ratio Estimator
The Short-Time Silence of Speech Signal as Signal-To-Noise Ratio EstimatorThe Short-Time Silence of Speech Signal as Signal-To-Noise Ratio Estimator
The Short-Time Silence of Speech Signal as Signal-To-Noise Ratio Estimator
 
MediaEval 2015 - The SPL-IT-UC Query by Example Search on Speech system for M...
MediaEval 2015 - The SPL-IT-UC Query by Example Search on Speech system for M...MediaEval 2015 - The SPL-IT-UC Query by Example Search on Speech system for M...
MediaEval 2015 - The SPL-IT-UC Query by Example Search on Speech system for M...
 
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
 
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
 
Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing
Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing
Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing
 
Subjective comparison of_speech_enhancement_algori (1)
Subjective comparison of_speech_enhancement_algori (1)Subjective comparison of_speech_enhancement_algori (1)
Subjective comparison of_speech_enhancement_algori (1)
 
Speaker Segmentation (2006)
Speaker Segmentation (2006)Speaker Segmentation (2006)
Speaker Segmentation (2006)
 
An efficient peak valley detection based vad algorithm for robust detection o...
An efficient peak valley detection based vad algorithm for robust detection o...An efficient peak valley detection based vad algorithm for robust detection o...
An efficient peak valley detection based vad algorithm for robust detection o...
 
AN EFFICIENT PEAK VALLEY DETECTION BASED VAD ALGORITHM FOR ROBUST DETECTION O...
AN EFFICIENT PEAK VALLEY DETECTION BASED VAD ALGORITHM FOR ROBUST DETECTION O...AN EFFICIENT PEAK VALLEY DETECTION BASED VAD ALGORITHM FOR ROBUST DETECTION O...
AN EFFICIENT PEAK VALLEY DETECTION BASED VAD ALGORITHM FOR ROBUST DETECTION O...
 
An efficient peak valley detection based vad algorithm for robust detection o...
An efficient peak valley detection based vad algorithm for robust detection o...An efficient peak valley detection based vad algorithm for robust detection o...
An efficient peak valley detection based vad algorithm for robust detection o...
 
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
 
Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction
Speech Enhancement Using Spectral Flatness Measure Based Spectral SubtractionSpeech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction
Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction
 
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral SubtractionA New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
 
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
 
20575-38936-1-PB.pdf
20575-38936-1-PB.pdf20575-38936-1-PB.pdf
20575-38936-1-PB.pdf
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...
 
saito22research_talk_at_NUS
saito22research_talk_at_NUSsaito22research_talk_at_NUS
saito22research_talk_at_NUS
 
Audio Noise Removal – The State of the Art
Audio Noise Removal – The State of the ArtAudio Noise Removal – The State of the Art
Audio Noise Removal – The State of the Art
 

More from Yuki Saito

hirai23slp03.pdf
hirai23slp03.pdfhirai23slp03.pdf
hirai23slp03.pdfYuki Saito
 
Interspeech2022 参加報告
Interspeech2022 参加報告Interspeech2022 参加報告
Interspeech2022 参加報告Yuki Saito
 
fujii22apsipa_asc
fujii22apsipa_ascfujii22apsipa_asc
fujii22apsipa_ascYuki Saito
 
Neural text-to-speech and voice conversion
Neural text-to-speech and voice conversionNeural text-to-speech and voice conversion
Neural text-to-speech and voice conversionYuki Saito
 
Nishimura22slp03 presentation
Nishimura22slp03 presentationNishimura22slp03 presentation
Nishimura22slp03 presentationYuki Saito
 
Nakai22sp03 presentation
Nakai22sp03 presentationNakai22sp03 presentation
Nakai22sp03 presentationYuki Saito
 
Saito21asj Autumn Meeting
Saito21asj Autumn MeetingSaito21asj Autumn Meeting
Saito21asj Autumn MeetingYuki Saito
 
Interspeech2020 reading
Interspeech2020 readingInterspeech2020 reading
Interspeech2020 readingYuki Saito
 
Saito20asj_autumn
Saito20asj_autumnSaito20asj_autumn
Saito20asj_autumnYuki Saito
 
ICASSP読み会2020
ICASSP読み会2020ICASSP読み会2020
ICASSP読み会2020Yuki Saito
 
Saito20asj s slide_published
Saito20asj s slide_publishedSaito20asj s slide_published
Saito20asj s slide_publishedYuki Saito
 
Saito19asjAutumn_DeNA
Saito19asjAutumn_DeNASaito19asjAutumn_DeNA
Saito19asjAutumn_DeNAYuki Saito
 
Deep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generationDeep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generationYuki Saito
 
釧路高専情報工学科向け進学説明会
釧路高専情報工学科向け進学説明会釧路高専情報工学科向け進学説明会
釧路高専情報工学科向け進学説明会Yuki Saito
 
miyoshi2017asj
miyoshi2017asjmiyoshi2017asj
miyoshi2017asjYuki Saito
 

More from Yuki Saito (20)

hirai23slp03.pdf
hirai23slp03.pdfhirai23slp03.pdf
hirai23slp03.pdf
 
Interspeech2022 参加報告
Interspeech2022 参加報告Interspeech2022 参加報告
Interspeech2022 参加報告
 
fujii22apsipa_asc
fujii22apsipa_ascfujii22apsipa_asc
fujii22apsipa_asc
 
Neural text-to-speech and voice conversion
Neural text-to-speech and voice conversionNeural text-to-speech and voice conversion
Neural text-to-speech and voice conversion
 
Nishimura22slp03 presentation
Nishimura22slp03 presentationNishimura22slp03 presentation
Nishimura22slp03 presentation
 
Nakai22sp03 presentation
Nakai22sp03 presentationNakai22sp03 presentation
Nakai22sp03 presentation
 
Saito21asj Autumn Meeting
Saito21asj Autumn MeetingSaito21asj Autumn Meeting
Saito21asj Autumn Meeting
 
Interspeech2020 reading
Interspeech2020 readingInterspeech2020 reading
Interspeech2020 reading
 
Saito20asj_autumn
Saito20asj_autumnSaito20asj_autumn
Saito20asj_autumn
 
ICASSP読み会2020
ICASSP読み会2020ICASSP読み会2020
ICASSP読み会2020
 
Saito20asj s slide_published
Saito20asj s slide_publishedSaito20asj s slide_published
Saito20asj s slide_published
 
Saito19asjAutumn_DeNA
Saito19asjAutumn_DeNASaito19asjAutumn_DeNA
Saito19asjAutumn_DeNA
 
Deep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generationDeep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generation
 
Saito19asj_s
Saito19asj_sSaito19asj_s
Saito19asj_s
 
Saito18sp03
Saito18sp03Saito18sp03
Saito18sp03
 
Saito18asj_s
Saito18asj_sSaito18asj_s
Saito18asj_s
 
Saito17asjA
Saito17asjASaito17asjA
Saito17asjA
 
釧路高専情報工学科向け進学説明会
釧路高専情報工学科向け進学説明会釧路高専情報工学科向け進学説明会
釧路高専情報工学科向け進学説明会
 
miyoshi17sp07
miyoshi17sp07miyoshi17sp07
miyoshi17sp07
 
miyoshi2017asj
miyoshi2017asjmiyoshi2017asj
miyoshi2017asj
 

Recently uploaded

LIGHT Community Medicine LIGHT IS A SOURCE OF ENERGY THERE ARE TWO TYPE OF S...
LIGHT  Community Medicine LIGHT IS A SOURCE OF ENERGY THERE ARE TWO TYPE OF S...LIGHT  Community Medicine LIGHT IS A SOURCE OF ENERGY THERE ARE TWO TYPE OF S...
LIGHT Community Medicine LIGHT IS A SOURCE OF ENERGY THERE ARE TWO TYPE OF S...Abhinav S
 
Chemistry chapter 1 solutions detailed explanation
Chemistry chapter 1 solutions detailed explanationChemistry chapter 1 solutions detailed explanation
Chemistry chapter 1 solutions detailed explanationayuqroyjohn85
 
Thornyissue testing of slideshow for website
Thornyissue testing of slideshow for websiteThornyissue testing of slideshow for website
Thornyissue testing of slideshow for websitesuelcarter1
 
Open Access Publishing in Astrophysics and the Open Journal of Astrophysics
Open Access Publishing in Astrophysics and the Open Journal of AstrophysicsOpen Access Publishing in Astrophysics and the Open Journal of Astrophysics
Open Access Publishing in Astrophysics and the Open Journal of AstrophysicsPeter Coles
 
Anti-Obesity Activity of Anthocyanins and Corresponding Introduction in Dieta...
Anti-Obesity Activity of Anthocyanins and Corresponding Introduction in Dieta...Anti-Obesity Activity of Anthocyanins and Corresponding Introduction in Dieta...
Anti-Obesity Activity of Anthocyanins and Corresponding Introduction in Dieta...AmalDhivaharS
 
Introduction to the research of stem cells
Introduction to the research of stem cellsIntroduction to the research of stem cells
Introduction to the research of stem cellsAlaaOraby6
 
conceptofatomic#number Physical Science second sem week 1.pptx
conceptofatomic#number Physical Science second sem week 1.pptxconceptofatomic#number Physical Science second sem week 1.pptx
conceptofatomic#number Physical Science second sem week 1.pptxAnggeComeso
 
Volatile Oils-Introduction for pharmacy students and graduates
Volatile Oils-Introduction for pharmacy students and graduatesVolatile Oils-Introduction for pharmacy students and graduates
Volatile Oils-Introduction for pharmacy students and graduatesAhmed Metwaly
 
The ExoGRAVITY project - observations of exoplanets from the ground with opti...
The ExoGRAVITY project - observations of exoplanets from the ground with opti...The ExoGRAVITY project - observations of exoplanets from the ground with opti...
The ExoGRAVITY project - observations of exoplanets from the ground with opti...Advanced-Concepts-Team
 
Open Access Publishing and the Open Journal of Astrophysics
Open Access Publishing and the Open Journal of AstrophysicsOpen Access Publishing and the Open Journal of Astrophysics
Open Access Publishing and the Open Journal of AstrophysicsPeter Coles
 
Duchenne Muscular Dystrophy or DMD .pptx
Duchenne Muscular Dystrophy or DMD .pptxDuchenne Muscular Dystrophy or DMD .pptx
Duchenne Muscular Dystrophy or DMD .pptxNavanidhan.M
 
Genetic Code. A comprehensive overview..pdf
Genetic Code. A comprehensive overview..pdfGenetic Code. A comprehensive overview..pdf
Genetic Code. A comprehensive overview..pdfmughalgumar440
 
PROSTHETIC FEET description and its types
PROSTHETIC FEET description and its typesPROSTHETIC FEET description and its types
PROSTHETIC FEET description and its typeseshasmalik27
 
commercial production of cellulase enzyme and its uses
commercial production of cellulase enzyme and its usescommercial production of cellulase enzyme and its uses
commercial production of cellulase enzyme and its usesSilpa Selvaraj
 
Rootstock scion and Interstock Relationship Selection of Elite Mother Plants
Rootstock scion and Interstock Relationship Selection of Elite Mother PlantsRootstock scion and Interstock Relationship Selection of Elite Mother Plants
Rootstock scion and Interstock Relationship Selection of Elite Mother PlantsAmanDohre
 
Salesforce Starter Package Presentation.
Salesforce Starter Package Presentation.Salesforce Starter Package Presentation.
Salesforce Starter Package Presentation.Naresh Gupta
 
Study of X - Ray Spectra and its types
Study  of X  - Ray Spectra and its typesStudy  of X  - Ray Spectra and its types
Study of X - Ray Spectra and its typestanishashukla147
 
REARING EQUIPMENT IN SERICULTURE . pptx
REARING EQUIPMENT IN SERICULTURE . pptxREARING EQUIPMENT IN SERICULTURE . pptx
REARING EQUIPMENT IN SERICULTURE . pptxVISHALI SELVAM
 
Earth and Planetary Science | Volume 01 | Issue 01 | April 2022
Earth and Planetary Science | Volume 01 | Issue 01 | April 2022Earth and Planetary Science | Volume 01 | Issue 01 | April 2022
Earth and Planetary Science | Volume 01 | Issue 01 | April 2022Nan Yang Academy of Sciences
 

Recently uploaded (20)

LIGHT Community Medicine LIGHT IS A SOURCE OF ENERGY THERE ARE TWO TYPE OF S...
LIGHT  Community Medicine LIGHT IS A SOURCE OF ENERGY THERE ARE TWO TYPE OF S...LIGHT  Community Medicine LIGHT IS A SOURCE OF ENERGY THERE ARE TWO TYPE OF S...
LIGHT Community Medicine LIGHT IS A SOURCE OF ENERGY THERE ARE TWO TYPE OF S...
 
Chemistry chapter 1 solutions detailed explanation
Chemistry chapter 1 solutions detailed explanationChemistry chapter 1 solutions detailed explanation
Chemistry chapter 1 solutions detailed explanation
 
Thornyissue testing of slideshow for website
Thornyissue testing of slideshow for websiteThornyissue testing of slideshow for website
Thornyissue testing of slideshow for website
 
Open Access Publishing in Astrophysics and the Open Journal of Astrophysics
Open Access Publishing in Astrophysics and the Open Journal of AstrophysicsOpen Access Publishing in Astrophysics and the Open Journal of Astrophysics
Open Access Publishing in Astrophysics and the Open Journal of Astrophysics
 
Anti-Obesity Activity of Anthocyanins and Corresponding Introduction in Dieta...
Anti-Obesity Activity of Anthocyanins and Corresponding Introduction in Dieta...Anti-Obesity Activity of Anthocyanins and Corresponding Introduction in Dieta...
Anti-Obesity Activity of Anthocyanins and Corresponding Introduction in Dieta...
 
Introduction to the research of stem cells
Introduction to the research of stem cellsIntroduction to the research of stem cells
Introduction to the research of stem cells
 
conceptofatomic#number Physical Science second sem week 1.pptx
conceptofatomic#number Physical Science second sem week 1.pptxconceptofatomic#number Physical Science second sem week 1.pptx
conceptofatomic#number Physical Science second sem week 1.pptx
 
Volatile Oils-Introduction for pharmacy students and graduates
Volatile Oils-Introduction for pharmacy students and graduatesVolatile Oils-Introduction for pharmacy students and graduates
Volatile Oils-Introduction for pharmacy students and graduates
 
The ExoGRAVITY project - observations of exoplanets from the ground with opti...
The ExoGRAVITY project - observations of exoplanets from the ground with opti...The ExoGRAVITY project - observations of exoplanets from the ground with opti...
The ExoGRAVITY project - observations of exoplanets from the ground with opti...
 
Research methods in ethnobotany- Exploring Traditional Wisdom
Research methods in ethnobotany- Exploring Traditional WisdomResearch methods in ethnobotany- Exploring Traditional Wisdom
Research methods in ethnobotany- Exploring Traditional Wisdom
 
Open Access Publishing and the Open Journal of Astrophysics
Open Access Publishing and the Open Journal of AstrophysicsOpen Access Publishing and the Open Journal of Astrophysics
Open Access Publishing and the Open Journal of Astrophysics
 
Duchenne Muscular Dystrophy or DMD .pptx
Duchenne Muscular Dystrophy or DMD .pptxDuchenne Muscular Dystrophy or DMD .pptx
Duchenne Muscular Dystrophy or DMD .pptx
 
Genetic Code. A comprehensive overview..pdf
Genetic Code. A comprehensive overview..pdfGenetic Code. A comprehensive overview..pdf
Genetic Code. A comprehensive overview..pdf
 
PROSTHETIC FEET description and its types
PROSTHETIC FEET description and its typesPROSTHETIC FEET description and its types
PROSTHETIC FEET description and its types
 
commercial production of cellulase enzyme and its uses
commercial production of cellulase enzyme and its usescommercial production of cellulase enzyme and its uses
commercial production of cellulase enzyme and its uses
 
Rootstock scion and Interstock Relationship Selection of Elite Mother Plants
Rootstock scion and Interstock Relationship Selection of Elite Mother PlantsRootstock scion and Interstock Relationship Selection of Elite Mother Plants
Rootstock scion and Interstock Relationship Selection of Elite Mother Plants
 
Salesforce Starter Package Presentation.
Salesforce Starter Package Presentation.Salesforce Starter Package Presentation.
Salesforce Starter Package Presentation.
 
Study of X - Ray Spectra and its types
Study  of X  - Ray Spectra and its typesStudy  of X  - Ray Spectra and its types
Study of X - Ray Spectra and its types
 
REARING EQUIPMENT IN SERICULTURE . pptx
REARING EQUIPMENT IN SERICULTURE . pptxREARING EQUIPMENT IN SERICULTURE . pptx
REARING EQUIPMENT IN SERICULTURE . pptx
 
Earth and Planetary Science | Volume 01 | Issue 01 | April 2022
Earth and Planetary Science | Volume 01 | Issue 01 | April 2022Earth and Planetary Science | Volume 01 | Issue 01 | April 2022
Earth and Planetary Science | Volume 01 | Issue 01 | April 2022
 

Une18apsipa

  • 1. ©Yuki Saito, 13/11/2018 GENERATIVE APPROACH USING THE NOISE GENERATION MODELS FOR DNN-BASED SPEECH SYNTHESIS TRAINED FROM NOISY SPEECH Masakazu Une1, Yuki Saito2, Shinnosuke Takamichi2, Daichi Kitamura3, Ryoichi Miyazaki1, and Hiroshi Saruwatari2 1NIT, Tokuyama College, Japan, 2The Univ. of Tokyo, Japan, 3NIT, Kagawa College, Japan APSIPA-ASC 2018 TU-P1-5.1
  • 2. /181 Text-To-Speech (TTS) synthesis using Deep Neural Networks (DNNs)  Text-To-Speech (TTS) synthesis  TTS using Deep Neural Networks (DNNs) [Zen et al., 2013] Text Speech Linguistic features Speech params. Text analysis Speech synthesis Text-To-Speech (TTS) DNN-based acoustic models To realize high-quality TTS, studio-quality clean speech data is required for training the DNNs.
  • 3. /18  Goal: realizing high-quality TTS using NOISY speech data  Common approach: noise reduction before training DNNs – Error caused by the noise reduction is propagated to TTS training...  Proposed: training DNNs considering noise additive process – GAN*-based noise generation models are introduced to TTS training.  Results: improving synthetic speech quality 2 Outline of this talk *Generative Adversarial Network [Goodfellow et al., 2014] Noise reduction Noisy (observed) Clean (estimated) TTS Noise addition Noisy (observed) Clean (unobserved) TTS Noise generation models
  • 4. /18 Noise reduction using Spectral Subtraction (SS)*  Amplitude spectra after noise reduction 𝒚s (SS) is calculated as: – 𝑦s SS 𝑡, 𝑓 = 𝑦ns 2 𝑡, 𝑓 − 𝛽 𝑦n 2 𝑓 𝑦ns 2 𝑡, 𝑓 − 𝛽 𝑦n 2 𝑓 > 0 0 otherwise  The estimated average power of noise 𝒚n 2 is defined as: – 𝑦n 2 𝑓 = 1 𝑇n 𝑡=1 𝑇n 𝑦n 2 𝑡, 𝑓 (𝑇n: total frame length of the noise)  Limitations – Approximating the noise distribution with its expectation value 𝒚n 2 – Causing trade-off between noise reduction & speech distortion due to setting the hyper-parameter 𝛽 (noise suppression ratio) 3*[Boll, 1979]
  • 5. /18 Training TTS from noisy speech using SS 4 Mean squared error 𝐿MSE 𝒚s SS , 𝒚s SS TTS Linguistic features Predicted clean amplitude spectra Estimated clean amplitude spectra Noisy amplitude spectra 𝒚s (SS) 𝒚s (SS) 𝒚ns Noise reduction using SS → Minimize𝐿MSE 𝒚s SS , 𝒚s SS = 1 𝑇 𝒚s SS − 𝒚s SS ⊤ 𝒚s SS − 𝒚s SS 𝑇: total frame length of the features
  • 6. /18  1. Speech distortion caused by error of SS  2. Propagation of the distortion by using 𝒚s SS as a target vector Issues in training TTS using SS 5 𝐿MSE 𝒚s SS , 𝒚s SS 𝒚s (SS) 𝒚s (SS) 𝒚ns Noise reduction using SS TTS These issues significantly degrade synthetic speech quality...
  • 7. /186 Proposed algorithm: Training TTS using noise generation models based on GANs
  • 8. /187 Overview of the proposed algorithm 𝐿MSE 𝒚ns, 𝒚ns TTS Linguistic features Estimated noisy Noisy Predicted clean 𝒚s 𝒚ns 𝒚ns Noise addition Pre-trained noise generation models 𝐺n ⋅ Prior noise 𝒚n Generated noise 𝒏 We want 𝐺n ⋅ to model the distribution of the observed noise.
  • 9. /188 Pre-training of noise generation models based on GANs Noise generation models 𝐺n ⋅ Prior noise 𝒚n Generated noise 𝒏 Discriminative models 𝐷 ⋅ 𝒚ns Noisy 𝑉 𝐺n, 𝐷 or 𝑉 𝐺n, 𝐷 = min 𝐺n max 𝐷 𝐸 log 𝐷 𝒚n + 𝐸 log 1 − 𝐷 𝒚n 1: observed 0: generated Extraction of non-speech period 𝒚n Observed noise
  • 10. /189 Pre-training of noise generation models based on GANs Noise generation models 𝐺n ⋅ Prior noise 𝒚n Generated noise 𝒏 Discriminative models 𝐷 ⋅ 𝒚ns Noisy 𝑉 𝐺n, 𝐷 or 𝑉 𝐺n, 𝐷 = min 𝐺n max 𝐷 𝐸 log 𝐷 𝒚n + 𝐸 log 1 − 𝐷 𝒚n 1: observed 0: generated Extraction of non-speech period 𝒚n Observed noise
  • 11. /1810 Pre-training of noise generation models based on GANs Noise generation models 𝐺n ⋅ Prior noise 𝒚n Generated noise 𝒏 Discriminative models 𝐷 ⋅ 𝒚ns Noisy 𝑉 𝐺n, 𝐷 𝑉 𝐺n, 𝐷 = min 𝐺n max 𝐷 𝐸 log 𝐷 𝒚n + 𝐸 log 1 − 𝐷 𝒚n 1: observed *Jensen—Shannon This minimizes the approx. JS* divergence betw. distributions of 𝒚n & 𝒚n. Extraction of non-speech period 𝒚n Observed noise
  • 12. /1811 Comparison of observed/generated noise (generating Gaussian noise from uniform noise) Frequency Amplitude Freq.[kHz]Freq.[kHz] Time [s] Observed Generated Spectrogram Histogram Our noise generation models effectively reproduce characteristics of the observed noise!
  • 13. /18  Modeling distribution of stationary noise by using GANs – Musical noise [Miyazaki et al., 2012] (unpleasant sound) can be reduced. – By using recurrent networks, distribution of non-stationary noise can be also modeled by our algorithm.  Extending the proposed algorithm – Distribution of context-dependent noise (e.g., pop-noise) can be captured by using conditional GANs [Mirza et al., 2015]. – By using WaveNet [Oord et al., 2016], noise distribution can be modeled in the waveform domain.  Adapting TTS or noise generation models – Pre-recorded clean speech data can be used to build initial models used in our algorithm. 12 Discussion of proposed algorithm
  • 15. /18 Experimental conditions 14 Dataset Japanese female speaker (subset of JSUT corpus [Sonobe et al., 2017]) Train / evaluate data 3,000 / 53 sentences (16 kHz sampling) Linguistic feats. 442-dimensional vector (phoneme, accent type, F0, UV, duration, etc...) Speech params. 257-dimensional log amplitude spectrum Waveform synthesis Griffin & Lim’s method [Griffin et al., 1986] Prior / observed noise Uniform / Gaussian (artificially added) DNN architectures Feed-Forward (details are written in our manuscript) Noise suppression ratio of SS 𝛽 0.5, 1.0, 2.0, and 5.0 (larger value means stronger noise reduction) Input SNR 0, 5, and 10 [dB] Evaluation method Preference AB test in terms of speech quality (25 participants / evaluation)
  • 16. /18 Results of subjective evaluation of speech quality (input SNR = 0 [dB]) 15In all cases, the 𝑝-values between the methods were smaller than 10−6 . 0.368 0.632 SS+MSE (β = 0.5) SS+MSE (β = 1.0) SS+MSE (β = 2.0) SS+MSE (β = 5.0) Proposed 0.312 0.688 0.312 0.688 0.00 0.25 0.50 0.75 1.00 Preference score 0.253 0.747 MSE+SS (𝛽 = 0.5) MSE+SS (𝛽 = 1.0) MSE+SS (𝛽 = 2.0) MSE+SS (𝛽 = 5.0) Proposed Preference score 0.00 0.25 0.50 0.75 1.00 Our algorithm significantly improves speech quality compared with TTS using SS!
  • 17. /18 Results of subjective evaluation of speech quality (input SNR = 5 [dB]) 16In all cases, the 𝑝-values between the methods were smaller than 10−6 . 0.368 0.632 SS+MSE (β = 0.5) SS+MSE (β = 1.0) SS+MSE (β = 2.0) SS+MSE (β = 5.0) Proposed 0.312 0.688 0.312 0.688 0.00 0.25 0.50 0.75 1.00 Preference score 0.253 0.747 0.292 0.708 0.320 0.680 0.323 0.677 0.00 0.25 0.50 0.75 1.00 Preference score 0.216 0.784 SS+MSE (β = 0.5) SS+MSE (β = 1.0) SS+MSE (β = 2.0) SS+MSE (β = 5.0) Proposed MSE+SS (𝛽 = 0.5) MSE+SS (𝛽 = 1.0) MSE+SS (𝛽 = 2.0) MSE+SS (𝛽 = 5.0) Proposed Preference score 0.00 0.25 0.50 0.75 1.00 Our algorithm significantly improves speech quality compared with TTS using SS!
  • 18. /18 Results of subjective evaluation of speech quality (input SNR = 10 [dB]) 17In all cases, the 𝑝-values between the methods were smaller than 10−6 . 0.368 0.632 SS+MSE (β = 0.5) SS+MSE (β = 1.0) SS+MSE (β = 2.0) SS+MSE (β = 5.0) Proposed 0.312 0.688 0.312 0.688 0.00 0.25 0.50 0.75 1.00 Preference score 0.253 0.747 0.268 0.732 0.292 0.707 0.256 0.744 0.00 0.25 0.50 0.75 1.00 Preference score 0.288 0.712 SS+MSE (β = 0.5) SS+MSE (β = 1.0) SS+MSE (β = 2.0) SS+MSE (β = 5.0) Proposed MSE+SS (𝛽 = 0.5) MSE+SS (𝛽 = 1.0) MSE+SS (𝛽 = 2.0) MSE+SS (𝛽 = 5.0) Proposed Preference score 0.00 0.25 0.50 0.75 1.00 Our algorithm significantly improves speech quality compared with TTS using SS!
  • 19. /18 Conclusion  Purpose – Training high-quality TTS using noisy speech data  Proposed – Training algorithm considering noise additive process • Our noise generation models can learn distribution of observed noise through the GAN-based training.  Results – Improving synthetic speech quality compared with TTS using SS  Future work – Modeling non-stationary noise by the proposed algorithm • Using richer DNN architectures (e.g., long-short term memory) – Comparing our algorithm with state-of-the-art noise suppression 18 Thank you for your attention!