Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Une18apsipa

74 views

Published on

APSIPA-ASC 2018 @ Hawaii

Published in: Science
  • Be the first to comment

  • Be the first to like this

Une18apsipa

  1. 1. ©Yuki Saito, 13/11/2018 GENERATIVE APPROACH USING THE NOISE GENERATION MODELS FOR DNN-BASED SPEECH SYNTHESIS TRAINED FROM NOISY SPEECH Masakazu Une1, Yuki Saito2, Shinnosuke Takamichi2, Daichi Kitamura3, Ryoichi Miyazaki1, and Hiroshi Saruwatari2 1NIT, Tokuyama College, Japan, 2The Univ. of Tokyo, Japan, 3NIT, Kagawa College, Japan APSIPA-ASC 2018 TU-P1-5.1
  2. 2. /181 Text-To-Speech (TTS) synthesis using Deep Neural Networks (DNNs)  Text-To-Speech (TTS) synthesis  TTS using Deep Neural Networks (DNNs) [Zen et al., 2013] Text Speech Linguistic features Speech params. Text analysis Speech synthesis Text-To-Speech (TTS) DNN-based acoustic models To realize high-quality TTS, studio-quality clean speech data is required for training the DNNs.
  3. 3. /18  Goal: realizing high-quality TTS using NOISY speech data  Common approach: noise reduction before training DNNs – Error caused by the noise reduction is propagated to TTS training...  Proposed: training DNNs considering noise additive process – GAN*-based noise generation models are introduced to TTS training.  Results: improving synthetic speech quality 2 Outline of this talk *Generative Adversarial Network [Goodfellow et al., 2014] Noise reduction Noisy (observed) Clean (estimated) TTS Noise addition Noisy (observed) Clean (unobserved) TTS Noise generation models
  4. 4. /18 Noise reduction using Spectral Subtraction (SS)*  Amplitude spectra after noise reduction 𝒚s (SS) is calculated as: – 𝑦s SS 𝑡, 𝑓 = 𝑦ns 2 𝑡, 𝑓 − 𝛽 𝑦n 2 𝑓 𝑦ns 2 𝑡, 𝑓 − 𝛽 𝑦n 2 𝑓 > 0 0 otherwise  The estimated average power of noise 𝒚n 2 is defined as: – 𝑦n 2 𝑓 = 1 𝑇n 𝑡=1 𝑇n 𝑦n 2 𝑡, 𝑓 (𝑇n: total frame length of the noise)  Limitations – Approximating the noise distribution with its expectation value 𝒚n 2 – Causing trade-off between noise reduction & speech distortion due to setting the hyper-parameter 𝛽 (noise suppression ratio) 3*[Boll, 1979]
  5. 5. /18 Training TTS from noisy speech using SS 4 Mean squared error 𝐿MSE 𝒚s SS , 𝒚s SS TTS Linguistic features Predicted clean amplitude spectra Estimated clean amplitude spectra Noisy amplitude spectra 𝒚s (SS) 𝒚s (SS) 𝒚ns Noise reduction using SS → Minimize𝐿MSE 𝒚s SS , 𝒚s SS = 1 𝑇 𝒚s SS − 𝒚s SS ⊤ 𝒚s SS − 𝒚s SS 𝑇: total frame length of the features
  6. 6. /18  1. Speech distortion caused by error of SS  2. Propagation of the distortion by using 𝒚s SS as a target vector Issues in training TTS using SS 5 𝐿MSE 𝒚s SS , 𝒚s SS 𝒚s (SS) 𝒚s (SS) 𝒚ns Noise reduction using SS TTS These issues significantly degrade synthetic speech quality...
  7. 7. /186 Proposed algorithm: Training TTS using noise generation models based on GANs
  8. 8. /187 Overview of the proposed algorithm 𝐿MSE 𝒚ns, 𝒚ns TTS Linguistic features Estimated noisy Noisy Predicted clean 𝒚s 𝒚ns 𝒚ns Noise addition Pre-trained noise generation models 𝐺n ⋅ Prior noise 𝒚n Generated noise 𝒏 We want 𝐺n ⋅ to model the distribution of the observed noise.
  9. 9. /188 Pre-training of noise generation models based on GANs Noise generation models 𝐺n ⋅ Prior noise 𝒚n Generated noise 𝒏 Discriminative models 𝐷 ⋅ 𝒚ns Noisy 𝑉 𝐺n, 𝐷 or 𝑉 𝐺n, 𝐷 = min 𝐺n max 𝐷 𝐸 log 𝐷 𝒚n + 𝐸 log 1 − 𝐷 𝒚n 1: observed 0: generated Extraction of non-speech period 𝒚n Observed noise
  10. 10. /189 Pre-training of noise generation models based on GANs Noise generation models 𝐺n ⋅ Prior noise 𝒚n Generated noise 𝒏 Discriminative models 𝐷 ⋅ 𝒚ns Noisy 𝑉 𝐺n, 𝐷 or 𝑉 𝐺n, 𝐷 = min 𝐺n max 𝐷 𝐸 log 𝐷 𝒚n + 𝐸 log 1 − 𝐷 𝒚n 1: observed 0: generated Extraction of non-speech period 𝒚n Observed noise
  11. 11. /1810 Pre-training of noise generation models based on GANs Noise generation models 𝐺n ⋅ Prior noise 𝒚n Generated noise 𝒏 Discriminative models 𝐷 ⋅ 𝒚ns Noisy 𝑉 𝐺n, 𝐷 𝑉 𝐺n, 𝐷 = min 𝐺n max 𝐷 𝐸 log 𝐷 𝒚n + 𝐸 log 1 − 𝐷 𝒚n 1: observed *Jensen—Shannon This minimizes the approx. JS* divergence betw. distributions of 𝒚n & 𝒚n. Extraction of non-speech period 𝒚n Observed noise
  12. 12. /1811 Comparison of observed/generated noise (generating Gaussian noise from uniform noise) Frequency Amplitude Freq.[kHz]Freq.[kHz] Time [s] Observed Generated Spectrogram Histogram Our noise generation models effectively reproduce characteristics of the observed noise!
  13. 13. /18  Modeling distribution of stationary noise by using GANs – Musical noise [Miyazaki et al., 2012] (unpleasant sound) can be reduced. – By using recurrent networks, distribution of non-stationary noise can be also modeled by our algorithm.  Extending the proposed algorithm – Distribution of context-dependent noise (e.g., pop-noise) can be captured by using conditional GANs [Mirza et al., 2015]. – By using WaveNet [Oord et al., 2016], noise distribution can be modeled in the waveform domain.  Adapting TTS or noise generation models – Pre-recorded clean speech data can be used to build initial models used in our algorithm. 12 Discussion of proposed algorithm
  14. 14. /1813 Experimental evaluations
  15. 15. /18 Experimental conditions 14 Dataset Japanese female speaker (subset of JSUT corpus [Sonobe et al., 2017]) Train / evaluate data 3,000 / 53 sentences (16 kHz sampling) Linguistic feats. 442-dimensional vector (phoneme, accent type, F0, UV, duration, etc...) Speech params. 257-dimensional log amplitude spectrum Waveform synthesis Griffin & Lim’s method [Griffin et al., 1986] Prior / observed noise Uniform / Gaussian (artificially added) DNN architectures Feed-Forward (details are written in our manuscript) Noise suppression ratio of SS 𝛽 0.5, 1.0, 2.0, and 5.0 (larger value means stronger noise reduction) Input SNR 0, 5, and 10 [dB] Evaluation method Preference AB test in terms of speech quality (25 participants / evaluation)
  16. 16. /18 Results of subjective evaluation of speech quality (input SNR = 0 [dB]) 15In all cases, the 𝑝-values between the methods were smaller than 10−6 . 0.368 0.632 SS+MSE (β = 0.5) SS+MSE (β = 1.0) SS+MSE (β = 2.0) SS+MSE (β = 5.0) Proposed 0.312 0.688 0.312 0.688 0.00 0.25 0.50 0.75 1.00 Preference score 0.253 0.747 MSE+SS (𝛽 = 0.5) MSE+SS (𝛽 = 1.0) MSE+SS (𝛽 = 2.0) MSE+SS (𝛽 = 5.0) Proposed Preference score 0.00 0.25 0.50 0.75 1.00 Our algorithm significantly improves speech quality compared with TTS using SS!
  17. 17. /18 Results of subjective evaluation of speech quality (input SNR = 5 [dB]) 16In all cases, the 𝑝-values between the methods were smaller than 10−6 . 0.368 0.632 SS+MSE (β = 0.5) SS+MSE (β = 1.0) SS+MSE (β = 2.0) SS+MSE (β = 5.0) Proposed 0.312 0.688 0.312 0.688 0.00 0.25 0.50 0.75 1.00 Preference score 0.253 0.747 0.292 0.708 0.320 0.680 0.323 0.677 0.00 0.25 0.50 0.75 1.00 Preference score 0.216 0.784 SS+MSE (β = 0.5) SS+MSE (β = 1.0) SS+MSE (β = 2.0) SS+MSE (β = 5.0) Proposed MSE+SS (𝛽 = 0.5) MSE+SS (𝛽 = 1.0) MSE+SS (𝛽 = 2.0) MSE+SS (𝛽 = 5.0) Proposed Preference score 0.00 0.25 0.50 0.75 1.00 Our algorithm significantly improves speech quality compared with TTS using SS!
  18. 18. /18 Results of subjective evaluation of speech quality (input SNR = 10 [dB]) 17In all cases, the 𝑝-values between the methods were smaller than 10−6 . 0.368 0.632 SS+MSE (β = 0.5) SS+MSE (β = 1.0) SS+MSE (β = 2.0) SS+MSE (β = 5.0) Proposed 0.312 0.688 0.312 0.688 0.00 0.25 0.50 0.75 1.00 Preference score 0.253 0.747 0.268 0.732 0.292 0.707 0.256 0.744 0.00 0.25 0.50 0.75 1.00 Preference score 0.288 0.712 SS+MSE (β = 0.5) SS+MSE (β = 1.0) SS+MSE (β = 2.0) SS+MSE (β = 5.0) Proposed MSE+SS (𝛽 = 0.5) MSE+SS (𝛽 = 1.0) MSE+SS (𝛽 = 2.0) MSE+SS (𝛽 = 5.0) Proposed Preference score 0.00 0.25 0.50 0.75 1.00 Our algorithm significantly improves speech quality compared with TTS using SS!
  19. 19. /18 Conclusion  Purpose – Training high-quality TTS using noisy speech data  Proposed – Training algorithm considering noise additive process • Our noise generation models can learn distribution of observed noise through the GAN-based training.  Results – Improving synthetic speech quality compared with TTS using SS  Future work – Modeling non-stationary noise by the proposed algorithm • Using richer DNN architectures (e.g., long-short term memory) – Comparing our algorithm with state-of-the-art noise suppression 18 Thank you for your attention!

×