SlideShare a Scribd company logo
2018. 12. 11
DSP & AI Lab
Min-Jae Hwang
TOWARD WAVENET SPEECH SYNTHESIS
PROFILE
2
Min-Jae Hwang
• Affiliation : 7th semester of MS/Ph.D. student
DSP & AI Lab., Yonsei Univ., Seoul, Korea
• Adviser : Prof. Hong-Goo Kang
• E-mail : hmj234@dsp.yonsei.ac.kr
Work experience
• 2017.12 ~ 2017.12 – Intern at NAVER corporation
• 2018.01 ~ 2018.11 – Intern at Microsoft Research Asia
Research area
• 2015.8 ~ 2016.12 - Audio watermarking
• M. Hwang, J. Lee, M. Lee, and H. Kang,“사전 분석법을 통한 스프레드 스펙트럼 기반 오디오 워터마킹 알고리즘의 성능 향상,” 한국음향학회 제 33회 음성통신 및
신호처리 학술대회, 2016
• M. Hwang, J. Lee, M. Lee, and H. Kang, “SVD-based adaptive QIM watermarking on stereo audio signals,” in IEEE Transactions on Multimedia, 2017
• 2017.1 ~ Present - Deep learning-based statistical parametric speech synthesis
• M. Hwang, E. Song, K. Byun, and H. Kang, “Modeling-by-generation structure-based noise compensation algorithm for glottal vocoding speech synthesis syste
m,” in ICASSP, 2018
• M. Hwang, E. Song, J. Kim, and H. Kang, “A unified framework for the generation of glottal signals in deep learning-based parametric speech synthesis
systems,” in Interspeech, 2018
• M. Hwang, F. Soong, F. Xie, X. Wang, and H. Kang, “LP-WaveNet: linear prediction-based WaveNet speech synthesis,” in ICASSP, 2019 [Submitted]
CONTENTS
Introduction
• Story of WaveNet vocoder
Various types of WaveNet vocoders
• SoftMax-WaveNet
• MDN-WaveNet
• ExcitNet
• Proposed LP-WaveNet
Tuning of WaveNet
• Noise injection
• Waveform generation methods
Experiments
Conclusion & Summary
3
INTRODUCTION
Story of WaveNet vocoder
Traditional
Text-to-speech
WaveNet MDN-WaveNet ExcitNet LP-WaveNet
~2015 2016 2017 2018 2019
WaveNet
Vocoder
2017
CLASSICAL TTS SYSTEMS
• Parameterize speech signal using vocoder
technique
• Pros) High controllability, small database
• Cons) Low quality
5
• Concatenate tiny segment of real speech
signal
• Pros) High quality
• Cons) Low controllability, large database
Unit-selection speech synthesis [1] Statistical parametric speech synthesis (SPSS) [2]
Text-to-speech
GENERATIVE MODEL-BASED SPEECH SYNTHESIS
WaveNet [3]
• First generative model for raw audio waveform
• Predict the probability distribution of waveform sample auto-regressively
• Generate high quality of audio / speech signal
• Impact on the task of TTS, voice conversion, music synthesis, etc.
WaveNet in TTS task
6
1
( ) ( | )
N
n n
n
p p x <
=
= Õx x
[Mean opinion scores]
• Utilize linguistic features and F0
contour as a conditional information
[WaveNet speech synthesis]
• Present higher quality than the
conventional TTS systems
WAVENET VOCODER-BASED SPEECH SYNTHESIS
Utilize WaveNet as parametric vocoder [4]
• Use acoustic features as conditional information
Advantages
• Higher quality synthesized speech than the conventional vocoders
• Don’t require hand-engineered processing pipeline
• Higher controllability than the case of linguistic features
• Controlling acoustic features
• Higher training efficiency than the case of linguistic features
• Linguistic feature: 25~35 hour database
• Acoustic feature: 1 hour database
7
[WaveNet vocoder based parametric speech synthesis]
Conventional
vocoder
WaveNet
vocoder
Recorded
1. SoftMax-WaveNet
2. MDN-WaveNet
3. ExcitNet
4. Proposed LP-WaveNet
Traditional
Text-to-Speech
WaveNet MDN-WaveNet ExcitNet LP-WaveNet
~2015 2016 2017 2018 2019
WaveNet
Vocoder
2017
VARIOUS TYPES OF WAVENET VOCODERS
BASIC OF WAVENET
Auto-regressive generative model
• Predict probability distribution of speech samples
• Use past waveform samples as a condition information of WaveNet
Problem: Long-term dependency nature of speech signal
• Highly correlated speech signal in high sampling rate, e.g. 16000 Hz
• E.g. 1) At least 160 (=16000/100) speech samples to represent 100Hz of voice correctly
• E.g. 2) Averagely 6,000 speech samples to represent single word1
Solution
• Put the speech samples as an input of dilated causal convolution layer
• Stack the dilated causal convolution layer
• Result in exponentially growing receptive field
9
: 1
1
( ) ( | )
N
n n R n
n
p p x - -
=
= Õx x
[Dilated causal convolution]
1 Statistics based on the average speaking rate of a set of TED talk speakers
http://sixminutes.dlugan.com/speaking-rate/
Effective embedding method of long receptive field is required
BASIC OF WAVENET
WaveNet without dilation
10
[WaveNet without dilated convolution]
Receptive field: # layer - 1
BASIC OF WAVENET
WaveNet with dilation
11
[WaveNet with dilated convolution]
Receptive field: 𝟐# layer − 𝟏
BASIC OF WAVENET
12
Multiple stack of residual blocks
1. Dilated causal convolution
• Exponentially increase the receptive field
2. Gated activation
• Impose non-linearity on the model
• Enable conditional WaveNet
3. Residual / skip connections
• Speed up the convergence
• Enable deep layered model training
( ) ( )tanh * * sigmoid * *f f g gW V W V= + +z x h x h!
1
0
[ , ] [ ] [ ]
M
k
y d n h k x n d k
-
=
= - ×å
[Basic WaveNet architecture]
: input of activation
: conditional vector
: output of activation
x
h
z
SOFTMAX-WAVENET
Use WaveNet as multinomial logistic regression model [4]
• 𝜇-law companding for evenly distributed speech sample
• 8-bit one-hot encoding
• 256 symbols
Estimate each symbol using WaveNet
• Predict the sample by SoftMax distribution
Optimize network by cross-entropy (CE) loss
• Minimize the probabilistic distance between 𝑝& and 𝑞&
13
( )ln 1
( ) , =255
ln(1 )
x
y sign x
µ
µ
µ
+
= ×
+
8 ( )bitp OneHot y=
exp( )
exp( )
q
n
n q
i
i
z
q
z
=
å
[ ]logn n
n
L p q= -å
( | )q
nWaveNet <=z q h
[Distribution of speech samples]
[SoftMax-WaveNet]
SOFTMAX-WAVENET
Limitation by the usage of 8-bit quantization
• Noisy synthetic speech due to insufficient number of quantization bits
Intuitive solution
• Expand the SoftMax dimension to 65,536 corresponds to 16-bit quantization
à High computational cost & difficult to train
Mixture density network (MDN)-based solution [5]
• Train the WaveNet to predict the parameter of pre-defined speech distribution
14
MDN-WAVENET
Define the distribution of waveform sample as parameterized form [5]
• Discretized mixture of logistic (MoL) distribution
• Discretized logistic mixture with 16bit quantization (Δ = 1/2-.
)
Estimate mixture parameters by WaveNet
Optimize network by negative log-likelihood (NLL) loss
15
: logistic functions
[ ]log ( | , )n n
n
L p x x<= -å h
1
/ 2 / 2
( )
N
n n
n
n n n
x x
p x
s s
µ µ
p s s
=
é ùæ ö æ ö+ D - - D -
= × -ê úç ÷ ç ÷
è ø è øë û
å
[ , , ] ( , )s
nWaveNetp µ
<=z z z x h
softmax( )
exp( )s
p
µ
=
=
=
π z
µ z
s z
, for unity-summed mixture gain
, for positive value of mixture scale
Mixture density network
[MDN-WaveNet]
MDN-WAVENET
Higher quality than SoftMax-WaveNet
• Enable to model the speech signal by 16-bit
• Overcome difficulty of spectrum modeling
Difficulty of WaveNet training
• Due to increased quantization bit from 8-bit to 16-bit
Solution based on the human speech production model [6]
• Model the vocal source signal, whose physical behavior is much simpler than the speech signal
16
Mixture density network
[Spectrogram comparison] [Loss comparison]
SoftMax
WaveNet
MDN
WaveNet
Recorded
SPEECH PRODUCTION MODEL
Concept
• Modeling the speech as the filtered output of vocal source signal to vocal tract filter
Methodology: Linear prediction (LP) approach
• Define speech signal as linear combination of past speech samples
17
[ ]( ) ( ) ( ) ( ) ( )S z G z V z R z E z= × × ×
[Speech production model] [Spectral deconvolution by LP analysis]
1
P
n i n i n
i
s s ea -
=
= +å
1
1
( ) ( ) ( ), where ( )
1
p i
ii
S z H z E z H z
za -
=
= × =
- å
Spectrum part = LP coefficients
Excitation part = Residual signal of LP analysis
Speech = [vocal fold × vocal tract ×	lip radiation] × excitation
EXCITNET
Model the excitation signal by WaveNet, instead of speech signal [6]
Training stage
• Extract excitation signal by linear prediction (LP) analysis
• Periodically updated filter coefficients matched to frame rate of acoustic feature
• Train WaveNet to model excitation signal
Synthesis stage
• Generated excitation signal by WaveNet
• Synthesize speech signal by LP synthesis filtering
18
1
P
n n i n i
i
e s sa -
=
= - å
1
p
n i n i n
i
s s ea -
=
= +å
[ExcitNet]
Recorded Excitation LP synthesis
EXCITNET
High quality of synthesized speech even though 8-bit quantization is used
• Overcome the difficulty of spectrum modeling by using external spectral shaping filter
Limitation
• Independent modeling of excitation and spectrum parts results in unpredictable
prediction error
19
1
1
( ) ( )
1
p
i
i
i
S z E z
za -
=
= ×
- å
Contains error
Contains error
Objective
Represent LP synthesis process during WaveNet training / generation
Recorded
SoftMax
WaveNet ExcitNet
LP-WAVENET
Motivation from the assumption of WaveNet vocoder
1. Previous speech samples, 𝐱3&, are given
2. LP coefficients, {𝛼6}, are given
Probabilistic analysis
• Difference between the random variables 𝑋& and 𝐸& is only a constant value of 𝑥;&
Assume the discretized MoL distributed speech
• Shifting property of 2nd order random variable
20
1
ˆ
p
n i n i
i
x xa -
=
= åTheir linear combination, , are also given
( | , )n np e <x h
ˆnx
( | , )n np x <x h
[Difference of distributions between speech and excitation]
ˆ
x e
i i
x e
i i n
x e
i i
x
s s
p p
µ µ
=
= +
=
Only mean parameters
are different
Linear prediction
ˆ| ( , ) ( ) | ( , )
ˆ| ( , )
n n n n n
n n n
X E x
E x
< <
<
= +
= +
x h x h
x h
ˆn n nx e x= +
LP-WAVENET
Utilize the causality of WaveNet and the linearity of LP synthesis processes [7]
21
( ), , ,s
nWaveNetp µ
<
é ù =ë ûz z z x h
1
( | , )
/ 2 / 2
n n
N
i i
i
i i i
p x
x x
s s
µ µ
p s s
<
=
=
é ùæ ö æ ö+ D - - D -
× -ê úç ÷ ç ÷
è ø è øë û
å
x h
1. Mixture parameter prediction
2. Mixture parameter refinement
3. Discretized MoL loss calculation
[ ]log ( | , )n n
n
E p x <= -å x h
softmax( )
ˆ
exp( )
n
s
x
p
µ
=
= +
=
π z
µ z
s z
[LP-WaveNet]
Linear prediction Recorded ExcitNet LP-WaveNet
TUNING OF WAVENET
1. Solution to waveform divergence problem
2. Waveform generation methods
WAVEFORM DIVERGENCE PROBLEM
Waveform divergence problem
• Waveform divergence by a stacked error during WaveNet’s autoregressive generation
Cause – Overfitting on the silence region
• Unique solution in silence region
• Easier to be happen when the portion of silence region in training set is larger
• Be sensitive to tiny error in silence region during waveform generation
Solution – Noise injection
• Inject negligible amount of noise
• Allowing only 1 bit error
• Increase robustness to the prediction error
in silence region
23
( ),curr prevx WaveNet= x h ( )0 , silWaveNet= 0 h
16
ˆ , where = 2 2e e= + ×x x n 0 0 0
1 1 1
2 2 2
~ ( | )
~ ( | )
~ ( | )
~ ( | )n n n
x p x
x p x
x p x
x p x
<
<
<
<
x
x
x
x
!
0err
0 1err err+
i
i
errå
!
[Error stacking during waveform generation]
WAVEFORM GENERATION METHODS
Random sampling
Argmax sampling
Greedy sampling [8]
Mode sampling [7]
• Sharper distribution on voiced region
24
ˆ ( ) ~ ( ( ) | ( ), )randx n p x n x k n< h
max
ˆ ˆ ˆ( ) ( ) (1 ) ( )greedy randx n vuv x n vuv x n= × + - ×
max
( )
ˆ ( ) arg max ( ( ) | ( ), )
x n
x n p x n x k n= < h
,
ˆ ˆ ˆ( ) ( ) (1 ) ( )mode rand nar randx n vuv x n vuv x n= × + - ×
1
ˆ ( ) ~ ( , )
I
rand i i i
i
x n DistLogic sp µ
=
å
,
1
ˆ ( ) ~ ,
I
i
rand narr i i
i
s
x n DistLogic
c
p µ
=
æ ö
ç ÷
è ø
å
[Random & argmax sampling]
[Scale parameter control in mode sampling]
randx
maxx
Random
sampling
Recorded
Argmax
sampling
Greedy
sampling
Mode
sampling
EXPERIMENTS
EXPERIMENTS
Network architecture
Systems
• WN<: MDN-WaveNet that models the speech signal
• WN=: MDN-WaveNet that models the excitation signal
• WN>?: Proposed LP-WaveNet
26
Database Korean male: MBC YBS database
Training / validation / Test 2,500 (~3.2h) / 200 / 200
Minibatch size 4 GPU x 20000 samples
Dilation 3 * [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
Layer 30
Receptive field 3070 samples
Residual chn. 128
Skip chn. 128
Quantization 65536 uniform
Number of mixture 10 mixture components --> 30 output channels
Conditioning features Voicing flag, LSF, F0 (log), Eng (log), BAP
Normalization Gaussian normalization
Learning rate 1.00E-04
Initialization / optimizer Xavier / Adam
Sample generation method Mode sampling
LEARNING CURVE
27
Training speed
𝐖𝐍 𝑳𝑷 = WN= > WN<
OBJECTIVE EVALUATION
Performance measurements
• VUV: Voicing error rate (%)
• F0 RMSE: F0 root mean square error (Hz) in voiced region
• LSD: Log-spectral distance of AR spectrum (dB)
• F-LSD: LSD of synthesized speech in frequency domain (dB)
Results
28
A/S: analysis / synthesis
SPSS: synthesis by predicted acoustic features
SUBJECTIVE EVALUATION
Mean opinion score (MOS) test
• 20 random synthesized utterances from test set
• 12 native Korean speakers
• Include STRAIGHT (STR)-based speech synthesis system as baseline [9]
Results
29
A/S: analysis / synthesis
SPSS: synthesis by predicted acoustic features
SAMPLES
Recorded
STRAIGHT – A/S
WN< – A/S
WN= – A/S
WN>? – A/S
30
STRAIGHT – SPSS
WN< – SPSS
WN= – SPSS
WN>? – SPSS
A/S: analysis / synthesis
SPSS: synthesis by predicted acoustic features
SUMMARY & CONCLUSION
Investigated various types of WaveNet vocoder
• SoftMax-WaveNet
• MDN-WaveNet
• ExcitNet
Proposed an LP-WaveNet
• Explicitly represent linear prediction structure of speech waveform into WaveNet framework
Introduced tuning methods of WaveNet training / generation
• Noise injection
• Sample generation methods
Performed experiment in objective and subjective manner
• Achieve faster convergence speed than conventional WaveNet vocoders, whereas keep
similar speech quality
31
REFERENCES
[1] A. J. Hunt and A. W. Black, “Unit selection in a concatenative speech synthesis system using a large speech database,” in
Proc. ICASSP, 1996
[2] H. Zen, et. al. “Statistical parametric speech synthesis using deep neural networks,” in Proc. ICASSP, 2013.
[3] A. van den Oord, et. al.,“WaveNet: A generative model for raw audio generation,” in CoRR, 2016.
[4] A.Tamamori, et. al., “Speaker-dependent wavenet vocoder,” in INTERSPEECH, 2017.
[5] A. van den Oord, et. al., "Parallel WaveNet: Fast High-fidelity speech synthesis," in CoRR, 2017
[6] E. Song, K. Byun, and H.-G. Kang, “A neural excitation vocoder for statistical parametric speech synthesis systems,” in
Arxiv, 2018.
[7] M. Hwang, F. Soong, F. Xie, X. Wang, and H. Kang, “LP-WaveNet: Linear Prediction-based WaveNet Speech Synthesis,”
Submitted to Proc. ICASSP, 2019.
[8] W. Xin, et. al., “A comparison of recent waveform generation and acoustic modeling methods for neural-network-based
speech synthesis,” in Proc. ICASSP, 2018.
[9] H. Kawahara, “STRAIGHT, exploration of the other aspect of vocoder: Perceptually isomorphic decomposition of speech
sounds,” Acoustical Science and Technology, 2006.
32
Thank you!
33

More Related Content

What's hot

Conformer review
Conformer reviewConformer review
Conformer review
June-Woo Kim
 
Introductory Lecture to Audio Signal Processing
Introductory Lecture to Audio Signal ProcessingIntroductory Lecture to Audio Signal Processing
Introductory Lecture to Audio Signal Processing
Angelo Salatino
 
Speech processing
Speech processingSpeech processing
[DL輪読会]Towards End-to-End Prosody Transfer for Expressive Speech Synthesis wi...
[DL輪読会]Towards End-to-End Prosody Transfer for Expressive Speech Synthesis wi...[DL輪読会]Towards End-to-End Prosody Transfer for Expressive Speech Synthesis wi...
[DL輪読会]Towards End-to-End Prosody Transfer for Expressive Speech Synthesis wi...
Deep Learning JP
 
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Speech Signal Processing
Speech Signal ProcessingSpeech Signal Processing
Speech Signal Processing
Murtadha Alsabbagh
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
Mark Chang
 
rnn BASICS
rnn BASICSrnn BASICS
rnn BASICS
Priyanka Reddy
 
Advancements in Neural Vocoders
Advancements in Neural VocodersAdvancements in Neural Vocoders
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
Alia Hamwi
 
VQ-VAE
VQ-VAEVQ-VAE
VQ-VAE
수철 박
 
Final ppt
Final pptFinal ppt
Final ppt
Sharu Sparky
 
RNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential DataRNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential Data
Yao-Chieh Hu
 
サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価
サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価
サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価
Shinnosuke Takamichi
 
Conditional Random Fields
Conditional Random FieldsConditional Random Fields
Conditional Random Fields
lswing
 
[DL輪読会]Diffusion-based Voice Conversion with Fast Maximum Likelihood Samplin...
[DL輪読会]Diffusion-based Voice Conversion with Fast  Maximum Likelihood Samplin...[DL輪読会]Diffusion-based Voice Conversion with Fast  Maximum Likelihood Samplin...
[DL輪読会]Diffusion-based Voice Conversion with Fast Maximum Likelihood Samplin...
Deep Learning JP
 
FPGA・リコンフィギャラブルシステム研究の最新動向
FPGA・リコンフィギャラブルシステム研究の最新動向FPGA・リコンフィギャラブルシステム研究の最新動向
FPGA・リコンフィギャラブルシステム研究の最新動向
Shinya Takamaeda-Y
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
Basha Chand
 
WaveNet
WaveNetWaveNet
WaveNet
AbeyHurtis
 
Diffusion models beat gans on image synthesis
Diffusion models beat gans on image synthesisDiffusion models beat gans on image synthesis
Diffusion models beat gans on image synthesis
BeerenSahu
 

What's hot (20)

Conformer review
Conformer reviewConformer review
Conformer review
 
Introductory Lecture to Audio Signal Processing
Introductory Lecture to Audio Signal ProcessingIntroductory Lecture to Audio Signal Processing
Introductory Lecture to Audio Signal Processing
 
Speech processing
Speech processingSpeech processing
Speech processing
 
[DL輪読会]Towards End-to-End Prosody Transfer for Expressive Speech Synthesis wi...
[DL輪読会]Towards End-to-End Prosody Transfer for Expressive Speech Synthesis wi...[DL輪読会]Towards End-to-End Prosody Transfer for Expressive Speech Synthesis wi...
[DL輪読会]Towards End-to-End Prosody Transfer for Expressive Speech Synthesis wi...
 
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
 
Speech Signal Processing
Speech Signal ProcessingSpeech Signal Processing
Speech Signal Processing
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
rnn BASICS
rnn BASICSrnn BASICS
rnn BASICS
 
Advancements in Neural Vocoders
Advancements in Neural VocodersAdvancements in Neural Vocoders
Advancements in Neural Vocoders
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
VQ-VAE
VQ-VAEVQ-VAE
VQ-VAE
 
Final ppt
Final pptFinal ppt
Final ppt
 
RNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential DataRNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential Data
 
サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価
サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価
サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価
 
Conditional Random Fields
Conditional Random FieldsConditional Random Fields
Conditional Random Fields
 
[DL輪読会]Diffusion-based Voice Conversion with Fast Maximum Likelihood Samplin...
[DL輪読会]Diffusion-based Voice Conversion with Fast  Maximum Likelihood Samplin...[DL輪読会]Diffusion-based Voice Conversion with Fast  Maximum Likelihood Samplin...
[DL輪読会]Diffusion-based Voice Conversion with Fast Maximum Likelihood Samplin...
 
FPGA・リコンフィギャラブルシステム研究の最新動向
FPGA・リコンフィギャラブルシステム研究の最新動向FPGA・リコンフィギャラブルシステム研究の最新動向
FPGA・リコンフィギャラブルシステム研究の最新動向
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
WaveNet
WaveNetWaveNet
WaveNet
 
Diffusion models beat gans on image synthesis
Diffusion models beat gans on image synthesisDiffusion models beat gans on image synthesis
Diffusion models beat gans on image synthesis
 

Similar to Toward wave net speech synthesis

Speaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet VocoderSpeaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet Vocoder
Akira Tamamori
 
Speaker recognition using MFCC
Speaker recognition using MFCCSpeaker recognition using MFCC
Speaker recognition using MFCC
Hira Shaukat
 
Speech recognition final
Speech recognition finalSpeech recognition final
Speech recognition final
Archit Vora
 
H0814247
H0814247H0814247
H0814247
IOSR Journals
 
Real Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform DomainReal Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform Domain
Willy Marroquin (WillyDevNET)
 
Research_Wu.pptx
Research_Wu.pptxResearch_Wu.pptx
Research_Wu.pptx
Rakesh Pogula
 
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
IRJET Journal
 
Speech Compression using LPC
Speech Compression using LPCSpeech Compression using LPC
Speech Compression using LPC
Disha Modi
 
A_Noise_Reduction_Method_Based_on_LMS_Adaptive_Fil.pdf
A_Noise_Reduction_Method_Based_on_LMS_Adaptive_Fil.pdfA_Noise_Reduction_Method_Based_on_LMS_Adaptive_Fil.pdf
A_Noise_Reduction_Method_Based_on_LMS_Adaptive_Fil.pdf
Bala Murugan
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
NAVER Engineering
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...
Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...
Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...
IJERA Editor
 
Translatotron review
Translatotron reviewTranslatotron review
Translatotron review
June-Woo Kim
 
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
IRJET Journal
 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech Analysis
Rushin Shah
 
F010334548
F010334548F010334548
F010334548
IOSR Journals
 
WaveNet.pdf
WaveNet.pdfWaveNet.pdf
WaveNet.pdf
ssuser849b73
 
Dereverberation in the stft and log mel frequency feature domains
Dereverberation in the stft and log mel frequency feature domainsDereverberation in the stft and log mel frequency feature domains
Dereverberation in the stft and log mel frequency feature domains
Takuya Yoshioka
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By Matlab
Ankit Gujrati
 
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
IJERA Editor
 

Similar to Toward wave net speech synthesis (20)

Speaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet VocoderSpeaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet Vocoder
 
Speaker recognition using MFCC
Speaker recognition using MFCCSpeaker recognition using MFCC
Speaker recognition using MFCC
 
Speech recognition final
Speech recognition finalSpeech recognition final
Speech recognition final
 
H0814247
H0814247H0814247
H0814247
 
Real Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform DomainReal Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform Domain
 
Research_Wu.pptx
Research_Wu.pptxResearch_Wu.pptx
Research_Wu.pptx
 
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
 
Speech Compression using LPC
Speech Compression using LPCSpeech Compression using LPC
Speech Compression using LPC
 
A_Noise_Reduction_Method_Based_on_LMS_Adaptive_Fil.pdf
A_Noise_Reduction_Method_Based_on_LMS_Adaptive_Fil.pdfA_Noise_Reduction_Method_Based_on_LMS_Adaptive_Fil.pdf
A_Noise_Reduction_Method_Based_on_LMS_Adaptive_Fil.pdf
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...
Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...
Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...
 
Translatotron review
Translatotron reviewTranslatotron review
Translatotron review
 
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech Analysis
 
F010334548
F010334548F010334548
F010334548
 
WaveNet.pdf
WaveNet.pdfWaveNet.pdf
WaveNet.pdf
 
Dereverberation in the stft and log mel frequency feature domains
Dereverberation in the stft and log mel frequency feature domainsDereverberation in the stft and log mel frequency feature domains
Dereverberation in the stft and log mel frequency feature domains
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By Matlab
 
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
 

More from NAVER Engineering

React vac pattern
React vac patternReact vac pattern
React vac pattern
NAVER Engineering
 
디자인 시스템에 직방 ZUIX
디자인 시스템에 직방 ZUIX디자인 시스템에 직방 ZUIX
디자인 시스템에 직방 ZUIX
NAVER Engineering
 
진화하는 디자인 시스템(걸음마 편)
진화하는 디자인 시스템(걸음마 편)진화하는 디자인 시스템(걸음마 편)
진화하는 디자인 시스템(걸음마 편)
NAVER Engineering
 
서비스 운영을 위한 디자인시스템 프로젝트
서비스 운영을 위한 디자인시스템 프로젝트서비스 운영을 위한 디자인시스템 프로젝트
서비스 운영을 위한 디자인시스템 프로젝트
NAVER Engineering
 
BPL(Banksalad Product Language) 무야호
BPL(Banksalad Product Language) 무야호BPL(Banksalad Product Language) 무야호
BPL(Banksalad Product Language) 무야호
NAVER Engineering
 
이번 생에 디자인 시스템은 처음이라
이번 생에 디자인 시스템은 처음이라이번 생에 디자인 시스템은 처음이라
이번 생에 디자인 시스템은 처음이라
NAVER Engineering
 
날고 있는 여러 비행기 넘나 들며 정비하기
날고 있는 여러 비행기 넘나 들며 정비하기날고 있는 여러 비행기 넘나 들며 정비하기
날고 있는 여러 비행기 넘나 들며 정비하기
NAVER Engineering
 
쏘카프레임 구축 배경과 과정
 쏘카프레임 구축 배경과 과정 쏘카프레임 구축 배경과 과정
쏘카프레임 구축 배경과 과정
NAVER Engineering
 
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
NAVER Engineering
 
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
NAVER Engineering
 
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
NAVER Engineering
 
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
NAVER Engineering
 
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
NAVER Engineering
 
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
NAVER Engineering
 
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
NAVER Engineering
 
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
NAVER Engineering
 
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
NAVER Engineering
 
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
NAVER Engineering
 
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
NAVER Engineering
 
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
NAVER Engineering
 

More from NAVER Engineering (20)

React vac pattern
React vac patternReact vac pattern
React vac pattern
 
디자인 시스템에 직방 ZUIX
디자인 시스템에 직방 ZUIX디자인 시스템에 직방 ZUIX
디자인 시스템에 직방 ZUIX
 
진화하는 디자인 시스템(걸음마 편)
진화하는 디자인 시스템(걸음마 편)진화하는 디자인 시스템(걸음마 편)
진화하는 디자인 시스템(걸음마 편)
 
서비스 운영을 위한 디자인시스템 프로젝트
서비스 운영을 위한 디자인시스템 프로젝트서비스 운영을 위한 디자인시스템 프로젝트
서비스 운영을 위한 디자인시스템 프로젝트
 
BPL(Banksalad Product Language) 무야호
BPL(Banksalad Product Language) 무야호BPL(Banksalad Product Language) 무야호
BPL(Banksalad Product Language) 무야호
 
이번 생에 디자인 시스템은 처음이라
이번 생에 디자인 시스템은 처음이라이번 생에 디자인 시스템은 처음이라
이번 생에 디자인 시스템은 처음이라
 
날고 있는 여러 비행기 넘나 들며 정비하기
날고 있는 여러 비행기 넘나 들며 정비하기날고 있는 여러 비행기 넘나 들며 정비하기
날고 있는 여러 비행기 넘나 들며 정비하기
 
쏘카프레임 구축 배경과 과정
 쏘카프레임 구축 배경과 과정 쏘카프레임 구축 배경과 과정
쏘카프레임 구축 배경과 과정
 
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
 
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
 
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
 
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
 
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
 
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
 
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
 
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
 
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
 
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
 
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
 
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
 

Recently uploaded

5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
marufrahmanstratejm
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
saastr
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 

Recently uploaded (20)

5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 

Toward wave net speech synthesis

  • 1. 2018. 12. 11 DSP & AI Lab Min-Jae Hwang TOWARD WAVENET SPEECH SYNTHESIS
  • 2. PROFILE 2 Min-Jae Hwang • Affiliation : 7th semester of MS/Ph.D. student DSP & AI Lab., Yonsei Univ., Seoul, Korea • Adviser : Prof. Hong-Goo Kang • E-mail : hmj234@dsp.yonsei.ac.kr Work experience • 2017.12 ~ 2017.12 – Intern at NAVER corporation • 2018.01 ~ 2018.11 – Intern at Microsoft Research Asia Research area • 2015.8 ~ 2016.12 - Audio watermarking • M. Hwang, J. Lee, M. Lee, and H. Kang,“사전 분석법을 통한 스프레드 스펙트럼 기반 오디오 워터마킹 알고리즘의 성능 향상,” 한국음향학회 제 33회 음성통신 및 신호처리 학술대회, 2016 • M. Hwang, J. Lee, M. Lee, and H. Kang, “SVD-based adaptive QIM watermarking on stereo audio signals,” in IEEE Transactions on Multimedia, 2017 • 2017.1 ~ Present - Deep learning-based statistical parametric speech synthesis • M. Hwang, E. Song, K. Byun, and H. Kang, “Modeling-by-generation structure-based noise compensation algorithm for glottal vocoding speech synthesis syste m,” in ICASSP, 2018 • M. Hwang, E. Song, J. Kim, and H. Kang, “A unified framework for the generation of glottal signals in deep learning-based parametric speech synthesis systems,” in Interspeech, 2018 • M. Hwang, F. Soong, F. Xie, X. Wang, and H. Kang, “LP-WaveNet: linear prediction-based WaveNet speech synthesis,” in ICASSP, 2019 [Submitted]
  • 3. CONTENTS Introduction • Story of WaveNet vocoder Various types of WaveNet vocoders • SoftMax-WaveNet • MDN-WaveNet • ExcitNet • Proposed LP-WaveNet Tuning of WaveNet • Noise injection • Waveform generation methods Experiments Conclusion & Summary 3
  • 4. INTRODUCTION Story of WaveNet vocoder Traditional Text-to-speech WaveNet MDN-WaveNet ExcitNet LP-WaveNet ~2015 2016 2017 2018 2019 WaveNet Vocoder 2017
  • 5. CLASSICAL TTS SYSTEMS • Parameterize speech signal using vocoder technique • Pros) High controllability, small database • Cons) Low quality 5 • Concatenate tiny segment of real speech signal • Pros) High quality • Cons) Low controllability, large database Unit-selection speech synthesis [1] Statistical parametric speech synthesis (SPSS) [2] Text-to-speech
  • 6. GENERATIVE MODEL-BASED SPEECH SYNTHESIS WaveNet [3] • First generative model for raw audio waveform • Predict the probability distribution of waveform sample auto-regressively • Generate high quality of audio / speech signal • Impact on the task of TTS, voice conversion, music synthesis, etc. WaveNet in TTS task 6 1 ( ) ( | ) N n n n p p x < = = Õx x [Mean opinion scores] • Utilize linguistic features and F0 contour as a conditional information [WaveNet speech synthesis] • Present higher quality than the conventional TTS systems
  • 7. WAVENET VOCODER-BASED SPEECH SYNTHESIS Utilize WaveNet as parametric vocoder [4] • Use acoustic features as conditional information Advantages • Higher quality synthesized speech than the conventional vocoders • Don’t require hand-engineered processing pipeline • Higher controllability than the case of linguistic features • Controlling acoustic features • Higher training efficiency than the case of linguistic features • Linguistic feature: 25~35 hour database • Acoustic feature: 1 hour database 7 [WaveNet vocoder based parametric speech synthesis] Conventional vocoder WaveNet vocoder Recorded
  • 8. 1. SoftMax-WaveNet 2. MDN-WaveNet 3. ExcitNet 4. Proposed LP-WaveNet Traditional Text-to-Speech WaveNet MDN-WaveNet ExcitNet LP-WaveNet ~2015 2016 2017 2018 2019 WaveNet Vocoder 2017 VARIOUS TYPES OF WAVENET VOCODERS
  • 9. BASIC OF WAVENET Auto-regressive generative model • Predict probability distribution of speech samples • Use past waveform samples as a condition information of WaveNet Problem: Long-term dependency nature of speech signal • Highly correlated speech signal in high sampling rate, e.g. 16000 Hz • E.g. 1) At least 160 (=16000/100) speech samples to represent 100Hz of voice correctly • E.g. 2) Averagely 6,000 speech samples to represent single word1 Solution • Put the speech samples as an input of dilated causal convolution layer • Stack the dilated causal convolution layer • Result in exponentially growing receptive field 9 : 1 1 ( ) ( | ) N n n R n n p p x - - = = Õx x [Dilated causal convolution] 1 Statistics based on the average speaking rate of a set of TED talk speakers http://sixminutes.dlugan.com/speaking-rate/ Effective embedding method of long receptive field is required
  • 10. BASIC OF WAVENET WaveNet without dilation 10 [WaveNet without dilated convolution] Receptive field: # layer - 1
  • 11. BASIC OF WAVENET WaveNet with dilation 11 [WaveNet with dilated convolution] Receptive field: 𝟐# layer − 𝟏
  • 12. BASIC OF WAVENET 12 Multiple stack of residual blocks 1. Dilated causal convolution • Exponentially increase the receptive field 2. Gated activation • Impose non-linearity on the model • Enable conditional WaveNet 3. Residual / skip connections • Speed up the convergence • Enable deep layered model training ( ) ( )tanh * * sigmoid * *f f g gW V W V= + +z x h x h! 1 0 [ , ] [ ] [ ] M k y d n h k x n d k - = = - ×å [Basic WaveNet architecture] : input of activation : conditional vector : output of activation x h z
  • 13. SOFTMAX-WAVENET Use WaveNet as multinomial logistic regression model [4] • 𝜇-law companding for evenly distributed speech sample • 8-bit one-hot encoding • 256 symbols Estimate each symbol using WaveNet • Predict the sample by SoftMax distribution Optimize network by cross-entropy (CE) loss • Minimize the probabilistic distance between 𝑝& and 𝑞& 13 ( )ln 1 ( ) , =255 ln(1 ) x y sign x µ µ µ + = × + 8 ( )bitp OneHot y= exp( ) exp( ) q n n q i i z q z = å [ ]logn n n L p q= -å ( | )q nWaveNet <=z q h [Distribution of speech samples] [SoftMax-WaveNet]
  • 14. SOFTMAX-WAVENET Limitation by the usage of 8-bit quantization • Noisy synthetic speech due to insufficient number of quantization bits Intuitive solution • Expand the SoftMax dimension to 65,536 corresponds to 16-bit quantization à High computational cost & difficult to train Mixture density network (MDN)-based solution [5] • Train the WaveNet to predict the parameter of pre-defined speech distribution 14
  • 15. MDN-WAVENET Define the distribution of waveform sample as parameterized form [5] • Discretized mixture of logistic (MoL) distribution • Discretized logistic mixture with 16bit quantization (Δ = 1/2-. ) Estimate mixture parameters by WaveNet Optimize network by negative log-likelihood (NLL) loss 15 : logistic functions [ ]log ( | , )n n n L p x x<= -å h 1 / 2 / 2 ( ) N n n n n n n x x p x s s µ µ p s s = é ùæ ö æ ö+ D - - D - = × -ê úç ÷ ç ÷ è ø è øë û å [ , , ] ( , )s nWaveNetp µ <=z z z x h softmax( ) exp( )s p µ = = = π z µ z s z , for unity-summed mixture gain , for positive value of mixture scale Mixture density network [MDN-WaveNet]
  • 16. MDN-WAVENET Higher quality than SoftMax-WaveNet • Enable to model the speech signal by 16-bit • Overcome difficulty of spectrum modeling Difficulty of WaveNet training • Due to increased quantization bit from 8-bit to 16-bit Solution based on the human speech production model [6] • Model the vocal source signal, whose physical behavior is much simpler than the speech signal 16 Mixture density network [Spectrogram comparison] [Loss comparison] SoftMax WaveNet MDN WaveNet Recorded
  • 17. SPEECH PRODUCTION MODEL Concept • Modeling the speech as the filtered output of vocal source signal to vocal tract filter Methodology: Linear prediction (LP) approach • Define speech signal as linear combination of past speech samples 17 [ ]( ) ( ) ( ) ( ) ( )S z G z V z R z E z= × × × [Speech production model] [Spectral deconvolution by LP analysis] 1 P n i n i n i s s ea - = = +å 1 1 ( ) ( ) ( ), where ( ) 1 p i ii S z H z E z H z za - = = × = - å Spectrum part = LP coefficients Excitation part = Residual signal of LP analysis Speech = [vocal fold × vocal tract × lip radiation] × excitation
  • 18. EXCITNET Model the excitation signal by WaveNet, instead of speech signal [6] Training stage • Extract excitation signal by linear prediction (LP) analysis • Periodically updated filter coefficients matched to frame rate of acoustic feature • Train WaveNet to model excitation signal Synthesis stage • Generated excitation signal by WaveNet • Synthesize speech signal by LP synthesis filtering 18 1 P n n i n i i e s sa - = = - å 1 p n i n i n i s s ea - = = +å [ExcitNet] Recorded Excitation LP synthesis
  • 19. EXCITNET High quality of synthesized speech even though 8-bit quantization is used • Overcome the difficulty of spectrum modeling by using external spectral shaping filter Limitation • Independent modeling of excitation and spectrum parts results in unpredictable prediction error 19 1 1 ( ) ( ) 1 p i i i S z E z za - = = × - å Contains error Contains error Objective Represent LP synthesis process during WaveNet training / generation Recorded SoftMax WaveNet ExcitNet
  • 20. LP-WAVENET Motivation from the assumption of WaveNet vocoder 1. Previous speech samples, 𝐱3&, are given 2. LP coefficients, {𝛼6}, are given Probabilistic analysis • Difference between the random variables 𝑋& and 𝐸& is only a constant value of 𝑥;& Assume the discretized MoL distributed speech • Shifting property of 2nd order random variable 20 1 ˆ p n i n i i x xa - = = åTheir linear combination, , are also given ( | , )n np e <x h ˆnx ( | , )n np x <x h [Difference of distributions between speech and excitation] ˆ x e i i x e i i n x e i i x s s p p µ µ = = + = Only mean parameters are different Linear prediction ˆ| ( , ) ( ) | ( , ) ˆ| ( , ) n n n n n n n n X E x E x < < < = + = + x h x h x h ˆn n nx e x= +
  • 21. LP-WAVENET Utilize the causality of WaveNet and the linearity of LP synthesis processes [7] 21 ( ), , ,s nWaveNetp µ < é ù =ë ûz z z x h 1 ( | , ) / 2 / 2 n n N i i i i i i p x x x s s µ µ p s s < = = é ùæ ö æ ö+ D - - D - × -ê úç ÷ ç ÷ è ø è øë û å x h 1. Mixture parameter prediction 2. Mixture parameter refinement 3. Discretized MoL loss calculation [ ]log ( | , )n n n E p x <= -å x h softmax( ) ˆ exp( ) n s x p µ = = + = π z µ z s z [LP-WaveNet] Linear prediction Recorded ExcitNet LP-WaveNet
  • 22. TUNING OF WAVENET 1. Solution to waveform divergence problem 2. Waveform generation methods
  • 23. WAVEFORM DIVERGENCE PROBLEM Waveform divergence problem • Waveform divergence by a stacked error during WaveNet’s autoregressive generation Cause – Overfitting on the silence region • Unique solution in silence region • Easier to be happen when the portion of silence region in training set is larger • Be sensitive to tiny error in silence region during waveform generation Solution – Noise injection • Inject negligible amount of noise • Allowing only 1 bit error • Increase robustness to the prediction error in silence region 23 ( ),curr prevx WaveNet= x h ( )0 , silWaveNet= 0 h 16 ˆ , where = 2 2e e= + ×x x n 0 0 0 1 1 1 2 2 2 ~ ( | ) ~ ( | ) ~ ( | ) ~ ( | )n n n x p x x p x x p x x p x < < < < x x x x ! 0err 0 1err err+ i i errå ! [Error stacking during waveform generation]
  • 24. WAVEFORM GENERATION METHODS Random sampling Argmax sampling Greedy sampling [8] Mode sampling [7] • Sharper distribution on voiced region 24 ˆ ( ) ~ ( ( ) | ( ), )randx n p x n x k n< h max ˆ ˆ ˆ( ) ( ) (1 ) ( )greedy randx n vuv x n vuv x n= × + - × max ( ) ˆ ( ) arg max ( ( ) | ( ), ) x n x n p x n x k n= < h , ˆ ˆ ˆ( ) ( ) (1 ) ( )mode rand nar randx n vuv x n vuv x n= × + - × 1 ˆ ( ) ~ ( , ) I rand i i i i x n DistLogic sp µ = å , 1 ˆ ( ) ~ , I i rand narr i i i s x n DistLogic c p µ = æ ö ç ÷ è ø å [Random & argmax sampling] [Scale parameter control in mode sampling] randx maxx Random sampling Recorded Argmax sampling Greedy sampling Mode sampling
  • 26. EXPERIMENTS Network architecture Systems • WN<: MDN-WaveNet that models the speech signal • WN=: MDN-WaveNet that models the excitation signal • WN>?: Proposed LP-WaveNet 26 Database Korean male: MBC YBS database Training / validation / Test 2,500 (~3.2h) / 200 / 200 Minibatch size 4 GPU x 20000 samples Dilation 3 * [1, 2, 4, 8, 16, 32, 64, 128, 256, 512] Layer 30 Receptive field 3070 samples Residual chn. 128 Skip chn. 128 Quantization 65536 uniform Number of mixture 10 mixture components --> 30 output channels Conditioning features Voicing flag, LSF, F0 (log), Eng (log), BAP Normalization Gaussian normalization Learning rate 1.00E-04 Initialization / optimizer Xavier / Adam Sample generation method Mode sampling
  • 28. OBJECTIVE EVALUATION Performance measurements • VUV: Voicing error rate (%) • F0 RMSE: F0 root mean square error (Hz) in voiced region • LSD: Log-spectral distance of AR spectrum (dB) • F-LSD: LSD of synthesized speech in frequency domain (dB) Results 28 A/S: analysis / synthesis SPSS: synthesis by predicted acoustic features
  • 29. SUBJECTIVE EVALUATION Mean opinion score (MOS) test • 20 random synthesized utterances from test set • 12 native Korean speakers • Include STRAIGHT (STR)-based speech synthesis system as baseline [9] Results 29 A/S: analysis / synthesis SPSS: synthesis by predicted acoustic features
  • 30. SAMPLES Recorded STRAIGHT – A/S WN< – A/S WN= – A/S WN>? – A/S 30 STRAIGHT – SPSS WN< – SPSS WN= – SPSS WN>? – SPSS A/S: analysis / synthesis SPSS: synthesis by predicted acoustic features
  • 31. SUMMARY & CONCLUSION Investigated various types of WaveNet vocoder • SoftMax-WaveNet • MDN-WaveNet • ExcitNet Proposed an LP-WaveNet • Explicitly represent linear prediction structure of speech waveform into WaveNet framework Introduced tuning methods of WaveNet training / generation • Noise injection • Sample generation methods Performed experiment in objective and subjective manner • Achieve faster convergence speed than conventional WaveNet vocoders, whereas keep similar speech quality 31
  • 32. REFERENCES [1] A. J. Hunt and A. W. Black, “Unit selection in a concatenative speech synthesis system using a large speech database,” in Proc. ICASSP, 1996 [2] H. Zen, et. al. “Statistical parametric speech synthesis using deep neural networks,” in Proc. ICASSP, 2013. [3] A. van den Oord, et. al.,“WaveNet: A generative model for raw audio generation,” in CoRR, 2016. [4] A.Tamamori, et. al., “Speaker-dependent wavenet vocoder,” in INTERSPEECH, 2017. [5] A. van den Oord, et. al., "Parallel WaveNet: Fast High-fidelity speech synthesis," in CoRR, 2017 [6] E. Song, K. Byun, and H.-G. Kang, “A neural excitation vocoder for statistical parametric speech synthesis systems,” in Arxiv, 2018. [7] M. Hwang, F. Soong, F. Xie, X. Wang, and H. Kang, “LP-WaveNet: Linear Prediction-based WaveNet Speech Synthesis,” Submitted to Proc. ICASSP, 2019. [8] W. Xin, et. al., “A comparison of recent waveform generation and acoustic modeling methods for neural-network-based speech synthesis,” in Proc. ICASSP, 2018. [9] H. Kawahara, “STRAIGHT, exploration of the other aspect of vocoder: Perceptually isomorphic decomposition of speech sounds,” Acoustical Science and Technology, 2006. 32