Research_Wu.pptx

Voice Conversion with Neural-
based Speech Generation Model
Yi-Chiao Wu
Toda Lab, Nagoya University

About me
https://bigpon.github.io/
• Education
• Nagoya University, informatics (current)
• Topics: voice conversion, speech synthesis
• National Chiao Tung University, communication
engineering (MS, BS)
• Topic: speaker verification
• Work experience
• Academia Sinica, research assistant [2015–2017]
• Topic: voice conversion, speech enhancement
• Asus, software R&D [2013-2015]
• Topic: speaker verification
• Realtek, system designer [2012-2013]
1

Voice conversion
• Changing the speaker identity of speech
• Keeping the linguistic content consistent
2
Speech
analysis
Speaker
A
Speaker
B
Speech
synthesis
Feature
conversion
Source Target
Source to target

Conventional speech synthesis
• Vocoder: voice coder
• Encoder (analyzer)
• decomposing speech into acoustic features
• Decoder (synthesizer)
• synthesizing speech from acoustic features
3
Speech
analysis
Speech
synthesis
Spectral
features
Prosodic
features
Speech Speech
Vocoder
Encoder Decoder

Neural-based speech generation
• Neural-based decoder (synthesizer)
• Input: acoustic features
• Output: speech samples
4
Neural-based
speech
generation
Spectral
features
Prosodic
features
Speech

Research overview
1. Build a voice conversion (VC) with neural-based
generation model baseline system (done)
2. Improve the robustness (done)
3. Improve the flexibility (done)
4. Build a real-time system (on-going)
5

VC with neural-generation model
baseline system
• Source to target features conversion
• Speech generation from converted features
• Voice conversion challenge 2018 system
• Y.-C. Wu et al. “The NU non-parallel voice conversion system for
the voice conversion challenge 2018,” Proc. Odyssey, 2018.
6
Feature
conversion
Generation
model
Source
features
Converted
target
features
Converted
target

Collapsed speech
• Collapsed speech  Refined speech
• Waveform shape constraint
• Y. -C. Wu et al. “Collapsed speech generation and suppression for
WaveNet vocoder ,” Proc. Interspeech, 2018.
• Y. -C. Wu et al. “Non-parallel voice conversion system with
WaveNet vocoder and collapsed speech suppression,” Proc. IEEE
Access, 2020.
7

Acoustic Mismatch
• Mismatch between training and testing stage
• Speech quality degradation
• Noisy generated speech (Collapsed speech)
8
Generation
model
Target
features
Speech
Generation
model
Converted
target
features
Speech
Training
Testing
Mismatch

Temporal Mismatch
• TTS postfilter
• Source: artificial speech; target: natural speech
• Source and target have the same data length, but the
temporal structures are different
• Y.-C. Wu et al. “A cyclical post-filtering approach to mismatch
refinement of neural vocoder for text-to-speech systems,” Submitted
to Interspeech, 2020.
9
Generation
model
Source
features
Speech
Training
Temporal mismatch

WaveNet [A. Oord+, 2016]
• Audio signals have a very long term dependency
• Basic RNN cannot model long-term correlations
• Stacked CNN layers
• Input: a segment of previous samples (receptive field)
• Output: the conditional probability
of the current sample
10
 
1
| ,...,
n n r n
P y y y
 
y0 y1 y2 y3
p(y4)
Receptive field

Dilated CNN [F. Yu+, 2016]
• Convolution with skip holes
• Efficiently extend the receptive field
• Downsampling-like structure makes the network
capture information on different levels
11
Receptive field
Dilation
size = 4
Dilation
size = 2
Dilation
size = 1

Is WaveNet vocoder suitable for
speech generation?
• Speech signal is a quasi-periodic signal
• Periodic part: long-term correlation
• Non-periodic part: short-term correlation
• WaveNet
• Fixed network architecture
• Without prior knowledge of speech signal
• Problems of WaveNet as a vocoder
• Inefficient speech signal modeling
• Limited pitch controllability
12

Quasi-Periodic WaveNet
• Pitch-dependent dilated convolution
• Dynamically change the dilation size
• Model the periodic part with prior F0 knowledge (long-
term correlations)
• Cascaded network
• Fixed modules model the non-periodic part with the
nearest samples (short-term correlations)
• Adaptive modules for periodic part
• Y.-C. Wu et al. “Quasi-Periodic WaveNet vocoder: a pitch dependent
dilated convolution model for parametric speech generation ,” Proc.
Interspeech, 2019.
13

Pitch-dependent dilated
convolution
• Pitch-dependent dilated factor: Et = Fs/(F0,t × a)
14
Fixed dilated convolution
Receptive field
Effective receptive field
Output
Dilation = 2*ET
Hidden layer
Dilation = 1*ET
Input
Output
Dilation = 2
Hidden layer
Dilation = 1
Input
Pitch dependent
dilated factors
ET
Pitch dependent dilated convolution
Receptive field
Effective receptive field
Et=2
Et-1=3
...

Effective receptive filed
• Fixed number of samples in a receptive filed
• Fixed number of samples in one cycle
• The same number of past cycles in a effective
receptive field for arbitrary F0
15

Cascaded networks
• Fixed modules for short-term correlations
• Fixed dilated convolution (DCNN)
• Adaptive modules for long-term correlations
• Pitch-dependent dilated convolution (PDCNN)
16
Skip connection
Fixed
block
Fixed
block
Adaptive
block
Adaptive
block
...
Fixed / Adaptive
2×1 dilated
Gated
Next residual block
Relu 1×1
Softmax
Relu 1×1
Previous residual block
Causal
Output
Auxiliary
features
Skip
connection
Input
Fixed/Adaptive residual block
...
Upsample
Acoustic
features
1×1
1×1
1×1
Auxiliary
features

Parallel WaveGAN [R. Yamamoto+, 2020]
• GAN structure for speech waveform generation
• WaveNet-like generator
• Non-autoregressive and non-causal
• Fast (RTF: 0.02 with one Titan V)
• Compact (3% of WaveNet)
17
Generator (G)
Gaussian
noise
Acoustic
features
Discriminator
(D)
Generated
speech
Natural
speech
Multi-resolution
STFT loss
LD
Ladv
Lsp
LG λadv
Training
Synthesis

Quasi-Periodic Parallel WaveGAN
• Pitch-dependent
dilated convolution
• Cascaded network
18
Skip connection
Adaptive
block
Adaptive
block
Fixed
block
Fixed
block
...
Adaptive / Fixed
3×1 dilated
Gated
Next residual block
ReLU 1×1 ReLU 1×1
Previous residual block
1×1
Generated
speech
Auxiliary
features
Skip
connection
Gaussian
noise
Adaptive / Fixed
residual block
...
Upsample
Acoustic
features
1×1
1×1
1×1
Auxiliary
features
Macroblock 0
Macroblock 1
• Y.-C. Wu et al. “Quasi-Periodic Parallel WaveGAN vocoder: a non-
autoregressive pitch-dependent dilated convolution model for parametric
speech generation,” Submitted to Interspeech, 2020.

Other works
• Voice conversion (VC)
• Exemplar VC w/ LLE [Y.-C. Wu+, 2016]
• Variational AutoEncoder (VAE) [C.-C. Hsu+, 2016]
• VAE w/ WGAN [C.-C. Hsu+, 2016]
• CycleVC [P. L. Tobing+, 2019]
• Seq2Seq Transformer VC [W.-C. Huang+, 2020]
• Speech enhancement (SE)
• Exemplar VC w/ LLE for SE [Y.-C. Wu+, 2017]
19

Thank you for your attention !
https://bigpon.github.io/QuasiPeriodicParallelWaveGAN_demo/
https://bigpon.github.io/QuasiPeriodicWaveNet_demo/
https://bigpon.github.io/LpcConstrainedWaveNet_demo/
20

Research_Wu.pptx

Recommended

Recommended

More Related Content

Similar to Research_Wu.pptx

Similar to Research_Wu.pptx (20)

More from Rakesh Pogula

More from Rakesh Pogula (6)

Recently uploaded

Recently uploaded (20)

Research_Wu.pptx