Invited Talk at IEEE SLT 2021
Title: "Recent progress on voice conversion: What is next?"
Speaker: Tomoki Toda
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
8. 1st VCC (VCC2016)
• Parallel training
2nd VCC (VCC2018)
• Parallel training
• Non‐parallel training
3rd VCC (VCC2020)
• Semi‐parallel training
• Non‐parallel training across different languages
(cross‐lingual VC)
Series of VCCs
http://www.vc‐challenge.org/
2016
2017
2018
2019
2020
• Basic task: speaker conversion [Abe; ’90]
• Convert a source speaker’s voice into a specific target speaker’s voice
VCCs: 2
12. Overall Result of VCC2016 Listening Tests
1 2 3 4 5
0
20
40
60
80
100
Mean opinion score (MOS) on naturalness
Correct rate [%] on speaker similarity
Target
Source
Baseline A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
Better
Better
P
Q
Correct = 75%
MOS = 3.5
• 17 submitted systems
• 1 baseline (GMM [Toda; ’07] and freely available vocoder in 2004)
J system
[Kobayashi; ’16]
VCC2016: 4
17. Overall Results of VCC2018 Listening Tests
100
80
60
40
20
0
1 2 3 4 5
MOS on naturalness
Similarity score [%]
100
80
60
40
20
0
1 2 3 4 5
MOS on naturalness
Similarity score [%]
N17 system
[Tobing; ’18]
N10 system [Liu; ’18]
N17 system
[Wu; ’18]
N10 system [Liu; ’18]
Parallel training task
• 23 submitted systems
• 1 baseline
Nonparallel training task
• 11 submitted systems
• 1 baseline
VCC2018: 4
Baseline
[Kobayashi; ‘18b]
Baseline
[Kobayashi; ‘18b]
18. • Effectiveness of waveform generation using WaveNet [van den Oord; ’16] as
a data‐driven vocoder, “neural vocoder” (in top 2 systems, N10 & N17)
• Effectiveness of alignment‐free training [Sun; ’16] based on reconstruction
process using pretrained encoder (in top system, N10)
Findings through VCC2018
Input
speech
Feature
conversion
Synthesis w/
neural vocoder
Analysis
Converted
speech
Input
features
Speaker‐
independent
features
Pretrained
Encoder
Speaker‐aware
decoder
Reconstructed
features
Speaker information
VCC2018: 5
External ASR data
Speech data
21. • Speech waveform generation using neural vocoder
• Feature conversion w/ encoder‐decoder capable of alignment‐free training
Trend in VCC2020 Systems
Neural
vocoder
Converted
speech
Frame‐to‐frame encoder‐decoder
• Autoencoder (baseline)
• Phonetic posteriorgram (PPG, baseline)
Sequence‐to‐sequence encoder‐decoder
• Text representation (ASR+TTS, baseline)
VCC2020: 3
Input
speech
Encoder Decoder
Speaker information
External
ASR data
External
speech data
Neural vocoders (baselines)
• Autoregressive (AR) model
(WaveNet, WaveRNN, …)
• Non‐AR model
(Parallel WaveGAN, …)
External
TTS data
22. Overall Results of VCC2020 Listening Tests
Semi‐parallel training task
• 28 submitted systems
• 3 baseline systems
Cross‐lingual VC task
• 25 submitted systems
• 3 baseline systems
T22 baseline
ASR+TTS+PWG
[Huang; ’20]
T16 baseline
CycleVAE+PWG
[Tobing; ’20]
T16 baseline
T22 baseline
Top system
ASR+TTS+WN
[Zhang; ’20]
Top system
PPG+WN [Liu; ’20]
VCC2020: 4
T11 baseline
VCC2018 top
system
T11 baseline
23. • Some systems outperform VCC2018 top system!
• Good performance of alignment‐free training
• Text representation works very well in intra‐lingual VC.
• Conversion of long‐term features is well handled.
• PPG works well in cross‐lingual VC.
• No conversion of long‐term features is not perfect but still reasonable.
• Good performance of neural vocoder
• Autoregressive models tend to outperform non‐autoregressive models.
• Current performance of intra‐lingual VC
• Speaker similarity is comparable with target recorded speech!
• Naturalness is still distinguishable from target recorded speech
• Current performance of cross‐lingual VC
• More difficult than intra‐lingual VC
• More difficult to judge speaker similarity between different languages
Findings through VCC2020
VCC2020: 5
24. Summary of Recent Progress of VC Techniques
Summary of VCCs
Analysis Conversion Synthesis
Parametric decomposition
w/ high‐quality vocoder
Excitation parameters
Spectral parameters
No decomposition
(raw representation)
Mel‐spectrogram
Frame‐to‐frame
Parametric
probability models
Neural regression
Sequence‐to‐sequence
Encoder‐decoder
w/ attention
Long‐term feature
modeling
Supervised parallel training
Regression using time‐
aligned features
Unsupervised (alignment‐
free) nonparallel training
Reconstruction using
speaker‐independent
features
Pretrained models
High‐quality vocoder
Signal processing
based on source‐
filter model
Waveform modeling
w/ neural vocoder
AR models
Non‐AR models
Mapping
Training
Simplified More complex
More flexible Data‐driven
26. • Further performance improvements
• Higher‐quality and more controllable
• End‐to‐end and sequence‐to‐sequence VC
• Unsupervised factorization
• Flexible use of huge amount of existing speech data
• Pretraining and transfer learning
• Self‐supervised training using unlabeled data
• More varieties of conversion
• Beyond speaker conversion: from who speaks to how to speak
• Common representation over arbitrary languages
• Better embeddings of long‐term features
• Conversion of pronunciation, speaking styles, emotion, ...
1. Make VC Techniques Better and Better!
Future direction: 1
Sequence‐to‐sequence VC + unsupervised learning w/o any labels
27. • Towards new definition of VC
2. Convert Higher Level Features!
Future direction: 2
“VC is a technique to modify speech waveform to convert
non‐/para‐linguistic information while preserving linguistic information.”
arbitrary conceptual
Speech
Word‐level
features
Converted
speech
Target
information
Long‐term
features
Short‐term
features
Higher‐level
embeddings
Higher than text
representation
Conceptual
information
Linguistic
information
Para‐linguistic
information
Non‐linguistic
information
+
+
+
+
Higher‐level
encoder
Also convert word
representation!
Deeper
decoder
Sequence‐to‐sequence VC using higher‐level embeddings
28. 3. Develop (Interactive) VC Applications
Future direction: 3
• Augment speech production by using low‐latency real‐time VC [Toda, ’14]
• Towards “interactive VC” to create cooperatively working functions between
user and system by using additionally input signals and interaction
Speech & singing voices
produced beyond
physical constraints
Low‐latency
real‐time VC
Speech & singing
voices produced by
existing functions
Involuntary control of system behavior
to avoid physically impossible output!
Soft constraints on
speech production
[JST CREST CoAugmetation project]
Instantaneous feedback of system behavior
Intentional control of
system behavior as we want!
Additional signals
(e.g., gesture)
Interactive VC framework to understand system behaviors
of a data‐driven system through interaction
31. Summary
• What was Voice Conversion (VC)?
• Technique to convert non‐/para‐linguistic information
• Many useful applications but to be recognized as “kitchen knife”
• Review of recent trend of VC research
• Significant progress through recent Voice Conversion Challenges (VCCs)
• Neural vocoder and sequence‐to‐sequence mapping
• Alignment‐free training capable of handling unsupervised training
• Possible future direction of VC research
• Further performance improvements
• From linguistically unchanged to conceptually unchanged
• Interactive VC to augment our speech production
Summary
Let’s start VC research and develop useful and helpful applications!
33. • Freely available VCC2020 baseline systems
• Sequence‐to‐sequence VC based on cascaded ASR + TTS w/ ESPnet
• https://bit.ly/2RVwAVk [W.‐C. Huang]
• Frame‐wise VC based on CycleVAE + Parallel WaveGAN
• https://bit.ly/369AXUK [P.L. Tobing, Y.‐C. Wu]
• Tutorial materials at INTERSPEECH 2019
• https://bit.ly/328LwSS [T. Toda, K. Kobayashi, T. Hayashi]
• Lecture slides and Hands‐on
• Development of VC w/ WaveNet vocoder on Google Colab note
• Summer school materials at SPCC 2018 (& 2019)
• Lecture slides on “Advanced Voice Conversion”
• https://bit.ly/2PpWEYx
• More details of recent progress of VC techniques
• Hands‐on slides
• https://bit.ly/2pmwuLC
• More details of sprocket to develop VCC2018 baseline system
Resources