Recent progress on voice conversion: What is next?

Recent Progress on Voice Conversion:
What is Next?
Tomoki TODA
Nagoya University, Japan
Jan. 21st, 2021
Voice Conversion
Vocoder ?
https://en.wikipedia.org/wiki/{Voder,Vocoder}
Voder
https://arxiv.org/pdf/{1706.03762, 1609.03499}.pdf

Voice Conversion (VC)
What’s VC?

• My own definition of VC
• Need to address various sub‐problems
Voice Conversion (VC)
How to factorize?
How to analyze? How to generate?
How to convert?
VC
How to parameterize?
VC: 1
“VC is a technique to modify speech waveform to convert
non‐/para‐linguistic information while preserving linguistic information.”

Demo of VC: Vocal Effector
• Let’s convert my singing voice into specific characters’ singing voices!
Realtime statistical VC software
[Dr. Kobayashi, Nagoya Univ.]
Famous virtual singer
VC: 2
[Toda; ’12][Kobayashi; ’18a]
VC

NOTE: Risk of VC
• Need to look at a possibility that VC is misused for spoofing…
• VC makes it possible for someone to speak with your voices…
• But… we should NOT stop VC research because there are
many useful applications (e.g., speaking aid)!
• What can we do?
• Collaborate with anti‐spoofing research [Wu; ’15, Kinnunen; ’17, Todisco; ’19]
• Need to widely tell people how to use VC correctly!
VC needs to be socially recognized as a kitchen knife.
VC: 3

Recent Progress
Let’s overview
challenge activities!

Voice Conversion Challenge (VCC)
• Objective of VCC
• Following a policy of Blizzard Challenge [Black; 2005]
“Evaluation campaign” rather than “competition”
• Also reveal a risk of VC techniques
• Effective but possible to be used for spoofing
• Collaborate with automatic speaker verification (ASV) community
Better understand different VC techniques by comparing their
performance using a freely‐available dataset as a common dataset!
VCCs: 1

1st VCC (VCC2016)
• Parallel training
2nd VCC (VCC2018)
• Parallel training
• Non‐parallel training
3rd VCC (VCC2020)
• Semi‐parallel training
• Non‐parallel training across different languages
(cross‐lingual VC)
Series of VCCs
http://www.vc‐challenge.org/
2016
2017
2018
2019
2020
• Basic task: speaker conversion [Abe; ’90]
• Convert a source speaker’s voice into a specific target speaker’s voice
VCCs: 2

• Special session at INTERSPEECH 2016
VCC2016: 1
https://bit.ly/3nZaFdh

• Supervised training using utterance pairs of source & target speech
Task: Parallel Training
Target speaker
Please say
the same thing.
Please say
the same thing.
Let’s convert
my voice.
Let’s convert
my voice.
Source speaker
VCC2016: 2
Training
Conversion
Training set
• Same linguistic contents
• 162 sentences in each speaker
Evaluation set
• 54 sentences in each
source speaker
Evaluation by listening test
• Naturalness
• Speaker similarity

Trend in VCC2016 Systems
Converted
speech
Input
speech Conversion w/
statistical model
Converted speech
parameters
Synthesis
Analysis
Speech
parameters
• Use speech waveform generation w/ vocoder (i.e., source‐filter model)
Spectral parameter
Excitation
signal
Pulse train
Gaussian noise Synthetic speech
Synthesis filter
Excitation parameter
High‐quality vocoder
(WORLD [Morise; ’16], STRAIGHT [Kawahara; ’99])
VCC2016: 3
Probability model (GMM, DNN)
trained w/ time‐aligned features

Overall Result of VCC2016 Listening Tests
1 2 3 4 5
0
20
40
60
80
100
Mean opinion score (MOS) on naturalness
Correct rate [%] on speaker similarity
Target
Source
Baseline A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
Better
Better
P
Q
Correct = 75%
MOS = 3.5
• 17 submitted systems
• 1 baseline (GMM [Toda; ’07] and freely available vocoder in 2004)
J system
[Kobayashi; ’16]
VCC2016: 4

Findings through VCC2016
• Effectiveness of using direct waveform modification as vocoder‐free
waveform generation (in top system, J)
Converted
speech
Input
speech Direct waveform
modification
Conversion
Analysis
Spectral
sequence
Log‐spectral
differential sequence
Excitation
conversion
GMM for spectrum
F0 statistics &
GMM for aperiodicity
Vocoded waveform
Still need to use vocoder to convert excitation parameters…
VCC2016: 5
This framework was implemented in the demo system.
Note that excitation conversion can be avoided by singing
the same key as the target. [Kobayashi; ’18a]

• Special session at Speaker Odyssey 2018
VCC2018: 1
https://bit.ly/3bWzpQU

• Unsupervised training using arbitrary utterances of source & target speech
Additional Task: Nonparallel Training
Target speaker
Please say
the same thing.
Let’s say
a different thing.
Let’s convert
my voice.
Let’s convert
my voice.
Source speaker
VCC2018: 2
Training
Conversion
Training set
• Different linguistic contents
Evaluation set
source speaker
More evaluation
• Listening test
• ASR & ASV results

Direct waveform
modification
• Speech waveform generation using vocoder‐free framework
• Feature conversion using neural networks
Input
speech
Conversion
Synthesis w/
traditional vocoder
Analysis Converted
speech
Neural networks (most systems) and
GMMs (4 systems including baseline)
trained with time‐aligned features
(most systems)
Vocoder‐free
(5 systems including baseline)
Pseudo parallel data generation
to handle nonparallel training
(several systems)
VCC2018: 3
Converted
speech

Overall Results of VCC2018 Listening Tests
100
80
60
40
20
0
1 2 3 4 5
MOS on naturalness
Similarity score [%]
100
80
60
40
20
0
1 2 3 4 5
MOS on naturalness
Similarity score [%]
N17 system
[Tobing; ’18]
N10 system [Liu; ’18]
N17 system
[Wu; ’18]
N10 system [Liu; ’18]
Parallel training task
• 1 baseline
Nonparallel training task
• 1 baseline
VCC2018: 4
Baseline
[Kobayashi; ‘18b]
Baseline
[Kobayashi; ‘18b]

• Effectiveness of waveform generation using WaveNet [van den Oord; ’16] as
a data‐driven vocoder, “neural vocoder” (in top 2 systems, N10 & N17)
• Effectiveness of alignment‐free training [Sun; ’16] based on reconstruction
process using pretrained encoder (in top system, N10)
Input
speech
Feature
conversion
Synthesis w/
neural vocoder
Analysis
Converted
speech
Input
features
Speaker‐
independent
features
Pretrained
Encoder
Speaker‐aware
decoder
Reconstructed
features
Speaker information
VCC2018: 5
External ASR data
Speech data

• Joint Workshop for the Blizzard Challenge and
Voice Conversion Challenge 2020
VCC2020: 1
https://bit.ly/3p0H98t

• Unsupervised training using different languages
More Challenging Task: Cross‐Lingual VC
Target speaker
Please say
the same thing.
让我们说另一种
语言。
Let’s convert
my voice.
Let’s convert
my voice.
Source speaker
VCC2020: 2
Training
Conversion
Training set
• Bigger gap on linguistic information
Evaluation set
source speaker
More evaluation & analysis
• Listening test
• ASR & spoofing results, more

• Speech waveform generation using neural vocoder
• Feature conversion w/ encoder‐decoder capable of alignment‐free training
Neural
vocoder
Converted
speech
Frame‐to‐frame encoder‐decoder
• Autoencoder (baseline)
• Phonetic posteriorgram (PPG, baseline)
Sequence‐to‐sequence encoder‐decoder
• Text representation (ASR+TTS, baseline)
VCC2020: 3
Input
speech
Encoder Decoder
Speaker information
External
ASR data
External
speech data
Neural vocoders (baselines)
• Autoregressive (AR) model
(WaveNet, WaveRNN, …)
• Non‐AR model
(Parallel WaveGAN, …)
External
TTS data

Overall Results of VCC2020 Listening Tests
Semi‐parallel training task
• 3 baseline systems
Cross‐lingual VC task
• 3 baseline systems
T22 baseline
ASR+TTS+PWG
[Huang; ’20]
T16 baseline
CycleVAE+PWG
[Tobing; ’20]
T16 baseline
T22 baseline
Top system
ASR+TTS+WN
[Zhang; ’20]
Top system
PPG+WN [Liu; ’20]
VCC2020: 4
T11 baseline
VCC2018 top
system
T11 baseline

• Some systems outperform VCC2018 top system!
• Good performance of alignment‐free training
• Text representation works very well in intra‐lingual VC.
• Conversion of long‐term features is well handled.
• PPG works well in cross‐lingual VC.
• No conversion of long‐term features is not perfect but still reasonable.
• Good performance of neural vocoder
• Autoregressive models tend to outperform non‐autoregressive models.
• Current performance of intra‐lingual VC
• Speaker similarity is comparable with target recorded speech!
• Naturalness is still distinguishable from target recorded speech
• Current performance of cross‐lingual VC
• More difficult than intra‐lingual VC
• More difficult to judge speaker similarity between different languages
VCC2020: 5

Summary of Recent Progress of VC Techniques
Summary of VCCs
Analysis Conversion Synthesis
Parametric decomposition
w/ high‐quality vocoder
 Excitation parameters
 Spectral parameters
No decomposition
(raw representation)
 Mel‐spectrogram
Frame‐to‐frame
 Parametric
probability models
 Neural regression
Sequence‐to‐sequence
 Encoder‐decoder
w/ attention
 Long‐term feature
modeling
Supervised parallel training
 Regression using time‐
aligned features
Unsupervised (alignment‐
free) nonparallel training
 Reconstruction using
speaker‐independent
features
 Pretrained models
 Signal processing
based on source‐
filter model
Waveform modeling
w/ neural vocoder
 AR models
 Non‐AR models
Mapping
Training
Simplified More complex
More flexible Data‐driven

What’s Next?
My own view of
future directions

• Further performance improvements
• Higher‐quality and more controllable
• End‐to‐end and sequence‐to‐sequence VC
• Unsupervised factorization
• Flexible use of huge amount of existing speech data
• Pretraining and transfer learning
• Self‐supervised training using unlabeled data
• More varieties of conversion
• Beyond speaker conversion: from who speaks to how to speak
• Common representation over arbitrary languages
• Better embeddings of long‐term features
• Conversion of pronunciation, speaking styles, emotion, ...
1. Make VC Techniques Better and Better!
Future direction: 1
Sequence‐to‐sequence VC + unsupervised learning w/o any labels

• Towards new definition of VC
2. Convert Higher Level Features!
Future direction: 2
“VC is a technique to modify speech waveform to convert
non‐/para‐linguistic information while preserving linguistic information.”
arbitrary conceptual
Speech
Word‐level
features
Converted
speech
Target
information
Long‐term
features
Short‐term
features
Higher‐level
embeddings
Higher than text
representation
Conceptual
information
Linguistic
information
Para‐linguistic
information
Non‐linguistic
information
+
+
+
+
Higher‐level
encoder
Also convert word
representation!
Deeper
decoder
Sequence‐to‐sequence VC using higher‐level embeddings

3. Develop (Interactive) VC Applications
Future direction: 3
• Augment speech production by using low‐latency real‐time VC [Toda, ’14]
• Towards “interactive VC” to create cooperatively working functions between
user and system by using  additionally input signals and interaction
Speech & singing voices
produced beyond
physical constraints
Low‐latency
real‐time VC
Speech & singing
voices produced by
existing functions
Involuntary control of system behavior
to avoid physically impossible output!
Soft constraints on
speech production
[JST CREST CoAugmetation project]
Instantaneous feedback of system behavior
Intentional control of
system behavior as we want!
Additional signals
(e.g., gesture)
Interactive VC framework to understand system behaviors
of a data‐driven system through interaction

Example of Interactive VC for Singing Aid
Future direction: 4
[Morikawa; ’17]

Another Example of Interactive VC
• “Karaoke”‐type singing aid system for laryngectomees
• Suppose that a user sings to background music
• Control singing expression, such as vibrato, by moving an arm
+ = Interactive VC
Future direction: 5
[Okawa; ’21]

Summary
• What was Voice Conversion (VC)?
• Technique to convert non‐/para‐linguistic information
• Many useful applications but to be recognized as “kitchen knife”
• Review of recent trend of VC research
• Significant progress through recent Voice Conversion Challenges (VCCs)
• Neural vocoder and sequence‐to‐sequence mapping
• Alignment‐free training capable of handling unsupervised training
• Possible future direction of VC research
• Further performance improvements
• From linguistically unchanged to conceptually unchanged
• Interactive VC to augment our speech production
Summary
Let’s start VC research and develop useful and helpful applications!

• Freely available VCC2020 baseline systems
• Sequence‐to‐sequence VC based on cascaded ASR + TTS w/ ESPnet
• https://bit.ly/2RVwAVk [W.‐C. Huang]
• Frame‐wise VC based on CycleVAE + Parallel WaveGAN
• https://bit.ly/369AXUK [P.L. Tobing, Y.‐C. Wu]
• Tutorial materials at INTERSPEECH 2019
• https://bit.ly/328LwSS [T. Toda, K. Kobayashi, T. Hayashi]
• Lecture slides and Hands‐on
• Development of VC w/ WaveNet vocoder on Google Colab note
• Summer school materials at SPCC 2018 (& 2019)
• Lecture slides on “Advanced Voice Conversion”
• https://bit.ly/2PpWEYx
• More details of recent progress of VC techniques
• Hands‐on slides
• https://bit.ly/2pmwuLC
• More details of sprocket to develop VCC2018 baseline system
Resources

[Abe; ’90] M. Abe, S. Nakamura, K. Shikano, H. Kuwabara.  Voice conversion through vector quantization.  J.
Acoust. Soc. Jpn (E), Vol. 11, No. 2, pp. 71–76, 1990.
[Black; ’05] A.W. Black, K. Tokuda. The Blizzard Challenge – 2005: evaluating corpus‐based speech synthesis
on common datasets.  Proc. INTERSPEECH, pp. 77–80, 2005.
[Huang; ’20] W.‐C. Huang, T. Hayashi, S. Watanabe, T. Toda.  The sequence‐to‐sequence baseline for the
Voice Conversion Challenge 2020: cascading ASR and TTS.  Proc. Joint workshop for the Blizzard Challenge
and Voice Conversion Challenge 2020, pp. 160–164, 2020.
[Kawahara; ’99] H. Kawahara, I. Masuda‐Katsuse, A. de Cheveigne.   Restructuring speech representations
using a pitch‐adaptive time‐frequency smoothing and an instantaneous‐frequency‐based F0 extraction:
possible role of a repetitive structure in sounds.   Speech Commun., Vol. 27, No. 3–4, pp. 187–207, 1999.
[Kinnunen; ’17] T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Yamagishi, K.A. Lee.  The
ASVspoof 2017 Challenge: assessing the limits of replay spoofing attack detection.  Proc. INTERSPEECH, pp.
2‐‐6, 2017.
[Kobayashi; ’16] K. Kobayashi, S. Takamichi, S. Nakamura, T. Toda. The NU‐NAIST voice conversion system
for the Voice Conversion Challenge 2016. Proc. INTERSPEECH, pp. 1667–1671, 2016.
[Kobayashi; ’18a] K. Kobayashi, T. Toda, S. Nakamura.  Intra‐gender statistical singing voice conversion with
direct waveform modification using log‐spectral differential.  Speech Commun., Vol. 99, pp. 211–220, 2018.
[Kobayashi; ’18b] K. Kobayashi, T. Toda.  sprocket: open‐source voice conversion software.  Proc. Odyssey,
pp. 203–210, 2018.
[Liu; ’18] L.‐J. Liu, Z.‐H. Ling, Y. Jiang, M. Zhou, L.‐R. Dai.  WaveNet Vocoder with Limited Training Data for
Voice Conversion.  Proc. INTERSPEECH, pp. 1983–1987, 2018.
[Liu; ’20] L.‐J. Liu, Y.‐N. Chen, J.‐X. Zhang, Y. Jiang, Y.‐J. Hu, Z.‐H. Ling, L.‐R. Dai. Non‐parallel voice conversion
with autoregressive conversion model and duration adjustment. Proc. Joint workshop for the Blizzard
Challenge and Voice Conversion Challenge 2020, pp. 126–130, 2020.
References: 1

[Morikawa; ’17] K. Morikawa, T. Toda.  Electrolaryngeal speech modification towards singing aid system for
laryngectomees.  Proc. APSIPA ASC, 4 pages, 2017.
[Morise; ’16] M. Morise, F. Yokomori, K. Ozawa.  WORLD: a vocoder‐based high‐quality speech synthesis
system for real‐time applications.  IEICE Trans. Inf. & Syst., Vol. E99‐D, No. 7, pp. 1877–1884, 2016.
[Okawa; ’21] 大川舜平, 石黒祥生, 大谷健登, 西野隆典, 小林和弘, 戸田智基, 武田一哉.  電気式人工喉
頭を用いた歌唱システムにおける自然な身体動作を利用した歌唱表現付与の提案.  第25回情報処理学
会シンポジウム INTERACTION 2021, 6 pages, Mar. 2021.
[Sun; ’16] L. Sun, K. Li, H. Wang, S. Kang, H.M. Meng.  Phonetic posteriorgrams for many‐to‐one voice
conversion without parallel data training.  Proc. IEEE ICME, 6 pages, 2016.
[Tobing; ’18] P.L. Tobing, Y. Wu, T. Hayashi, K. Kobayashi, T. Toda.  NU voice conversion system for the voice
conversion challenge 2018.  Proc. Odyssey, pp. 219–226, 2018.
[Tobing; ’20] P.L. Tobing, Y. Wu, T. Toda.  Baseline system of Voice Conversion Challenge 2020 with cyclic
variational autoencoder and parallel WaveGAN. Proc. Joint workshop for the Blizzard Challenge and Voice
Conversion Challenge 2020, pp. 155–159, 2020.
[Todisco; ’19] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N.
Evans, T.H. Kinnunen, K.A. Lee  ASVspoof 2019: future horizons in spoofed and fake audio detection.  Proc.
INTERSPEECH, pp. 1008–1012, 2019.
[Toda, ’07] T. Toda, A.W. Black, K. Tokuda.  Voice conversion based on maximum likelihood estimation of
spectral parameter trajectory.  IEEE Trans. Audio, Speech & Lang. Process., Vol. 15, No. 8, pp. 2222–2235,
2007.
[Toda, ’12] T. Toda, T. Muramatsu, H. Banno.  Implementation of computationally efficient real‐time voice
conversion.  Proc. INTERSPEECH, 4 pages, 2012.
[Toda, ’14] T. Toda.  Augmented speech production based on real‐time statistical voice conversion.  Proc.
GlobalSIP, pp. 755–759, 2014.
References: 2

[van den Oord; ’16] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N.
Kalchbrenner, A. W. Senior, K. Kavukcuoglu.  WaveNet: a generative model for raw audio.  arXiv preprint,
arXiv:1609.03499, 15 pages, 2016.
[Wu; ’15] Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, H. Li.  Spoofing and countermeasures for
speaker verification: A survey.  Speech Commun.  Vol. 66, pp. 130–153, 2015.
[Wu; ’18] Y.‐C. Wu, P.L. Tobing, T. Hayashi, K. Kobayashi, T. Toda.  The NU non‐parallel voice conversion
system for the voice conversion challenge 2018.  Proc. Odyssey, pp. 211–218, 2018.
[Zhang; ’20] J.‐X. Zhang, L.‐J. Liu, Y.‐N. Chen, Y.‐J. Hu, Y. Jiang, Z.‐H. Ling, L.‐R. Dai.  Voice conversion by
cascading automatic speech recognition and text‐to‐speech synthesis with prosody transfer.  Proc. Joint
workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pp. 121–125, 2020.
VCC2016 Summary T. Toda, L.‐H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu, J. Yamagishi. The
Voice Conversion Challenge 2016. Proc. INTERSPEECH, pp. 1632–1636, 2016.
VCC2016 Analysis M. Wester, Z. Wu, J. Yamagishi. Analysis of the Voice Conversion Challenge 2016
evaluation results. Proc. INTERSPEECH, pp. 1637–1641, 2016.
VCC2018 Summary J. Lorenzo‐Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, Z. Ling.
The voice conversion challenge 2018: promoting development of parallel and nonparallel methods.  Proc.
Odyssey, pp. 195–202, 2018.
VCC2018 Analysis T. Kinnunen, J. Lorenzo‐Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, Z. Ling.
A spoofing benchmark for the 2018 voice conversion challenge: leveraging from spoofing countermeasures
for speech artifact assessment.  Proc. Odyssey, pp. 187–194, 2018.
VCC2020 Summary Z. Yi, W.‐C. Huang, X. Tian, J. Yamagishi, R.K. Das, T. Kinnunen, Z. Ling, T. Toda.  Voice
Conversion Challenge 2020 – intra‐lingual semi‐parallel and cross‐lingual voice conversion –.  Proc. Joint
workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pp. 80–98, 2020.
VCC2020 Analysis R.K. Das, T. Kinnunen, W.‐C. Huang, Z. Ling, J. Yamagishi, Z. Yi, X. Tian, T. Toda.
Predictions of subjective ratings and spoofing assessments of Voice Conversion Challenge 2020 submissions.
Proc. Joint workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pp. 99–120, 2020.
References: 3

Recent progress on voice conversion: What is next?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

More from NU_I_TODALAB

More from NU_I_TODALAB (14)

Recently uploaded

Recently uploaded (20)

Recent progress on voice conversion: What is next?