SlideShare a Scribd company logo
1 of 36
Nagoya University, Japan
Interactive Voice Conversion
for Augmented Speech Production
Tomoki TODA
July 2, 2021
Interactive VC
Physical 
functions
Machine 
learning
Interaction
Cooperatively augmented speech production
Physical Mechanism of Speech Production
• Produce speech signals by physically controlling speech organs
Sound source generation by
vocal folds vibration
• Quasi-periodic excitation signal
Modulation by articulation
• Resonance characteristics
• Nonlinguistic information is not controlled...
• These physical functions are hard to replace and cause limitations...
1
Can We Produce Speech Beyond Constraints?
• Possibly use voice conversion to augment our speech production by
intentionally controlling more information [Toda; 2014]
Sound source
generation
Articulation
Speech
Voice
conversion
Augmented sound
source generation
Augmented
articulation
Converted
speech
Augment speech production
beyond physical constraints
Hello…
Hello…
Hello…
Normal speech organs
would be virtually
implanted!
Even if some speech
organs were lost…
Hello!
2
Voice Conversion (VC)
Technique to convert non-/para-linguistic information
while keeping linguistic information unchanged
Basic Process of Voice Conversion
• Combining signal processing for speech analysis-synthesis and
machine learning for statistical feature conversion
Converted speech parameters
Training
data
Converted
speech
Input
speech
Feature
conversion
Converted speech
parameters
Synthesis
Analysis
Extracted speech
parameters
Highly nonlinear
function
Source and target speech data
(e.g., parallel data consisting of utterance pairs)
3
[Abe; 1990]
Demo of VC: Vocal Effector
• Convert my singing voice into a specific charactersʼ singing voice!
Realtime VC software
[Dr. Kobayashi, Nagoya Univ.]
Famous virtual singer
[Toda; 2012][Kobayashi; 2018a]
VC
4
1st VCC (VCC2016)
• Parallel training
2nd VCC (VCC2018)
• Parallel training
• Nonparallel training
3rd VCC (VCC2020)
• Semi-parallel training
• Nonparallel training across
different languages
Recent Progress of VC Techniques
http://www.vc-challenge.org/
• Progress through Voice Conversion Challenge (VCC) [Toda; 2016]
Source
speaker
Target
speaker
Freely available
baseline system
Top system
5
[Kobayashi; 2016]
[Liu; 2018]
[Toda; 2007]
[Kobayashi; 2018b]
[Zhang; 2020]
[Liu; 2020]
[Tobing; 2020]
[Huang; 2020]
Converted speech parameters
Recent Trend of VC Techniques
Training
data
Converted
speech
Input
speech
Feature
conversion
Synthesis
Analysis
Simplified
Parametric decomposition
 Resonance &
excitation parameters
No decomposition
 Power spectrogram
High-quality vocoder
 Signal processing
Deep waveform generation
 Neural vocoder
Data-driven
Frame-to-frame
 Parametric probabilistic models
 Resonance modeling
Sequence-to-sequence
 Encoder-decoder w/ attention
 Joint resonance & excitation modeling
More
complex
Supervised parallel training
 Regression using time-aligned
source & target features
Unsupervised nonparallel training
 Reconstruction through
speaker-independent features
 Pretrained models
More
flexible
6
NOTE: Risk of VC
• Need to look at a possibility that VC is misused for spoofing
• VC makes it possible for someone to speak with your voices!
• But... we should NOT stop VC research because there are
many useful applications (e.g., speaking aid)!
• What can we do?
• Collaborate with anti-spoofing research [Wu; ʼ15, Kinnunen; ʼ17, Todisco; ʼ19]
• Need to widely tell people how to use VC correctly!
VC needs to be socially recognized as a kitchen knife.
7
From VC to Interactive VC
Limitations
• Batch-type processing
• Limited controllability
• Less interpretable
To augment speech production
• Quick response
• Better controllability
• Understandable behavior
Instantaneous feedback of system output to
understand system behavior through interaction
Desired speech
free from physical
constraints
Interactive VC w/
LLRT processing
Speech produced by
physical functions
Intentional control
of system output
Multimodal
behavior signals
Acquire unconscious
control skills?
Interactive VC
• Leverage interaction between user and system to develop cooperatively
working functions for augmenting speech production
• Achieve low-latency real-time (LLRT) processing
• Incorporate physical mechanism and multimodal behavior signals
Physical
mechanism
Involuntary control to avoid
physically impossible output
8
[JST CREST, CoAugmentation Project (PI: Toda), 2019-]
Recent Progress of Interactive VC Techniques
1. LLRT VC with computationally efficient network architecture
2. Controllable waveform generation considering physical mechanism
3. Speech expression control with multimodal behavior signals
Produced
speech
Desired
speech
Multimodal
behavior signals
Excitation
conversion
Resonance
conversion
Waveform
generation
LLRT conversion processing
Controllability
9
Recent Progress of Interactive VC Techniques
1. LLRT VC with computationally efficient network architecture
2. Controllable waveform generation considering physical mechanism
3. Speech expression control with multimodal behavior signals
Produced
speech
Desired
speech
Excitation
conversion
Resonance
conversion
Waveform
generation
LLRT conversion processing
Controllability
9
Multimodal
behavior signals
LLRT VC w/ Computationally Efficient Network
Short-time
frame analysis
Input
speech
Converted
mel-spectrogram
Speaker code
of target voice
RNN decoder
Excitation
parameters
RNN decoder
Mel-spectrogram
RNN encoder RNN encoder
Latent features Latent features
Feature conversion network [Tobing; 2021b]
• Based on VAE w/ sparse RNN
10
Encoder
Speaker-aware decoder
Speaker-independent features
LLRT VC w/ Computationally Efficient Network
Short-time
frame analysis
Input
speech
Converted
mel-spectrogram
Speaker code
of target voice
RNN decoder
Excitation
parameters
RNN decoder
Mel-spectrogram
RNN encoder RNN encoder
Latent features Latent features
Feature conversion network [Tobing; 2021b]
• Based on VAE w/ sparse RNN
Multi-band
discrete waveforms
Converted
speech
Modified
WaveRNN
Frame-wise
CNN
Full-band
waveform synthesis
Time-variant
IIR filtering
Waveform generation network [Tobing; 2021a]
• Auto-regressive neural vocoder
10
Frame-wise processing
Sample-wise
processing in
each frame
Cascaded Network Training w/ Fine-Tuning
Natural mel-
spectrogram
VAE
Reconstructed
mel-spectrogram
Converted
mel-spectrogram
VAE
Cyclically
reconstructed
mel-spectrogram
11
Converted
mel-spectrogram
Cyclic training of VAE: CycleVAE [Tobing; 2019]
• Pseudo parallel data generation
Natural mel-
spectrogram
Waveform
Neural
vocoder
Analysis
Training of universal neural vocoder
• Arbitrary speakers and languages
CycleVAE
Neural
vocoder
Natural mel-
spectrogram
Waveform
Reconstructed
mel-spectrogram
Fine-tuning of CycleVAE by propagating loss of universal neural vocoder
Freeze
Results of Listening Tests
Naturalness
(Max: 5, Min: 1)
Higher is better
Speaker similarity
(Max: 100, Min: 0)
Higher is better
Natural voice of source speakers 4.58 12.23
Natural voice of target speakers 4.60 83.39
VCC2020 seq-to-seq baseline [Huang; 2020] 4.01 81.19
VCC2020 frame-to-frame baseline [Tobing; 2020] 3.84 69.91
LLRT VC w/o fine-tuning [Tobing; 2021b] 3.34 59.56
LLRT VC w/ fine-tuning [Tobing; 2021b] 3.93 69.28
12
w/ a single core of 2.1̶2.7 GHz CPU
• Real-time processing w/ 10 ms frame shift
• Latency < 50 ms
Recent Progress of Interactive VC Techniques
1. LLRT VC with computationally efficient network architecture
2. Controllable waveform generation considering physical mechanism
3. Speech expression control with multimodal behavior signals
Produced
speech
Desired
speech
Excitation
conversion
Resonance
conversion
Waveform
generation
LLRT conversion processing
Controllability
9
Multimodal
behavior signals
• Improvement of controllability of unified neural vocoder by softly
implementing physical mechanism of speech production
Controllable Deep Waveform Generation
Resonance
filtering
Excitation
generation
Waveform
Features
Waveform
generation
Waveform
Features
Source-filter models
Unified models
Traditional vocoder
STRAIGHT,
WORLD, ...
WaveNet,
WaveRNN,
PWG, ...
Resonance
filtering
Excitation
generation
Waveform
Features
Proposed
vocoders
Resonance
filtering
Excitation
generation
Waveform
Features
Resonance
filtering
Excitation
generation
Waveform
Features
LPCNet,
GlotGAN,
GELP, ...
NSF, ...
Parametric models Deep neural networks
13
Quasi-Periodic Neural Vocoders
• Dilated convolution network (e.g., WaveNet [van den Oord; 2016])
• F0-dependent dilated convolution network
• Dynamically change dilation length w/ a given F0 pattern
14
[Wu; 2021a][Wu; 2021b]
𝑇
3
𝑇
2
𝑇
2
𝑇
2
𝑇
1
𝑇
3
𝑇
1
𝑇
1 𝑇 1/𝐹 ,
Fundamental
period:
𝑥
𝑥
Input
𝑥
1st layer
𝑥
𝑥 𝑥
𝑥
𝑥
2nd layer
Dilation length 𝑇
Dilation length 2𝑇
𝑥
𝑥
Input
𝑥
1st layer
𝑥
𝑥 𝑥
𝑥
𝑥
2nd layer
Dilation length 1
Dilation length 2 Waveform sample
sequence modeling
w/ fixed receptive
field
Waveform sample
sequence modeling
w/ time-varying
receptive field
Generated
from
20th layer
Behavior of Dilated Convolution Networks
F0-dependent
dilated convolution
Fixed dilated
convolution
Fixed dilated
convolution
F0-dependent
dilated convolution
Fixed dilated
convolution
Noise signal
Waveform Waveform
Noise signal
Waveform
Noise signal
15
Resonance
filtering
Excitation
generation
[Wu; 2021a]
Generated
from
20th layer
Generated
from
5th layer
Behavior of Dilated Convolution Networks
F0-dependent
dilated convolution
Fixed dilated
convolution
Fixed dilated
convolution
F0-dependent
dilated convolution
Fixed dilated
convolution
Noise signal
Waveform Waveform
Noise signal
Waveform
Noise signal
15
Resonance
filtering
Excitation
generation
[Wu; 2021a]
Generated
from
20th layer
Generated
from
5th layer
Generated
from
10th layer
Behavior of Dilated Convolution Networks
F0-dependent
dilated convolution
Fixed dilated
convolution
Fixed dilated
convolution
F0-dependent
dilated convolution
Fixed dilated
convolution
Noise signal
Waveform Waveform
Noise signal
Waveform
Noise signal
15
Resonance
filtering
Excitation
generation
[Wu; 2021a]
Generated
from
20th layer
Generated
from
5th layer
Generated
from
10th layer
Generated
from
15th layer
Behavior of Dilated Convolution Networks
F0-dependent
dilated convolution
Fixed dilated
convolution
Fixed dilated
convolution
F0-dependent
dilated convolution
Fixed dilated
convolution
Noise signal
Waveform Waveform
Noise signal
Waveform
Noise signal
15
Resonance
filtering
Excitation
generation
[Wu; 2021a]
Generated
from
20th layer
Generated
from
5th layer
Generated
from
10th layer
Generated
from
15th layer
Behavior of Dilated Convolution Networks
F0-dependent
dilated convolution
Fixed dilated
convolution
Fixed dilated
convolution
F0-dependent
dilated convolution
Fixed dilated
convolution
Noise signal
Waveform Waveform
Noise signal
Waveform
Noise signal
15
Resonance
filtering
Excitation
generation
• Well factorization of a network into excitation and resonance parts
• Significantly improve F0 controllability including extrapolation performance
Generated
from
20th layer
[Wu; 2021a]
Recent Progress of Interactive VC Techniques
1. LLRT VC with computationally efficient network architecture
2. Controllable waveform generation considering physical mechanism
3. Speech expression control with multimodal behavior signals
Produced
speech
Desired
speech
Excitation
conversion
Resonance
conversion
Waveform
generation
LLRT conversion processing
Controllability
9
Multimodal
behavior signals
Augmented Speech Production: Speaking Aid
• Laryngectomees
• Removal of larynx
• Separated trachea from vocal tract
• Alternative speaking methods
• Electrolaryngeal (EL) speech with an electrolarynx, esophageal speech, ...
• Suffer from unnatural speech quality and less expression
Vocal folds
Esophagus
Trachea
Laryngectomy
Unable to produce
sound source in a usual
manner with vibration
of vocal folds…
Esophagus
Develop an augmented speech production system to recover lost voices!
16
Singing-Aid System with Interactive VC
• Interactive VC to convert EL speech into singing voice
• Real-time melody control by playing MIDI keyboard
• Freely sing an arbitrary song
F0 pattern
conversion
MIDI keyboard
performance
EL
speech
Resonance
conversion
Resonance
features of
EL speech
Resonance
features of
singing voice
Singing
voice
Waveform
generation
MIDI melody
pattern
F0 pattern of
singing voice
[Morikawa; 2017][Li; 2019]
17
Demo of Singing-Aid System
18
Expression Control w/ Multimodal Signals
• “Karaoke”-type singing aid system with interactive VC
• Sing a song to background music without playing a music instrument
• Control vibrato by moving an arm
F0 pattern
conversion
Arm
movements
EL
speech
Resonance
conversion
Singing
voice
Waveform
generation
MIDI melody
pattern
Vibrato control
parameters
Arm position
detection
Background
music
[Okawa; 2021]
19
Summary
• Voice Conversion (VC)
• Technique to convert non-/para-linguistic information
• Significant progress through recent Voice Conversion Challenges (VCCs)
• Need to be recognized as “kitchen knife”
• From VC to Interactive VC towards augmented speech production
• Low-latency real-time conversion to achieve quick response
• Incorporate physical mechanism to network and additional use of
multimodal behavior signals to achieve better controllability
• Immediate goal: achieve high-quality instantaneous feedback to help
users to understand system behavior through interaction
20
Interactive VC
Physical 
functions
Machine 
learning
Interaction
Cooperatively augmented speech production
References
[Abe; 1990] M. Abe, S. Nakamura, K. Shikano, H. Kuwabara. Voice conversion through vector quantization.
J. Acoust. Soc. Jpn (E), Vol. 11, No. 2, pp. 71‒76, 1990.
[Huang; 2020] W.-C. Huang, T. Hayashi, S. Watanabe, T. Toda. The sequence-to-sequence baseline for the
Voice Conversion Challenge 2020: cascading ASR and TTS. Proc. Joint workshop for the Blizzard Challenge
and Voice Conversion Challenge 2020, pp. 160‒164, 2020.
[Kinnunen; 2017] T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Yamagishi, K.A. Lee.
The ASVspoof 2017 Challenge: assessing the limits of replay spoofing attack detection. Proc.
INTERSPEECH, pp. 2‒6, 2017.
[Kobayashi; 2016] K. Kobayashi, S. Takamichi, S. Nakamura, T. Toda. The NU-NAIST voice conversion
system for the Voice Conversion Challenge 2016. Proc. INTERSPEECH, pp. 1667‒1671, 2016.
[Kobayashi; 2018a] K. Kobayashi, T. Toda, S. Nakamura. Intra-gender statistical singing voice conversion
with direct waveform modification using log-spectral differential. Speech Commun., Vol. 99, pp. 211‒220,
2018.
[Kobayashi; 2018b] K. Kobayashi, T. Toda. sprocket: open-source voice conversion software. Proc.
Odyssey, pp. 203‒210, 2018.
[Li; 2019] L. Li, T. Toda, K. Morikawa, K. Kobayashi, S. Makino. Improving singing aid system for
laryngectomees with statistical voice conversion and VAE-SPACE. Proc. ISMIR, pp. 784‒790, 2019.
[Liu; 2018] L.-J. Liu, Z.-H. Ling, Y. Jiang, M. Zhou, L.-R. Dai. WaveNet Vocoder with Limited Training Data
for Voice Conversion. Proc. INTERSPEECH, pp. 1983‒1987, 2018.
[Liu; 2020] L.-J. Liu, Y.-N. Chen, J.-X. Zhang, Y. Jiang, Y.-J. Hu, Z.-H. Ling, L.-R. Dai. Non-parallel voice
conversion with autoregressive conversion model and duration adjustment. Proc. Joint workshop for the
Blizzard Challenge and Voice Conversion Challenge 2020, pp. 126‒130, 2020.
[Morikawa; 2017] K. Morikawa, T. Toda. Electrolaryngeal speech modification towards singing aid system
for laryngectomees. Proc. APSIPA ASC, 4 pages, 2017.
[Okawa; 2021] ⼤川舜平, ⽯⿊祥⽣, ⼤⾕健登, ⻄野隆典, ⼩林和弘, ⼾⽥智基, 武⽥⼀哉. 電気式⼈⼯喉頭を⽤い
た歌唱システムにおける⾃然な⾝体動作を利⽤した歌唱表現付与の提案. 第25回情報処理学会シンポジウム
INTERACTION 2021, 6 pages, Mar. 2021.
[Tobing; 2019] P.L. Tobing, Y.-C. Wu, T. Hayashi, K. Kobayashi, T. Toda. Non-parallel voice conversion with
cyclic variational autoencoder. Proc. INTERSPEECH, pp. 674‒678, 2019. References: 1
[Tobing; 2020] P.L. Tobing, Y. Wu, T. Toda. Baseline system of Voice Conversion Challenge 2020 with
cyclic variational autoencoder and parallel WaveGAN. Proc. Joint workshop for the Blizzard Challenge and
Voice Conversion Challenge 2020, pp. 155‒159, 2020.
[Tobing; 2021a] P.L. Tobing, Y.-C. Wu, T. Hayashi, K. Kobayashi, T. Toda. Non-parallel voice conversion
with cyclic variational autoencoder. Proc. INTERSPEECH, 5 pages, 2021 (to appear).
[Tobing; 2021b] P.L. Tobing, T. Toda. Non-parallel voice conversion with cyclic variational autoencoder.
Proc. 11th ISCA Speech Synthesis Workshop (SSW11), 6 pages, 2021 (to appear).
[Toda, 2007] T. Toda, A.W. Black, K. Tokuda. Voice conversion based on maximum likelihood estimation of
spectral parameter trajectory. IEEE Trans. Audio, Speech & Lang. Process., Vol. 15, No. 8, pp. 2222‒2235,
2007.
[Toda; 2012] T. Toda, T. Muramatsu, H. Banno. Implementation of computationally efficient real-time voice
conversion. Proc. INTERSPEECH, 4 pages, 2012.
[Toda, 2014] T. Toda. Augmented speech production based on real-time statistical voice conversion. Proc.
GlobalSIP, pp. 755‒759, 2014.
[Toda; 2016] T. Toda, L.-H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu, J. Yamagishi. The Voice
Conversion Challenge 2016. Proc. INTERSPEECH, pp. 1632‒1636, 2016.
[Todisco; 2019] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N.
Evans, T.H. Kinnunen, K.A. Lee ASVspoof 2019: future horizons in spoofed and fake audio detection. Proc.
INTERSPEECH, pp. 1008‒1012, 2019.
[van den Oord; 2016] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N.
Kalchbrenner, A. W. Senior, K. Kavukcuoglu. WaveNet: a generative model for raw audio. arXiv preprint,
arXiv:1609.03499, 15 pages, 2016.
[Wu; 2015] Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, H. Li. Spoofing and countermeasures for
speaker verification: A survey. Speech Commun. Vol. 66, pp. 130‒153, 2015.
[Wu; 2021a] Y.-C. Wu, T. Hayashi, T. Okamoto, H. Kawai, T. Toda. Quasi-periodic parallel WaveGAN: a
non-autoregressive raw waveform generative model with pitch-dependent dilated convolution neural
network. IEEE/ACM Trans. Audio, Speech & Lang. Process., Vol. 29, pp. 792‒806, 2021.
[Wu; 2021b] Y.-C. Wu, T. Hayashi, T. Okamoto, H. Kawai, T. Toda. Quasi-periodic WaveNet: an
autoregressive raw waveform generative model with pitch-dependent dilated convolution neural network.
IEEE/ACM Trans. Audio, Speech & Lang. Process., Vol. 29, pp. 1134‒1148, 2021. References: 2
[Zhang; 2020] J.-X. Zhang, L.-J. Liu, Y.-N. Chen, Y.-J. Hu, Y. Jiang, Z.-H. Ling, L.-R. Dai. Voice conversion by
cascading automatic speech recognition and text-to-speech synthesis with prosody transfer. Proc. Joint
workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pp. 121‒125, 2020.
* VCC series
[VCC2016 Summary] T. Toda, L.-H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu, J. Yamagishi. The
Voice Conversion Challenge 2016. Proc. INTERSPEECH, pp. 1632‒1636, 2016.
[VCC2016 Analysis] M. Wester, Z. Wu, J. Yamagishi. Analysis of the Voice Conversion Challenge 2016
evaluation results. Proc. INTERSPEECH, pp. 1637‒1641, 2016.
[VCC2018 Summary] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, Z.
Ling. The voice conversion challenge 2018: promoting development of parallel and nonparallel methods.
Proc. Odyssey, pp. 195‒202, 2018.
[VCC2018 Analysis] T. Kinnunen, J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, Z. Ling.
A spoofing benchmark for the 2018 voice conversion challenge: leveraging from spoofing countermeasures
for speech artifact assessment. Proc. Odyssey, pp. 187‒194, 2018.
[VCC2020 Summary] Z. Yi, W.-C. Huang, X. Tian, J. Yamagishi, R.K. Das, T. Kinnunen, Z. Ling, T. Toda.
Voice Conversion Challenge 2020 ‒ intra-lingual semi-parallel and cross-lingual voice conversion ‒. Proc.
Joint workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pp. 80‒98, 2020.
[VCC2020 Analysis] R.K. Das, T. Kinnunen, W.-C. Huang, Z. Ling, J. Yamagishi, Z. Yi, X. Tian, T. Toda.
Predictions of subjective ratings and spoofing assessments of Voice Conversion Challenge 2020
submissions. Proc. Joint workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pp. 99‒
120, 2020.
* Nonparallel VC w/ speaker-independent representations
[PPG] L. Sun, K. Li, H. Wang, S. Kang, H.M. Meng. Phonetic posteriorgrams for many-to-one voice
conversion without parallel data training. Proc. IEEE ICME, 6 pages, 2016.
[VAE] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, H.-M. Wang. Voice conversion from non-parallel corpora
using variational auto-encoder. Prof. APSIPA ASC, 6 pages, 2016.
[VQVAE] A. van den Oord, O. Vinyals, K. Kavukcuoglu. Neural discrete representation learning. arXiv
preprint, arXiv:1711.00937, 11 pages, 2017.
References: 3
* Vocoder
[STRAIGHT] H. Kawahara, I. Masuda-Katsuse, A. de Cheveigne. Restructuring speech representations using
a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible
role of a repetitive structure in sounds. Speech Commun., Vol. 27, No. 3‒4, pp. 187‒207, 1999.
[WORLD] M. Morise, F. Yokomori, K. Ozawa. WORLD: a vocoder-based high-quality speech synthesis system
for real-time applications. IEICE Trans. Inf. & Syst., Vol. E99-D, No. 7, pp. 1877‒1884, 2016.
[LPCNet] J.-M. Valin, J. Skoglund. LPCNet: Improving neural speech synthesis through linear prediction. Proc.
IEEE ICASSP, pp. 5891‒5895, 2019.
[GlotGAN] L. Juvela, B. Bollepalli, V. Tsiaras, P. Alku. GlotNet ̶ a raw waveform model for the glottal
excitation in statistical parametric speech synthesis. IEEE/ACM Trans. Audio, Speech & Lang. Process., Vol.
27, No. 6, pp. 1019‒1030, 2019.
[GELP] L. Juvela, B. Bollepalli, J. Yamagishi, P. Alku. GELP: GAN-excited linear prediction for speech
synthesis from mel-spectrogram. Proc. INTERSPEECH, pp. 694‒698, 2019.
[NSF] X. Wang, S. Takaki J. Yamagishi. Neural source-filter waveform models for statistical parametric
speech synthesis. IEEE/ACM Trans. Audio, Speech & Lang. Process., Vol. 28, pp. 402‒415, 2019.
[WaveNet] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W.
Senior, K. Kavukcuoglu. WaveNet: a generative model for raw audio. arXiv preprint, arXiv:1609.03499, 15
pages, 2016.
[Parallel WaveNet] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. van den
Driessche, E. Lockhart, L.C. Cobo, F. Stimberg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman, E. Elsen, N.
Kalchbrenner, H. Zen, A. Graves, H. King, T. Walters, D. Belov, D. Hassabis. Parallel WaveNet: fast high-
fidelity speech synthesis. arXiv preprint, arXiv:1711.10433, 11 pages, 2017.
[WaveRNN] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van
den Oord, S. Dieleman, K. Kavukcuoglu. Efficient neural audio synthesis. Proc. ICML, pp. 2410‒2419, 2018.
[PWG] R. Yamamoto, E. Song, J.-M. Kim. Parallel WaveGAN: A fast waveform generation model based on
generative adversarial networks with multi-resolution spectrogram. Proc. IEEE ICASSP, pp. 6199‒6203, 2020.
[aHM] G. Degottex, Y. Stylianou. Analysis and synthesis of speech using an adaptive full-band harmonic
model. IEEE Trans. Audio, Speech & Lang. Process., Vol. 21, No. 10, pp. 2085‒2095, 2013.
[DDSP] J. Engel, L. Hantrakul, C. Gu, A. Roberts. DDSP: differentiable digital signal processing. Proc. ICLR,
16 pages, 2020. References: 4

More Related Content

What's hot

差分スペクトル法に基づく DNN 声質変換の計算量削減に向けたフィルタ推定
差分スペクトル法に基づく DNN 声質変換の計算量削減に向けたフィルタ推定差分スペクトル法に基づく DNN 声質変換の計算量削減に向けたフィルタ推定
差分スペクトル法に基づく DNN 声質変換の計算量削減に向けたフィルタ推定
Shinnosuke Takamichi
 

What's hot (20)

音声信号の分析と加工 - 音声を自在に変換するには?
音声信号の分析と加工 - 音声を自在に変換するには?音声信号の分析と加工 - 音声を自在に変換するには?
音声信号の分析と加工 - 音声を自在に変換するには?
 
敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワーク敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワーク
 
深層生成モデルに基づく音声合成技術
深層生成モデルに基づく音声合成技術深層生成モデルに基づく音声合成技術
深層生成モデルに基づく音声合成技術
 
WaveNetが音声合成研究に与える影響
WaveNetが音声合成研究に与える影響WaveNetが音声合成研究に与える影響
WaveNetが音声合成研究に与える影響
 
差分スペクトル法に基づく DNN 声質変換の計算量削減に向けたフィルタ推定
差分スペクトル法に基づく DNN 声質変換の計算量削減に向けたフィルタ推定差分スペクトル法に基づく DNN 声質変換の計算量削減に向けたフィルタ推定
差分スペクトル法に基づく DNN 声質変換の計算量削減に向けたフィルタ推定
 
サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価
サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価
サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価
 
リアルタイムDNN音声変換フィードバックによるキャラクタ性の獲得手法
リアルタイムDNN音声変換フィードバックによるキャラクタ性の獲得手法リアルタイムDNN音声変換フィードバックによるキャラクタ性の獲得手法
リアルタイムDNN音声変換フィードバックによるキャラクタ性の獲得手法
 
調波打撃音モデルに基づく線形多チャネルブラインド音源分離
調波打撃音モデルに基づく線形多チャネルブラインド音源分離調波打撃音モデルに基づく線形多チャネルブラインド音源分離
調波打撃音モデルに基づく線形多チャネルブラインド音源分離
 
JTubeSpeech: 音声認識と話者照合のために YouTube から構築される日本語音声コーパス
JTubeSpeech:  音声認識と話者照合のために YouTube から構築される日本語音声コーパスJTubeSpeech:  音声認識と話者照合のために YouTube から構築される日本語音声コーパス
JTubeSpeech: 音声認識と話者照合のために YouTube から構築される日本語音声コーパス
 
喉頭摘出者のための歌唱支援を目指した電気音声変換法
喉頭摘出者のための歌唱支援を目指した電気音声変換法喉頭摘出者のための歌唱支援を目指した電気音声変換法
喉頭摘出者のための歌唱支援を目指した電気音声変換法
 
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
 
音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3
音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3
音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3
 
複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査
 
統計的ボイチェン研究事情
統計的ボイチェン研究事情統計的ボイチェン研究事情
統計的ボイチェン研究事情
 
音情報処理における特徴表現
音情報処理における特徴表現音情報処理における特徴表現
音情報処理における特徴表現
 
独立低ランク行列分析に基づく音源分離とその発展
独立低ランク行列分析に基づく音源分離とその発展独立低ランク行列分析に基づく音源分離とその発展
独立低ランク行列分析に基づく音源分離とその発展
 
非負値行列因子分解を用いた被り音の抑圧
非負値行列因子分解を用いた被り音の抑圧非負値行列因子分解を用いた被り音の抑圧
非負値行列因子分解を用いた被り音の抑圧
 
Dsp2015for ss
Dsp2015for ssDsp2015for ss
Dsp2015for ss
 
Nakai22sp03 presentation
Nakai22sp03 presentationNakai22sp03 presentation
Nakai22sp03 presentation
 
深層学習に基づく音響特徴量からの振幅スペクトログラム予測
深層学習に基づく音響特徴量からの振幅スペクトログラム予測深層学習に基づく音響特徴量からの振幅スペクトログラム予測
深層学習に基づく音響特徴量からの振幅スペクトログラム予測
 

Similar to Interactive voice conversion for augmented speech production

44 i9 advanced-speaker-recognition
44 i9 advanced-speaker-recognition44 i9 advanced-speaker-recognition
44 i9 advanced-speaker-recognition
sunnysyed
 
MM_Conferencing.ppt
MM_Conferencing.pptMM_Conferencing.ppt
MM_Conferencing.ppt
Videoguy
 

Similar to Interactive voice conversion for augmented speech production (20)

Delivering Great WebRTC on Mobile Devices
Delivering Great WebRTC on Mobile DevicesDelivering Great WebRTC on Mobile Devices
Delivering Great WebRTC on Mobile Devices
 
[IJET-V1I6P21] Authors : Easwari.N , Ponmuthuramalingam.P
[IJET-V1I6P21] Authors : Easwari.N , Ponmuthuramalingam.P[IJET-V1I6P21] Authors : Easwari.N , Ponmuthuramalingam.P
[IJET-V1I6P21] Authors : Easwari.N , Ponmuthuramalingam.P
 
final ppt BATCH 3.pptx
final ppt BATCH 3.pptxfinal ppt BATCH 3.pptx
final ppt BATCH 3.pptx
 
Voice recognition system
Voice recognition systemVoice recognition system
Voice recognition system
 
44 i9 advanced-speaker-recognition
44 i9 advanced-speaker-recognition44 i9 advanced-speaker-recognition
44 i9 advanced-speaker-recognition
 
G010424248
G010424248G010424248
G010424248
 
MM_Conferencing.ppt
MM_Conferencing.pptMM_Conferencing.ppt
MM_Conferencing.ppt
 
Real-Time Non-Intrusive Speech Quality Estimation: A Signal-Based Model
Real-Time Non-Intrusive Speech Quality Estimation: A Signal-Based ModelReal-Time Non-Intrusive Speech Quality Estimation: A Signal-Based Model
Real-Time Non-Intrusive Speech Quality Estimation: A Signal-Based Model
 
Deep Learning in NLP (BERT, ERNIE and REFORMER)
Deep Learning in NLP (BERT, ERNIE and REFORMER)Deep Learning in NLP (BERT, ERNIE and REFORMER)
Deep Learning in NLP (BERT, ERNIE and REFORMER)
 
Telepresence Interoperability Testing
Telepresence Interoperability TestingTelepresence Interoperability Testing
Telepresence Interoperability Testing
 
Deep Learning - Speaker Verification, Sound Event Detection
Deep Learning - Speaker Verification, Sound Event DetectionDeep Learning - Speaker Verification, Sound Event Detection
Deep Learning - Speaker Verification, Sound Event Detection
 
Mini Project- Audio Enhancement
Mini Project-  Audio EnhancementMini Project-  Audio Enhancement
Mini Project- Audio Enhancement
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Telepresence Testing Approach by Shenick
Telepresence Testing Approach by ShenickTelepresence Testing Approach by Shenick
Telepresence Testing Approach by Shenick
 
SPEECH CODING
SPEECH CODINGSPEECH CODING
SPEECH CODING
 
Environmentally robust ASR front end for DNN-based acoustic models
Environmentally robust ASR front end for DNN-based acoustic modelsEnvironmentally robust ASR front end for DNN-based acoustic models
Environmentally robust ASR front end for DNN-based acoustic models
 
Servey
ServeyServey
Servey
 
ONAP SDC - Model driven design
ONAP SDC - Model driven designONAP SDC - Model driven design
ONAP SDC - Model driven design
 
seminar4
seminar4seminar4
seminar4
 
Voice recognition security systems
Voice recognition security systemsVoice recognition security systems
Voice recognition security systems
 

More from NU_I_TODALAB

More from NU_I_TODALAB (13)

異常音検知に対する深層学習適用事例
異常音検知に対する深層学習適用事例異常音検知に対する深層学習適用事例
異常音検知に対する深層学習適用事例
 
信号の独立性に基づく多チャンネル音源分離
信号の独立性に基づく多チャンネル音源分離信号の独立性に基づく多チャンネル音源分離
信号の独立性に基づく多チャンネル音源分離
 
距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知
 
Weakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-AttentionWeakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-Attention
 
音素事後確率を利用した表現学習に基づく発話感情認識
音素事後確率を利用した表現学習に基づく発話感情認識音素事後確率を利用した表現学習に基づく発話感情認識
音素事後確率を利用した表現学習に基づく発話感情認識
 
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
 
End-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head DecoderネットワークEnd-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head Decoderネットワーク
 
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
 
Deep Neural Networkに基づく日常生活行動認識における適応手法
Deep Neural Networkに基づく日常生活行動認識における適応手法Deep Neural Networkに基づく日常生活行動認識における適応手法
Deep Neural Networkに基づく日常生活行動認識における適応手法
 
CTCに基づく音響イベントからの擬音語表現への変換
CTCに基づく音響イベントからの擬音語表現への変換CTCに基づく音響イベントからの擬音語表現への変換
CTCに基づく音響イベントからの擬音語表現への変換
 
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
 
実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調
実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調
実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調
 
ケプストラム正則化NTFによるステレオチャネル楽曲音源分離
ケプストラム正則化NTFによるステレオチャネル楽曲音源分離ケプストラム正則化NTFによるステレオチャネル楽曲音源分離
ケプストラム正則化NTFによるステレオチャネル楽曲音源分離
 

Recently uploaded

Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Lovely Professional University
 
Seizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networksSeizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networks
IJECEIAES
 
21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx
rahulmanepalli02
 
electrical installation and maintenance.
electrical installation and maintenance.electrical installation and maintenance.
electrical installation and maintenance.
benjamincojr
 

Recently uploaded (20)

"United Nations Park" Site Visit Report.
"United Nations Park" Site  Visit Report."United Nations Park" Site  Visit Report.
"United Nations Park" Site Visit Report.
 
Passive Air Cooling System and Solar Water Heater.ppt
Passive Air Cooling System and Solar Water Heater.pptPassive Air Cooling System and Solar Water Heater.ppt
Passive Air Cooling System and Solar Water Heater.ppt
 
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
 
Seizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networksSeizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networks
 
Diploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfDiploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdf
 
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
 
Piping and instrumentation diagram p.pdf
Piping and instrumentation diagram p.pdfPiping and instrumentation diagram p.pdf
Piping and instrumentation diagram p.pdf
 
Raashid final report on Embedded Systems
Raashid final report on Embedded SystemsRaashid final report on Embedded Systems
Raashid final report on Embedded Systems
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
 
Filters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsFilters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility Applications
 
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...
 
Introduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AIIntroduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AI
 
21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx
 
Autodesk Construction Cloud (Autodesk Build).pptx
Autodesk Construction Cloud (Autodesk Build).pptxAutodesk Construction Cloud (Autodesk Build).pptx
Autodesk Construction Cloud (Autodesk Build).pptx
 
Operating System chapter 9 (Virtual Memory)
Operating System chapter 9 (Virtual Memory)Operating System chapter 9 (Virtual Memory)
Operating System chapter 9 (Virtual Memory)
 
electrical installation and maintenance.
electrical installation and maintenance.electrical installation and maintenance.
electrical installation and maintenance.
 
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfInstruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
 
UNIT-2 image enhancement.pdf Image Processing Unit 2 AKTU
UNIT-2 image enhancement.pdf Image Processing Unit 2 AKTUUNIT-2 image enhancement.pdf Image Processing Unit 2 AKTU
UNIT-2 image enhancement.pdf Image Processing Unit 2 AKTU
 
Insurance management system project report.pdf
Insurance management system project report.pdfInsurance management system project report.pdf
Insurance management system project report.pdf
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1
 

Interactive voice conversion for augmented speech production

  • 1. Nagoya University, Japan Interactive Voice Conversion for Augmented Speech Production Tomoki TODA July 2, 2021 Interactive VC Physical  functions Machine  learning Interaction Cooperatively augmented speech production
  • 2. Physical Mechanism of Speech Production • Produce speech signals by physically controlling speech organs Sound source generation by vocal folds vibration • Quasi-periodic excitation signal Modulation by articulation • Resonance characteristics • Nonlinguistic information is not controlled... • These physical functions are hard to replace and cause limitations... 1
  • 3. Can We Produce Speech Beyond Constraints? • Possibly use voice conversion to augment our speech production by intentionally controlling more information [Toda; 2014] Sound source generation Articulation Speech Voice conversion Augmented sound source generation Augmented articulation Converted speech Augment speech production beyond physical constraints Hello… Hello… Hello… Normal speech organs would be virtually implanted! Even if some speech organs were lost… Hello! 2
  • 4. Voice Conversion (VC) Technique to convert non-/para-linguistic information while keeping linguistic information unchanged
  • 5. Basic Process of Voice Conversion • Combining signal processing for speech analysis-synthesis and machine learning for statistical feature conversion Converted speech parameters Training data Converted speech Input speech Feature conversion Converted speech parameters Synthesis Analysis Extracted speech parameters Highly nonlinear function Source and target speech data (e.g., parallel data consisting of utterance pairs) 3 [Abe; 1990]
  • 6. Demo of VC: Vocal Effector • Convert my singing voice into a specific charactersʼ singing voice! Realtime VC software [Dr. Kobayashi, Nagoya Univ.] Famous virtual singer [Toda; 2012][Kobayashi; 2018a] VC 4
  • 7. 1st VCC (VCC2016) • Parallel training 2nd VCC (VCC2018) • Parallel training • Nonparallel training 3rd VCC (VCC2020) • Semi-parallel training • Nonparallel training across different languages Recent Progress of VC Techniques http://www.vc-challenge.org/ • Progress through Voice Conversion Challenge (VCC) [Toda; 2016] Source speaker Target speaker Freely available baseline system Top system 5 [Kobayashi; 2016] [Liu; 2018] [Toda; 2007] [Kobayashi; 2018b] [Zhang; 2020] [Liu; 2020] [Tobing; 2020] [Huang; 2020]
  • 8. Converted speech parameters Recent Trend of VC Techniques Training data Converted speech Input speech Feature conversion Synthesis Analysis Simplified Parametric decomposition  Resonance & excitation parameters No decomposition  Power spectrogram High-quality vocoder  Signal processing Deep waveform generation  Neural vocoder Data-driven Frame-to-frame  Parametric probabilistic models  Resonance modeling Sequence-to-sequence  Encoder-decoder w/ attention  Joint resonance & excitation modeling More complex Supervised parallel training  Regression using time-aligned source & target features Unsupervised nonparallel training  Reconstruction through speaker-independent features  Pretrained models More flexible 6
  • 9. NOTE: Risk of VC • Need to look at a possibility that VC is misused for spoofing • VC makes it possible for someone to speak with your voices! • But... we should NOT stop VC research because there are many useful applications (e.g., speaking aid)! • What can we do? • Collaborate with anti-spoofing research [Wu; ʼ15, Kinnunen; ʼ17, Todisco; ʼ19] • Need to widely tell people how to use VC correctly! VC needs to be socially recognized as a kitchen knife. 7
  • 10. From VC to Interactive VC Limitations • Batch-type processing • Limited controllability • Less interpretable To augment speech production • Quick response • Better controllability • Understandable behavior
  • 11. Instantaneous feedback of system output to understand system behavior through interaction Desired speech free from physical constraints Interactive VC w/ LLRT processing Speech produced by physical functions Intentional control of system output Multimodal behavior signals Acquire unconscious control skills? Interactive VC • Leverage interaction between user and system to develop cooperatively working functions for augmenting speech production • Achieve low-latency real-time (LLRT) processing • Incorporate physical mechanism and multimodal behavior signals Physical mechanism Involuntary control to avoid physically impossible output 8 [JST CREST, CoAugmentation Project (PI: Toda), 2019-]
  • 12. Recent Progress of Interactive VC Techniques 1. LLRT VC with computationally efficient network architecture 2. Controllable waveform generation considering physical mechanism 3. Speech expression control with multimodal behavior signals Produced speech Desired speech Multimodal behavior signals Excitation conversion Resonance conversion Waveform generation LLRT conversion processing Controllability 9
  • 13. Recent Progress of Interactive VC Techniques 1. LLRT VC with computationally efficient network architecture 2. Controllable waveform generation considering physical mechanism 3. Speech expression control with multimodal behavior signals Produced speech Desired speech Excitation conversion Resonance conversion Waveform generation LLRT conversion processing Controllability 9 Multimodal behavior signals
  • 14. LLRT VC w/ Computationally Efficient Network Short-time frame analysis Input speech Converted mel-spectrogram Speaker code of target voice RNN decoder Excitation parameters RNN decoder Mel-spectrogram RNN encoder RNN encoder Latent features Latent features Feature conversion network [Tobing; 2021b] • Based on VAE w/ sparse RNN 10 Encoder Speaker-aware decoder Speaker-independent features
  • 15. LLRT VC w/ Computationally Efficient Network Short-time frame analysis Input speech Converted mel-spectrogram Speaker code of target voice RNN decoder Excitation parameters RNN decoder Mel-spectrogram RNN encoder RNN encoder Latent features Latent features Feature conversion network [Tobing; 2021b] • Based on VAE w/ sparse RNN Multi-band discrete waveforms Converted speech Modified WaveRNN Frame-wise CNN Full-band waveform synthesis Time-variant IIR filtering Waveform generation network [Tobing; 2021a] • Auto-regressive neural vocoder 10 Frame-wise processing Sample-wise processing in each frame
  • 16. Cascaded Network Training w/ Fine-Tuning Natural mel- spectrogram VAE Reconstructed mel-spectrogram Converted mel-spectrogram VAE Cyclically reconstructed mel-spectrogram 11 Converted mel-spectrogram Cyclic training of VAE: CycleVAE [Tobing; 2019] • Pseudo parallel data generation Natural mel- spectrogram Waveform Neural vocoder Analysis Training of universal neural vocoder • Arbitrary speakers and languages CycleVAE Neural vocoder Natural mel- spectrogram Waveform Reconstructed mel-spectrogram Fine-tuning of CycleVAE by propagating loss of universal neural vocoder Freeze
  • 17. Results of Listening Tests Naturalness (Max: 5, Min: 1) Higher is better Speaker similarity (Max: 100, Min: 0) Higher is better Natural voice of source speakers 4.58 12.23 Natural voice of target speakers 4.60 83.39 VCC2020 seq-to-seq baseline [Huang; 2020] 4.01 81.19 VCC2020 frame-to-frame baseline [Tobing; 2020] 3.84 69.91 LLRT VC w/o fine-tuning [Tobing; 2021b] 3.34 59.56 LLRT VC w/ fine-tuning [Tobing; 2021b] 3.93 69.28 12 w/ a single core of 2.1̶2.7 GHz CPU • Real-time processing w/ 10 ms frame shift • Latency < 50 ms
  • 18. Recent Progress of Interactive VC Techniques 1. LLRT VC with computationally efficient network architecture 2. Controllable waveform generation considering physical mechanism 3. Speech expression control with multimodal behavior signals Produced speech Desired speech Excitation conversion Resonance conversion Waveform generation LLRT conversion processing Controllability 9 Multimodal behavior signals
  • 19. • Improvement of controllability of unified neural vocoder by softly implementing physical mechanism of speech production Controllable Deep Waveform Generation Resonance filtering Excitation generation Waveform Features Waveform generation Waveform Features Source-filter models Unified models Traditional vocoder STRAIGHT, WORLD, ... WaveNet, WaveRNN, PWG, ... Resonance filtering Excitation generation Waveform Features Proposed vocoders Resonance filtering Excitation generation Waveform Features Resonance filtering Excitation generation Waveform Features LPCNet, GlotGAN, GELP, ... NSF, ... Parametric models Deep neural networks 13
  • 20. Quasi-Periodic Neural Vocoders • Dilated convolution network (e.g., WaveNet [van den Oord; 2016]) • F0-dependent dilated convolution network • Dynamically change dilation length w/ a given F0 pattern 14 [Wu; 2021a][Wu; 2021b] 𝑇 3 𝑇 2 𝑇 2 𝑇 2 𝑇 1 𝑇 3 𝑇 1 𝑇 1 𝑇 1/𝐹 , Fundamental period: 𝑥 𝑥 Input 𝑥 1st layer 𝑥 𝑥 𝑥 𝑥 𝑥 2nd layer Dilation length 𝑇 Dilation length 2𝑇 𝑥 𝑥 Input 𝑥 1st layer 𝑥 𝑥 𝑥 𝑥 𝑥 2nd layer Dilation length 1 Dilation length 2 Waveform sample sequence modeling w/ fixed receptive field Waveform sample sequence modeling w/ time-varying receptive field
  • 21. Generated from 20th layer Behavior of Dilated Convolution Networks F0-dependent dilated convolution Fixed dilated convolution Fixed dilated convolution F0-dependent dilated convolution Fixed dilated convolution Noise signal Waveform Waveform Noise signal Waveform Noise signal 15 Resonance filtering Excitation generation [Wu; 2021a]
  • 22. Generated from 20th layer Generated from 5th layer Behavior of Dilated Convolution Networks F0-dependent dilated convolution Fixed dilated convolution Fixed dilated convolution F0-dependent dilated convolution Fixed dilated convolution Noise signal Waveform Waveform Noise signal Waveform Noise signal 15 Resonance filtering Excitation generation [Wu; 2021a]
  • 23. Generated from 20th layer Generated from 5th layer Generated from 10th layer Behavior of Dilated Convolution Networks F0-dependent dilated convolution Fixed dilated convolution Fixed dilated convolution F0-dependent dilated convolution Fixed dilated convolution Noise signal Waveform Waveform Noise signal Waveform Noise signal 15 Resonance filtering Excitation generation [Wu; 2021a]
  • 24. Generated from 20th layer Generated from 5th layer Generated from 10th layer Generated from 15th layer Behavior of Dilated Convolution Networks F0-dependent dilated convolution Fixed dilated convolution Fixed dilated convolution F0-dependent dilated convolution Fixed dilated convolution Noise signal Waveform Waveform Noise signal Waveform Noise signal 15 Resonance filtering Excitation generation [Wu; 2021a]
  • 25. Generated from 20th layer Generated from 5th layer Generated from 10th layer Generated from 15th layer Behavior of Dilated Convolution Networks F0-dependent dilated convolution Fixed dilated convolution Fixed dilated convolution F0-dependent dilated convolution Fixed dilated convolution Noise signal Waveform Waveform Noise signal Waveform Noise signal 15 Resonance filtering Excitation generation • Well factorization of a network into excitation and resonance parts • Significantly improve F0 controllability including extrapolation performance Generated from 20th layer [Wu; 2021a]
  • 26. Recent Progress of Interactive VC Techniques 1. LLRT VC with computationally efficient network architecture 2. Controllable waveform generation considering physical mechanism 3. Speech expression control with multimodal behavior signals Produced speech Desired speech Excitation conversion Resonance conversion Waveform generation LLRT conversion processing Controllability 9 Multimodal behavior signals
  • 27. Augmented Speech Production: Speaking Aid • Laryngectomees • Removal of larynx • Separated trachea from vocal tract • Alternative speaking methods • Electrolaryngeal (EL) speech with an electrolarynx, esophageal speech, ... • Suffer from unnatural speech quality and less expression Vocal folds Esophagus Trachea Laryngectomy Unable to produce sound source in a usual manner with vibration of vocal folds… Esophagus Develop an augmented speech production system to recover lost voices! 16
  • 28. Singing-Aid System with Interactive VC • Interactive VC to convert EL speech into singing voice • Real-time melody control by playing MIDI keyboard • Freely sing an arbitrary song F0 pattern conversion MIDI keyboard performance EL speech Resonance conversion Resonance features of EL speech Resonance features of singing voice Singing voice Waveform generation MIDI melody pattern F0 pattern of singing voice [Morikawa; 2017][Li; 2019] 17
  • 29. Demo of Singing-Aid System 18
  • 30. Expression Control w/ Multimodal Signals • “Karaoke”-type singing aid system with interactive VC • Sing a song to background music without playing a music instrument • Control vibrato by moving an arm F0 pattern conversion Arm movements EL speech Resonance conversion Singing voice Waveform generation MIDI melody pattern Vibrato control parameters Arm position detection Background music [Okawa; 2021] 19
  • 31. Summary • Voice Conversion (VC) • Technique to convert non-/para-linguistic information • Significant progress through recent Voice Conversion Challenges (VCCs) • Need to be recognized as “kitchen knife” • From VC to Interactive VC towards augmented speech production • Low-latency real-time conversion to achieve quick response • Incorporate physical mechanism to network and additional use of multimodal behavior signals to achieve better controllability • Immediate goal: achieve high-quality instantaneous feedback to help users to understand system behavior through interaction 20 Interactive VC Physical  functions Machine  learning Interaction Cooperatively augmented speech production
  • 33. [Abe; 1990] M. Abe, S. Nakamura, K. Shikano, H. Kuwabara. Voice conversion through vector quantization. J. Acoust. Soc. Jpn (E), Vol. 11, No. 2, pp. 71‒76, 1990. [Huang; 2020] W.-C. Huang, T. Hayashi, S. Watanabe, T. Toda. The sequence-to-sequence baseline for the Voice Conversion Challenge 2020: cascading ASR and TTS. Proc. Joint workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pp. 160‒164, 2020. [Kinnunen; 2017] T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Yamagishi, K.A. Lee. The ASVspoof 2017 Challenge: assessing the limits of replay spoofing attack detection. Proc. INTERSPEECH, pp. 2‒6, 2017. [Kobayashi; 2016] K. Kobayashi, S. Takamichi, S. Nakamura, T. Toda. The NU-NAIST voice conversion system for the Voice Conversion Challenge 2016. Proc. INTERSPEECH, pp. 1667‒1671, 2016. [Kobayashi; 2018a] K. Kobayashi, T. Toda, S. Nakamura. Intra-gender statistical singing voice conversion with direct waveform modification using log-spectral differential. Speech Commun., Vol. 99, pp. 211‒220, 2018. [Kobayashi; 2018b] K. Kobayashi, T. Toda. sprocket: open-source voice conversion software. Proc. Odyssey, pp. 203‒210, 2018. [Li; 2019] L. Li, T. Toda, K. Morikawa, K. Kobayashi, S. Makino. Improving singing aid system for laryngectomees with statistical voice conversion and VAE-SPACE. Proc. ISMIR, pp. 784‒790, 2019. [Liu; 2018] L.-J. Liu, Z.-H. Ling, Y. Jiang, M. Zhou, L.-R. Dai. WaveNet Vocoder with Limited Training Data for Voice Conversion. Proc. INTERSPEECH, pp. 1983‒1987, 2018. [Liu; 2020] L.-J. Liu, Y.-N. Chen, J.-X. Zhang, Y. Jiang, Y.-J. Hu, Z.-H. Ling, L.-R. Dai. Non-parallel voice conversion with autoregressive conversion model and duration adjustment. Proc. Joint workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pp. 126‒130, 2020. [Morikawa; 2017] K. Morikawa, T. Toda. Electrolaryngeal speech modification towards singing aid system for laryngectomees. Proc. APSIPA ASC, 4 pages, 2017. [Okawa; 2021] ⼤川舜平, ⽯⿊祥⽣, ⼤⾕健登, ⻄野隆典, ⼩林和弘, ⼾⽥智基, 武⽥⼀哉. 電気式⼈⼯喉頭を⽤い た歌唱システムにおける⾃然な⾝体動作を利⽤した歌唱表現付与の提案. 第25回情報処理学会シンポジウム INTERACTION 2021, 6 pages, Mar. 2021. [Tobing; 2019] P.L. Tobing, Y.-C. Wu, T. Hayashi, K. Kobayashi, T. Toda. Non-parallel voice conversion with cyclic variational autoencoder. Proc. INTERSPEECH, pp. 674‒678, 2019. References: 1
  • 34. [Tobing; 2020] P.L. Tobing, Y. Wu, T. Toda. Baseline system of Voice Conversion Challenge 2020 with cyclic variational autoencoder and parallel WaveGAN. Proc. Joint workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pp. 155‒159, 2020. [Tobing; 2021a] P.L. Tobing, Y.-C. Wu, T. Hayashi, K. Kobayashi, T. Toda. Non-parallel voice conversion with cyclic variational autoencoder. Proc. INTERSPEECH, 5 pages, 2021 (to appear). [Tobing; 2021b] P.L. Tobing, T. Toda. Non-parallel voice conversion with cyclic variational autoencoder. Proc. 11th ISCA Speech Synthesis Workshop (SSW11), 6 pages, 2021 (to appear). [Toda, 2007] T. Toda, A.W. Black, K. Tokuda. Voice conversion based on maximum likelihood estimation of spectral parameter trajectory. IEEE Trans. Audio, Speech & Lang. Process., Vol. 15, No. 8, pp. 2222‒2235, 2007. [Toda; 2012] T. Toda, T. Muramatsu, H. Banno. Implementation of computationally efficient real-time voice conversion. Proc. INTERSPEECH, 4 pages, 2012. [Toda, 2014] T. Toda. Augmented speech production based on real-time statistical voice conversion. Proc. GlobalSIP, pp. 755‒759, 2014. [Toda; 2016] T. Toda, L.-H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu, J. Yamagishi. The Voice Conversion Challenge 2016. Proc. INTERSPEECH, pp. 1632‒1636, 2016. [Todisco; 2019] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T.H. Kinnunen, K.A. Lee ASVspoof 2019: future horizons in spoofed and fake audio detection. Proc. INTERSPEECH, pp. 1008‒1012, 2019. [van den Oord; 2016] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, K. Kavukcuoglu. WaveNet: a generative model for raw audio. arXiv preprint, arXiv:1609.03499, 15 pages, 2016. [Wu; 2015] Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, H. Li. Spoofing and countermeasures for speaker verification: A survey. Speech Commun. Vol. 66, pp. 130‒153, 2015. [Wu; 2021a] Y.-C. Wu, T. Hayashi, T. Okamoto, H. Kawai, T. Toda. Quasi-periodic parallel WaveGAN: a non-autoregressive raw waveform generative model with pitch-dependent dilated convolution neural network. IEEE/ACM Trans. Audio, Speech & Lang. Process., Vol. 29, pp. 792‒806, 2021. [Wu; 2021b] Y.-C. Wu, T. Hayashi, T. Okamoto, H. Kawai, T. Toda. Quasi-periodic WaveNet: an autoregressive raw waveform generative model with pitch-dependent dilated convolution neural network. IEEE/ACM Trans. Audio, Speech & Lang. Process., Vol. 29, pp. 1134‒1148, 2021. References: 2
  • 35. [Zhang; 2020] J.-X. Zhang, L.-J. Liu, Y.-N. Chen, Y.-J. Hu, Y. Jiang, Z.-H. Ling, L.-R. Dai. Voice conversion by cascading automatic speech recognition and text-to-speech synthesis with prosody transfer. Proc. Joint workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pp. 121‒125, 2020. * VCC series [VCC2016 Summary] T. Toda, L.-H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu, J. Yamagishi. The Voice Conversion Challenge 2016. Proc. INTERSPEECH, pp. 1632‒1636, 2016. [VCC2016 Analysis] M. Wester, Z. Wu, J. Yamagishi. Analysis of the Voice Conversion Challenge 2016 evaluation results. Proc. INTERSPEECH, pp. 1637‒1641, 2016. [VCC2018 Summary] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, Z. Ling. The voice conversion challenge 2018: promoting development of parallel and nonparallel methods. Proc. Odyssey, pp. 195‒202, 2018. [VCC2018 Analysis] T. Kinnunen, J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, Z. Ling. A spoofing benchmark for the 2018 voice conversion challenge: leveraging from spoofing countermeasures for speech artifact assessment. Proc. Odyssey, pp. 187‒194, 2018. [VCC2020 Summary] Z. Yi, W.-C. Huang, X. Tian, J. Yamagishi, R.K. Das, T. Kinnunen, Z. Ling, T. Toda. Voice Conversion Challenge 2020 ‒ intra-lingual semi-parallel and cross-lingual voice conversion ‒. Proc. Joint workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pp. 80‒98, 2020. [VCC2020 Analysis] R.K. Das, T. Kinnunen, W.-C. Huang, Z. Ling, J. Yamagishi, Z. Yi, X. Tian, T. Toda. Predictions of subjective ratings and spoofing assessments of Voice Conversion Challenge 2020 submissions. Proc. Joint workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pp. 99‒ 120, 2020. * Nonparallel VC w/ speaker-independent representations [PPG] L. Sun, K. Li, H. Wang, S. Kang, H.M. Meng. Phonetic posteriorgrams for many-to-one voice conversion without parallel data training. Proc. IEEE ICME, 6 pages, 2016. [VAE] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, H.-M. Wang. Voice conversion from non-parallel corpora using variational auto-encoder. Prof. APSIPA ASC, 6 pages, 2016. [VQVAE] A. van den Oord, O. Vinyals, K. Kavukcuoglu. Neural discrete representation learning. arXiv preprint, arXiv:1711.00937, 11 pages, 2017. References: 3
  • 36. * Vocoder [STRAIGHT] H. Kawahara, I. Masuda-Katsuse, A. de Cheveigne. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds. Speech Commun., Vol. 27, No. 3‒4, pp. 187‒207, 1999. [WORLD] M. Morise, F. Yokomori, K. Ozawa. WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. & Syst., Vol. E99-D, No. 7, pp. 1877‒1884, 2016. [LPCNet] J.-M. Valin, J. Skoglund. LPCNet: Improving neural speech synthesis through linear prediction. Proc. IEEE ICASSP, pp. 5891‒5895, 2019. [GlotGAN] L. Juvela, B. Bollepalli, V. Tsiaras, P. Alku. GlotNet ̶ a raw waveform model for the glottal excitation in statistical parametric speech synthesis. IEEE/ACM Trans. Audio, Speech & Lang. Process., Vol. 27, No. 6, pp. 1019‒1030, 2019. [GELP] L. Juvela, B. Bollepalli, J. Yamagishi, P. Alku. GELP: GAN-excited linear prediction for speech synthesis from mel-spectrogram. Proc. INTERSPEECH, pp. 694‒698, 2019. [NSF] X. Wang, S. Takaki J. Yamagishi. Neural source-filter waveform models for statistical parametric speech synthesis. IEEE/ACM Trans. Audio, Speech & Lang. Process., Vol. 28, pp. 402‒415, 2019. [WaveNet] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, K. Kavukcuoglu. WaveNet: a generative model for raw audio. arXiv preprint, arXiv:1609.03499, 15 pages, 2016. [Parallel WaveNet] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. van den Driessche, E. Lockhart, L.C. Cobo, F. Stimberg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman, E. Elsen, N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Walters, D. Belov, D. Hassabis. Parallel WaveNet: fast high- fidelity speech synthesis. arXiv preprint, arXiv:1711.10433, 11 pages, 2017. [WaveRNN] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, K. Kavukcuoglu. Efficient neural audio synthesis. Proc. ICML, pp. 2410‒2419, 2018. [PWG] R. Yamamoto, E. Song, J.-M. Kim. Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. Proc. IEEE ICASSP, pp. 6199‒6203, 2020. [aHM] G. Degottex, Y. Stylianou. Analysis and synthesis of speech using an adaptive full-band harmonic model. IEEE Trans. Audio, Speech & Lang. Process., Vol. 21, No. 10, pp. 2085‒2095, 2013. [DDSP] J. Engel, L. Hantrakul, C. Gu, A. Roberts. DDSP: differentiable digital signal processing. Proc. ICLR, 16 pages, 2020. References: 4