Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance

Yamagishi Laboratory, National Institute of Informatics, Japan
Yamagishi Laboratory, National Institute of Informatics, JapanYamagishi Laboratory, National Institute of Informatics, Japan
Preliminary study on using vector
quantization latent spaces for TTS/VC
systems with consistent performance
Hieu-Thi Luong, Junichi Yamagishi
National Institute of Informatics, JAPAN
Motivation
● In general, speech synthesis research and development (TTS: Text-to-speech, VC:
Voice conversion,...) focus on the end product that is the natural and expressive
synthesized speech (with or without an emphasis on speaker individuality).
● However, by training latent representations with specific traits, we opens up the
possibilities to use in many applications/scenarios. Specifically, a discrete latent
feature is a low bit-rate representation comparing with continuous one.
● In this work, we investigated the use of vector quantization variational autoencoder
(VQVAE) component to model the linguistic latent space of our unified TTS/VC voice
cloning system (name: NAUTILUS).
● Beside general evaluations on subjective quality/similarity, we also show how our
carefully designed system can be used to investigate the latent space, which is a
preliminary study for future research on this topic.
2
NAUTILUS-VQ
(←left) A voice cloning system that is a combination of TTS and
VC (NAUTILUS*) with a vector quantization (VQ) component
used to turn the continuous latent linguistic embedding (LLE) into
discrete vector.
*We proposed NAUTILUS system with continuous LLE in our previous publication:
[1] Luong, Hieu-Thi, and Junichi Yamagishi. "Nautilus: a versatile voice cloning system." IEEE/ACM Transactions
on Audio, Speech, and Language Processing 28 (2020): 2967-2981.
Text-to-speech Stack
(Multispeaker TTS)
Speech-to-speech Stack
(Many-to-many VC)
MAE: mean-absolute error
MSE: mean-square error
CE: cross-entropy
x: phonetic feature
y: acoustic feature
o: waveform
z: LLE
q: quantized LLE
spk : with speaker
components (ohv)
3
NAUTILUS-VQ: 1.INITIAL TRAINING
The main difference with the original is that we replace VAE-based
encoders with VQVAE-based encoders which change the setup for
the initial supervised (text+speech) training steps:
VAE-based encoders
(Original)
VQVAE-based encoders
(Proposed)
Equation (1) is the jointly-trained of a multimodal neural net with joint-goal
and tied-layer losses [1]
Equation (2) and (3) take on a form of a typical VQVAE setup.
(1)
(2)
(3)
(4)
(Interchangeable encoder modules)
4
NAUTILUS-VQ: 2.VOICE CLONING
Using pretrained modules from previous stage,
we want to “clone” new voices of unseen
speakers using their untranscribed speech data:
Step 1(←left): Adapting speech encoder and
neural vocoder independently using STS stack
(no text involved)
Step 2(right→): Tuning speech decoder and
neural vocoder together using STS stack:
Step 1 - Adaptation Step 2 - “Welding”
(5) (tuning speech decoder)
(6) (tuning neural vocoder)
(7)
5
NAUTILUS-VQ: 3.TTS/VC INFERENCE
Even though we only used speech encoder in
the voice cloning stage, we can use either text or
speech encoder to create a completed TTS/VC
system
→ This is the case as we believe in the consistency
between the text and speech encoders which we have
deployed various policies (jointly train TTS/VC, “tie”
latent spaces,...) in initial training stage to enforce it.
*figures are highly simplified for presentation purposes,
the network architecture and details setup can be found
in our paper and in [1]
Inference/Text-to-Speech Inference/Voice Conversion
6
Shaping the linguistic latent spaces
The purpose of the multimodal setup is to train highly consistent and interchangeable text encoder
and speech encoder. We enforce this by assuming certain properties about the linguistic latent space.
(a) Standard (b) VAE-based [1] (c) VQVAE-based
Simple, but text and spech is not
one-to-one relation.
Denoise training, density-wise tying
function, has weaknesses of VAE
Discrete latent feature, ???
Consistency? (Ideally)
Text and speech of the
same content will be
transformed to an
identical LLE sequence
by text and speech
encoders
7
Experiments
Japanese voice cloning TTS/VC system using
untranscribed speech of target speakers:
● Initial training: jointly train (supervised)
using speech and text from ~236 hours
of the 16kHz CSJ corpus, and ~134
hours of a 24kHz in-house corpus.
● Voice cloning: using only speech of
target speakers (Table 1) to clone new
voices for the TTS/VC system.
*Details experiment setups and network architecture
can be found in the paper.
Speaker Gender Quantity Duration
F001 female 483 utt. 45.0 min
F002 female 481 utt. 44.4 min
F003 female 484 utt. 47.4 min
F004 female 468 utt. 40.8 min
F005 female 485 utt. 47.6 min
XL10 female
10 utt. 55 s
125 utt. 10.9 min
500 utt. 44.5 min
2000 utt. 2.9 h
8750 utt. 12.9 h
Table 1: Japanese target speakers for voice cloning task.
8
Evaluations
Quality
Similarity
Quality
Similarity
Exp01 (←left): comparing NAUTILUS
and NAUTILUS-VQ on voice cloning
task with ~45 minutes of
untranscribed speech.
→ NAUTILUS is better in most data
points.
Exp02 (right→): comparing proposed
voice cloning systems with a
conventional SD TTS system
(BASELINE) trained with ~12 hours of
transcribed speech of a speaker.
→ NAUTILUS and NAUTILUS-VQ are not
as good as BASELINE but using less
demanding data.
Each speaker/system/task was evaluated 300
times.
samples
9
Discrete LLE sequences
The discrete LLE sequences generated by speech and text encoders from the respectable speech and
text inputs of the same content overlap by 54.41% in this particular example utterance.
Speech-encoded LLE
Text-encoded LLE
Overlap
10
Text Encoder transforms /SIL/ phoneme to three
different discrete vectors. While Speech Encoder
transforms `silence’ segments to many more
discrete vectors.
VQ codes histogram (1)
/SIL/
All training data
Speech-encoded LLE
Text-encoded LLE
Overlap
11
Discrete LLE index
Bin
percentages
(all
training
data)
/SIL/ phoneme histogram concentrates
on two vectors for both text-encoded
and speech-encoded LLE.
VQ codes histogram (2)
/i/ /r/
/a/ /k/
/g/
12
Conclusion
● Unified voice cloning TTS/VC (why?): Convenient and versatile. Moreover by creating a
system that operates in two parallel modes (TTS/VC) we created a way to inspect the
hidden feature.
→ If we can observe it then we can utilize it,... somehow (future work)
● Discrete LLE (why?): Useful for specific applications (low bit-rate representation, finite
discrete vector). Moreover it is easier to analyze the discrete latent space compares
with the continuous one.
→ A model with high quality/similarity is required for latent space inspection. As low
performance suggests no useful information is learn by LLE.
In summary: VQVAE-based encoders can be used for our unified voice cloning system with
acceptable performance and allow more interesting analysis on the linguistic latent space.
13
Thank you for your attention
14
1 of 14

More Related Content

What's hot(20)

Digital Signal Processinf (DSP) Course OutlineDigital Signal Processinf (DSP) Course Outline
Digital Signal Processinf (DSP) Course Outline
Mohammad Sohai Khan Niazi418 views
vorlagevorlage
vorlage
Karim Safty199 views
Net2Net2
Net2
shwetha mk305 views
Icml2018 naver reviewIcml2018 naver review
Icml2018 naver review
NAVER Engineering1.7K views
Concurrency in PythonConcurrency in Python
Concurrency in Python
konryd1.3K views
Sequence learning and modern RNNsSequence learning and modern RNNs
Sequence learning and modern RNNs
Grigory Sapunov2.2K views
AnupVMathurAnupVMathur
AnupVMathur
anupmath158 views
DETR ECCV20DETR ECCV20
DETR ECCV20
Mengmeng Xu96 views

Similar to Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance(20)

More from Yamagishi Laboratory, National Institute of Informatics, Japan(11)

The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
Yamagishi Laboratory, National Institute of Informatics, Japan52 views
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
Yamagishi Laboratory, National Institute of Informatics, Japan16 views
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Yamagishi Laboratory, National Institute of Informatics, Japan15 views
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
Yamagishi Laboratory, National Institute of Informatics, Japan41 views
Generalization Ability of MOS Prediction NetworksGeneralization Ability of MOS Prediction Networks
Generalization Ability of MOS Prediction Networks
Yamagishi Laboratory, National Institute of Informatics, Japan73 views
Estimating the confidence of speech spoofing countermeasureEstimating the confidence of speech spoofing countermeasure
Estimating the confidence of speech spoofing countermeasure
Yamagishi Laboratory, National Institute of Informatics, Japan70 views
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Yamagishi Laboratory, National Institute of Informatics, Japan110 views
 How do Voices from Past Speech Synthesis Challenges Compare Today? How do Voices from Past Speech Synthesis Challenges Compare Today?
How do Voices from Past Speech Synthesis Challenges Compare Today?
Yamagishi Laboratory, National Institute of Informatics, Japan131 views
Text-to-Speech Synthesis Techniques for MIDI-to-Audio SynthesisText-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Yamagishi Laboratory, National Institute of Informatics, Japan176 views
Neural source-filter waveform modelNeural source-filter waveform model
Neural source-filter waveform model
Yamagishi Laboratory, National Institute of Informatics, Japan1.1K views
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Yamagishi Laboratory, National Institute of Informatics, Japan5.2K views

Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance

  • 1. Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance Hieu-Thi Luong, Junichi Yamagishi National Institute of Informatics, JAPAN
  • 2. Motivation ● In general, speech synthesis research and development (TTS: Text-to-speech, VC: Voice conversion,...) focus on the end product that is the natural and expressive synthesized speech (with or without an emphasis on speaker individuality). ● However, by training latent representations with specific traits, we opens up the possibilities to use in many applications/scenarios. Specifically, a discrete latent feature is a low bit-rate representation comparing with continuous one. ● In this work, we investigated the use of vector quantization variational autoencoder (VQVAE) component to model the linguistic latent space of our unified TTS/VC voice cloning system (name: NAUTILUS). ● Beside general evaluations on subjective quality/similarity, we also show how our carefully designed system can be used to investigate the latent space, which is a preliminary study for future research on this topic. 2
  • 3. NAUTILUS-VQ (←left) A voice cloning system that is a combination of TTS and VC (NAUTILUS*) with a vector quantization (VQ) component used to turn the continuous latent linguistic embedding (LLE) into discrete vector. *We proposed NAUTILUS system with continuous LLE in our previous publication: [1] Luong, Hieu-Thi, and Junichi Yamagishi. "Nautilus: a versatile voice cloning system." IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020): 2967-2981. Text-to-speech Stack (Multispeaker TTS) Speech-to-speech Stack (Many-to-many VC) MAE: mean-absolute error MSE: mean-square error CE: cross-entropy x: phonetic feature y: acoustic feature o: waveform z: LLE q: quantized LLE spk : with speaker components (ohv) 3
  • 4. NAUTILUS-VQ: 1.INITIAL TRAINING The main difference with the original is that we replace VAE-based encoders with VQVAE-based encoders which change the setup for the initial supervised (text+speech) training steps: VAE-based encoders (Original) VQVAE-based encoders (Proposed) Equation (1) is the jointly-trained of a multimodal neural net with joint-goal and tied-layer losses [1] Equation (2) and (3) take on a form of a typical VQVAE setup. (1) (2) (3) (4) (Interchangeable encoder modules) 4
  • 5. NAUTILUS-VQ: 2.VOICE CLONING Using pretrained modules from previous stage, we want to “clone” new voices of unseen speakers using their untranscribed speech data: Step 1(←left): Adapting speech encoder and neural vocoder independently using STS stack (no text involved) Step 2(right→): Tuning speech decoder and neural vocoder together using STS stack: Step 1 - Adaptation Step 2 - “Welding” (5) (tuning speech decoder) (6) (tuning neural vocoder) (7) 5
  • 6. NAUTILUS-VQ: 3.TTS/VC INFERENCE Even though we only used speech encoder in the voice cloning stage, we can use either text or speech encoder to create a completed TTS/VC system → This is the case as we believe in the consistency between the text and speech encoders which we have deployed various policies (jointly train TTS/VC, “tie” latent spaces,...) in initial training stage to enforce it. *figures are highly simplified for presentation purposes, the network architecture and details setup can be found in our paper and in [1] Inference/Text-to-Speech Inference/Voice Conversion 6
  • 7. Shaping the linguistic latent spaces The purpose of the multimodal setup is to train highly consistent and interchangeable text encoder and speech encoder. We enforce this by assuming certain properties about the linguistic latent space. (a) Standard (b) VAE-based [1] (c) VQVAE-based Simple, but text and spech is not one-to-one relation. Denoise training, density-wise tying function, has weaknesses of VAE Discrete latent feature, ??? Consistency? (Ideally) Text and speech of the same content will be transformed to an identical LLE sequence by text and speech encoders 7
  • 8. Experiments Japanese voice cloning TTS/VC system using untranscribed speech of target speakers: ● Initial training: jointly train (supervised) using speech and text from ~236 hours of the 16kHz CSJ corpus, and ~134 hours of a 24kHz in-house corpus. ● Voice cloning: using only speech of target speakers (Table 1) to clone new voices for the TTS/VC system. *Details experiment setups and network architecture can be found in the paper. Speaker Gender Quantity Duration F001 female 483 utt. 45.0 min F002 female 481 utt. 44.4 min F003 female 484 utt. 47.4 min F004 female 468 utt. 40.8 min F005 female 485 utt. 47.6 min XL10 female 10 utt. 55 s 125 utt. 10.9 min 500 utt. 44.5 min 2000 utt. 2.9 h 8750 utt. 12.9 h Table 1: Japanese target speakers for voice cloning task. 8
  • 9. Evaluations Quality Similarity Quality Similarity Exp01 (←left): comparing NAUTILUS and NAUTILUS-VQ on voice cloning task with ~45 minutes of untranscribed speech. → NAUTILUS is better in most data points. Exp02 (right→): comparing proposed voice cloning systems with a conventional SD TTS system (BASELINE) trained with ~12 hours of transcribed speech of a speaker. → NAUTILUS and NAUTILUS-VQ are not as good as BASELINE but using less demanding data. Each speaker/system/task was evaluated 300 times. samples 9
  • 10. Discrete LLE sequences The discrete LLE sequences generated by speech and text encoders from the respectable speech and text inputs of the same content overlap by 54.41% in this particular example utterance. Speech-encoded LLE Text-encoded LLE Overlap 10 Text Encoder transforms /SIL/ phoneme to three different discrete vectors. While Speech Encoder transforms `silence’ segments to many more discrete vectors.
  • 11. VQ codes histogram (1) /SIL/ All training data Speech-encoded LLE Text-encoded LLE Overlap 11 Discrete LLE index Bin percentages (all training data) /SIL/ phoneme histogram concentrates on two vectors for both text-encoded and speech-encoded LLE.
  • 12. VQ codes histogram (2) /i/ /r/ /a/ /k/ /g/ 12
  • 13. Conclusion ● Unified voice cloning TTS/VC (why?): Convenient and versatile. Moreover by creating a system that operates in two parallel modes (TTS/VC) we created a way to inspect the hidden feature. → If we can observe it then we can utilize it,... somehow (future work) ● Discrete LLE (why?): Useful for specific applications (low bit-rate representation, finite discrete vector). Moreover it is easier to analyze the discrete latent space compares with the continuous one. → A model with high quality/similarity is required for latent space inspection. As low performance suggests no useful information is learn by LLE. In summary: VQVAE-based encoders can be used for our unified voice cloning system with acceptable performance and allow more interesting analysis on the linguistic latent space. 13
  • 14. Thank you for your attention 14