Interspeech 2017 s_miyoshi

Hiroyuki Miyoshi, Yuki Saito,
Shinnosuke Takamichi, and Hiroshi Saruwatari
(The University of Tokyo)
Voice Conversion Using
Sequence-to-Sequence Learning
of Context Posterior Probabilities
INTERSPEECH Tue-O-4-10-1
Stockholm, Sweden
Aug. 22, 2017

INTERSPEECH 2017 @Stockholm Aug. 22, 2017 1/15
Outline of This Talk
Issue:
 Voice conversion needs parallel data of source and target speakers.
Conventional method
 Voice conversion using context posterior probabilities (CPPs). [Sun et al., 2016]
1. Recognition: source speech feats. → source CPPs.
2. Synthesis: copied source CPPs. → target speech feats.
Pros. : Non-parallel voice conversion
Cons. : Difficulty of converting speaker individuality included in CPPs
Proposed:
 Sequence-to-sequence (Seq2Seq) conversion from source CPPs to target
CPPs
 Joint training of recognition and synthesis to increase conversion performance
Results:
 Seq2Seq learning achieved variable-length voice conversion.
 Joint training improved speaker similarity and quality of converted speech.

Conventional Voice Conversion Algorithm:
Copied Context Posterior Probability
[Sun et al., 2016] Training
Target
speech feats.
LSTM
LSTM
Source
speech feats.
a
i
u
Target
CPP
𝒙
𝑹(⋅)
CPP
Context
label
𝒍 𝑥ෝ𝒑 𝒙 ෝ𝒑 𝑦
𝑮(⋅)
𝒚
𝑮(ෝ𝒑 𝑦)
1. Recognition 2. Synthesis
Time
Recognition Error
(Softmax cross entropy)
𝐿 𝐶(ෝ𝒑 𝒙, 𝒍 𝑥)
Synthesis Error
(Mean squared error)
𝐿 𝐺(𝑮(ෝ𝒑 𝑦), 𝒚)
Separated training

Conventional Voice Conversion Algorithm:
Copied Context Posterior Probability
[Sun et al., 2016] Conversion (conventional)
Predicted
speech feats.
LSTM
LSTM
Source
speech feats.
Target
CPP
𝒙
𝑹(⋅)
Source
CPP
𝑮(⋅)
ෝ𝒚
Time
ෝ𝒑 𝒙
𝑮(ෝ𝒑 𝑦)
COPY

Time
Issues of Conventional Voice Conversion
1. CPPs’ shapes and lengths are significantly different betw. speakers.
Shapes are different.
Lengths of each phoneme are different.
2. Improving recognition accuracy ≠ improving synthesis accuracy
Conventional method separately trains speech recognition/synthesis.

Proposed Algorithms
1. Sequence-to-Sequence Conversion from
Source CPP to Target CPP
2. Joint Training of Recognition and Synthesis
(like auto-encoding)

Sequence-to-Sequence Learning [Sutskever et al., 2014]
 Sequence-to-Sequence Learning: variable-length conversion
雨が降る
It rainsInput sequence (Encoder)
Output sequence (Decoder)
Japanese-to-English translation using Seq2Seq learning
Constraints
 Phoneme duration is given.
 Conversion is done phoneme by phoneme.
 Problems of Seq2Seq conversion of CPPs
・Determining duration is difficult.
・Conversion failures propagate if the number of frames to be generated is large.
[Weng et al., 2016]

Sequence-to-Sequence Conversion of CPPs
Target
speech feats.
LSTM
LSTM
Source
speech feats.
Target CPP
𝒙
𝑹(⋅)
CPP
ෝ𝒑 𝒙
𝑪(ෝ𝒑 𝒙) 𝑮(⋅)
ෝ𝒚
𝑮(𝑪(ෝ𝒑 𝑥))
Time
 Conversion
Seq2Seq
conversion
𝑪(⋅)
Loss function: 𝑳 𝑮 𝑪(ෝ𝒑 𝒙), ෝ𝒑 𝒚 + 𝑳 𝑪 𝑪 ෝ𝒑 𝒙 , 𝒍 𝒚
Mean squared error Softmax cross entropy
betw. predicted CPPs/target labels
Minimizes conversion error Alleviates recognition error
included in target CPPs.

Effect of the Proposed Algorithm
 Variable-length voice conversion
0
1
Variable-length
conversion of CPPs is achieved!
Source CPP Target CPP
Frame
CPP after Seq2Seq conversion
Time

Joint Training of
Speech Recognition and Synthesis
 Training
Source
speech feats.
LSTM
LSTM
Source
speech feats.
Source CPP
𝒙
𝑹(⋅)
𝒍 𝑥
ෝ𝒑 𝒙
𝑮(⋅)
𝒙
𝑮(ෝ𝒑 𝑥)
Time Joint training
Recognition Error
𝐿 𝐶 𝑹 𝒙 , 𝒍 𝑥 + 𝐿 𝐺(𝑮(ෝ𝒑 𝑥), 𝒙)
(Conventional term) + Synthesis error using predicted CPP

Experimental Setup
Dataset ATR Japanese speech database
(phonetically balanced 503 sentences)
Training/Test 450 sentences / 53 sentences (16 kHz sampling)
Linguistic feats. 224-dimensional vectors (phonemes)
Speech feats. Mel-cepstrum (1st-through-24th) + Delta
Optimization algorithm AdaGrad (learning rate = 0.01) [Duchi et al., 2011.]
Recognition/ Synthesis Model Bidirectional LSTM (256 units)
Encoder / Decoder Bidirectional LSTM / LSTM (256 units each)
Number of Speakers 8 people including source and target speaker

Objective and Subjective Evaluations of
Seq2Seq Learning
Objective Eval.
Subjective Eval.
Better!
Better!
Worse
Error bars denote
95 % confidence
intervals.
Source Target
Voice samples are
available online.

Objective Evaluation of Joint Training
Better!
Joint Training got better score on mel-cepstral distortion!
Auto-encoding case
Calculates reconstruction error
after recognition and synthesis.

Subjective Evaluation of Joint Training
Better!
Subjective Eval.
Better!
Joint Training made both speaker similarity and speech quality better!

Conclusion
Issue:
 Difficulty of converting speaker individuality included in CPPs.
 Improving recognition accuracy ≠ improving synthesis accuracy.
Proposed:
 Sequence-to-sequence (Seq2Seq) conversion from source CPPs to target
CPPs.
 Joint training of recognition and synthesis.
Results:
 Seq2Seq learning achieved variable-length voice conversion.
 Joint training improved speaker similarity and quality of converted speech.

Interspeech 2017 s_miyoshi

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Interspeech 2017 s_miyoshi

Similar to Interspeech 2017 s_miyoshi (20)

Recently uploaded

Recently uploaded (20)

Interspeech 2017 s_miyoshi