Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Hiroyuki Miyoshi, Yuki Saito,
Shinnosuke Takamichi, and Hiroshi Saruwatari
(The University of Tokyo)
Voice Conversion Usin...
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 1/15
Outline of This Talk
Issue:
 Voice conversion needs parallel data of sourc...
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 2/15
Conventional Voice Conversion Algorithm:
Copied Context Posterior Probabili...
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 3/15
Conventional Voice Conversion Algorithm:
Copied Context Posterior Probabili...
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 4/15
Time
Issues of Conventional Voice Conversion
1. CPPs’ shapes and lengths ar...
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 5/15
Proposed Algorithms
1. Sequence-to-Sequence Conversion from
Source CPP to T...
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 6/15
Sequence-to-Sequence Learning [Sutskever et al., 2014]
 Sequence-to-Sequen...
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 7/15
Sequence-to-Sequence Conversion of CPPs
Target
speech feats.
LSTM
LSTM
Sour...
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 8/15
Effect of the Proposed Algorithm
 Variable-length voice conversion
0
1
Var...
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 9/15
Joint Training of
Speech Recognition and Synthesis
 Training
Source
speech...
Experimental Evaluations
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 11/15
Experimental Setup
Dataset ATR Japanese speech database
(phonetically bala...
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 12/15
Objective and Subjective Evaluations of
Seq2Seq Learning
Objective Eval.
...
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 13/15
Objective Evaluation of Joint Training
Better!
Joint Training got better s...
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 14/15
Subjective Evaluation of Joint Training
Better!
Subjective Eval.
Better!
...
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 15/15
Conclusion
Issue:
 Difficulty of converting speaker individuality include...
Upcoming SlideShare
Loading in …5
×

Interspeech 2017 s_miyoshi

222 views

Published on

Voice conversion (VC) using sequence-to-sequence learning of context posterior probabilities is proposed. Conventional VC using shared context posterior probabilities predicts target speech parameters from the context posterior probabilities estimated from the source speech parameters. Although conventional VC can be built from non-parallel data, it is difficult to convert speaker individuality such as phonetic property and speaking rate contained in the posterior probabilities because the source posterior probabilities are directly used for predicting target speech parameters. In this work, we assume that the training data partly include parallel speech data and propose sequence-to-sequence learning between the source and target posterior probabilities. The conversion models perform non-linear and variable-length transformation from the source probability sequence to the target one. Further, we propose a joint training algorithm for the modules. In contrast to conventional VC, which separately trains the speech recognition that estimates posterior probabilities and the speech synthesis that predicts target speech parameters, our proposed method jointly trains these modules along with the proposed probability conversion modules. Experimental results demonstrate that our approach outperforms the conventional VC.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Interspeech 2017 s_miyoshi

  1. 1. Hiroyuki Miyoshi, Yuki Saito, Shinnosuke Takamichi, and Hiroshi Saruwatari (The University of Tokyo) Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities INTERSPEECH Tue-O-4-10-1 Stockholm, Sweden Aug. 22, 2017
  2. 2. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 1/15 Outline of This Talk Issue:  Voice conversion needs parallel data of source and target speakers. Conventional method  Voice conversion using context posterior probabilities (CPPs). [Sun et al., 2016] 1. Recognition: source speech feats. → source CPPs. 2. Synthesis: copied source CPPs. → target speech feats. Pros. : Non-parallel voice conversion Cons. : Difficulty of converting speaker individuality included in CPPs Proposed:  Sequence-to-sequence (Seq2Seq) conversion from source CPPs to target CPPs  Joint training of recognition and synthesis to increase conversion performance Results:  Seq2Seq learning achieved variable-length voice conversion.  Joint training improved speaker similarity and quality of converted speech.
  3. 3. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 2/15 Conventional Voice Conversion Algorithm: Copied Context Posterior Probability [Sun et al., 2016] Training Target speech feats. LSTM LSTM Source speech feats. a i u Target CPP 𝒙 𝑹(⋅) CPP Context label 𝒍 𝑥ෝ𝒑 𝒙 ෝ𝒑 𝑦 𝑮(⋅) 𝒚 𝑮(ෝ𝒑 𝑦) 1. Recognition 2. Synthesis Time Recognition Error (Softmax cross entropy) 𝐿 𝐶(ෝ𝒑 𝒙, 𝒍 𝑥) Synthesis Error (Mean squared error) 𝐿 𝐺(𝑮(ෝ𝒑 𝑦), 𝒚) Separated training
  4. 4. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 3/15 Conventional Voice Conversion Algorithm: Copied Context Posterior Probability [Sun et al., 2016] Conversion (conventional) Predicted speech feats. LSTM LSTM Source speech feats. Target CPP 𝒙 𝑹(⋅) Source CPP 𝑮(⋅) ෝ𝒚 Time ෝ𝒑 𝒙 𝑮(ෝ𝒑 𝑦) 1. Recognition 2. Synthesis COPY
  5. 5. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 4/15 Time Issues of Conventional Voice Conversion 1. CPPs’ shapes and lengths are significantly different betw. speakers. Shapes are different. Lengths of each phoneme are different. 2. Improving recognition accuracy ≠ improving synthesis accuracy Conventional method separately trains speech recognition/synthesis.
  6. 6. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 5/15 Proposed Algorithms 1. Sequence-to-Sequence Conversion from Source CPP to Target CPP 2. Joint Training of Recognition and Synthesis (like auto-encoding)
  7. 7. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 6/15 Sequence-to-Sequence Learning [Sutskever et al., 2014]  Sequence-to-Sequence Learning: variable-length conversion 雨 が 降る It rainsInput sequence (Encoder) Output sequence (Decoder) Japanese-to-English translation using Seq2Seq learning Constraints  Phoneme duration is given.  Conversion is done phoneme by phoneme.  Problems of Seq2Seq conversion of CPPs ・Determining duration is difficult. ・Conversion failures propagate if the number of frames to be generated is large. [Weng et al., 2016]
  8. 8. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 7/15 Sequence-to-Sequence Conversion of CPPs Target speech feats. LSTM LSTM Source speech feats. Target CPP 𝒙 𝑹(⋅) CPP ෝ𝒑 𝒙 𝑪(ෝ𝒑 𝒙) 𝑮(⋅) ෝ𝒚 𝑮(𝑪(ෝ𝒑 𝑥)) 1. Recognition 2. Synthesis Time  Conversion Seq2Seq conversion 𝑪(⋅) Loss function: 𝑳 𝑮 𝑪(ෝ𝒑 𝒙), ෝ𝒑 𝒚 + 𝑳 𝑪 𝑪 ෝ𝒑 𝒙 , 𝒍 𝒚 Mean squared error Softmax cross entropy betw. predicted CPPs/target labels Minimizes conversion error Alleviates recognition error included in target CPPs.
  9. 9. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 8/15 Effect of the Proposed Algorithm  Variable-length voice conversion 0 1 Variable-length conversion of CPPs is achieved! Source CPP Target CPP Frame CPP after Seq2Seq conversion Time
  10. 10. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 9/15 Joint Training of Speech Recognition and Synthesis  Training Source speech feats. LSTM LSTM Source speech feats. Source CPP 𝒙 𝑹(⋅) 𝒍 𝑥 ෝ𝒑 𝒙 𝑮(⋅) 𝒙 𝑮(ෝ𝒑 𝑥) 1. Recognition 2. Synthesis Time Joint training Recognition Error 𝐿 𝐶 𝑹 𝒙 , 𝒍 𝑥 + 𝐿 𝐺(𝑮(ෝ𝒑 𝑥), 𝒙) (Conventional term) + Synthesis error using predicted CPP
  11. 11. Experimental Evaluations
  12. 12. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 11/15 Experimental Setup Dataset ATR Japanese speech database (phonetically balanced 503 sentences) Training/Test 450 sentences / 53 sentences (16 kHz sampling) Linguistic feats. 224-dimensional vectors (phonemes) Speech feats. Mel-cepstrum (1st-through-24th) + Delta Optimization algorithm AdaGrad (learning rate = 0.01) [Duchi et al., 2011.] Recognition/ Synthesis Model Bidirectional LSTM (256 units) Encoder / Decoder Bidirectional LSTM / LSTM (256 units each) Number of Speakers 8 people including source and target speaker
  13. 13. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 12/15 Objective and Subjective Evaluations of Seq2Seq Learning Objective Eval. Subjective Eval. Better! Better! Worse Error bars denote 95 % confidence intervals. Source Target Voice samples are available online.
  14. 14. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 13/15 Objective Evaluation of Joint Training Better! Joint Training got better score on mel-cepstral distortion! Auto-encoding case Calculates reconstruction error after recognition and synthesis.
  15. 15. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 14/15 Subjective Evaluation of Joint Training Better! Subjective Eval. Better! Joint Training made both speaker similarity and speech quality better!
  16. 16. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 15/15 Conclusion Issue:  Difficulty of converting speaker individuality included in CPPs.  Improving recognition accuracy ≠ improving synthesis accuracy. Proposed:  Sequence-to-sequence (Seq2Seq) conversion from source CPPs to target CPPs.  Joint training of recognition and synthesis. Results:  Seq2Seq learning achieved variable-length voice conversion.  Joint training improved speaker similarity and quality of converted speech.

×