Advertisement

Interspeech 2017 s_miyoshi

-- at University of Tokyo
Aug. 12, 2017
Advertisement

More Related Content

Similar to Interspeech 2017 s_miyoshi(20)

Advertisement

Interspeech 2017 s_miyoshi

  1. Hiroyuki Miyoshi, Yuki Saito, Shinnosuke Takamichi, and Hiroshi Saruwatari (The University of Tokyo) Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities INTERSPEECH Tue-O-4-10-1 Stockholm, Sweden Aug. 22, 2017
  2. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 1/15 Outline of This Talk Issue:  Voice conversion needs parallel data of source and target speakers. Conventional method  Voice conversion using context posterior probabilities (CPPs). [Sun et al., 2016] 1. Recognition: source speech feats. → source CPPs. 2. Synthesis: copied source CPPs. → target speech feats. Pros. : Non-parallel voice conversion Cons. : Difficulty of converting speaker individuality included in CPPs Proposed:  Sequence-to-sequence (Seq2Seq) conversion from source CPPs to target CPPs  Joint training of recognition and synthesis to increase conversion performance Results:  Seq2Seq learning achieved variable-length voice conversion.  Joint training improved speaker similarity and quality of converted speech.
  3. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 2/15 Conventional Voice Conversion Algorithm: Copied Context Posterior Probability [Sun et al., 2016] Training Target speech feats. LSTM LSTM Source speech feats. a i u Target CPP 𝒙 𝑹(⋅) CPP Context label 𝒍 𝑥ෝ𝒑 𝒙 ෝ𝒑 𝑦 𝑮(⋅) 𝒚 𝑮(ෝ𝒑 𝑦) 1. Recognition 2. Synthesis Time Recognition Error (Softmax cross entropy) 𝐿 𝐶(ෝ𝒑 𝒙, 𝒍 𝑥) Synthesis Error (Mean squared error) 𝐿 𝐺(𝑮(ෝ𝒑 𝑦), 𝒚) Separated training
  4. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 3/15 Conventional Voice Conversion Algorithm: Copied Context Posterior Probability [Sun et al., 2016] Conversion (conventional) Predicted speech feats. LSTM LSTM Source speech feats. Target CPP 𝒙 𝑹(⋅) Source CPP 𝑮(⋅) ෝ𝒚 Time ෝ𝒑 𝒙 𝑮(ෝ𝒑 𝑦) 1. Recognition 2. Synthesis COPY
  5. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 4/15 Time Issues of Conventional Voice Conversion 1. CPPs’ shapes and lengths are significantly different betw. speakers. Shapes are different. Lengths of each phoneme are different. 2. Improving recognition accuracy ≠ improving synthesis accuracy Conventional method separately trains speech recognition/synthesis.
  6. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 5/15 Proposed Algorithms 1. Sequence-to-Sequence Conversion from Source CPP to Target CPP 2. Joint Training of Recognition and Synthesis (like auto-encoding)
  7. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 6/15 Sequence-to-Sequence Learning [Sutskever et al., 2014]  Sequence-to-Sequence Learning: variable-length conversion 雨 が 降る It rainsInput sequence (Encoder) Output sequence (Decoder) Japanese-to-English translation using Seq2Seq learning Constraints  Phoneme duration is given.  Conversion is done phoneme by phoneme.  Problems of Seq2Seq conversion of CPPs ・Determining duration is difficult. ・Conversion failures propagate if the number of frames to be generated is large. [Weng et al., 2016]
  8. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 7/15 Sequence-to-Sequence Conversion of CPPs Target speech feats. LSTM LSTM Source speech feats. Target CPP 𝒙 𝑹(⋅) CPP ෝ𝒑 𝒙 𝑪(ෝ𝒑 𝒙) 𝑮(⋅) ෝ𝒚 𝑮(𝑪(ෝ𝒑 𝑥)) 1. Recognition 2. Synthesis Time  Conversion Seq2Seq conversion 𝑪(⋅) Loss function: 𝑳 𝑮 𝑪(ෝ𝒑 𝒙), ෝ𝒑 𝒚 + 𝑳 𝑪 𝑪 ෝ𝒑 𝒙 , 𝒍 𝒚 Mean squared error Softmax cross entropy betw. predicted CPPs/target labels Minimizes conversion error Alleviates recognition error included in target CPPs.
  9. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 8/15 Effect of the Proposed Algorithm  Variable-length voice conversion 0 1 Variable-length conversion of CPPs is achieved! Source CPP Target CPP Frame CPP after Seq2Seq conversion Time
  10. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 9/15 Joint Training of Speech Recognition and Synthesis  Training Source speech feats. LSTM LSTM Source speech feats. Source CPP 𝒙 𝑹(⋅) 𝒍 𝑥 ෝ𝒑 𝒙 𝑮(⋅) 𝒙 𝑮(ෝ𝒑 𝑥) 1. Recognition 2. Synthesis Time Joint training Recognition Error 𝐿 𝐶 𝑹 𝒙 , 𝒍 𝑥 + 𝐿 𝐺(𝑮(ෝ𝒑 𝑥), 𝒙) (Conventional term) + Synthesis error using predicted CPP
  11. Experimental Evaluations
  12. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 11/15 Experimental Setup Dataset ATR Japanese speech database (phonetically balanced 503 sentences) Training/Test 450 sentences / 53 sentences (16 kHz sampling) Linguistic feats. 224-dimensional vectors (phonemes) Speech feats. Mel-cepstrum (1st-through-24th) + Delta Optimization algorithm AdaGrad (learning rate = 0.01) [Duchi et al., 2011.] Recognition/ Synthesis Model Bidirectional LSTM (256 units) Encoder / Decoder Bidirectional LSTM / LSTM (256 units each) Number of Speakers 8 people including source and target speaker
  13. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 12/15 Objective and Subjective Evaluations of Seq2Seq Learning Objective Eval. Subjective Eval. Better! Better! Worse Error bars denote 95 % confidence intervals. Source Target Voice samples are available online.
  14. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 13/15 Objective Evaluation of Joint Training Better! Joint Training got better score on mel-cepstral distortion! Auto-encoding case Calculates reconstruction error after recognition and synthesis.
  15. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 14/15 Subjective Evaluation of Joint Training Better! Subjective Eval. Better! Joint Training made both speaker similarity and speech quality better!
  16. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 15/15 Conclusion Issue:  Difficulty of converting speaker individuality included in CPPs.  Improving recognition accuracy ≠ improving synthesis accuracy. Proposed:  Sequence-to-sequence (Seq2Seq) conversion from source CPPs to target CPPs.  Joint training of recognition and synthesis. Results:  Seq2Seq learning achieved variable-length voice conversion.  Joint training improved speaker similarity and quality of converted speech.
Advertisement