Modern Text-to-Speech
Systems Review
Lenar Gabdrakhmanov
ML Engineer at qqwqweqwe
Table of contents
1. Introducing to TTS
2. Synthesized speech quality evaluation
3. WaveNet (2016-09-12)
4. Fast WaveNet (2016-11-29)
5. Deep Voice (2017-02-25)
6. Tacotron (2017-03-29)
7. Deep Voice 2 (2017-05-24)
8. TTS + ASR = Speech Chain (2017-07-16)
9. Deep Voice 3 (2017-10-20)
10. Tacotron 2 (2017-12-16)
Table of contents
11. GST-Tacotron (2018-03-23)
12. Transfer Learning Tacotron (2018-06-12)
13. Transformer TTS (2018-09-19)
14. RUSLAN: Russian Spoken Language Corpus For Speech Synthesis
15. References
Introducing to Text-to-Speech
Two original speech synthesis techniques:
1. Concatenative (unit selection);
2. Statistical parametric.
Introducing to Text-to-Speech
Introducing to Text-to-Speech
Synthesized speech quality evaluation
● Mean Opinion Score (MOS) is the most frequently used method of subjective
measure of speech quality;
● Speech quality is rated on a 5-point scale:
○ 1 - bad quality;
○ 5 - excellent quality.
WaveNet: A Generative Model For Raw Audio [1]
● Not truly end-to-end system yet;
● Generates raw audio waveforms (≥ 16k samples per second!);
● Fully convolutional neural network;
● Every new sample is conditioned on all previous samples;
● Softmax layer models conditional distributions over individual audio sample;
● Extra linguistic features (e.g. phone identities, syllable stress, fundamental
frequency F0
) required;
● Can be extended to multi-speaker TTS;
● Achieves a 4.21 ± 0.081 MOS on US English, a 4.08 ± 0.085 on Chinese;
● Computationally expensive.
WaveNet: A Generative Model For Raw Audio
WaveNet: A Generative Model For Raw Audio
WaveNet: A Generative Model For Raw Audio
Samples:
1. “The Blue Lagoon is a 1980 American romance and adventure film directed
by Randal Kleiser” (parametric);
2. <same> (concatenative);
3. <same> (WaveNet).
Fast Wavenet Generation Algorithm [2]
Fast Wavenet Generation Algorithm
Deep Voice: Real-time Neural Text-to-Speech [3]
● Not truly end-to-end system yet: “Deep Voice lays the groundwork for truly
end-to-end neural speech synthesis”;
● Consists of five blocks:
○ grapheme-to-phoneme conversion model (encoder - Bi-GRU with 1024 units x 3, decoder -
GRU with 1024 units x 3);
○ segmentation model for locating phoneme boundaries (Convs + GRU + Convs);
○ phoneme duration prediction model (FC-256 x 2, GRU with 128 units x 2, FC);
○ fundamental frequency prediction model (joint model with above);
○ audio synthesis model (variant of WaveNet).
● Faster than real time inference (up to 400x faster on both CPU and GPU
compared to Fast WaveNet [2]);
● Achieves a 2.67± 0.37 MOS on US English;
Deep Voice: Real-time Neural Text-to-Speech
Deep Voice: Real-time Neural Text-to-Speech
Tacotron: Towards End-to-End Speech Synthesis [4]
● Fully end-to-end: given <text, audio> pairs, model can be trained completely
from scratch with random initialization;
● Predicts linear- and mel-scale spectrograms;
● Achieves a 3.82 MOS on US English.
Tacotron: Towards End-to-End Speech Synthesis
Tacotron: Towards End-to-End Speech Synthesis
Samples:
1. “Generative adversarial network or variational auto-encoder.”;
2. “He has read the whole thing.”;
3. “He reads books.”;
4. “The quick brown fox jumps over the lazy dog.”;
5. “Does the quick brown fox jump over the lazy dog?”.
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
[5]
● Improved architecture based on Deep Voice [3];
● Can learn from hundreds of unique voices from less than half an hour of data
per speaker;
● Voice model based on WaveNet [1] architecture;
● Achieves a 3.53 ± 0.12 MOS.
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
Samples:
1. “About half the people who are infected also lose weight.”;
2. <same>;
3. <same>.
Listening while Speaking: Speech Chain by Deep
Learning [6]
● Two parts: TTS model and ASR model;
● Single- and multi-speaker;
● Jointly training:
○ Supervised step;
○ Unsupervised: unpaired text and speech.
● Both TTS and AST models are Tacotron-like [4].
Listening while Speaking: Speech Chain by Deep
Learning [6]
Deep Voice 3: Scaling Text-to-Speech with
Convolutional Sequence Learning [7]
● Fully-convolutional sequence-to-sequence attention-based model;
● Converts input text to spectrograms (or other acoustic parameters);
● Suitable for both single and multi speaker;
● Needs 10x less training time and and converges after 500k iterations
(compared to Tacotron [4] that converges after 2m iterations);
● Novel attention mechanism to introduce monotonic alignment;
● MOS: 3.78 ± 0.30 (with WaveNet), same score for Tacotron [4] (with
Wavenet), 2.74 ± 0.35 for Deep Voice 2 [5] (with WaveNet).
Deep Voice 3: Scaling Text-to-Speech with
Convolutional Sequence Learning [6]
Deep Voice 3: Scaling Text-to-Speech with
Convolutional Sequence Learning [6]
Deep Voice 3: Scaling Text-to-Speech with
Convolutional Sequence Learning [6]
Deep Voice 3: Scaling Text-to-Speech with
Convolutional Sequence Learning [6]
Samples:
1. … (trained for single speaker - 20 hours total);
2. … (trained for 108 speakers - 44 hours total);
3. <same>;
4. … (trained for 2484 speakers - 820 hours of ASR data total);
5. <same>.
Natural TTS Synthesis by Conditioning WaveNet on
Mel Spectrogram Predictions [8]
● Uses simpler building blocks, in contrast to original Tacotron [4];
● Maps input characters to mel-scale spectrogram;
● Modified WaveNet [1] synthesizes audio waveforms from spectrograms
directly (no need for linguistic, phoneme duration and other features);
● However, this WaveNet is pretrained separately;
● Achieves a 4.526 ± 0.066 MOS on US English.
Natural TTS Synthesis by Conditioning WaveNet on
Mel Spectrogram Predictions [7]
Natural TTS Synthesis by Conditioning WaveNet on
Mel Spectrogram Predictions [7]
Samples:
1. “Generative adversarial network or variational auto-encoder.”;
2. “Don't desert me here in the desert!”;
3. “He thought it was time to present the present.”;
4. “The buses aren't the problem, they actually provide a solution.”;
5. “The buses aren't the PROBLEM, they actually provide a SOLUTION.”.
Style Tokens: Unsupervised Style Modeling, Control
and Transfer in End-to-End Speech Synthesis [9]
● Based on Tacotron [4] with slight changes;
● Learn embeddings for 10 style tokens;
● Reference encoder: stack of 2D Convs, GRU with 128 units;
● Style token layer: Attention with 256-D token embeddings
Style Tokens: Unsupervised Style Modeling, Control
and Transfer in End-to-End Speech Synthesis [9]
Style Tokens: Unsupervised Style Modeling, Control
and Transfer in End-to-End Speech Synthesis [9]
Samples:
1. “United Airlines five six three from Los Angeles to New Orleans has Landed”;
2. <same>;
3. <same>;
4. <same>;
5. <same>.
Transfer Learning from Speaker Verification to
Multispeaker Text-To-Speech Synthesis [10]
● Consists of several independently trained components:
○ Speaker encoder network to generate speaker embedding vectors: LSTM with 768 units
followed by 256-D projection layer x 3;
○ Synthesis network - Tacotron 2 [8].
● MOS: 4.22 ± 0.06 and 3.28 ± 0.07 for seen and unseen speakers
respectively.
Transfer Learning from Speaker Verification to
Multispeaker Text-To-Speech Synthesis [10]
Transfer Learning from Speaker Verification to
Multispeaker Text-To-Speech Synthesis [10]
Transfer Learning from Speaker Verification to
Multispeaker Text-To-Speech Synthesis [10]
Samples:
1. … (reference audio, seen speaker);
2. … (synthesized audio, seen speaker);
3. … (ref., unseen);
4. … (synth., unseen);
5. … (ref., different language);
6. … (synth., different language).
Close to Human Quality TTS with Transformer [11]
● Based on state-of-the-art NMT model called “Transformer”;
● Based on Tacotron 2 [8] model, but all RNN layers are replaced with
Transformers;
● Using of Transformer allows to construct encoder and decoder hidden states
in parallel;
● Training about 4.25 times faster compared with Tacotron 2;
● Achieves state-of-the-art performance: 4.39 ± 0.05 (proposed model), 4.38 ±
0.05 (Tacotron 2), 4.44 ± 0.05 (ground truth).
Close to Human Quality TTS with Transformer
Close to Human Quality TTS with Transformer
Close to Human Quality TTS with Transformer
Samples:
1. “Two to five inches of rain is possible by early Monday, resulting in some
flooding.”;
2. “Flooding is likely in some parishes of southern Louisiana.”;
3. “Defending champ tiger woods is one of eight Golfers within two strokes.”.
RUSLAN: Russian Spoken Language Corpus For
Speech Synthesis
● Authors: Rustem Garaev, Evgenii Razinkov, Lenar Gabdrakhmanov;
● Largest annotated speech corpus in Russian for a single speaker - more than
31 hours of speech (<text, audio> pairs) - 22200 samples;
● Several improvements for Tacotron [4] were proposed:
○ All GRU layers replaced with Layer Normalized LSTM;
○ Fast Griffin-Lim algorithm used for waveform synthesizing from spectrogram.
● MOS: 4.05 for intelligibility and 3.78 for naturalness (vs 3.12 and 2.17 for
original model);
● Corpus site with recordings and synthesized speech:
https://ruslan-corpus.github.io/
RUSLAN: Russian Spoken Language Corpus For
Speech Synthesis
RUSLAN: Russian Spoken Language Corpus For
Speech Synthesis
Samples:
1. “Тринадцать лет назад я взялся за перо.” (from dataset);
2. “Тема работы - разработка и реализация метода генерации русской речи
на основе текста.”;
3. “Спасибо за внимание.”.
References
[1] Van Den Oord, Aäron, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol
Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray
Kavukcuoglu. "WaveNet: A generative model for raw audio." In SSW, p. 125.
2016.
[2] Paine, Tom Le, Pooya Khorrami, Shiyu Chang, Yang Zhang, Prajit
Ramachandran, Mark A. Hasegawa-Johnson, and Thomas S. Huang. "Fast
wavenet generation algorithm." arXiv preprint arXiv:1611.09482 (2016).
References (2)
[3] Arik, Sercan O., Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew
Gibiansky, Yongguo Kang, Xian Li et al. "Deep voice: Real-time neural
text-to-speech." arXiv preprint arXiv:1702.07825 (2017).
[4] Wang, Yuxuan, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss,
Navdeep Jaitly, Zongheng Yang et al. "Tacotron: Towards end-to-end speech
synthesis." arXiv preprint arXiv:1703.10135 (2017).
[5] Arik, Sercan, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan Peng,
Wei Ping, Jonathan Raiman, and Yanqi Zhou. "Deep voice 2: Multi-speaker
neural text-to-speech." arXiv preprint arXiv:1705.08947 (2017).
References (3)
[6] Tjandra, Andros, Sakriani Sakti, and Satoshi Nakamura. "Listening while
speaking: Speech chain by deep learning." Automatic Speech Recognition
and Understanding Workshop (ASRU), 2017 IEEE. IEEE, 2017.
[7] Ping, Wei, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan,
Sharan Narang, Jonathan Raiman, and John Miller. "Deep voice 3: Scaling
text-to-speech with convolutional sequence learning." (2018).
[8] Shen, Jonathan, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep
Jaitly, Zongheng Yang, Zhifeng Chen et al. "Natural tts synthesis by
conditioning wavenet on mel spectrogram predictions." In 2018 IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pp. 4779-4783. IEEE, 2018.
References (4)
[9] Wang, Yuxuan, et al. "Style Tokens: Unsupervised Style Modeling, Control
and Transfer in End-to-End Speech Synthesis." arXiv preprint
arXiv:1803.09017 (2018).
[10] Jia, Ye, et al. "Transfer Learning from Speaker Verification to Multispeaker
Text-To-Speech Synthesis." arXiv preprint arXiv:1806.04558 (2018).
[11] Li, N., Liu, S., Liu, Y., Zhao, S., Liu, M., & Zhou, M. (2018). Close to Human
Quality TTS with Transformer. arXiv preprint arXiv:1809.08895.
Stay in Touch with Us
https://provectus.com/
Lenar Gabdrakhmanov:
@morelen17
RUSLAN Corpus:
https://ruslan-corpus.github.io/

Lenar Gabdrakhmanov (Provectus): Speech synthesis

  • 1.
    Modern Text-to-Speech Systems Review LenarGabdrakhmanov ML Engineer at qqwqweqwe
  • 2.
    Table of contents 1.Introducing to TTS 2. Synthesized speech quality evaluation 3. WaveNet (2016-09-12) 4. Fast WaveNet (2016-11-29) 5. Deep Voice (2017-02-25) 6. Tacotron (2017-03-29) 7. Deep Voice 2 (2017-05-24) 8. TTS + ASR = Speech Chain (2017-07-16) 9. Deep Voice 3 (2017-10-20) 10. Tacotron 2 (2017-12-16)
  • 3.
    Table of contents 11.GST-Tacotron (2018-03-23) 12. Transfer Learning Tacotron (2018-06-12) 13. Transformer TTS (2018-09-19) 14. RUSLAN: Russian Spoken Language Corpus For Speech Synthesis 15. References
  • 4.
    Introducing to Text-to-Speech Twooriginal speech synthesis techniques: 1. Concatenative (unit selection); 2. Statistical parametric.
  • 5.
  • 6.
  • 7.
    Synthesized speech qualityevaluation ● Mean Opinion Score (MOS) is the most frequently used method of subjective measure of speech quality; ● Speech quality is rated on a 5-point scale: ○ 1 - bad quality; ○ 5 - excellent quality.
  • 8.
    WaveNet: A GenerativeModel For Raw Audio [1] ● Not truly end-to-end system yet; ● Generates raw audio waveforms (≥ 16k samples per second!); ● Fully convolutional neural network; ● Every new sample is conditioned on all previous samples; ● Softmax layer models conditional distributions over individual audio sample; ● Extra linguistic features (e.g. phone identities, syllable stress, fundamental frequency F0 ) required; ● Can be extended to multi-speaker TTS; ● Achieves a 4.21 ± 0.081 MOS on US English, a 4.08 ± 0.085 on Chinese; ● Computationally expensive.
  • 9.
    WaveNet: A GenerativeModel For Raw Audio
  • 10.
    WaveNet: A GenerativeModel For Raw Audio
  • 11.
    WaveNet: A GenerativeModel For Raw Audio Samples: 1. “The Blue Lagoon is a 1980 American romance and adventure film directed by Randal Kleiser” (parametric); 2. <same> (concatenative); 3. <same> (WaveNet).
  • 12.
  • 13.
  • 14.
    Deep Voice: Real-timeNeural Text-to-Speech [3] ● Not truly end-to-end system yet: “Deep Voice lays the groundwork for truly end-to-end neural speech synthesis”; ● Consists of five blocks: ○ grapheme-to-phoneme conversion model (encoder - Bi-GRU with 1024 units x 3, decoder - GRU with 1024 units x 3); ○ segmentation model for locating phoneme boundaries (Convs + GRU + Convs); ○ phoneme duration prediction model (FC-256 x 2, GRU with 128 units x 2, FC); ○ fundamental frequency prediction model (joint model with above); ○ audio synthesis model (variant of WaveNet). ● Faster than real time inference (up to 400x faster on both CPU and GPU compared to Fast WaveNet [2]); ● Achieves a 2.67± 0.37 MOS on US English;
  • 15.
    Deep Voice: Real-timeNeural Text-to-Speech
  • 16.
    Deep Voice: Real-timeNeural Text-to-Speech
  • 17.
    Tacotron: Towards End-to-EndSpeech Synthesis [4] ● Fully end-to-end: given <text, audio> pairs, model can be trained completely from scratch with random initialization; ● Predicts linear- and mel-scale spectrograms; ● Achieves a 3.82 MOS on US English.
  • 18.
  • 19.
    Tacotron: Towards End-to-EndSpeech Synthesis Samples: 1. “Generative adversarial network or variational auto-encoder.”; 2. “He has read the whole thing.”; 3. “He reads books.”; 4. “The quick brown fox jumps over the lazy dog.”; 5. “Does the quick brown fox jump over the lazy dog?”.
  • 20.
    Deep Voice 2:Multi-Speaker Neural Text-to-Speech [5] ● Improved architecture based on Deep Voice [3]; ● Can learn from hundreds of unique voices from less than half an hour of data per speaker; ● Voice model based on WaveNet [1] architecture; ● Achieves a 3.53 ± 0.12 MOS.
  • 21.
    Deep Voice 2:Multi-Speaker Neural Text-to-Speech
  • 22.
    Deep Voice 2:Multi-Speaker Neural Text-to-Speech
  • 23.
    Deep Voice 2:Multi-Speaker Neural Text-to-Speech Samples: 1. “About half the people who are infected also lose weight.”; 2. <same>; 3. <same>.
  • 24.
    Listening while Speaking:Speech Chain by Deep Learning [6] ● Two parts: TTS model and ASR model; ● Single- and multi-speaker; ● Jointly training: ○ Supervised step; ○ Unsupervised: unpaired text and speech. ● Both TTS and AST models are Tacotron-like [4].
  • 25.
    Listening while Speaking:Speech Chain by Deep Learning [6]
  • 26.
    Deep Voice 3:Scaling Text-to-Speech with Convolutional Sequence Learning [7] ● Fully-convolutional sequence-to-sequence attention-based model; ● Converts input text to spectrograms (or other acoustic parameters); ● Suitable for both single and multi speaker; ● Needs 10x less training time and and converges after 500k iterations (compared to Tacotron [4] that converges after 2m iterations); ● Novel attention mechanism to introduce monotonic alignment; ● MOS: 3.78 ± 0.30 (with WaveNet), same score for Tacotron [4] (with Wavenet), 2.74 ± 0.35 for Deep Voice 2 [5] (with WaveNet).
  • 27.
    Deep Voice 3:Scaling Text-to-Speech with Convolutional Sequence Learning [6]
  • 28.
    Deep Voice 3:Scaling Text-to-Speech with Convolutional Sequence Learning [6]
  • 29.
    Deep Voice 3:Scaling Text-to-Speech with Convolutional Sequence Learning [6]
  • 30.
    Deep Voice 3:Scaling Text-to-Speech with Convolutional Sequence Learning [6] Samples: 1. … (trained for single speaker - 20 hours total); 2. … (trained for 108 speakers - 44 hours total); 3. <same>; 4. … (trained for 2484 speakers - 820 hours of ASR data total); 5. <same>.
  • 31.
    Natural TTS Synthesisby Conditioning WaveNet on Mel Spectrogram Predictions [8] ● Uses simpler building blocks, in contrast to original Tacotron [4]; ● Maps input characters to mel-scale spectrogram; ● Modified WaveNet [1] synthesizes audio waveforms from spectrograms directly (no need for linguistic, phoneme duration and other features); ● However, this WaveNet is pretrained separately; ● Achieves a 4.526 ± 0.066 MOS on US English.
  • 32.
    Natural TTS Synthesisby Conditioning WaveNet on Mel Spectrogram Predictions [7]
  • 33.
    Natural TTS Synthesisby Conditioning WaveNet on Mel Spectrogram Predictions [7] Samples: 1. “Generative adversarial network or variational auto-encoder.”; 2. “Don't desert me here in the desert!”; 3. “He thought it was time to present the present.”; 4. “The buses aren't the problem, they actually provide a solution.”; 5. “The buses aren't the PROBLEM, they actually provide a SOLUTION.”.
  • 34.
    Style Tokens: UnsupervisedStyle Modeling, Control and Transfer in End-to-End Speech Synthesis [9] ● Based on Tacotron [4] with slight changes; ● Learn embeddings for 10 style tokens; ● Reference encoder: stack of 2D Convs, GRU with 128 units; ● Style token layer: Attention with 256-D token embeddings
  • 35.
    Style Tokens: UnsupervisedStyle Modeling, Control and Transfer in End-to-End Speech Synthesis [9]
  • 36.
    Style Tokens: UnsupervisedStyle Modeling, Control and Transfer in End-to-End Speech Synthesis [9] Samples: 1. “United Airlines five six three from Los Angeles to New Orleans has Landed”; 2. <same>; 3. <same>; 4. <same>; 5. <same>.
  • 37.
    Transfer Learning fromSpeaker Verification to Multispeaker Text-To-Speech Synthesis [10] ● Consists of several independently trained components: ○ Speaker encoder network to generate speaker embedding vectors: LSTM with 768 units followed by 256-D projection layer x 3; ○ Synthesis network - Tacotron 2 [8]. ● MOS: 4.22 ± 0.06 and 3.28 ± 0.07 for seen and unseen speakers respectively.
  • 38.
    Transfer Learning fromSpeaker Verification to Multispeaker Text-To-Speech Synthesis [10]
  • 39.
    Transfer Learning fromSpeaker Verification to Multispeaker Text-To-Speech Synthesis [10]
  • 40.
    Transfer Learning fromSpeaker Verification to Multispeaker Text-To-Speech Synthesis [10] Samples: 1. … (reference audio, seen speaker); 2. … (synthesized audio, seen speaker); 3. … (ref., unseen); 4. … (synth., unseen); 5. … (ref., different language); 6. … (synth., different language).
  • 41.
    Close to HumanQuality TTS with Transformer [11] ● Based on state-of-the-art NMT model called “Transformer”; ● Based on Tacotron 2 [8] model, but all RNN layers are replaced with Transformers; ● Using of Transformer allows to construct encoder and decoder hidden states in parallel; ● Training about 4.25 times faster compared with Tacotron 2; ● Achieves state-of-the-art performance: 4.39 ± 0.05 (proposed model), 4.38 ± 0.05 (Tacotron 2), 4.44 ± 0.05 (ground truth).
  • 42.
    Close to HumanQuality TTS with Transformer
  • 43.
    Close to HumanQuality TTS with Transformer
  • 44.
    Close to HumanQuality TTS with Transformer Samples: 1. “Two to five inches of rain is possible by early Monday, resulting in some flooding.”; 2. “Flooding is likely in some parishes of southern Louisiana.”; 3. “Defending champ tiger woods is one of eight Golfers within two strokes.”.
  • 45.
    RUSLAN: Russian SpokenLanguage Corpus For Speech Synthesis ● Authors: Rustem Garaev, Evgenii Razinkov, Lenar Gabdrakhmanov; ● Largest annotated speech corpus in Russian for a single speaker - more than 31 hours of speech (<text, audio> pairs) - 22200 samples; ● Several improvements for Tacotron [4] were proposed: ○ All GRU layers replaced with Layer Normalized LSTM; ○ Fast Griffin-Lim algorithm used for waveform synthesizing from spectrogram. ● MOS: 4.05 for intelligibility and 3.78 for naturalness (vs 3.12 and 2.17 for original model); ● Corpus site with recordings and synthesized speech: https://ruslan-corpus.github.io/
  • 46.
    RUSLAN: Russian SpokenLanguage Corpus For Speech Synthesis
  • 47.
    RUSLAN: Russian SpokenLanguage Corpus For Speech Synthesis Samples: 1. “Тринадцать лет назад я взялся за перо.” (from dataset); 2. “Тема работы - разработка и реализация метода генерации русской речи на основе текста.”; 3. “Спасибо за внимание.”.
  • 49.
    References [1] Van DenOord, Aäron, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. "WaveNet: A generative model for raw audio." In SSW, p. 125. 2016. [2] Paine, Tom Le, Pooya Khorrami, Shiyu Chang, Yang Zhang, Prajit Ramachandran, Mark A. Hasegawa-Johnson, and Thomas S. Huang. "Fast wavenet generation algorithm." arXiv preprint arXiv:1611.09482 (2016).
  • 50.
    References (2) [3] Arik,Sercan O., Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li et al. "Deep voice: Real-time neural text-to-speech." arXiv preprint arXiv:1702.07825 (2017). [4] Wang, Yuxuan, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang et al. "Tacotron: Towards end-to-end speech synthesis." arXiv preprint arXiv:1703.10135 (2017). [5] Arik, Sercan, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, and Yanqi Zhou. "Deep voice 2: Multi-speaker neural text-to-speech." arXiv preprint arXiv:1705.08947 (2017).
  • 51.
    References (3) [6] Tjandra,Andros, Sakriani Sakti, and Satoshi Nakamura. "Listening while speaking: Speech chain by deep learning." Automatic Speech Recognition and Understanding Workshop (ASRU), 2017 IEEE. IEEE, 2017. [7] Ping, Wei, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. "Deep voice 3: Scaling text-to-speech with convolutional sequence learning." (2018). [8] Shen, Jonathan, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen et al. "Natural tts synthesis by conditioning wavenet on mel spectrogram predictions." In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779-4783. IEEE, 2018.
  • 52.
    References (4) [9] Wang,Yuxuan, et al. "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis." arXiv preprint arXiv:1803.09017 (2018). [10] Jia, Ye, et al. "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis." arXiv preprint arXiv:1806.04558 (2018). [11] Li, N., Liu, S., Liu, Y., Zhao, S., Liu, M., & Zhou, M. (2018). Close to Human Quality TTS with Transformer. arXiv preprint arXiv:1809.08895.
  • 53.
    Stay in Touchwith Us https://provectus.com/ Lenar Gabdrakhmanov: @morelen17 RUSLAN Corpus: https://ruslan-corpus.github.io/