Translatotron, Jia, Ye, et al. "Direct Speech-to-Speech Translation with a Sequence-to-Sequence Model}}." Proc. Interspeech 2019 (2019): 1123-1127. review by June-Woo Kim
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
Direct speech translation with seq2seq model
1. Direct speech-to-speech translation with a
sequence-to-sequence model
Ye Jia, Ron J. Weiss, Fadi Biadsy, Wolfgang Macherey, Melvin Johnson,
Zhifeng Chen, Yonghui Wu
Google Research.
Arxiv Date: 12-April-2019
Presented by: June-Woo Kim
Artificial Brain Research Lab., School of Sensor and Display,
Kyungpook National University
05-June-2019
2. 2021-01-09
Table of Contents
• Overview of the paper
• The Proposed Model
• Experiments and Results
• Conclusion
• Major Take Away
3. 2021-01-09
Introduction & Main goal of the Paper
• The new system could soon greatly improve foreign-
language interactions.
• Current translators break down the translation process into
three steps, based on converting the speech to text:
– Speech Recognition: It used to convert the source speech into text.
– Machine Translation: It is used for translating the converted text
into the target language.
– Text-to-Speech Synthesis (TTS): It is used to produce speech in the
target language from the translated text.
4. 2021-01-09
Introduction & Main goal of the Paper
• Google, however, unlike cascaded systems, doesn’t rely on an
intermediate text representation in either language.
• The new system called Translatotron uses machine learning to
bypass the text representation steps, converting spectrograms of
speech from one language into another language.
– Attention-based sequence-to-sequence neural network which can directly
translate speech from one language into speech in another language without
relying on an intermediate text representation.
– The network is trained end-to-end, learning to map source speech
spectrograms into target spectrograms in another language.
– Experiments are conducted on 2 different Spanish to English datasets.
– Proposed model slightly underperforms a baseline cascade of a direct
speech-to-text translation model and a text-to-speech synthesis model.
5. 2021-01-09
Introduction & Main goal of the Paper
• Although it's in early stages, the system can reproduce some
aspects of the original speaker's voice and tone.
6. 2021-01-09
The Proposed Model
• Primary task
– Attention-based Seq2Seq model.
• Speaker Encoder
– Pretrained speaker D-Vector.
• Vocoder
– Converts target spectrograms to time-domain waveforms.
• Auxiliary tasks (Secondary tasks)
– Predict source and target phonem sequences.
7. 2021-01-09
Primary task: Attention based Seq2Seq
network (Encoder)
• Sequence-to-sequence encoder stack maps 80-channel log-mel spectrogram
input features into hidden states which are passed through an attention-
based alignment mechanism to condition an autoregressive decoder.
– 8 BLSTM layer.
– Final layer output is passed to the primary decoder, whereas intermediate
activations are passed to auxiliary decoders predicting phoneme sequences.
8. 2021-01-09
Primary task: Attention based Seq2Seq
network (Decoder)
• Is is similar to Tacotron 2 TTS model, including pre-net, autoregressive LSTM stack, and post-net
components.
– Pre-net: a 2-layer fully connected network.
– Autoregressive LSTM: Bi-directional LSTM(encoder), LSTM (2 Uni-directional layers in decoder)
– Post-net: a 5-layer CNN with residual connections, refines the mel spectrogram.
• Use Multi-Head Attention with 4 heads instead of location-sensitive attention.
– Location-sensitive attention: connects encoder and decoder.
• 4 or 6 LSTM layers leads to good performance.
9. 2021-01-09
What is Tacotron 2?
• Input: ㅇ ㅏ ㄴ ㄴ ㅕ ㅇ ㅎ ㅏ ㅅ ㅔ 요 → Character Embedding
• Encoder
– 3 convolution Layers → Bi-directional LSTM (512 neurons)
• Attention Unit (Location Sensitive Attention concat encoder and decoder
• Decoder
– LSTM layer (2 uni-directional layers with 1024 neurons) → Linear Transform → Predicted Spectrogram
Frame
• PostNet (5 Convolutional Layers) → Enhanced Prediction
• WaveNet Vocoder
– Map text sequence to sequence(12.5ms 80 dimensional audio spectrogram. 24kHz wave)
• .. and Finally Output: wav file
10. 2021-01-09
Vocoder: Converts target spectrograms to
time-domain waveforms
• They use Griffin-Lim vocoder.
– However, they use a WaveRNN neural vocoder when evaluating speech naturalness in listening
tests. (MOS Tests)
• Using Reduction Factor r
– It is to generate multiple frames at the same time using Reduction Factor r.
– In this paper, the Reduction Factor is set to 2.
– That is, the decoder generates a log Spectrogram corresponding to two frames in one time step.
– This is because of the continuity of the voice signal.
– This assumption is significant because the speech signal is made up of a single pronunciation
across several frames.
11. 2021-01-09
Speaker Encoder: D-vector (Speaker
Independent System’s features)
• Pretrained on a speaker verification Output is D-Vector.
– 851K speakers across 8 languages.
– Not updated during the training of Translatotron.
– This model computes a 256-dim speaker embedding from the speaker
reference utterance, which is passed into a linear projection layer to
reduce the dimensionality to 16.
• Speaker discriminator output is concatenate to last BLSTM layer.
• This part is only used for voice transfer task.
12. 2021-01-09
Auxiliary tasks: Two other networks which
are branching from the encoder.
• Two optional auxiliary decoders, each with their own attention components, predict source and target
phoneme sequences.
– 2-Layer LSTMs with single-head additive attention.
• One input is source and the other input is target.
– In Conversational experiment, source and target input is output of 8-layer BLSTM.
– In Fisher experiment, however, source input is output of 4-layer BLSTM and target input is output of 6-layer
BLSTM.
• During only training, not running in test.
• Multitask training. (using 3 losses to get better BLEU scores)
14. 2021-01-09
Experiment
• 1. Conversational Spanish-to-English
• 2. Fisher Spanish-to-English
– Spanish corpus of telephone conversations and corresponding English
translations
• 3. MOS
• 4. Voice Cloning
• To evaluate speech-to-speech translation performance they
compute BLEU scores as an objective measure of speech
intelligibility and translation quality, by using a pretrained ASR
system to recognize the generated speech, and comparing the
resulting transcripts to ground truth reference translations.
15. 2021-01-09
Experiments 1 – Conversational
Spanish-to-English
• Datasets (Using LibriSpeech 979k utterances pairs)
– Spanish: Authors were crowdsourcing Spanish humans to read the both
sides of a conversational Spanish-English Machine Translation dataset.
– English: TTS output (female) English speech.
• Data Specific
– Input speech is 16kHz and feature frames are created by stacking 3
adjacent frames of an 80-channel log-mel spectrogram.
– Output is 24kHz, using Reduction Factor 2, predicting two spectrogram
frames for each decoding step.
• Speaker Encoder was not used in these experiments since the
target speech always came from the same speaker. (TTS)
• Using Auxiliary decoder is good for training. So they use 3 losses
with multi-task learning.
16. 2021-01-09
Experiments 2 – Fisher Spanish-to-
English
• Datasets (120k parallel utterances pairs, spanning 127 hours of source
speech)
– Spanish: Spanish Fisher corpus of telephone conversations.
– English: TTS output (female) English speech.
• Data Specific
– Input speech is 8kHz and features are constructed by stacking 80-channel log-
mel spectrograms, with deltas and accelerations.
– They obtain good performance required significantly more careful
regularization and tuning, added Gaussian weight noise to all LSTM weights as
regularization respectively.
– Output is 24kHz, using Reduction Factor 2, predicting two spectrogram frames
for each decoding step.
• These datasets more especially sensitive to the auxiliary decoder
hyperparameters.
• They find that pre-training the bottom 6 encoder layers on an ST task
improves BLEU scores by over 5 points.
17. 2021-01-09
Experiments 3 – Naturalness MOS
• Using WaveRNN vocoders dramatically improves ratings
over Griffin-Lim into the “Very Good” range.
18. 2021-01-09
Experiments 4 – Cross language voice
transfer (check)
• Using 606k utterances pairs
– Since target recordings contained noise, they apply the denoising
and volume normalization from [15] to improve output quality.
• Data Specific
– Input speech is 16kHz and feature frames are created by stacking 3
adjacent frames of an 80-channel log-mel spectrogram.
– Output is 24kHz, using Reduction Factor 2, predicting two
spectrogram frames for each decoding step.
• Training the full model depicted in Figure 1.
[15] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen et al., “Transfer learning from speaker verification to
multispeaker text-to-speech synthesis,” in Proc. NeurIPS, 2018.
19. 2021-01-09
Experiments - Summary
• The Google AI engineers validated Translatotron’s
translation quality by measuring the BLEU (bilingual
evaluation understudy) score, computed with text converted
by a speech recognition system.
• They use the 16k Word-Piece attention-based ASR model
from trained on the 960 hours LibriSpeech corpus.
• In addition, they conduct listening tests to measure
subjective speech naturalness Mean Opinion Score (MOS),
as well as speaker similarity MOS for voice transfer.
24. 2021-01-09
Conclusion
• The authors concluded that Translatotron is the first end-to-
end model that can directly translate speech from one
language into speech in another language and can retain the
source voice in the translated speech.
• They are considering this as a starting point for future
research on end-to-end speech-to-speech translation systems.
• In addition, they find that it is important to use speech
transcripts during training. (for Auxiliary tasks)
25. 2021-01-09
Major Take-away
• End-to-End direct speech-to-speech translation.
• Using Speaker Encoder for transfer voice.
• In primary decoder, it predicts 1025-dim log spectrogram
frames corresponding to the translated speech.
– They Also use reduction factor of 2 for predicting two spectrogram
frames for each decoding step.
• Auxiliary task
– Using 3 losses with Multitask learning that only training.
26. 2021-01-09
Major Take-away
• In my research, encoders and decoders are 6 stack and 8 multi-
head.
• It maps 80-channel log mel spectrogram input features into
Multi-Head Attention with Positional Encoding and the decoder
which predicts 80-channel mel spectrogram frames
corresponding to the transfer voice.
• This paper, however, encoder stack maps 80-channel log-mel
spectrogram input features into hidden states which are passed
through an attention-based alignment mechanism to condition an
autoregressive decoder, which predicts 1025-dim log
spectrogram frames.
• And the auxiliary decoders, each with their own attention
components, predict source and target phoneme sequences.
Editor's Notes
Contents
Introduction1
Introduction2
Introduction3, End
Overview of model
This model composed of several separately trained components: an attention-based seq2seq network, which generates target spectrograms.
A vocoder which converts target spectrograms to time-domain waveforms and, optionally, a pretrained speaker encoder which can be used to condition the decoder on the identity of the source speaker, enabling cross-language voice conversion with translation.
Primary tasks
Primary tasks
Vocoder
Vocoder
디코더는 time step 당 하나가 아닌 여러 프레임의 스펙트로그램을 예상함으로써 훈련 시간, 합성 시간, 모델 사이즈를 줄임. 이는 연속한 프레임의 스펙트로그램끼리 서로 겹치는 정보가 많기 때문에 가능. 이렇게 디코더 time step 당 예측하는 프레임의 개수를 reduction factor(r)라고 부른다.
Speaker Encoder (D-Vector)
Auxiliary tasks
Training Parameter
We study two Spanish-to-English translation datasets: the large scale “conversational” corpus of parallel text and read speech pairs from [21], and the Spanish Fisher corpus of telephone conversations and corresponding English translations [38], which is smaller and more challenging due to the spontaneous and informal speaking style. In Sections 3.1 and 3.2, we synthesize target speech from the target transcript using a single (female) speaker English TTS system; In Section 3.4, we use real human target speech for voice transfer experiments on the conversational dataset. Models were implemented using the Lingvo framework [39].
English: Instead of using the human target speech, they use a TTS model to synthesize target speech in a single female English speaker’s voice in order to simplify the learning objective (English Tacotron 2 TTS model)