Direct speech translation with seq2seq model

Direct speech-to-speech translation with a
sequence-to-sequence model
Ye Jia, Ron J. Weiss, Fadi Biadsy, Wolfgang Macherey, Melvin Johnson,
Zhifeng Chen, Yonghui Wu
Google Research.
Arxiv Date: 12-April-2019
Presented by: June-Woo Kim
Artificial Brain Research Lab., School of Sensor and Display,
Kyungpook National University
05-June-2019

2021-01-09
Table of Contents
• Overview of the paper
• The Proposed Model
• Experiments and Results
• Conclusion
• Major Take Away

2021-01-09
Introduction & Main goal of the Paper
• The new system could soon greatly improve foreign-
language interactions.
• Current translators break down the translation process into
three steps, based on converting the speech to text:
– Speech Recognition: It used to convert the source speech into text.
– Machine Translation: It is used for translating the converted text
into the target language.
– Text-to-Speech Synthesis (TTS): It is used to produce speech in the
target language from the translated text.

2021-01-09
• Google, however, unlike cascaded systems, doesn’t rely on an
intermediate text representation in either language.
• The new system called Translatotron uses machine learning to
bypass the text representation steps, converting spectrograms of
speech from one language into another language.
– Attention-based sequence-to-sequence neural network which can directly
translate speech from one language into speech in another language without
relying on an intermediate text representation.
– The network is trained end-to-end, learning to map source speech
spectrograms into target spectrograms in another language.
– Experiments are conducted on 2 different Spanish to English datasets.
– Proposed model slightly underperforms a baseline cascade of a direct
speech-to-text translation model and a text-to-speech synthesis model.

2021-01-09
• Although it's in early stages, the system can reproduce some
aspects of the original speaker's voice and tone.

2021-01-09
The Proposed Model
• Primary task
– Attention-based Seq2Seq model.
• Speaker Encoder
– Pretrained speaker D-Vector.
• Vocoder
– Converts target spectrograms to time-domain waveforms.
• Auxiliary tasks (Secondary tasks)
– Predict source and target phonem sequences.

2021-01-09
Primary task: Attention based Seq2Seq
network (Encoder)
• Sequence-to-sequence encoder stack maps 80-channel log-mel spectrogram
input features into hidden states which are passed through an attention-
based alignment mechanism to condition an autoregressive decoder.
– 8 BLSTM layer.
– Final layer output is passed to the primary decoder, whereas intermediate
activations are passed to auxiliary decoders predicting phoneme sequences.

2021-01-09
Primary task: Attention based Seq2Seq
network (Decoder)
• Is is similar to Tacotron 2 TTS model, including pre-net, autoregressive LSTM stack, and post-net
components.
– Pre-net: a 2-layer fully connected network.
– Autoregressive LSTM: Bi-directional LSTM(encoder), LSTM (2 Uni-directional layers in decoder)
– Post-net: a 5-layer CNN with residual connections, refines the mel spectrogram.
• Use Multi-Head Attention with 4 heads instead of location-sensitive attention.
– Location-sensitive attention: connects encoder and decoder.
• 4 or 6 LSTM layers leads to good performance.

2021-01-09
What is Tacotron 2?
• Input: ㅇ ㅏ ㄴ ㄴ ㅕ ㅇ ㅎ ㅏ ㅅ ㅔ 요 → Character Embedding
• Encoder
– 3 convolution Layers → Bi-directional LSTM (512 neurons)
• Attention Unit (Location Sensitive Attention  concat encoder and decoder
• Decoder
– LSTM layer (2 uni-directional layers with 1024 neurons) → Linear Transform → Predicted Spectrogram
Frame
• PostNet (5 Convolutional Layers) → Enhanced Prediction
• WaveNet Vocoder
– Map text sequence to sequence(12.5ms 80 dimensional audio spectrogram.  24kHz wave)
• .. and Finally Output: wav file

2021-01-09
Vocoder: Converts target spectrograms to
time-domain waveforms
• They use Griffin-Lim vocoder.
– However, they use a WaveRNN neural vocoder when evaluating speech naturalness in listening
tests. (MOS Tests)
• Using Reduction Factor r
– It is to generate multiple frames at the same time using Reduction Factor r.
– In this paper, the Reduction Factor is set to 2.
– That is, the decoder generates a log Spectrogram corresponding to two frames in one time step.
– This is because of the continuity of the voice signal.
– This assumption is significant because the speech signal is made up of a single pronunciation
across several frames.

2021-01-09
Speaker Encoder: D-vector (Speaker
Independent System’s features)
• Pretrained on a speaker verification  Output is D-Vector.
– 851K speakers across 8 languages.
– Not updated during the training of Translatotron.
– This model computes a 256-dim speaker embedding from the speaker
reference utterance, which is passed into a linear projection layer to
reduce the dimensionality to 16.
• Speaker discriminator output is concatenate to last BLSTM layer.
• This part is only used for voice transfer task.

2021-01-09
Auxiliary tasks: Two other networks which
are branching from the encoder.
• Two optional auxiliary decoders, each with their own attention components, predict source and target
phoneme sequences.
– 2-Layer LSTMs with single-head additive attention.
• One input is source and the other input is target.
– In Conversational experiment, source and target input is output of 8-layer BLSTM.
– In Fisher experiment, however, source input is output of 4-layer BLSTM and target input is output of 6-layer
BLSTM.
• During only training, not running in test.
• Multitask training. (using 3 losses to get better BLEU scores)

2021-01-09
Parameter
• Batch_size = 1024
• Use Adafactor optimizer

2021-01-09
Experiment
• 1. Conversational Spanish-to-English
• 2. Fisher Spanish-to-English
– Spanish corpus of telephone conversations and corresponding English
translations
• 3. MOS
• 4. Voice Cloning
• To evaluate speech-to-speech translation performance they
compute BLEU scores as an objective measure of speech
intelligibility and translation quality, by using a pretrained ASR
system to recognize the generated speech, and comparing the
resulting transcripts to ground truth reference translations.

2021-01-09
Experiments 1 – Conversational
Spanish-to-English
• Datasets (Using LibriSpeech 979k utterances pairs)
– Spanish: Authors were crowdsourcing Spanish humans to read the both
sides of a conversational Spanish-English Machine Translation dataset.
– English: TTS output (female) English speech.
• Data Specific
– Input speech is 16kHz and feature frames are created by stacking 3
adjacent frames of an 80-channel log-mel spectrogram.
– Output is 24kHz, using Reduction Factor 2, predicting two spectrogram
frames for each decoding step.
• Speaker Encoder was not used in these experiments since the
target speech always came from the same speaker. (TTS)
• Using Auxiliary decoder is good for training. So they use 3 losses
with multi-task learning.

2021-01-09
Experiments 2 – Fisher Spanish-to-
English
• Datasets (120k parallel utterances pairs, spanning 127 hours of source
speech)
– Spanish: Spanish Fisher corpus of telephone conversations.
– English: TTS output (female) English speech.
• Data Specific
– Input speech is 8kHz and features are constructed by stacking 80-channel log-
mel spectrograms, with deltas and accelerations.
– They obtain good performance required significantly more careful
regularization and tuning, added Gaussian weight noise to all LSTM weights as
regularization respectively.
– Output is 24kHz, using Reduction Factor 2, predicting two spectrogram frames
for each decoding step.
• These datasets more especially sensitive to the auxiliary decoder
hyperparameters.
• They find that pre-training the bottom 6 encoder layers on an ST task
improves BLEU scores by over 5 points.

2021-01-09
Experiments 3 – Naturalness MOS
• Using WaveRNN vocoders dramatically improves ratings
over Griffin-Lim into the “Very Good” range.

2021-01-09
Experiments 4 – Cross language voice
transfer (check)
• Using 606k utterances pairs
– Since target recordings contained noise, they apply the denoising
and volume normalization from [15] to improve output quality.
• Data Specific
– Input speech is 16kHz and feature frames are created by stacking 3
adjacent frames of an 80-channel log-mel spectrogram.
– Output is 24kHz, using Reduction Factor 2, predicting two
spectrogram frames for each decoding step.
• Training the full model depicted in Figure 1.
[15] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen et al., “Transfer learning from speaker verification to
multispeaker text-to-speech synthesis,” in Proc. NeurIPS, 2018.

2021-01-09
Experiments - Summary
• The Google AI engineers validated Translatotron’s
translation quality by measuring the BLEU (bilingual
evaluation understudy) score, computed with text converted
by a speech recognition system.
• They use the 16k Word-Piece attention-based ASR model
from trained on the 960 hours LibriSpeech corpus.
• In addition, they conduct listening tests to measure
subjective speech naturalness Mean Opinion Score (MOS),
as well as speaker similarity MOS for voice transfer.

2021-01-09
Results – Experiments 1

2021-01-09

2021-01-09
Conclusion
• The authors concluded that Translatotron is the first end-to-
end model that can directly translate speech from one
language into speech in another language and can retain the
source voice in the translated speech.
• They are considering this as a starting point for future
research on end-to-end speech-to-speech translation systems.
• In addition, they find that it is important to use speech
transcripts during training. (for Auxiliary tasks)

2021-01-09
Major Take-away
• End-to-End direct speech-to-speech translation.
• Using Speaker Encoder for transfer voice.
• In primary decoder, it predicts 1025-dim log spectrogram
frames corresponding to the translated speech.
– They Also use reduction factor of 2 for predicting two spectrogram
frames for each decoding step.
• Auxiliary task
– Using 3 losses with Multitask learning that only training.

2021-01-09
Major Take-away
• In my research, encoders and decoders are 6 stack and 8 multi-
head.
• It maps 80-channel log mel spectrogram input features into
Multi-Head Attention with Positional Encoding and the decoder
which predicts 80-channel mel spectrogram frames
corresponding to the transfer voice.
• This paper, however, encoder stack maps 80-channel log-mel
spectrogram input features into hidden states which are passed
through an attention-based alignment mechanism to condition an
autoregressive decoder, which predicts 1025-dim log
spectrogram frames.
• And the auxiliary decoders, each with their own attention
components, predict source and target phoneme sequences.

Direct speech translation with seq2seq model

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Similar to Direct speech translation with seq2seq model

Similar to Direct speech translation with seq2seq model (20)

More from June-Woo Kim

More from June-Woo Kim (6)

Recently uploaded

Recently uploaded (20)

Direct speech translation with seq2seq model

Editor's Notes