SlideShare a Scribd company logo
1 of 26
Direct speech-to-speech translation with a
sequence-to-sequence model
Ye Jia, Ron J. Weiss, Fadi Biadsy, Wolfgang Macherey, Melvin Johnson,
Zhifeng Chen, Yonghui Wu
Google Research.
Arxiv Date: 12-April-2019
Presented by: June-Woo Kim
Artificial Brain Research Lab., School of Sensor and Display,
Kyungpook National University
05-June-2019
2021-01-09
Table of Contents
• Overview of the paper
• The Proposed Model
• Experiments and Results
• Conclusion
• Major Take Away
2021-01-09
Introduction & Main goal of the Paper
• The new system could soon greatly improve foreign-
language interactions.
• Current translators break down the translation process into
three steps, based on converting the speech to text:
– Speech Recognition: It used to convert the source speech into text.
– Machine Translation: It is used for translating the converted text
into the target language.
– Text-to-Speech Synthesis (TTS): It is used to produce speech in the
target language from the translated text.
2021-01-09
Introduction & Main goal of the Paper
• Google, however, unlike cascaded systems, doesn’t rely on an
intermediate text representation in either language.
• The new system called Translatotron uses machine learning to
bypass the text representation steps, converting spectrograms of
speech from one language into another language.
– Attention-based sequence-to-sequence neural network which can directly
translate speech from one language into speech in another language without
relying on an intermediate text representation.
– The network is trained end-to-end, learning to map source speech
spectrograms into target spectrograms in another language.
– Experiments are conducted on 2 different Spanish to English datasets.
– Proposed model slightly underperforms a baseline cascade of a direct
speech-to-text translation model and a text-to-speech synthesis model.
2021-01-09
Introduction & Main goal of the Paper
• Although it's in early stages, the system can reproduce some
aspects of the original speaker's voice and tone.
2021-01-09
The Proposed Model
• Primary task
– Attention-based Seq2Seq model.
• Speaker Encoder
– Pretrained speaker D-Vector.
• Vocoder
– Converts target spectrograms to time-domain waveforms.
• Auxiliary tasks (Secondary tasks)
– Predict source and target phonem sequences.
2021-01-09
Primary task: Attention based Seq2Seq
network (Encoder)
• Sequence-to-sequence encoder stack maps 80-channel log-mel spectrogram
input features into hidden states which are passed through an attention-
based alignment mechanism to condition an autoregressive decoder.
– 8 BLSTM layer.
– Final layer output is passed to the primary decoder, whereas intermediate
activations are passed to auxiliary decoders predicting phoneme sequences.
2021-01-09
Primary task: Attention based Seq2Seq
network (Decoder)
• Is is similar to Tacotron 2 TTS model, including pre-net, autoregressive LSTM stack, and post-net
components.
– Pre-net: a 2-layer fully connected network.
– Autoregressive LSTM: Bi-directional LSTM(encoder), LSTM (2 Uni-directional layers in decoder)
– Post-net: a 5-layer CNN with residual connections, refines the mel spectrogram.
• Use Multi-Head Attention with 4 heads instead of location-sensitive attention.
– Location-sensitive attention: connects encoder and decoder.
• 4 or 6 LSTM layers leads to good performance.
2021-01-09
What is Tacotron 2?
• Input: ㅇ ㅏ ㄴ ㄴ ㅕ ㅇ ㅎ ㅏ ㅅ ㅔ 요 → Character Embedding
• Encoder
– 3 convolution Layers → Bi-directional LSTM (512 neurons)
• Attention Unit (Location Sensitive Attention  concat encoder and decoder
• Decoder
– LSTM layer (2 uni-directional layers with 1024 neurons) → Linear Transform → Predicted Spectrogram
Frame
• PostNet (5 Convolutional Layers) → Enhanced Prediction
• WaveNet Vocoder
– Map text sequence to sequence(12.5ms 80 dimensional audio spectrogram.  24kHz wave)
• .. and Finally Output: wav file
2021-01-09
Vocoder: Converts target spectrograms to
time-domain waveforms
• They use Griffin-Lim vocoder.
– However, they use a WaveRNN neural vocoder when evaluating speech naturalness in listening
tests. (MOS Tests)
• Using Reduction Factor r
– It is to generate multiple frames at the same time using Reduction Factor r.
– In this paper, the Reduction Factor is set to 2.
– That is, the decoder generates a log Spectrogram corresponding to two frames in one time step.
– This is because of the continuity of the voice signal.
– This assumption is significant because the speech signal is made up of a single pronunciation
across several frames.
2021-01-09
Speaker Encoder: D-vector (Speaker
Independent System’s features)
• Pretrained on a speaker verification  Output is D-Vector.
– 851K speakers across 8 languages.
– Not updated during the training of Translatotron.
– This model computes a 256-dim speaker embedding from the speaker
reference utterance, which is passed into a linear projection layer to
reduce the dimensionality to 16.
• Speaker discriminator output is concatenate to last BLSTM layer.
• This part is only used for voice transfer task.
2021-01-09
Auxiliary tasks: Two other networks which
are branching from the encoder.
• Two optional auxiliary decoders, each with their own attention components, predict source and target
phoneme sequences.
– 2-Layer LSTMs with single-head additive attention.
• One input is source and the other input is target.
– In Conversational experiment, source and target input is output of 8-layer BLSTM.
– In Fisher experiment, however, source input is output of 4-layer BLSTM and target input is output of 6-layer
BLSTM.
• During only training, not running in test.
• Multitask training. (using 3 losses to get better BLEU scores)
2021-01-09
Parameter
• Batch_size = 1024
• Use Adafactor optimizer
2021-01-09
Experiment
• 1. Conversational Spanish-to-English
• 2. Fisher Spanish-to-English
– Spanish corpus of telephone conversations and corresponding English
translations
• 3. MOS
• 4. Voice Cloning
• To evaluate speech-to-speech translation performance they
compute BLEU scores as an objective measure of speech
intelligibility and translation quality, by using a pretrained ASR
system to recognize the generated speech, and comparing the
resulting transcripts to ground truth reference translations.
2021-01-09
Experiments 1 – Conversational
Spanish-to-English
• Datasets (Using LibriSpeech 979k utterances pairs)
– Spanish: Authors were crowdsourcing Spanish humans to read the both
sides of a conversational Spanish-English Machine Translation dataset.
– English: TTS output (female) English speech.
• Data Specific
– Input speech is 16kHz and feature frames are created by stacking 3
adjacent frames of an 80-channel log-mel spectrogram.
– Output is 24kHz, using Reduction Factor 2, predicting two spectrogram
frames for each decoding step.
• Speaker Encoder was not used in these experiments since the
target speech always came from the same speaker. (TTS)
• Using Auxiliary decoder is good for training. So they use 3 losses
with multi-task learning.
2021-01-09
Experiments 2 – Fisher Spanish-to-
English
• Datasets (120k parallel utterances pairs, spanning 127 hours of source
speech)
– Spanish: Spanish Fisher corpus of telephone conversations.
– English: TTS output (female) English speech.
• Data Specific
– Input speech is 8kHz and features are constructed by stacking 80-channel log-
mel spectrograms, with deltas and accelerations.
– They obtain good performance required significantly more careful
regularization and tuning, added Gaussian weight noise to all LSTM weights as
regularization respectively.
– Output is 24kHz, using Reduction Factor 2, predicting two spectrogram frames
for each decoding step.
• These datasets more especially sensitive to the auxiliary decoder
hyperparameters.
• They find that pre-training the bottom 6 encoder layers on an ST task
improves BLEU scores by over 5 points.
2021-01-09
Experiments 3 – Naturalness MOS
• Using WaveRNN vocoders dramatically improves ratings
over Griffin-Lim into the “Very Good” range.
2021-01-09
Experiments 4 – Cross language voice
transfer (check)
• Using 606k utterances pairs
– Since target recordings contained noise, they apply the denoising
and volume normalization from [15] to improve output quality.
• Data Specific
– Input speech is 16kHz and feature frames are created by stacking 3
adjacent frames of an 80-channel log-mel spectrogram.
– Output is 24kHz, using Reduction Factor 2, predicting two
spectrogram frames for each decoding step.
• Training the full model depicted in Figure 1.
[15] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen et al., “Transfer learning from speaker verification to
multispeaker text-to-speech synthesis,” in Proc. NeurIPS, 2018.
2021-01-09
Experiments - Summary
• The Google AI engineers validated Translatotron’s
translation quality by measuring the BLEU (bilingual
evaluation understudy) score, computed with text converted
by a speech recognition system.
• They use the 16k Word-Piece attention-based ASR model
from trained on the 960 hours LibriSpeech corpus.
• In addition, they conduct listening tests to measure
subjective speech naturalness Mean Opinion Score (MOS),
as well as speaker similarity MOS for voice transfer.
2021-01-09
Results – Experiments 1
2021-01-09
Results – Experiments 2
2021-01-09
Results – Experiments 3
2021-01-09
Results – Experiments 4
2021-01-09
Conclusion
• The authors concluded that Translatotron is the first end-to-
end model that can directly translate speech from one
language into speech in another language and can retain the
source voice in the translated speech.
• They are considering this as a starting point for future
research on end-to-end speech-to-speech translation systems.
• In addition, they find that it is important to use speech
transcripts during training. (for Auxiliary tasks)
2021-01-09
Major Take-away
• End-to-End direct speech-to-speech translation.
• Using Speaker Encoder for transfer voice.
• In primary decoder, it predicts 1025-dim log spectrogram
frames corresponding to the translated speech.
– They Also use reduction factor of 2 for predicting two spectrogram
frames for each decoding step.
• Auxiliary task
– Using 3 losses with Multitask learning that only training.
2021-01-09
Major Take-away
• In my research, encoders and decoders are 6 stack and 8 multi-
head.
• It maps 80-channel log mel spectrogram input features into
Multi-Head Attention with Positional Encoding and the decoder
which predicts 80-channel mel spectrogram frames
corresponding to the transfer voice.
• This paper, however, encoder stack maps 80-channel log-mel
spectrogram input features into hidden states which are passed
through an attention-based alignment mechanism to condition an
autoregressive decoder, which predicts 1025-dim log
spectrogram frames.
• And the auxiliary decoders, each with their own attention
components, predict source and target phoneme sequences.

More Related Content

What's hot

A scalabilty and mobility resilient data search system
A  scalabilty and mobility resilient data search systemA  scalabilty and mobility resilient data search system
A scalabilty and mobility resilient data search systemAleesha Noushad
 
Environmental Sound detection Using MFCC technique
Environmental Sound detection Using MFCC techniqueEnvironmental Sound detection Using MFCC technique
Environmental Sound detection Using MFCC techniquePankaj Kumar
 
Voice biometric recognition
Voice biometric recognitionVoice biometric recognition
Voice biometric recognitionphyuhsan
 
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...TELKOMNIKA JOURNAL
 
Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...
Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...
Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...IDES Editor
 

What's hot (6)

A scalabilty and mobility resilient data search system
A  scalabilty and mobility resilient data search systemA  scalabilty and mobility resilient data search system
A scalabilty and mobility resilient data search system
 
Environmental Sound detection Using MFCC technique
Environmental Sound detection Using MFCC techniqueEnvironmental Sound detection Using MFCC technique
Environmental Sound detection Using MFCC technique
 
Voice biometric recognition
Voice biometric recognitionVoice biometric recognition
Voice biometric recognition
 
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...
 
Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...
Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...
Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...
 
Protocol.ppt
Protocol.pptProtocol.ppt
Protocol.ppt
 

Similar to Direct speech translation with seq2seq model

MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEIRJET Journal
 
Speech Compression using LPC
Speech Compression using LPCSpeech Compression using LPC
Speech Compression using LPCDisha Modi
 
Toward wave net speech synthesis
Toward wave net speech synthesisToward wave net speech synthesis
Toward wave net speech synthesisNAVER Engineering
 
SpecAugment review
SpecAugment reviewSpecAugment review
SpecAugment reviewJune-Woo Kim
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)IRJET Journal
 
Real-Time Non-Intrusive Speech Quality Estimation: A Signal-Based Model
Real-Time Non-Intrusive Speech Quality Estimation: A Signal-Based ModelReal-Time Non-Intrusive Speech Quality Estimation: A Signal-Based Model
Real-Time Non-Intrusive Speech Quality Estimation: A Signal-Based Modeladil raja
 
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...ssuser849b73
 
Master defence 2020 - Borys Olshanetskyi -Context Independent Speaker Classif...
Master defence 2020 - Borys Olshanetskyi -Context Independent Speaker Classif...Master defence 2020 - Borys Olshanetskyi -Context Independent Speaker Classif...
Master defence 2020 - Borys Olshanetskyi -Context Independent Speaker Classif...Lviv Data Science Summer School
 
Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...eSAT Publishing House
 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisRushin Shah
 
Pablo Magani - BCI SSVEP Speller
Pablo Magani - BCI SSVEP SpellerPablo Magani - BCI SSVEP Speller
Pablo Magani - BCI SSVEP SpellerPablo Magani
 
On the realization of non linear pseudo-noise generator for various signal pr...
On the realization of non linear pseudo-noise generator for various signal pr...On the realization of non linear pseudo-noise generator for various signal pr...
On the realization of non linear pseudo-noise generator for various signal pr...Alexander Decker
 
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Lviv Startup Club
 
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...Vimukthi Wickramasinghe
 
Frame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdfFrame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdfssuser849b73
 
Speech compression using voiced excited loosy predictive coding (lpc)
Speech compression using voiced excited loosy predictive coding (lpc)Speech compression using voiced excited loosy predictive coding (lpc)
Speech compression using voiced excited loosy predictive coding (lpc)Harshal Ladhe
 

Similar to Direct speech translation with seq2seq model (20)

Conv-TasNet.pdf
Conv-TasNet.pdfConv-TasNet.pdf
Conv-TasNet.pdf
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
 
Speech Compression using LPC
Speech Compression using LPCSpeech Compression using LPC
Speech Compression using LPC
 
Toward wave net speech synthesis
Toward wave net speech synthesisToward wave net speech synthesis
Toward wave net speech synthesis
 
SpecAugment review
SpecAugment reviewSpecAugment review
SpecAugment review
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)
 
EEND-SS.pdf
EEND-SS.pdfEEND-SS.pdf
EEND-SS.pdf
 
Real-Time Non-Intrusive Speech Quality Estimation: A Signal-Based Model
Real-Time Non-Intrusive Speech Quality Estimation: A Signal-Based ModelReal-Time Non-Intrusive Speech Quality Estimation: A Signal-Based Model
Real-Time Non-Intrusive Speech Quality Estimation: A Signal-Based Model
 
Sudormrf.pdf
Sudormrf.pdfSudormrf.pdf
Sudormrf.pdf
 
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
 
Master defence 2020 - Borys Olshanetskyi -Context Independent Speaker Classif...
Master defence 2020 - Borys Olshanetskyi -Context Independent Speaker Classif...Master defence 2020 - Borys Olshanetskyi -Context Independent Speaker Classif...
Master defence 2020 - Borys Olshanetskyi -Context Independent Speaker Classif...
 
Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...
 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech Analysis
 
Pablo Magani - BCI SSVEP Speller
Pablo Magani - BCI SSVEP SpellerPablo Magani - BCI SSVEP Speller
Pablo Magani - BCI SSVEP Speller
 
On the realization of non linear pseudo-noise generator for various signal pr...
On the realization of non linear pseudo-noise generator for various signal pr...On the realization of non linear pseudo-noise generator for various signal pr...
On the realization of non linear pseudo-noise generator for various signal pr...
 
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
 
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
 
Frame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdfFrame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdf
 
Conformer review
Conformer reviewConformer review
Conformer review
 
Speech compression using voiced excited loosy predictive coding (lpc)
Speech compression using voiced excited loosy predictive coding (lpc)Speech compression using voiced excited loosy predictive coding (lpc)
Speech compression using voiced excited loosy predictive coding (lpc)
 

More from June-Woo Kim

Monotonic Multihead Attention review
Monotonic Multihead Attention reviewMonotonic Multihead Attention review
Monotonic Multihead Attention reviewJune-Woo Kim
 
Non autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewNon autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewJune-Woo Kim
 
ICLR 2 papers review in signal processing domain
ICLR 2 papers review in signal processing domain ICLR 2 papers review in signal processing domain
ICLR 2 papers review in signal processing domain June-Woo Kim
 
Parallel WaveGAN review
Parallel WaveGAN reviewParallel WaveGAN review
Parallel WaveGAN reviewJune-Woo Kim
 
Voice Impersonation Using Generative Adversarial Networks review
Voice Impersonation Using Generative Adversarial Networks reviewVoice Impersonation Using Generative Adversarial Networks review
Voice Impersonation Using Generative Adversarial Networks reviewJune-Woo Kim
 

More from June-Woo Kim (6)

Monotonic Multihead Attention review
Monotonic Multihead Attention reviewMonotonic Multihead Attention review
Monotonic Multihead Attention review
 
Non autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewNon autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech review
 
ICLR 2 papers review in signal processing domain
ICLR 2 papers review in signal processing domain ICLR 2 papers review in signal processing domain
ICLR 2 papers review in signal processing domain
 
Parallel WaveGAN review
Parallel WaveGAN reviewParallel WaveGAN review
Parallel WaveGAN review
 
Blow review
Blow reviewBlow review
Blow review
 
Voice Impersonation Using Generative Adversarial Networks review
Voice Impersonation Using Generative Adversarial Networks reviewVoice Impersonation Using Generative Adversarial Networks review
Voice Impersonation Using Generative Adversarial Networks review
 

Recently uploaded

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 

Recently uploaded (20)

9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 

Direct speech translation with seq2seq model

  • 1. Direct speech-to-speech translation with a sequence-to-sequence model Ye Jia, Ron J. Weiss, Fadi Biadsy, Wolfgang Macherey, Melvin Johnson, Zhifeng Chen, Yonghui Wu Google Research. Arxiv Date: 12-April-2019 Presented by: June-Woo Kim Artificial Brain Research Lab., School of Sensor and Display, Kyungpook National University 05-June-2019
  • 2. 2021-01-09 Table of Contents • Overview of the paper • The Proposed Model • Experiments and Results • Conclusion • Major Take Away
  • 3. 2021-01-09 Introduction & Main goal of the Paper • The new system could soon greatly improve foreign- language interactions. • Current translators break down the translation process into three steps, based on converting the speech to text: – Speech Recognition: It used to convert the source speech into text. – Machine Translation: It is used for translating the converted text into the target language. – Text-to-Speech Synthesis (TTS): It is used to produce speech in the target language from the translated text.
  • 4. 2021-01-09 Introduction & Main goal of the Paper • Google, however, unlike cascaded systems, doesn’t rely on an intermediate text representation in either language. • The new system called Translatotron uses machine learning to bypass the text representation steps, converting spectrograms of speech from one language into another language. – Attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language without relying on an intermediate text representation. – The network is trained end-to-end, learning to map source speech spectrograms into target spectrograms in another language. – Experiments are conducted on 2 different Spanish to English datasets. – Proposed model slightly underperforms a baseline cascade of a direct speech-to-text translation model and a text-to-speech synthesis model.
  • 5. 2021-01-09 Introduction & Main goal of the Paper • Although it's in early stages, the system can reproduce some aspects of the original speaker's voice and tone.
  • 6. 2021-01-09 The Proposed Model • Primary task – Attention-based Seq2Seq model. • Speaker Encoder – Pretrained speaker D-Vector. • Vocoder – Converts target spectrograms to time-domain waveforms. • Auxiliary tasks (Secondary tasks) – Predict source and target phonem sequences.
  • 7. 2021-01-09 Primary task: Attention based Seq2Seq network (Encoder) • Sequence-to-sequence encoder stack maps 80-channel log-mel spectrogram input features into hidden states which are passed through an attention- based alignment mechanism to condition an autoregressive decoder. – 8 BLSTM layer. – Final layer output is passed to the primary decoder, whereas intermediate activations are passed to auxiliary decoders predicting phoneme sequences.
  • 8. 2021-01-09 Primary task: Attention based Seq2Seq network (Decoder) • Is is similar to Tacotron 2 TTS model, including pre-net, autoregressive LSTM stack, and post-net components. – Pre-net: a 2-layer fully connected network. – Autoregressive LSTM: Bi-directional LSTM(encoder), LSTM (2 Uni-directional layers in decoder) – Post-net: a 5-layer CNN with residual connections, refines the mel spectrogram. • Use Multi-Head Attention with 4 heads instead of location-sensitive attention. – Location-sensitive attention: connects encoder and decoder. • 4 or 6 LSTM layers leads to good performance.
  • 9. 2021-01-09 What is Tacotron 2? • Input: ㅇ ㅏ ㄴ ㄴ ㅕ ㅇ ㅎ ㅏ ㅅ ㅔ 요 → Character Embedding • Encoder – 3 convolution Layers → Bi-directional LSTM (512 neurons) • Attention Unit (Location Sensitive Attention  concat encoder and decoder • Decoder – LSTM layer (2 uni-directional layers with 1024 neurons) → Linear Transform → Predicted Spectrogram Frame • PostNet (5 Convolutional Layers) → Enhanced Prediction • WaveNet Vocoder – Map text sequence to sequence(12.5ms 80 dimensional audio spectrogram.  24kHz wave) • .. and Finally Output: wav file
  • 10. 2021-01-09 Vocoder: Converts target spectrograms to time-domain waveforms • They use Griffin-Lim vocoder. – However, they use a WaveRNN neural vocoder when evaluating speech naturalness in listening tests. (MOS Tests) • Using Reduction Factor r – It is to generate multiple frames at the same time using Reduction Factor r. – In this paper, the Reduction Factor is set to 2. – That is, the decoder generates a log Spectrogram corresponding to two frames in one time step. – This is because of the continuity of the voice signal. – This assumption is significant because the speech signal is made up of a single pronunciation across several frames.
  • 11. 2021-01-09 Speaker Encoder: D-vector (Speaker Independent System’s features) • Pretrained on a speaker verification  Output is D-Vector. – 851K speakers across 8 languages. – Not updated during the training of Translatotron. – This model computes a 256-dim speaker embedding from the speaker reference utterance, which is passed into a linear projection layer to reduce the dimensionality to 16. • Speaker discriminator output is concatenate to last BLSTM layer. • This part is only used for voice transfer task.
  • 12. 2021-01-09 Auxiliary tasks: Two other networks which are branching from the encoder. • Two optional auxiliary decoders, each with their own attention components, predict source and target phoneme sequences. – 2-Layer LSTMs with single-head additive attention. • One input is source and the other input is target. – In Conversational experiment, source and target input is output of 8-layer BLSTM. – In Fisher experiment, however, source input is output of 4-layer BLSTM and target input is output of 6-layer BLSTM. • During only training, not running in test. • Multitask training. (using 3 losses to get better BLEU scores)
  • 13. 2021-01-09 Parameter • Batch_size = 1024 • Use Adafactor optimizer
  • 14. 2021-01-09 Experiment • 1. Conversational Spanish-to-English • 2. Fisher Spanish-to-English – Spanish corpus of telephone conversations and corresponding English translations • 3. MOS • 4. Voice Cloning • To evaluate speech-to-speech translation performance they compute BLEU scores as an objective measure of speech intelligibility and translation quality, by using a pretrained ASR system to recognize the generated speech, and comparing the resulting transcripts to ground truth reference translations.
  • 15. 2021-01-09 Experiments 1 – Conversational Spanish-to-English • Datasets (Using LibriSpeech 979k utterances pairs) – Spanish: Authors were crowdsourcing Spanish humans to read the both sides of a conversational Spanish-English Machine Translation dataset. – English: TTS output (female) English speech. • Data Specific – Input speech is 16kHz and feature frames are created by stacking 3 adjacent frames of an 80-channel log-mel spectrogram. – Output is 24kHz, using Reduction Factor 2, predicting two spectrogram frames for each decoding step. • Speaker Encoder was not used in these experiments since the target speech always came from the same speaker. (TTS) • Using Auxiliary decoder is good for training. So they use 3 losses with multi-task learning.
  • 16. 2021-01-09 Experiments 2 – Fisher Spanish-to- English • Datasets (120k parallel utterances pairs, spanning 127 hours of source speech) – Spanish: Spanish Fisher corpus of telephone conversations. – English: TTS output (female) English speech. • Data Specific – Input speech is 8kHz and features are constructed by stacking 80-channel log- mel spectrograms, with deltas and accelerations. – They obtain good performance required significantly more careful regularization and tuning, added Gaussian weight noise to all LSTM weights as regularization respectively. – Output is 24kHz, using Reduction Factor 2, predicting two spectrogram frames for each decoding step. • These datasets more especially sensitive to the auxiliary decoder hyperparameters. • They find that pre-training the bottom 6 encoder layers on an ST task improves BLEU scores by over 5 points.
  • 17. 2021-01-09 Experiments 3 – Naturalness MOS • Using WaveRNN vocoders dramatically improves ratings over Griffin-Lim into the “Very Good” range.
  • 18. 2021-01-09 Experiments 4 – Cross language voice transfer (check) • Using 606k utterances pairs – Since target recordings contained noise, they apply the denoising and volume normalization from [15] to improve output quality. • Data Specific – Input speech is 16kHz and feature frames are created by stacking 3 adjacent frames of an 80-channel log-mel spectrogram. – Output is 24kHz, using Reduction Factor 2, predicting two spectrogram frames for each decoding step. • Training the full model depicted in Figure 1. [15] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen et al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” in Proc. NeurIPS, 2018.
  • 19. 2021-01-09 Experiments - Summary • The Google AI engineers validated Translatotron’s translation quality by measuring the BLEU (bilingual evaluation understudy) score, computed with text converted by a speech recognition system. • They use the 16k Word-Piece attention-based ASR model from trained on the 960 hours LibriSpeech corpus. • In addition, they conduct listening tests to measure subjective speech naturalness Mean Opinion Score (MOS), as well as speaker similarity MOS for voice transfer.
  • 24. 2021-01-09 Conclusion • The authors concluded that Translatotron is the first end-to- end model that can directly translate speech from one language into speech in another language and can retain the source voice in the translated speech. • They are considering this as a starting point for future research on end-to-end speech-to-speech translation systems. • In addition, they find that it is important to use speech transcripts during training. (for Auxiliary tasks)
  • 25. 2021-01-09 Major Take-away • End-to-End direct speech-to-speech translation. • Using Speaker Encoder for transfer voice. • In primary decoder, it predicts 1025-dim log spectrogram frames corresponding to the translated speech. – They Also use reduction factor of 2 for predicting two spectrogram frames for each decoding step. • Auxiliary task – Using 3 losses with Multitask learning that only training.
  • 26. 2021-01-09 Major Take-away • In my research, encoders and decoders are 6 stack and 8 multi- head. • It maps 80-channel log mel spectrogram input features into Multi-Head Attention with Positional Encoding and the decoder which predicts 80-channel mel spectrogram frames corresponding to the transfer voice. • This paper, however, encoder stack maps 80-channel log-mel spectrogram input features into hidden states which are passed through an attention-based alignment mechanism to condition an autoregressive decoder, which predicts 1025-dim log spectrogram frames. • And the auxiliary decoders, each with their own attention components, predict source and target phoneme sequences.

Editor's Notes

  1. Contents
  2. Introduction1
  3. Introduction2
  4. Introduction3, End
  5. Overview of model This model composed of several separately trained components: an attention-based seq2seq network, which generates target spectrograms. A vocoder which converts target spectrograms to time-domain waveforms and, optionally, a pretrained speaker encoder which can be used to condition the decoder on the identity of the source speaker, enabling cross-language voice conversion with translation.
  6. Primary tasks
  7. Primary tasks
  8. Vocoder
  9. Vocoder 디코더는 time step 당 하나가 아닌 여러 프레임의 스펙트로그램을 예상함으로써 훈련 시간, 합성 시간, 모델 사이즈를 줄임. 이는 연속한 프레임의 스펙트로그램끼리 서로 겹치는 정보가 많기 때문에 가능. 이렇게 디코더 time step 당 예측하는 프레임의 개수를 reduction factor(r)라고 부른다.
  10. Speaker Encoder (D-Vector)
  11. Auxiliary tasks
  12. Training Parameter
  13. We study two Spanish-to-English translation datasets: the large scale “conversational” corpus of parallel text and read speech pairs from [21], and the Spanish Fisher corpus of telephone conversations and corresponding English translations [38], which is smaller and more challenging due to the spontaneous and informal speaking style. In Sections 3.1 and 3.2, we synthesize target speech from the target transcript using a single (female) speaker English TTS system; In Section 3.4, we use real human target speech for voice transfer experiments on the conversational dataset. Models were implemented using the Lingvo framework [39].
  14. English: Instead of using the human target speech, they use a TTS model to synthesize target speech in a single female English speaker’s voice in order to simplify the learning objective (English Tacotron 2 TTS model)
  15. Fisher
  16. Talking about MOS
  17. About Voice Transfer using Speaker Encoder.