Utilizing the Pre-trained Model Effectively for Speech Translation

Utilizing the Pre-trained Model Effectively
for Speech Translation
许晨 / Chen Xu
东北大学自然语言处理实验室 / NEU NLP Lab.
小牛翻译 / NiuTrans

Table of contents
• Introduction to end-to-end speech translation
• End-to-end speech translation with pre-training
• Stacked acoustic-and-textual encoding (SATE)
• Offline speech translation in IWSLT2021 Campaign
• Future work

Speech Translation - classification
• Modality
• Speech to text translation (ST)
• Speech to speech translation (SS)
ST
你好 Hello
SS
你好 Hello

Speech Translation - classification
• Modality
• Speech to text translation (ST)
• Speech to speech translation (SS)
• Latency
• Offline
• Simultaneous/Streaming
ST
你好 Hello
SS
你好 Hello

Cascaded System
• Cascade the multiple independent models
ASR MT Target Text
Source speech Source Text

Cascaded System
ASR MT Target Text
• Pros
• Study for long time
• A large amounts of data available
• Evaluate and optimize each model explicitly

Cascaded System
ASR MT Target Text
• Pros
• Study for long time
• A large amounts of data available
• Evaluate and optimize each model explicitly
• Cons
• Error propagation
• High latency
• Missing the paralinguistic information not present in the text

A Unified view
ST
Source audio Target text

A Unified view
MT
Source text Target text
ASR
Source audio Source text
ST

• Learn a mapping function Y = F(x) from sequence to sequence
A Unified view
MT
ASR
ST

MT
ASR
ST
SS
Source audio Target audio
TTS
Source text Source audio
A Unified view
• Learn a mapping function Y = F(x) from sequence to sequence

End-to-end System
• A single model that learns to translate from audio to text in target
language, without intermediate discrete representation.

End-to-end System
ST Target Text
Source speech
• Pros
• Avoiding error propagation
• Low latency
• Keep the paralinguistic information

End-to-end System
ST Target Text
Source speech
• Pros
• Avoiding error propagation
• Low latency
• Keep the paralinguistic information
• Cons
• The heavy burden of the encoder
• Limited training data (Low resource)

End-to-end System - CTC
• Connectionist Temporal Classification: learn the soft alignment
between the speech feature and source text.

Speech feature
Predicted sequence
Merge repeat tokens
and remove ϵ
Output

Speech feature
Predicted sequence
Merge repeat tokens
and remove ϵ
Output
h h e ϵ l ϵ l l o o
h e e ϵ l l ϵ l o o
h h e e ϵ l ϵ l ϵ o
…

End-to-end System - Pre-training
• Pre-trained ASR encoder + Pre-trained MT decoder
ASR
Speech
feature
Decoder
Encoder
Source text
MT
Source text
Decoder
Encoder
Target text
ST
Speech
feature
Decoder
Encoder
Target text

End-to-end System - Rethinking Pre-training
• Is the ASR encoder sufficient for the ST encoder?

• Intuition:
• ASR encoder: transcription
• MT encoder: understanding
• ST encoder: transcription + understanding

• Intuition:
• ASR encoder: transcription (local dependency)
• MT encoder: understanding (global dependency)
• ST encoder: transcription + understanding (both)

• Intuition:
• ASR encoder: transcription (local dependency)
• MT encoder: understanding (global dependency)
• ST encoder: transcription + understanding (both)
• Verification with the Localness of self-attention in the encoder
• The sum of attention weights
to the surrounding words/features
within a fixed window

• Different behavior between MT encoder and ASR/ST encoder

• Is local attention sufficient for speech translation?
• Strong preference of CTC for local attention
• Better performance with global attention in ST
• Worse performance with global attention in ASR

• Is local attention sufficient for speech translation?
• Strong preference of CTC for local attention
• Better performance with global attention in ST
• Worse performance with global attention in ASR
• Conclusion
• The ST encoder is not a simple substitution of the ASR
encoder (or the MT encoder).

SATE - Architecture
• Stacked Acoustic-and-Textual Encoding (SATE)

SATE - Architecture
• Acoustic encoder: process the acoustic features

SATE - Architecture
• Textual encoder: generate the global representation

SATE - Architecture
• Textual encoder: generate the global representation
• Adaptor: alleviate the representation inconsistency issue

SATE - Adaptor
• Two principles:
• Adaptive: generate the embedding-like representation for
the textual encoder
···
CTC distribution
···
Embedding
X = Soft token embedding
0.1 0.25 0.05 0.2 0.1
···

SATE - Adaptor
• Two principles:
the textual encoder
• Informative: keep the paralinguistic information in the
output acoustic encoder

SATE - Adaptor
• Two principles:
the textual encoder
• Informative: keep the paralinguistic information in the
output acoustic encoder
• To do
• Length inconsistency: reduce the sequence length for the
textual encoder
• Shrink mechanism, downsampling

SATE - MTKD
• Multi-teacher Knowledge Distillation (MTKD)
• Minic the predicted distribution of the pre-trained model
ASR
Speech
Features
Encoder
CTC Loss
MT
Source text
Decoder
Encoder
Linear
Softmax
KD Loss
SATE
Linear
Softmax
KD Loss
Trans Loss

SATE - Experiements
• Datasets
• Models
Language Restricted
Unrestricted
ASR MT
En-De MuST-C (400h)
LibriSpeech (960h)
Opensubtitle2018 (18M)
En-Fr LibriSpeech (100h) WMT14 (10M)
Model Restricted Unrestricted
Arch Transformer Conformer
Hidden size 256 512
FNN Size 2048 2048
Attention heads 4 8

SATE - Experiements
• Results on the MuST-C En-De
• Degraded performance of the cascaded system

SATE - Experiements
• Large performance margin when the additional data is allowed

SATE - Experiements
• Significant improvement with SATE

SATE - Experiements
• Significant improvement with SATE
Achieve comparable performance
with the cascaded ST counterpart
when large-scale ASR and MT data
is available!

SATE - Experiements
• Results on the LibriSpeech En-Fr
Achieve comparable performance
with the cascaded ST counterpart
when large-scale ASR and MT data
is available!

SATE - Experiements
• Performance and speedup

SATE - Experiements
• Effects of the adaptor

SATE - Experiements
• Effects of the adaptor
• Impact on localness

IWSLT2021 - Offline Speech Translation
• Data statistics of the ASR, MT, and ST corpora

• Data statistics of the ASR, MT, and ST corpora
• Data augmentation: translate the transcription in source language
to the text in target language

• Baseline: CTC-based deep Transformer

• Architecture improvement: Conformer + RPE + SATE

• Architecture improvement: Conformer + RPE + SATE
• Final result: ensemble multiple diverse models

Future Work
• Adaptation of the existing methods
• ASR, MT, NLP, and CV
• Multi-modal
• Utilization of the additional training data
• Pre-training, multi-task learning
• Data augmentation
• Unified modeling for the speech input
• Speech features, Wave2vec
• Simultaneous speech translation

Utilizing the Pre-trained Model Effectively for Speech Translation

Recommended

Recommended

More Related Content

Similar to Utilizing the Pre-trained Model Effectively for Speech Translation

Similar to Utilizing the Pre-trained Model Effectively for Speech Translation (20)

Recently uploaded

Recently uploaded (20)

Utilizing the Pre-trained Model Effectively for Speech Translation