The document discusses utilizing pre-trained models for end-to-end speech translation. It describes how cascaded and end-to-end speech translation systems work, and the advantages and disadvantages of each. It then presents the stacked acoustic-and-textual encoding (SATE) approach which uses a pre-trained acoustic encoder and textual encoder with an adaptor to alleviate representation inconsistencies. Experimental results on the MuST-C dataset show SATE achieves comparable performance to cascaded systems when large pre-trained ASR and MT models are available. The document also discusses the offline speech translation task at IWSLT2021 where ensemble models using SATE outperformed baselines.
Utilizing the Pre-trained Model Effectively for Speech Translation
1. Utilizing the Pre-trained Model Effectively
for Speech Translation
许晨 / Chen Xu
东北大学自然语言处理实验室 / NEU NLP Lab.
小牛翻译 / NiuTrans
2. Table of contents
• Introduction to end-to-end speech translation
• End-to-end speech translation with pre-training
• Stacked acoustic-and-textual encoding (SATE)
• Offline speech translation in IWSLT2021 Campaign
• Future work
3. Speech Translation - classification
• Modality
• Speech to text translation (ST)
• Speech to speech translation (SS)
ST
你好 Hello
SS
你好 Hello
4. Speech Translation - classification
• Modality
• Speech to text translation (ST)
• Speech to speech translation (SS)
• Latency
• Offline
• Simultaneous/Streaming
ST
你好 Hello
SS
你好 Hello
5. Cascaded System
• Cascade the multiple independent models
ASR MT Target Text
Source speech Source Text
6. Cascaded System
• Cascade the multiple independent models
ASR MT Target Text
Source speech Source Text
7. Cascaded System
• Cascade the multiple independent models
ASR MT Target Text
Source speech Source Text
8. Cascaded System
• Cascade the multiple independent models
ASR MT Target Text
Source speech Source Text
9. Cascaded System
ASR MT Target Text
Source speech Source Text
• Cascade the multiple independent models
• Pros
• Study for long time
• A large amounts of data available
• Evaluate and optimize each model explicitly
10. Cascaded System
ASR MT Target Text
Source speech Source Text
• Cascade the multiple independent models
• Pros
• Study for long time
• A large amounts of data available
• Evaluate and optimize each model explicitly
• Cons
• Error propagation
• High latency
• Missing the paralinguistic information not present in the text
12. A Unified view
MT
Source text Target text
ASR
Source audio Source text
ST
Source audio Target text
13. • Learn a mapping function Y = F(x) from sequence to sequence
A Unified view
MT
Source text Target text
ASR
Source audio Source text
ST
Source audio Target text
14. MT
Source text Target text
ASR
Source audio Source text
ST
Source audio Target text
SS
Source audio Target audio
TTS
Source text Source audio
A Unified view
• Learn a mapping function Y = F(x) from sequence to sequence
15. End-to-end System
• A single model that learns to translate from audio to text in target
language, without intermediate discrete representation.
16. End-to-end System
ST Target Text
Source speech
• A single model that learns to translate from audio to text in target
language, without intermediate discrete representation.
• Pros
• Avoiding error propagation
• Low latency
• Keep the paralinguistic information
17. End-to-end System
ST Target Text
Source speech
• A single model that learns to translate from audio to text in target
language, without intermediate discrete representation.
• Pros
• Avoiding error propagation
• Low latency
• Keep the paralinguistic information
• Cons
• The heavy burden of the encoder
• Limited training data (Low resource)
18. End-to-end System
ST Target Text
Source speech
• A single model that learns to translate from audio to text in target
language, without intermediate discrete representation.
• Pros
• Avoiding error propagation
• Low latency
• Keep the paralinguistic information
• Cons
• The heavy burden of the encoder
• Limited training data (Low resource)
19. End-to-end System - CTC
• Connectionist Temporal Classification: learn the soft alignment
between the speech feature and source text.
20. End-to-end System - CTC
• Connectionist Temporal Classification: learn the soft alignment
between the speech feature and source text.
Speech feature
Predicted sequence
Merge repeat tokens
and remove ϵ
Output
21. End-to-end System - CTC
• Connectionist Temporal Classification: learn the soft alignment
between the speech feature and source text.
Speech feature
Predicted sequence
Merge repeat tokens
and remove ϵ
Output
h h e ϵ l ϵ l l o o
h e e ϵ l l ϵ l o o
h h e e ϵ l ϵ l ϵ o
…
22. End-to-end System
ST Target Text
Source speech
• A single model that learns to translate from audio to text in target
language, without intermediate discrete representation.
• Pros
• Avoiding error propagation
• Low latency
• Keep the paralinguistic information
• Cons
• The heavy burden of the encoder
• Limited training data (Low resource)
23. End-to-end System - Pre-training
• Pre-trained ASR encoder + Pre-trained MT decoder
ASR
Speech
feature
Decoder
Encoder
Source text
MT
Source text
Decoder
Encoder
Target text
ST
Speech
feature
Decoder
Encoder
Target text
24. End-to-end System - Rethinking Pre-training
• Is the ASR encoder sufficient for the ST encoder?
25. End-to-end System - Rethinking Pre-training
• Is the ASR encoder sufficient for the ST encoder?
• Intuition:
• ASR encoder: transcription
• MT encoder: understanding
• ST encoder: transcription + understanding
26. End-to-end System - Rethinking Pre-training
• Is the ASR encoder sufficient for the ST encoder?
• Intuition:
• ASR encoder: transcription (local dependency)
• MT encoder: understanding (global dependency)
• ST encoder: transcription + understanding (both)
27. End-to-end System - Rethinking Pre-training
• Is the ASR encoder sufficient for the ST encoder?
• Intuition:
• ASR encoder: transcription (local dependency)
• MT encoder: understanding (global dependency)
• ST encoder: transcription + understanding (both)
• Verification with the Localness of self-attention in the encoder
• The sum of attention weights
to the surrounding words/features
within a fixed window
28. End-to-end System - Rethinking Pre-training
• Different behavior between MT encoder and ASR/ST encoder
29. End-to-end System - Rethinking Pre-training
• Different behavior between MT encoder and ASR/ST encoder
• Is local attention sufficient for speech translation?
• Strong preference of CTC for local attention
• Better performance with global attention in ST
• Worse performance with global attention in ASR
30. End-to-end System - Rethinking Pre-training
• Different behavior between MT encoder and ASR/ST encoder
• Is local attention sufficient for speech translation?
• Strong preference of CTC for local attention
• Better performance with global attention in ST
• Worse performance with global attention in ASR
• Conclusion
• The ST encoder is not a simple substitution of the ASR
encoder (or the MT encoder).
32. SATE - Architecture
• Stacked Acoustic-and-Textual Encoding (SATE)
• Acoustic encoder: process the acoustic features
33. SATE - Architecture
• Stacked Acoustic-and-Textual Encoding (SATE)
• Acoustic encoder: process the acoustic features
• Textual encoder: generate the global representation
34. SATE - Architecture
• Stacked Acoustic-and-Textual Encoding (SATE)
• Acoustic encoder: process the acoustic features
• Textual encoder: generate the global representation
• Adaptor: alleviate the representation inconsistency issue
35. SATE - Adaptor
• Adaptor: alleviate the representation inconsistency issue
• Two principles:
• Adaptive: generate the embedding-like representation for
the textual encoder
···
CTC distribution
···
Embedding
X = Soft token embedding
0.1 0.25 0.05 0.2 0.1
···
36. SATE - Adaptor
• Adaptor: alleviate the representation inconsistency issue
• Two principles:
• Adaptive: generate the embedding-like representation for
the textual encoder
• Informative: keep the paralinguistic information in the
output acoustic encoder
37. SATE - Adaptor
• Adaptor: alleviate the representation inconsistency issue
• Two principles:
• Adaptive: generate the embedding-like representation for
the textual encoder
• Informative: keep the paralinguistic information in the
output acoustic encoder
38. SATE - Adaptor
• Adaptor: alleviate the representation inconsistency issue
• Two principles:
• Adaptive: generate the embedding-like representation for
the textual encoder
• Informative: keep the paralinguistic information in the
output acoustic encoder
• To do
• Length inconsistency: reduce the sequence length for the
textual encoder
• Shrink mechanism, downsampling
39. SATE - MTKD
• Multi-teacher Knowledge Distillation (MTKD)
• Minic the predicted distribution of the pre-trained model
ASR
Speech
Features
Encoder
CTC Loss
MT
Source text
Decoder
Encoder
Linear
Softmax
KD Loss
SATE
Linear
Softmax
KD Loss
Trans Loss
41. SATE - Experiements
• Results on the MuST-C En-De
• Degraded performance of the cascaded system
42. SATE - Experiements
• Results on the MuST-C En-De
• Degraded performance of the cascaded system
• Large performance margin when the additional data is allowed
43. SATE - Experiements
• Results on the MuST-C En-De
• Degraded performance of the cascaded system
• Large performance margin when the additional data is allowed
• Significant improvement with SATE
44. SATE - Experiements
• Results on the MuST-C En-De
• Degraded performance of the cascaded system
• Large performance margin when the additional data is allowed
• Significant improvement with SATE
Achieve comparable performance
with the cascaded ST counterpart
when large-scale ASR and MT data
is available!
45. SATE - Experiements
• Results on the LibriSpeech En-Fr
Achieve comparable performance
with the cascaded ST counterpart
when large-scale ASR and MT data
is available!
48. SATE - Experiements
• Performance and speedup
• Effects of the adaptor
• Impact on localness
49. IWSLT2021 - Offline Speech Translation
• Data statistics of the ASR, MT, and ST corpora
50. IWSLT2021 - Offline Speech Translation
• Data statistics of the ASR, MT, and ST corpora
• Data augmentation: translate the transcription in source language
to the text in target language
53. IWSLT2021 - Offline Speech Translation
• Baseline: CTC-based deep Transformer
• Architecture improvement: Conformer + RPE + SATE
• Final result: ensemble multiple diverse models
54. Future Work
• Adaptation of the existing methods
• ASR, MT, NLP, and CV
• Multi-modal
• Utilization of the additional training data
• Pre-training, multi-task learning
• Data augmentation
• Unified modeling for the speech input
• Speech features, Wave2vec
• Simultaneous speech translation