Utilizing the Pre-trained Model Effectively
for Speech Translation
许晨 / Chen Xu
东北大学自然语言处理实验室 / NEU NLP Lab.
小牛翻译 / NiuTrans
Table of contents
• Introduction to end-to-end speech translation
• End-to-end speech translation with pre-training
• Stacked acoustic-and-textual encoding (SATE)
• Offline speech translation in IWSLT2021 Campaign
• Future work
Speech Translation - classification
• Modality
• Speech to text translation (ST)
• Speech to speech translation (SS)
ST
你好 Hello
SS
你好 Hello
Speech Translation - classification
• Modality
• Speech to text translation (ST)
• Speech to speech translation (SS)
• Latency
• Offline
• Simultaneous/Streaming
ST
你好 Hello
SS
你好 Hello
Cascaded System
• Cascade the multiple independent models
ASR MT Target Text
Source speech Source Text
Cascaded System
• Cascade the multiple independent models
ASR MT Target Text
Source speech Source Text
Cascaded System
• Cascade the multiple independent models
ASR MT Target Text
Source speech Source Text
Cascaded System
• Cascade the multiple independent models
ASR MT Target Text
Source speech Source Text
Cascaded System
ASR MT Target Text
Source speech Source Text
• Cascade the multiple independent models
• Pros
• Study for long time
• A large amounts of data available
• Evaluate and optimize each model explicitly
Cascaded System
ASR MT Target Text
Source speech Source Text
• Cascade the multiple independent models
• Pros
• Study for long time
• A large amounts of data available
• Evaluate and optimize each model explicitly
• Cons
• Error propagation
• High latency
• Missing the paralinguistic information not present in the text
A Unified view
ST
Source audio Target text
A Unified view
MT
Source text Target text
ASR
Source audio Source text
ST
Source audio Target text
• Learn a mapping function Y = F(x) from sequence to sequence
A Unified view
MT
Source text Target text
ASR
Source audio Source text
ST
Source audio Target text
MT
Source text Target text
ASR
Source audio Source text
ST
Source audio Target text
SS
Source audio Target audio
TTS
Source text Source audio
A Unified view
• Learn a mapping function Y = F(x) from sequence to sequence
End-to-end System
• A single model that learns to translate from audio to text in target
language, without intermediate discrete representation.
End-to-end System
ST Target Text
Source speech
• A single model that learns to translate from audio to text in target
language, without intermediate discrete representation.
• Pros
• Avoiding error propagation
• Low latency
• Keep the paralinguistic information
End-to-end System
ST Target Text
Source speech
• A single model that learns to translate from audio to text in target
language, without intermediate discrete representation.
• Pros
• Avoiding error propagation
• Low latency
• Keep the paralinguistic information
• Cons
• The heavy burden of the encoder
• Limited training data (Low resource)
End-to-end System
ST Target Text
Source speech
• A single model that learns to translate from audio to text in target
language, without intermediate discrete representation.
• Pros
• Avoiding error propagation
• Low latency
• Keep the paralinguistic information
• Cons
• The heavy burden of the encoder
• Limited training data (Low resource)
End-to-end System - CTC
• Connectionist Temporal Classification: learn the soft alignment
between the speech feature and source text.
End-to-end System - CTC
• Connectionist Temporal Classification: learn the soft alignment
between the speech feature and source text.
Speech feature
Predicted sequence
Merge repeat tokens
and remove ϵ
Output
End-to-end System - CTC
• Connectionist Temporal Classification: learn the soft alignment
between the speech feature and source text.
Speech feature
Predicted sequence
Merge repeat tokens
and remove ϵ
Output
h h e ϵ l ϵ l l o o
h e e ϵ l l ϵ l o o
h h e e ϵ l ϵ l ϵ o
…
End-to-end System
ST Target Text
Source speech
• A single model that learns to translate from audio to text in target
language, without intermediate discrete representation.
• Pros
• Avoiding error propagation
• Low latency
• Keep the paralinguistic information
• Cons
• The heavy burden of the encoder
• Limited training data (Low resource)
End-to-end System - Pre-training
• Pre-trained ASR encoder + Pre-trained MT decoder
ASR
Speech
feature
Decoder
Encoder
Source text
MT
Source text
Decoder
Encoder
Target text
ST
Speech
feature
Decoder
Encoder
Target text
End-to-end System - Rethinking Pre-training
• Is the ASR encoder sufficient for the ST encoder?
End-to-end System - Rethinking Pre-training
• Is the ASR encoder sufficient for the ST encoder?
• Intuition:
• ASR encoder: transcription
• MT encoder: understanding
• ST encoder: transcription + understanding
End-to-end System - Rethinking Pre-training
• Is the ASR encoder sufficient for the ST encoder?
• Intuition:
• ASR encoder: transcription (local dependency)
• MT encoder: understanding (global dependency)
• ST encoder: transcription + understanding (both)
End-to-end System - Rethinking Pre-training
• Is the ASR encoder sufficient for the ST encoder?
• Intuition:
• ASR encoder: transcription (local dependency)
• MT encoder: understanding (global dependency)
• ST encoder: transcription + understanding (both)
• Verification with the Localness of self-attention in the encoder
• The sum of attention weights
to the surrounding words/features
within a fixed window
End-to-end System - Rethinking Pre-training
• Different behavior between MT encoder and ASR/ST encoder
End-to-end System - Rethinking Pre-training
• Different behavior between MT encoder and ASR/ST encoder
• Is local attention sufficient for speech translation?
• Strong preference of CTC for local attention
• Better performance with global attention in ST
• Worse performance with global attention in ASR
End-to-end System - Rethinking Pre-training
• Different behavior between MT encoder and ASR/ST encoder
• Is local attention sufficient for speech translation?
• Strong preference of CTC for local attention
• Better performance with global attention in ST
• Worse performance with global attention in ASR
• Conclusion
• The ST encoder is not a simple substitution of the ASR
encoder (or the MT encoder).
SATE - Architecture
• Stacked Acoustic-and-Textual Encoding (SATE)
SATE - Architecture
• Stacked Acoustic-and-Textual Encoding (SATE)
• Acoustic encoder: process the acoustic features
SATE - Architecture
• Stacked Acoustic-and-Textual Encoding (SATE)
• Acoustic encoder: process the acoustic features
• Textual encoder: generate the global representation
SATE - Architecture
• Stacked Acoustic-and-Textual Encoding (SATE)
• Acoustic encoder: process the acoustic features
• Textual encoder: generate the global representation
• Adaptor: alleviate the representation inconsistency issue
SATE - Adaptor
• Adaptor: alleviate the representation inconsistency issue
• Two principles:
• Adaptive: generate the embedding-like representation for
the textual encoder
···
CTC distribution
···
Embedding
X = Soft token embedding
0.1 0.25 0.05 0.2 0.1
···
SATE - Adaptor
• Adaptor: alleviate the representation inconsistency issue
• Two principles:
• Adaptive: generate the embedding-like representation for
the textual encoder
• Informative: keep the paralinguistic information in the
output acoustic encoder
SATE - Adaptor
• Adaptor: alleviate the representation inconsistency issue
• Two principles:
• Adaptive: generate the embedding-like representation for
the textual encoder
• Informative: keep the paralinguistic information in the
output acoustic encoder
SATE - Adaptor
• Adaptor: alleviate the representation inconsistency issue
• Two principles:
• Adaptive: generate the embedding-like representation for
the textual encoder
• Informative: keep the paralinguistic information in the
output acoustic encoder
• To do
• Length inconsistency: reduce the sequence length for the
textual encoder
• Shrink mechanism, downsampling
SATE - MTKD
• Multi-teacher Knowledge Distillation (MTKD)
• Minic the predicted distribution of the pre-trained model
ASR
Speech
Features
Encoder
CTC Loss
MT
Source text
Decoder
Encoder
Linear
Softmax
KD Loss
SATE
Linear
Softmax
KD Loss
Trans Loss
SATE - Experiements
• Datasets
• Models
Language Restricted
Unrestricted
ASR MT
En-De MuST-C (400h)
LibriSpeech (960h)
Opensubtitle2018 (18M)
En-Fr LibriSpeech (100h) WMT14 (10M)
Model Restricted Unrestricted
Arch Transformer Conformer
Hidden size 256 512
FNN Size 2048 2048
Attention heads 4 8
SATE - Experiements
• Results on the MuST-C En-De
• Degraded performance of the cascaded system
SATE - Experiements
• Results on the MuST-C En-De
• Degraded performance of the cascaded system
• Large performance margin when the additional data is allowed
SATE - Experiements
• Results on the MuST-C En-De
• Degraded performance of the cascaded system
• Large performance margin when the additional data is allowed
• Significant improvement with SATE
SATE - Experiements
• Results on the MuST-C En-De
• Degraded performance of the cascaded system
• Large performance margin when the additional data is allowed
• Significant improvement with SATE
Achieve comparable performance
with the cascaded ST counterpart
when large-scale ASR and MT data
is available!
SATE - Experiements
• Results on the LibriSpeech En-Fr
Achieve comparable performance
with the cascaded ST counterpart
when large-scale ASR and MT data
is available!
SATE - Experiements
• Performance and speedup
SATE - Experiements
• Performance and speedup
• Effects of the adaptor
SATE - Experiements
• Performance and speedup
• Effects of the adaptor
• Impact on localness
IWSLT2021 - Offline Speech Translation
• Data statistics of the ASR, MT, and ST corpora
IWSLT2021 - Offline Speech Translation
• Data statistics of the ASR, MT, and ST corpora
• Data augmentation: translate the transcription in source language
to the text in target language
IWSLT2021 - Offline Speech Translation
• Baseline: CTC-based deep Transformer
IWSLT2021 - Offline Speech Translation
• Baseline: CTC-based deep Transformer
• Architecture improvement: Conformer + RPE + SATE
IWSLT2021 - Offline Speech Translation
• Baseline: CTC-based deep Transformer
• Architecture improvement: Conformer + RPE + SATE
• Final result: ensemble multiple diverse models
Future Work
• Adaptation of the existing methods
• ASR, MT, NLP, and CV
• Multi-modal
• Utilization of the additional training data
• Pre-training, multi-task learning
• Data augmentation
• Unified modeling for the speech input
• Speech features, Wave2vec
• Simultaneous speech translation
WeChat
here
Paper
here

Utilizing the Pre-trained Model Effectively for Speech Translation

  • 1.
    Utilizing the Pre-trainedModel Effectively for Speech Translation 许晨 / Chen Xu 东北大学自然语言处理实验室 / NEU NLP Lab. 小牛翻译 / NiuTrans
  • 2.
    Table of contents •Introduction to end-to-end speech translation • End-to-end speech translation with pre-training • Stacked acoustic-and-textual encoding (SATE) • Offline speech translation in IWSLT2021 Campaign • Future work
  • 3.
    Speech Translation -classification • Modality • Speech to text translation (ST) • Speech to speech translation (SS) ST 你好 Hello SS 你好 Hello
  • 4.
    Speech Translation -classification • Modality • Speech to text translation (ST) • Speech to speech translation (SS) • Latency • Offline • Simultaneous/Streaming ST 你好 Hello SS 你好 Hello
  • 5.
    Cascaded System • Cascadethe multiple independent models ASR MT Target Text Source speech Source Text
  • 6.
    Cascaded System • Cascadethe multiple independent models ASR MT Target Text Source speech Source Text
  • 7.
    Cascaded System • Cascadethe multiple independent models ASR MT Target Text Source speech Source Text
  • 8.
    Cascaded System • Cascadethe multiple independent models ASR MT Target Text Source speech Source Text
  • 9.
    Cascaded System ASR MTTarget Text Source speech Source Text • Cascade the multiple independent models • Pros • Study for long time • A large amounts of data available • Evaluate and optimize each model explicitly
  • 10.
    Cascaded System ASR MTTarget Text Source speech Source Text • Cascade the multiple independent models • Pros • Study for long time • A large amounts of data available • Evaluate and optimize each model explicitly • Cons • Error propagation • High latency • Missing the paralinguistic information not present in the text
  • 11.
    A Unified view ST Sourceaudio Target text
  • 12.
    A Unified view MT Sourcetext Target text ASR Source audio Source text ST Source audio Target text
  • 13.
    • Learn amapping function Y = F(x) from sequence to sequence A Unified view MT Source text Target text ASR Source audio Source text ST Source audio Target text
  • 14.
    MT Source text Targettext ASR Source audio Source text ST Source audio Target text SS Source audio Target audio TTS Source text Source audio A Unified view • Learn a mapping function Y = F(x) from sequence to sequence
  • 15.
    End-to-end System • Asingle model that learns to translate from audio to text in target language, without intermediate discrete representation.
  • 16.
    End-to-end System ST TargetText Source speech • A single model that learns to translate from audio to text in target language, without intermediate discrete representation. • Pros • Avoiding error propagation • Low latency • Keep the paralinguistic information
  • 17.
    End-to-end System ST TargetText Source speech • A single model that learns to translate from audio to text in target language, without intermediate discrete representation. • Pros • Avoiding error propagation • Low latency • Keep the paralinguistic information • Cons • The heavy burden of the encoder • Limited training data (Low resource)
  • 18.
    End-to-end System ST TargetText Source speech • A single model that learns to translate from audio to text in target language, without intermediate discrete representation. • Pros • Avoiding error propagation • Low latency • Keep the paralinguistic information • Cons • The heavy burden of the encoder • Limited training data (Low resource)
  • 19.
    End-to-end System -CTC • Connectionist Temporal Classification: learn the soft alignment between the speech feature and source text.
  • 20.
    End-to-end System -CTC • Connectionist Temporal Classification: learn the soft alignment between the speech feature and source text. Speech feature Predicted sequence Merge repeat tokens and remove ϵ Output
  • 21.
    End-to-end System -CTC • Connectionist Temporal Classification: learn the soft alignment between the speech feature and source text. Speech feature Predicted sequence Merge repeat tokens and remove ϵ Output h h e ϵ l ϵ l l o o h e e ϵ l l ϵ l o o h h e e ϵ l ϵ l ϵ o …
  • 22.
    End-to-end System ST TargetText Source speech • A single model that learns to translate from audio to text in target language, without intermediate discrete representation. • Pros • Avoiding error propagation • Low latency • Keep the paralinguistic information • Cons • The heavy burden of the encoder • Limited training data (Low resource)
  • 23.
    End-to-end System -Pre-training • Pre-trained ASR encoder + Pre-trained MT decoder ASR Speech feature Decoder Encoder Source text MT Source text Decoder Encoder Target text ST Speech feature Decoder Encoder Target text
  • 24.
    End-to-end System -Rethinking Pre-training • Is the ASR encoder sufficient for the ST encoder?
  • 25.
    End-to-end System -Rethinking Pre-training • Is the ASR encoder sufficient for the ST encoder? • Intuition: • ASR encoder: transcription • MT encoder: understanding • ST encoder: transcription + understanding
  • 26.
    End-to-end System -Rethinking Pre-training • Is the ASR encoder sufficient for the ST encoder? • Intuition: • ASR encoder: transcription (local dependency) • MT encoder: understanding (global dependency) • ST encoder: transcription + understanding (both)
  • 27.
    End-to-end System -Rethinking Pre-training • Is the ASR encoder sufficient for the ST encoder? • Intuition: • ASR encoder: transcription (local dependency) • MT encoder: understanding (global dependency) • ST encoder: transcription + understanding (both) • Verification with the Localness of self-attention in the encoder • The sum of attention weights to the surrounding words/features within a fixed window
  • 28.
    End-to-end System -Rethinking Pre-training • Different behavior between MT encoder and ASR/ST encoder
  • 29.
    End-to-end System -Rethinking Pre-training • Different behavior between MT encoder and ASR/ST encoder • Is local attention sufficient for speech translation? • Strong preference of CTC for local attention • Better performance with global attention in ST • Worse performance with global attention in ASR
  • 30.
    End-to-end System -Rethinking Pre-training • Different behavior between MT encoder and ASR/ST encoder • Is local attention sufficient for speech translation? • Strong preference of CTC for local attention • Better performance with global attention in ST • Worse performance with global attention in ASR • Conclusion • The ST encoder is not a simple substitution of the ASR encoder (or the MT encoder).
  • 31.
    SATE - Architecture •Stacked Acoustic-and-Textual Encoding (SATE)
  • 32.
    SATE - Architecture •Stacked Acoustic-and-Textual Encoding (SATE) • Acoustic encoder: process the acoustic features
  • 33.
    SATE - Architecture •Stacked Acoustic-and-Textual Encoding (SATE) • Acoustic encoder: process the acoustic features • Textual encoder: generate the global representation
  • 34.
    SATE - Architecture •Stacked Acoustic-and-Textual Encoding (SATE) • Acoustic encoder: process the acoustic features • Textual encoder: generate the global representation • Adaptor: alleviate the representation inconsistency issue
  • 35.
    SATE - Adaptor •Adaptor: alleviate the representation inconsistency issue • Two principles: • Adaptive: generate the embedding-like representation for the textual encoder ··· CTC distribution ··· Embedding X = Soft token embedding 0.1 0.25 0.05 0.2 0.1 ···
  • 36.
    SATE - Adaptor •Adaptor: alleviate the representation inconsistency issue • Two principles: • Adaptive: generate the embedding-like representation for the textual encoder • Informative: keep the paralinguistic information in the output acoustic encoder
  • 37.
    SATE - Adaptor •Adaptor: alleviate the representation inconsistency issue • Two principles: • Adaptive: generate the embedding-like representation for the textual encoder • Informative: keep the paralinguistic information in the output acoustic encoder
  • 38.
    SATE - Adaptor •Adaptor: alleviate the representation inconsistency issue • Two principles: • Adaptive: generate the embedding-like representation for the textual encoder • Informative: keep the paralinguistic information in the output acoustic encoder • To do • Length inconsistency: reduce the sequence length for the textual encoder • Shrink mechanism, downsampling
  • 39.
    SATE - MTKD •Multi-teacher Knowledge Distillation (MTKD) • Minic the predicted distribution of the pre-trained model ASR Speech Features Encoder CTC Loss MT Source text Decoder Encoder Linear Softmax KD Loss SATE Linear Softmax KD Loss Trans Loss
  • 40.
    SATE - Experiements •Datasets • Models Language Restricted Unrestricted ASR MT En-De MuST-C (400h) LibriSpeech (960h) Opensubtitle2018 (18M) En-Fr LibriSpeech (100h) WMT14 (10M) Model Restricted Unrestricted Arch Transformer Conformer Hidden size 256 512 FNN Size 2048 2048 Attention heads 4 8
  • 41.
    SATE - Experiements •Results on the MuST-C En-De • Degraded performance of the cascaded system
  • 42.
    SATE - Experiements •Results on the MuST-C En-De • Degraded performance of the cascaded system • Large performance margin when the additional data is allowed
  • 43.
    SATE - Experiements •Results on the MuST-C En-De • Degraded performance of the cascaded system • Large performance margin when the additional data is allowed • Significant improvement with SATE
  • 44.
    SATE - Experiements •Results on the MuST-C En-De • Degraded performance of the cascaded system • Large performance margin when the additional data is allowed • Significant improvement with SATE Achieve comparable performance with the cascaded ST counterpart when large-scale ASR and MT data is available!
  • 45.
    SATE - Experiements •Results on the LibriSpeech En-Fr Achieve comparable performance with the cascaded ST counterpart when large-scale ASR and MT data is available!
  • 46.
    SATE - Experiements •Performance and speedup
  • 47.
    SATE - Experiements •Performance and speedup • Effects of the adaptor
  • 48.
    SATE - Experiements •Performance and speedup • Effects of the adaptor • Impact on localness
  • 49.
    IWSLT2021 - OfflineSpeech Translation • Data statistics of the ASR, MT, and ST corpora
  • 50.
    IWSLT2021 - OfflineSpeech Translation • Data statistics of the ASR, MT, and ST corpora • Data augmentation: translate the transcription in source language to the text in target language
  • 51.
    IWSLT2021 - OfflineSpeech Translation • Baseline: CTC-based deep Transformer
  • 52.
    IWSLT2021 - OfflineSpeech Translation • Baseline: CTC-based deep Transformer • Architecture improvement: Conformer + RPE + SATE
  • 53.
    IWSLT2021 - OfflineSpeech Translation • Baseline: CTC-based deep Transformer • Architecture improvement: Conformer + RPE + SATE • Final result: ensemble multiple diverse models
  • 54.
    Future Work • Adaptationof the existing methods • ASR, MT, NLP, and CV • Multi-modal • Utilization of the additional training data • Pre-training, multi-task learning • Data augmentation • Unified modeling for the speech input • Speech features, Wave2vec • Simultaneous speech translation
  • 55.