SlideShare a Scribd company logo
1 of 55
Utilizing the Pre-trained Model Effectively
for Speech Translation
许晨 / Chen Xu
东北大学自然语言处理实验室 / NEU NLP Lab.
小牛翻译 / NiuTrans
Table of contents
• Introduction to end-to-end speech translation
• End-to-end speech translation with pre-training
• Stacked acoustic-and-textual encoding (SATE)
• Offline speech translation in IWSLT2021 Campaign
• Future work
Speech Translation - classification
• Modality
• Speech to text translation (ST)
• Speech to speech translation (SS)
ST
你好 Hello
SS
你好 Hello
Speech Translation - classification
• Modality
• Speech to text translation (ST)
• Speech to speech translation (SS)
• Latency
• Offline
• Simultaneous/Streaming
ST
你好 Hello
SS
你好 Hello
Cascaded System
• Cascade the multiple independent models
ASR MT Target Text
Source speech Source Text
Cascaded System
• Cascade the multiple independent models
ASR MT Target Text
Source speech Source Text
Cascaded System
• Cascade the multiple independent models
ASR MT Target Text
Source speech Source Text
Cascaded System
• Cascade the multiple independent models
ASR MT Target Text
Source speech Source Text
Cascaded System
ASR MT Target Text
Source speech Source Text
• Cascade the multiple independent models
• Pros
• Study for long time
• A large amounts of data available
• Evaluate and optimize each model explicitly
Cascaded System
ASR MT Target Text
Source speech Source Text
• Cascade the multiple independent models
• Pros
• Study for long time
• A large amounts of data available
• Evaluate and optimize each model explicitly
• Cons
• Error propagation
• High latency
• Missing the paralinguistic information not present in the text
A Unified view
ST
Source audio Target text
A Unified view
MT
Source text Target text
ASR
Source audio Source text
ST
Source audio Target text
• Learn a mapping function Y = F(x) from sequence to sequence
A Unified view
MT
Source text Target text
ASR
Source audio Source text
ST
Source audio Target text
MT
Source text Target text
ASR
Source audio Source text
ST
Source audio Target text
SS
Source audio Target audio
TTS
Source text Source audio
A Unified view
• Learn a mapping function Y = F(x) from sequence to sequence
End-to-end System
• A single model that learns to translate from audio to text in target
language, without intermediate discrete representation.
End-to-end System
ST Target Text
Source speech
• A single model that learns to translate from audio to text in target
language, without intermediate discrete representation.
• Pros
• Avoiding error propagation
• Low latency
• Keep the paralinguistic information
End-to-end System
ST Target Text
Source speech
• A single model that learns to translate from audio to text in target
language, without intermediate discrete representation.
• Pros
• Avoiding error propagation
• Low latency
• Keep the paralinguistic information
• Cons
• The heavy burden of the encoder
• Limited training data (Low resource)
End-to-end System
ST Target Text
Source speech
• A single model that learns to translate from audio to text in target
language, without intermediate discrete representation.
• Pros
• Avoiding error propagation
• Low latency
• Keep the paralinguistic information
• Cons
• The heavy burden of the encoder
• Limited training data (Low resource)
End-to-end System - CTC
• Connectionist Temporal Classification: learn the soft alignment
between the speech feature and source text.
End-to-end System - CTC
• Connectionist Temporal Classification: learn the soft alignment
between the speech feature and source text.
Speech feature
Predicted sequence
Merge repeat tokens
and remove ϵ
Output
End-to-end System - CTC
• Connectionist Temporal Classification: learn the soft alignment
between the speech feature and source text.
Speech feature
Predicted sequence
Merge repeat tokens
and remove ϵ
Output
h h e ϵ l ϵ l l o o
h e e ϵ l l ϵ l o o
h h e e ϵ l ϵ l ϵ o
…
End-to-end System
ST Target Text
Source speech
• A single model that learns to translate from audio to text in target
language, without intermediate discrete representation.
• Pros
• Avoiding error propagation
• Low latency
• Keep the paralinguistic information
• Cons
• The heavy burden of the encoder
• Limited training data (Low resource)
End-to-end System - Pre-training
• Pre-trained ASR encoder + Pre-trained MT decoder
ASR
Speech
feature
Decoder
Encoder
Source text
MT
Source text
Decoder
Encoder
Target text
ST
Speech
feature
Decoder
Encoder
Target text
End-to-end System - Rethinking Pre-training
• Is the ASR encoder sufficient for the ST encoder?
End-to-end System - Rethinking Pre-training
• Is the ASR encoder sufficient for the ST encoder?
• Intuition:
• ASR encoder: transcription
• MT encoder: understanding
• ST encoder: transcription + understanding
End-to-end System - Rethinking Pre-training
• Is the ASR encoder sufficient for the ST encoder?
• Intuition:
• ASR encoder: transcription (local dependency)
• MT encoder: understanding (global dependency)
• ST encoder: transcription + understanding (both)
End-to-end System - Rethinking Pre-training
• Is the ASR encoder sufficient for the ST encoder?
• Intuition:
• ASR encoder: transcription (local dependency)
• MT encoder: understanding (global dependency)
• ST encoder: transcription + understanding (both)
• Verification with the Localness of self-attention in the encoder
• The sum of attention weights
to the surrounding words/features
within a fixed window
End-to-end System - Rethinking Pre-training
• Different behavior between MT encoder and ASR/ST encoder
End-to-end System - Rethinking Pre-training
• Different behavior between MT encoder and ASR/ST encoder
• Is local attention sufficient for speech translation?
• Strong preference of CTC for local attention
• Better performance with global attention in ST
• Worse performance with global attention in ASR
End-to-end System - Rethinking Pre-training
• Different behavior between MT encoder and ASR/ST encoder
• Is local attention sufficient for speech translation?
• Strong preference of CTC for local attention
• Better performance with global attention in ST
• Worse performance with global attention in ASR
• Conclusion
• The ST encoder is not a simple substitution of the ASR
encoder (or the MT encoder).
SATE - Architecture
• Stacked Acoustic-and-Textual Encoding (SATE)
SATE - Architecture
• Stacked Acoustic-and-Textual Encoding (SATE)
• Acoustic encoder: process the acoustic features
SATE - Architecture
• Stacked Acoustic-and-Textual Encoding (SATE)
• Acoustic encoder: process the acoustic features
• Textual encoder: generate the global representation
SATE - Architecture
• Stacked Acoustic-and-Textual Encoding (SATE)
• Acoustic encoder: process the acoustic features
• Textual encoder: generate the global representation
• Adaptor: alleviate the representation inconsistency issue
SATE - Adaptor
• Adaptor: alleviate the representation inconsistency issue
• Two principles:
• Adaptive: generate the embedding-like representation for
the textual encoder
···
CTC distribution
···
Embedding
X = Soft token embedding
0.1 0.25 0.05 0.2 0.1
···
SATE - Adaptor
• Adaptor: alleviate the representation inconsistency issue
• Two principles:
• Adaptive: generate the embedding-like representation for
the textual encoder
• Informative: keep the paralinguistic information in the
output acoustic encoder
SATE - Adaptor
• Adaptor: alleviate the representation inconsistency issue
• Two principles:
• Adaptive: generate the embedding-like representation for
the textual encoder
• Informative: keep the paralinguistic information in the
output acoustic encoder
SATE - Adaptor
• Adaptor: alleviate the representation inconsistency issue
• Two principles:
• Adaptive: generate the embedding-like representation for
the textual encoder
• Informative: keep the paralinguistic information in the
output acoustic encoder
• To do
• Length inconsistency: reduce the sequence length for the
textual encoder
• Shrink mechanism, downsampling
SATE - MTKD
• Multi-teacher Knowledge Distillation (MTKD)
• Minic the predicted distribution of the pre-trained model
ASR
Speech
Features
Encoder
CTC Loss
MT
Source text
Decoder
Encoder
Linear
Softmax
KD Loss
SATE
Linear
Softmax
KD Loss
Trans Loss
SATE - Experiements
• Datasets
• Models
Language Restricted
Unrestricted
ASR MT
En-De MuST-C (400h)
LibriSpeech (960h)
Opensubtitle2018 (18M)
En-Fr LibriSpeech (100h) WMT14 (10M)
Model Restricted Unrestricted
Arch Transformer Conformer
Hidden size 256 512
FNN Size 2048 2048
Attention heads 4 8
SATE - Experiements
• Results on the MuST-C En-De
• Degraded performance of the cascaded system
SATE - Experiements
• Results on the MuST-C En-De
• Degraded performance of the cascaded system
• Large performance margin when the additional data is allowed
SATE - Experiements
• Results on the MuST-C En-De
• Degraded performance of the cascaded system
• Large performance margin when the additional data is allowed
• Significant improvement with SATE
SATE - Experiements
• Results on the MuST-C En-De
• Degraded performance of the cascaded system
• Large performance margin when the additional data is allowed
• Significant improvement with SATE
Achieve comparable performance
with the cascaded ST counterpart
when large-scale ASR and MT data
is available!
SATE - Experiements
• Results on the LibriSpeech En-Fr
Achieve comparable performance
with the cascaded ST counterpart
when large-scale ASR and MT data
is available!
SATE - Experiements
• Performance and speedup
SATE - Experiements
• Performance and speedup
• Effects of the adaptor
SATE - Experiements
• Performance and speedup
• Effects of the adaptor
• Impact on localness
IWSLT2021 - Offline Speech Translation
• Data statistics of the ASR, MT, and ST corpora
IWSLT2021 - Offline Speech Translation
• Data statistics of the ASR, MT, and ST corpora
• Data augmentation: translate the transcription in source language
to the text in target language
IWSLT2021 - Offline Speech Translation
• Baseline: CTC-based deep Transformer
IWSLT2021 - Offline Speech Translation
• Baseline: CTC-based deep Transformer
• Architecture improvement: Conformer + RPE + SATE
IWSLT2021 - Offline Speech Translation
• Baseline: CTC-based deep Transformer
• Architecture improvement: Conformer + RPE + SATE
• Final result: ensemble multiple diverse models
Future Work
• Adaptation of the existing methods
• ASR, MT, NLP, and CV
• Multi-modal
• Utilization of the additional training data
• Pre-training, multi-task learning
• Data augmentation
• Unified modeling for the speech input
• Speech features, Wave2vec
• Simultaneous speech translation
WeChat
here
Paper
here

More Related Content

Similar to Utilizing the Pre-trained Model Effectively for Speech Translation

Using Deep Learning at Scale - Guhan Suriyanarayanan and Adi Oltean, Microsoft
Using Deep Learning at Scale - Guhan Suriyanarayanan and Adi Oltean, MicrosoftUsing Deep Learning at Scale - Guhan Suriyanarayanan and Adi Oltean, Microsoft
Using Deep Learning at Scale - Guhan Suriyanarayanan and Adi Oltean, MicrosoftGuhan Suriyanarayanan
 
Speaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet VocoderSpeaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet VocoderAkira Tamamori
 
Scaling Massive Elasticsearch Clusters
Scaling Massive Elasticsearch ClustersScaling Massive Elasticsearch Clusters
Scaling Massive Elasticsearch ClustersSematext Group, Inc.
 
How We Scaled Bert To Serve 1+ Billion Daily Requests on CPU
How We Scaled Bert To Serve 1+ Billion Daily Requests on CPUHow We Scaled Bert To Serve 1+ Billion Daily Requests on CPU
How We Scaled Bert To Serve 1+ Billion Daily Requests on CPUDatabricks
 
What is machine translation
What is machine translationWhat is machine translation
What is machine translationStephen Peacock
 
Rust All Hands Winter 2011
Rust All Hands Winter 2011Rust All Hands Winter 2011
Rust All Hands Winter 2011Patrick Walton
 
AltaVista Search Engine Architecture
AltaVista Search Engine ArchitectureAltaVista Search Engine Architecture
AltaVista Search Engine ArchitectureChangshu Liu
 
CMP 221.pptx computer science machine and assembly language
CMP 221.pptx computer science machine and assembly languageCMP 221.pptx computer science machine and assembly language
CMP 221.pptx computer science machine and assembly languageomotunwaserejoice
 
Compiler Construction
Compiler ConstructionCompiler Construction
Compiler ConstructionAhmed Raza
 
An Introduction to Natural Language Processing
An Introduction to Natural Language ProcessingAn Introduction to Natural Language Processing
An Introduction to Natural Language ProcessingTyrone Systems
 
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...ayaha osaki
 
STS with Presto Engineering
STS with Presto EngineeringSTS with Presto Engineering
STS with Presto EngineeringHank Lydick
 

Similar to Utilizing the Pre-trained Model Effectively for Speech Translation (20)

SPEECH CODING
SPEECH CODINGSPEECH CODING
SPEECH CODING
 
Using Deep Learning at Scale - Guhan Suriyanarayanan and Adi Oltean, Microsoft
Using Deep Learning at Scale - Guhan Suriyanarayanan and Adi Oltean, MicrosoftUsing Deep Learning at Scale - Guhan Suriyanarayanan and Adi Oltean, Microsoft
Using Deep Learning at Scale - Guhan Suriyanarayanan and Adi Oltean, Microsoft
 
Build your own ASR engine
Build your own ASR engineBuild your own ASR engine
Build your own ASR engine
 
Speaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet VocoderSpeaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet Vocoder
 
Scaling Massive Elasticsearch Clusters
Scaling Massive Elasticsearch ClustersScaling Massive Elasticsearch Clusters
Scaling Massive Elasticsearch Clusters
 
How We Scaled Bert To Serve 1+ Billion Daily Requests on CPU
How We Scaled Bert To Serve 1+ Billion Daily Requests on CPUHow We Scaled Bert To Serve 1+ Billion Daily Requests on CPU
How We Scaled Bert To Serve 1+ Billion Daily Requests on CPU
 
What is machine translation
What is machine translationWhat is machine translation
What is machine translation
 
Deep Learning Summit (DLS01-4)
Deep Learning Summit (DLS01-4)Deep Learning Summit (DLS01-4)
Deep Learning Summit (DLS01-4)
 
Speech encoding techniques
Speech encoding techniquesSpeech encoding techniques
Speech encoding techniques
 
Rust All Hands Winter 2011
Rust All Hands Winter 2011Rust All Hands Winter 2011
Rust All Hands Winter 2011
 
AltaVista Search Engine Architecture
AltaVista Search Engine ArchitectureAltaVista Search Engine Architecture
AltaVista Search Engine Architecture
 
CMP 221.pptx computer science machine and assembly language
CMP 221.pptx computer science machine and assembly languageCMP 221.pptx computer science machine and assembly language
CMP 221.pptx computer science machine and assembly language
 
Compiler Construction
Compiler ConstructionCompiler Construction
Compiler Construction
 
Data representation
Data representationData representation
Data representation
 
add9.5.ppt
add9.5.pptadd9.5.ppt
add9.5.ppt
 
Data representation
Data representationData representation
Data representation
 
An Introduction to Natural Language Processing
An Introduction to Natural Language ProcessingAn Introduction to Natural Language Processing
An Introduction to Natural Language Processing
 
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
 
Speaker Segmentation (2006)
Speaker Segmentation (2006)Speaker Segmentation (2006)
Speaker Segmentation (2006)
 
STS with Presto Engineering
STS with Presto EngineeringSTS with Presto Engineering
STS with Presto Engineering
 

Recently uploaded

Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform EngineeringMarcus Vechiato
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuidePixlogix Infotech
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdfMuhammad Subhan
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTopCSSGallery
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024Lorenzo Miniero
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfdanishmna97
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfAnubhavMangla3
 
Microsoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdfMicrosoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdfOverkill Security
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxjbellis
 
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...SOFTTECHHUB
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxMarkSteadman7
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewDianaGray10
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsLeah Henrickson
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireExakis Nelite
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!Memoori
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingScyllaDB
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentationyogeshlabana357357
 

Recently uploaded (20)

Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
 
Microsoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdfMicrosoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdf
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 

Utilizing the Pre-trained Model Effectively for Speech Translation

  • 1. Utilizing the Pre-trained Model Effectively for Speech Translation 许晨 / Chen Xu 东北大学自然语言处理实验室 / NEU NLP Lab. 小牛翻译 / NiuTrans
  • 2. Table of contents • Introduction to end-to-end speech translation • End-to-end speech translation with pre-training • Stacked acoustic-and-textual encoding (SATE) • Offline speech translation in IWSLT2021 Campaign • Future work
  • 3. Speech Translation - classification • Modality • Speech to text translation (ST) • Speech to speech translation (SS) ST 你好 Hello SS 你好 Hello
  • 4. Speech Translation - classification • Modality • Speech to text translation (ST) • Speech to speech translation (SS) • Latency • Offline • Simultaneous/Streaming ST 你好 Hello SS 你好 Hello
  • 5. Cascaded System • Cascade the multiple independent models ASR MT Target Text Source speech Source Text
  • 6. Cascaded System • Cascade the multiple independent models ASR MT Target Text Source speech Source Text
  • 7. Cascaded System • Cascade the multiple independent models ASR MT Target Text Source speech Source Text
  • 8. Cascaded System • Cascade the multiple independent models ASR MT Target Text Source speech Source Text
  • 9. Cascaded System ASR MT Target Text Source speech Source Text • Cascade the multiple independent models • Pros • Study for long time • A large amounts of data available • Evaluate and optimize each model explicitly
  • 10. Cascaded System ASR MT Target Text Source speech Source Text • Cascade the multiple independent models • Pros • Study for long time • A large amounts of data available • Evaluate and optimize each model explicitly • Cons • Error propagation • High latency • Missing the paralinguistic information not present in the text
  • 11. A Unified view ST Source audio Target text
  • 12. A Unified view MT Source text Target text ASR Source audio Source text ST Source audio Target text
  • 13. • Learn a mapping function Y = F(x) from sequence to sequence A Unified view MT Source text Target text ASR Source audio Source text ST Source audio Target text
  • 14. MT Source text Target text ASR Source audio Source text ST Source audio Target text SS Source audio Target audio TTS Source text Source audio A Unified view • Learn a mapping function Y = F(x) from sequence to sequence
  • 15. End-to-end System • A single model that learns to translate from audio to text in target language, without intermediate discrete representation.
  • 16. End-to-end System ST Target Text Source speech • A single model that learns to translate from audio to text in target language, without intermediate discrete representation. • Pros • Avoiding error propagation • Low latency • Keep the paralinguistic information
  • 17. End-to-end System ST Target Text Source speech • A single model that learns to translate from audio to text in target language, without intermediate discrete representation. • Pros • Avoiding error propagation • Low latency • Keep the paralinguistic information • Cons • The heavy burden of the encoder • Limited training data (Low resource)
  • 18. End-to-end System ST Target Text Source speech • A single model that learns to translate from audio to text in target language, without intermediate discrete representation. • Pros • Avoiding error propagation • Low latency • Keep the paralinguistic information • Cons • The heavy burden of the encoder • Limited training data (Low resource)
  • 19. End-to-end System - CTC • Connectionist Temporal Classification: learn the soft alignment between the speech feature and source text.
  • 20. End-to-end System - CTC • Connectionist Temporal Classification: learn the soft alignment between the speech feature and source text. Speech feature Predicted sequence Merge repeat tokens and remove ϵ Output
  • 21. End-to-end System - CTC • Connectionist Temporal Classification: learn the soft alignment between the speech feature and source text. Speech feature Predicted sequence Merge repeat tokens and remove ϵ Output h h e ϵ l ϵ l l o o h e e ϵ l l ϵ l o o h h e e ϵ l ϵ l ϵ o …
  • 22. End-to-end System ST Target Text Source speech • A single model that learns to translate from audio to text in target language, without intermediate discrete representation. • Pros • Avoiding error propagation • Low latency • Keep the paralinguistic information • Cons • The heavy burden of the encoder • Limited training data (Low resource)
  • 23. End-to-end System - Pre-training • Pre-trained ASR encoder + Pre-trained MT decoder ASR Speech feature Decoder Encoder Source text MT Source text Decoder Encoder Target text ST Speech feature Decoder Encoder Target text
  • 24. End-to-end System - Rethinking Pre-training • Is the ASR encoder sufficient for the ST encoder?
  • 25. End-to-end System - Rethinking Pre-training • Is the ASR encoder sufficient for the ST encoder? • Intuition: • ASR encoder: transcription • MT encoder: understanding • ST encoder: transcription + understanding
  • 26. End-to-end System - Rethinking Pre-training • Is the ASR encoder sufficient for the ST encoder? • Intuition: • ASR encoder: transcription (local dependency) • MT encoder: understanding (global dependency) • ST encoder: transcription + understanding (both)
  • 27. End-to-end System - Rethinking Pre-training • Is the ASR encoder sufficient for the ST encoder? • Intuition: • ASR encoder: transcription (local dependency) • MT encoder: understanding (global dependency) • ST encoder: transcription + understanding (both) • Verification with the Localness of self-attention in the encoder • The sum of attention weights to the surrounding words/features within a fixed window
  • 28. End-to-end System - Rethinking Pre-training • Different behavior between MT encoder and ASR/ST encoder
  • 29. End-to-end System - Rethinking Pre-training • Different behavior between MT encoder and ASR/ST encoder • Is local attention sufficient for speech translation? • Strong preference of CTC for local attention • Better performance with global attention in ST • Worse performance with global attention in ASR
  • 30. End-to-end System - Rethinking Pre-training • Different behavior between MT encoder and ASR/ST encoder • Is local attention sufficient for speech translation? • Strong preference of CTC for local attention • Better performance with global attention in ST • Worse performance with global attention in ASR • Conclusion • The ST encoder is not a simple substitution of the ASR encoder (or the MT encoder).
  • 31. SATE - Architecture • Stacked Acoustic-and-Textual Encoding (SATE)
  • 32. SATE - Architecture • Stacked Acoustic-and-Textual Encoding (SATE) • Acoustic encoder: process the acoustic features
  • 33. SATE - Architecture • Stacked Acoustic-and-Textual Encoding (SATE) • Acoustic encoder: process the acoustic features • Textual encoder: generate the global representation
  • 34. SATE - Architecture • Stacked Acoustic-and-Textual Encoding (SATE) • Acoustic encoder: process the acoustic features • Textual encoder: generate the global representation • Adaptor: alleviate the representation inconsistency issue
  • 35. SATE - Adaptor • Adaptor: alleviate the representation inconsistency issue • Two principles: • Adaptive: generate the embedding-like representation for the textual encoder ··· CTC distribution ··· Embedding X = Soft token embedding 0.1 0.25 0.05 0.2 0.1 ···
  • 36. SATE - Adaptor • Adaptor: alleviate the representation inconsistency issue • Two principles: • Adaptive: generate the embedding-like representation for the textual encoder • Informative: keep the paralinguistic information in the output acoustic encoder
  • 37. SATE - Adaptor • Adaptor: alleviate the representation inconsistency issue • Two principles: • Adaptive: generate the embedding-like representation for the textual encoder • Informative: keep the paralinguistic information in the output acoustic encoder
  • 38. SATE - Adaptor • Adaptor: alleviate the representation inconsistency issue • Two principles: • Adaptive: generate the embedding-like representation for the textual encoder • Informative: keep the paralinguistic information in the output acoustic encoder • To do • Length inconsistency: reduce the sequence length for the textual encoder • Shrink mechanism, downsampling
  • 39. SATE - MTKD • Multi-teacher Knowledge Distillation (MTKD) • Minic the predicted distribution of the pre-trained model ASR Speech Features Encoder CTC Loss MT Source text Decoder Encoder Linear Softmax KD Loss SATE Linear Softmax KD Loss Trans Loss
  • 40. SATE - Experiements • Datasets • Models Language Restricted Unrestricted ASR MT En-De MuST-C (400h) LibriSpeech (960h) Opensubtitle2018 (18M) En-Fr LibriSpeech (100h) WMT14 (10M) Model Restricted Unrestricted Arch Transformer Conformer Hidden size 256 512 FNN Size 2048 2048 Attention heads 4 8
  • 41. SATE - Experiements • Results on the MuST-C En-De • Degraded performance of the cascaded system
  • 42. SATE - Experiements • Results on the MuST-C En-De • Degraded performance of the cascaded system • Large performance margin when the additional data is allowed
  • 43. SATE - Experiements • Results on the MuST-C En-De • Degraded performance of the cascaded system • Large performance margin when the additional data is allowed • Significant improvement with SATE
  • 44. SATE - Experiements • Results on the MuST-C En-De • Degraded performance of the cascaded system • Large performance margin when the additional data is allowed • Significant improvement with SATE Achieve comparable performance with the cascaded ST counterpart when large-scale ASR and MT data is available!
  • 45. SATE - Experiements • Results on the LibriSpeech En-Fr Achieve comparable performance with the cascaded ST counterpart when large-scale ASR and MT data is available!
  • 46. SATE - Experiements • Performance and speedup
  • 47. SATE - Experiements • Performance and speedup • Effects of the adaptor
  • 48. SATE - Experiements • Performance and speedup • Effects of the adaptor • Impact on localness
  • 49. IWSLT2021 - Offline Speech Translation • Data statistics of the ASR, MT, and ST corpora
  • 50. IWSLT2021 - Offline Speech Translation • Data statistics of the ASR, MT, and ST corpora • Data augmentation: translate the transcription in source language to the text in target language
  • 51. IWSLT2021 - Offline Speech Translation • Baseline: CTC-based deep Transformer
  • 52. IWSLT2021 - Offline Speech Translation • Baseline: CTC-based deep Transformer • Architecture improvement: Conformer + RPE + SATE
  • 53. IWSLT2021 - Offline Speech Translation • Baseline: CTC-based deep Transformer • Architecture improvement: Conformer + RPE + SATE • Final result: ensemble multiple diverse models
  • 54. Future Work • Adaptation of the existing methods • ASR, MT, NLP, and CV • Multi-modal • Utilization of the additional training data • Pre-training, multi-task learning • Data augmentation • Unified modeling for the speech input • Speech features, Wave2vec • Simultaneous speech translation