SlideShare a Scribd company logo
• Standard text-based translation systems are not enough in the
current world, where we have more than thousands of
languages. This is because the traditional systems have
drawbacks in creating speech-to-speech translation systems.
• It employs a cascading set of processes where the computing
costs and inference latency increase with each stage.
• This method cannot be used to translate into every spoken
language because more than 40% of the languages in the
world lack text writing systems..
A Direct Speech-to-Speech
Translate (S2ST)
Meta Version
of Direct S2ST
Advancing S2ST with discrete
units
• Enables faster inference and supports translation
between unwritten languages.
• It does not rely on text generation as an intermediate
step
• Trained using actual, publicly available audio data
instead of synthetic audio for numerous language pairs.
• The researchers used discretized speech units instead
of spectrograms, which they derived by clustering self-
supervised speech representations.
Meta’s Grip
over translate
Much Faster
and Better
The S2ST system performs
better than earlier direct S2ST
systems
Trained on real
data
It is first direct S2ST system to
be trained on real S2ST data
for many language pairings
Use of
Pretraining
It makes use of pretraining with
unlabeled speech data.
Mark with a better
solution
• The researchers employed self-supervised discrete units as targets (speech-to-unit
translation, or S2UT) for training the direct S2ST system to facilitate direct speech-to-
speech translation with discrete units (audio samples).
• They suggest a transformer-based sequence-to-sequence paradigm with an integrated
voice encoder and discrete unit decoder
Models and Improvements
S2ST model with discrete units.
A transformer-based S2UT model with a
speech encoder and a discrete unit
decoder
Flowchart and Finetuning process
Speech encoder and decoder
Two-pass decoding mechanism
The first-pass decoder generates text in
a related language (Mandarin), and the
second-pass decoder creates units.
Illustration of the textless S2ST model
• The left side is the speech-to-unit translation (S2UT) model with an auxiliary task while the right part is the unit-
based HiFi-GAN vocoder for unit-to-speech conversion.
Experiment Results
Average 3.2 BLEU gain when training the S2ST model on
the VoxPopuli S2ST dataset, compared to a baseline trained
on un-normalized speech target. Theye also incorporated
automatically mined S2ST data.
S2ST model that predicts using discrete units results
outperforms
6.6-12.1
BLEU
gain
additional
2.0 BLEU
gain
Experiment Data:
• Their study uses the Fisher Spanish-English speech translation corpus, which comprises 139K sentences (about
170 hours) transcribed in both Spanish and English from Spanish-speaking telephone conversations.
• For modeling target speech in English, Spanish or French, they train a single mHuBERT model with 100k subset of
VoxPopuli unlabeled speech, which contains 4.5k hrs of data from three languages for En, Es, and Fr.
• They employed VoxPopuli ASR dataset and convert text transcriptions to reference units for training the speech
normalizer. TTS data for HiFi-GAN vocoder along VAD to remove the silence at both ends of the audio
https://github.com/pytorch/fairseq/blob/ main/examples/speech_to_speech/docs/textless_ s2st_real_data.md
https://huggingface.co/facebook/tts_ transformer-en-ljspeech, Es: https://huggingface. co/facebook/tts_transformer-es-css10
Future of Translation
Simultaneous
translation
Large collection of S2ST
d e v e l o p e d t h r o u g h o u r
innovative NLP toolkit called
LASER.
SpeechMatrix
Building high-quality S2ST
models without any human
annotations.
Unsupervised Learning
Break down language barriers
in both the physical world and
the metaverse
Handshake between realms
References
“Enhanced Direct Speech-to-
Speech Translation Using
Self-supervised Pre-training
and Data Augmentation”
https://arxiv.org/abs/2204.02967
“Direct Speech-to-Speech
Translation With Discrete
Units”
https://arxiv.org/abs/2107.05604
“Textless Speech-to-
Speech Translation on
Real Data”
https://arxiv.org/abs/2112.08352
“Speech-to-speech
translation between
untranscribed unknown
languages”
https://arxiv.org/abs/1910.00795
SiddhantSancheti_MediumShortStory.pptx

More Related Content

Similar to SiddhantSancheti_MediumShortStory.pptx

final ppt BATCH 3.pptx
final ppt BATCH 3.pptxfinal ppt BATCH 3.pptx
final ppt BATCH 3.pptx
Mounika715343
 
“Neural Machine Translation for low resource languages: Use case anglais - wo...
“Neural Machine Translation for low resource languages: Use case anglais - wo...“Neural Machine Translation for low resource languages: Use case anglais - wo...
“Neural Machine Translation for low resource languages: Use case anglais - wo...
Paris Women in Machine Learning and Data Science
 
Approach of Syllable Based Unit Selection Text- To-Speech Synthesis System fo...
Approach of Syllable Based Unit Selection Text- To-Speech Synthesis System fo...Approach of Syllable Based Unit Selection Text- To-Speech Synthesis System fo...
Approach of Syllable Based Unit Selection Text- To-Speech Synthesis System fo...
iosrjce
 
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK
 MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK
ijitcs
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
NU_I_TODALAB
 
SMATalk: Standard Malay Text to Speech Talk System
SMATalk: Standard Malay Text to Speech Talk SystemSMATalk: Standard Malay Text to Speech Talk System
SMATalk: Standard Malay Text to Speech Talk System
CSCJournals
 
Translationusing moses1
Translationusing moses1Translationusing moses1
Translationusing moses1
Kalyanee Baruah
 
Parafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdfParafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdf
Universidad Nacional de San Martin
 
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert SystemModeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
csandit
 
Searching for the Best Machine Translation Combination
Searching for the Best Machine Translation CombinationSearching for the Best Machine Translation Combination
Searching for the Best Machine Translation Combination
Matīss ‎‎‎‎‎‎‎  
 
team10.ppt.pptx
team10.ppt.pptxteam10.ppt.pptx
team10.ppt.pptx
REMEGIUSPRAVEENSAHAY
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
IRJET Journal
 
Hi I am Ram.pptx
Hi I am Ram.pptxHi I am Ram.pptx
Hi I am Ram.pptx
ShubhamJain981677
 
Machine Transalation.pdf
Machine Transalation.pdfMachine Transalation.pdf
Machine Transalation.pdf
Amir Abdalla
 
Implementation of Marathi Language Speech Databases for Large Dictionary
Implementation of Marathi Language Speech Databases for Large DictionaryImplementation of Marathi Language Speech Databases for Large Dictionary
Implementation of Marathi Language Speech Databases for Large Dictionary
iosrjce
 
The Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationThe Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine Translation
Gennadi Lembersky
 
Introduction to text to speech
Introduction to text to speechIntroduction to text to speech
Introduction to text to speech
Bilgin Aksoy
 
Speech synthesis technology
Speech synthesis technologySpeech synthesis technology
Speech synthesis technology
Kalluri Madhuri
 
Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.
Sheeyam Shellvacumar
 
IRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation SystemIRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation System
IRJET Journal
 

Similar to SiddhantSancheti_MediumShortStory.pptx (20)

final ppt BATCH 3.pptx
final ppt BATCH 3.pptxfinal ppt BATCH 3.pptx
final ppt BATCH 3.pptx
 
“Neural Machine Translation for low resource languages: Use case anglais - wo...
“Neural Machine Translation for low resource languages: Use case anglais - wo...“Neural Machine Translation for low resource languages: Use case anglais - wo...
“Neural Machine Translation for low resource languages: Use case anglais - wo...
 
Approach of Syllable Based Unit Selection Text- To-Speech Synthesis System fo...
Approach of Syllable Based Unit Selection Text- To-Speech Synthesis System fo...Approach of Syllable Based Unit Selection Text- To-Speech Synthesis System fo...
Approach of Syllable Based Unit Selection Text- To-Speech Synthesis System fo...
 
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK
 MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
 
SMATalk: Standard Malay Text to Speech Talk System
SMATalk: Standard Malay Text to Speech Talk SystemSMATalk: Standard Malay Text to Speech Talk System
SMATalk: Standard Malay Text to Speech Talk System
 
Translationusing moses1
Translationusing moses1Translationusing moses1
Translationusing moses1
 
Parafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdfParafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdf
 
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert SystemModeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
 
Searching for the Best Machine Translation Combination
Searching for the Best Machine Translation CombinationSearching for the Best Machine Translation Combination
Searching for the Best Machine Translation Combination
 
team10.ppt.pptx
team10.ppt.pptxteam10.ppt.pptx
team10.ppt.pptx
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
 
Hi I am Ram.pptx
Hi I am Ram.pptxHi I am Ram.pptx
Hi I am Ram.pptx
 
Machine Transalation.pdf
Machine Transalation.pdfMachine Transalation.pdf
Machine Transalation.pdf
 
Implementation of Marathi Language Speech Databases for Large Dictionary
Implementation of Marathi Language Speech Databases for Large DictionaryImplementation of Marathi Language Speech Databases for Large Dictionary
Implementation of Marathi Language Speech Databases for Large Dictionary
 
The Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationThe Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine Translation
 
Introduction to text to speech
Introduction to text to speechIntroduction to text to speech
Introduction to text to speech
 
Speech synthesis technology
Speech synthesis technologySpeech synthesis technology
Speech synthesis technology
 
Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.
 
IRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation SystemIRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation System
 

Recently uploaded

How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 

Recently uploaded (20)

How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 

SiddhantSancheti_MediumShortStory.pptx

  • 1.
  • 2. • Standard text-based translation systems are not enough in the current world, where we have more than thousands of languages. This is because the traditional systems have drawbacks in creating speech-to-speech translation systems. • It employs a cascading set of processes where the computing costs and inference latency increase with each stage. • This method cannot be used to translate into every spoken language because more than 40% of the languages in the world lack text writing systems.. A Direct Speech-to-Speech Translate (S2ST)
  • 3. Meta Version of Direct S2ST Advancing S2ST with discrete units • Enables faster inference and supports translation between unwritten languages. • It does not rely on text generation as an intermediate step • Trained using actual, publicly available audio data instead of synthetic audio for numerous language pairs. • The researchers used discretized speech units instead of spectrograms, which they derived by clustering self- supervised speech representations.
  • 4. Meta’s Grip over translate Much Faster and Better The S2ST system performs better than earlier direct S2ST systems Trained on real data It is first direct S2ST system to be trained on real S2ST data for many language pairings Use of Pretraining It makes use of pretraining with unlabeled speech data.
  • 5. Mark with a better solution • The researchers employed self-supervised discrete units as targets (speech-to-unit translation, or S2UT) for training the direct S2ST system to facilitate direct speech-to- speech translation with discrete units (audio samples). • They suggest a transformer-based sequence-to-sequence paradigm with an integrated voice encoder and discrete unit decoder
  • 6. Models and Improvements S2ST model with discrete units. A transformer-based S2UT model with a speech encoder and a discrete unit decoder Flowchart and Finetuning process Speech encoder and decoder Two-pass decoding mechanism The first-pass decoder generates text in a related language (Mandarin), and the second-pass decoder creates units.
  • 7. Illustration of the textless S2ST model • The left side is the speech-to-unit translation (S2UT) model with an auxiliary task while the right part is the unit- based HiFi-GAN vocoder for unit-to-speech conversion.
  • 8. Experiment Results Average 3.2 BLEU gain when training the S2ST model on the VoxPopuli S2ST dataset, compared to a baseline trained on un-normalized speech target. Theye also incorporated automatically mined S2ST data. S2ST model that predicts using discrete units results outperforms 6.6-12.1 BLEU gain additional 2.0 BLEU gain
  • 9. Experiment Data: • Their study uses the Fisher Spanish-English speech translation corpus, which comprises 139K sentences (about 170 hours) transcribed in both Spanish and English from Spanish-speaking telephone conversations. • For modeling target speech in English, Spanish or French, they train a single mHuBERT model with 100k subset of VoxPopuli unlabeled speech, which contains 4.5k hrs of data from three languages for En, Es, and Fr. • They employed VoxPopuli ASR dataset and convert text transcriptions to reference units for training the speech normalizer. TTS data for HiFi-GAN vocoder along VAD to remove the silence at both ends of the audio https://github.com/pytorch/fairseq/blob/ main/examples/speech_to_speech/docs/textless_ s2st_real_data.md https://huggingface.co/facebook/tts_ transformer-en-ljspeech, Es: https://huggingface. co/facebook/tts_transformer-es-css10
  • 10. Future of Translation Simultaneous translation Large collection of S2ST d e v e l o p e d t h r o u g h o u r innovative NLP toolkit called LASER. SpeechMatrix Building high-quality S2ST models without any human annotations. Unsupervised Learning Break down language barriers in both the physical world and the metaverse Handshake between realms
  • 11. References “Enhanced Direct Speech-to- Speech Translation Using Self-supervised Pre-training and Data Augmentation” https://arxiv.org/abs/2204.02967 “Direct Speech-to-Speech Translation With Discrete Units” https://arxiv.org/abs/2107.05604 “Textless Speech-to- Speech Translation on Real Data” https://arxiv.org/abs/2112.08352 “Speech-to-speech translation between untranscribed unknown languages” https://arxiv.org/abs/1910.00795