SiddhantSancheti_MediumShortStory.pptx

• Standard text-based translation systems are not enough in the
current world, where we have more than thousands of
languages. This is because the traditional systems have
drawbacks in creating speech-to-speech translation systems.
• It employs a cascading set of processes where the computing
costs and inference latency increase with each stage.
• This method cannot be used to translate into every spoken
language because more than 40% of the languages in the
world lack text writing systems..
A Direct Speech-to-Speech
Translate (S2ST)

Meta Version
of Direct S2ST
Advancing S2ST with discrete
units
• Enables faster inference and supports translation
between unwritten languages.
• It does not rely on text generation as an intermediate
step
• Trained using actual, publicly available audio data
instead of synthetic audio for numerous language pairs.
• The researchers used discretized speech units instead
of spectrograms, which they derived by clustering self-
supervised speech representations.

Meta’s Grip
over translate
Much Faster
and Better
The S2ST system performs
better than earlier direct S2ST
systems
Trained on real
data
It is first direct S2ST system to
be trained on real S2ST data
for many language pairings
Use of
Pretraining
It makes use of pretraining with
unlabeled speech data.

Mark with a better
solution
• The researchers employed self-supervised discrete units as targets (speech-to-unit
translation, or S2UT) for training the direct S2ST system to facilitate direct speech-to-
speech translation with discrete units (audio samples).
• They suggest a transformer-based sequence-to-sequence paradigm with an integrated
voice encoder and discrete unit decoder

Models and Improvements
S2ST model with discrete units.
A transformer-based S2UT model with a
speech encoder and a discrete unit
decoder
Flowchart and Finetuning process
Speech encoder and decoder
Two-pass decoding mechanism
The first-pass decoder generates text in
a related language (Mandarin), and the
second-pass decoder creates units.

Illustration of the textless S2ST model
• The left side is the speech-to-unit translation (S2UT) model with an auxiliary task while the right part is the unit-
based HiFi-GAN vocoder for unit-to-speech conversion.

Experiment Results
Average 3.2 BLEU gain when training the S2ST model on
the VoxPopuli S2ST dataset, compared to a baseline trained
on un-normalized speech target. Theye also incorporated
automatically mined S2ST data.
S2ST model that predicts using discrete units results
outperforms
6.6-12.1
BLEU
gain
additional
2.0 BLEU
gain

Experiment Data:
• Their study uses the Fisher Spanish-English speech translation corpus, which comprises 139K sentences (about
170 hours) transcribed in both Spanish and English from Spanish-speaking telephone conversations.
• For modeling target speech in English, Spanish or French, they train a single mHuBERT model with 100k subset of
VoxPopuli unlabeled speech, which contains 4.5k hrs of data from three languages for En, Es, and Fr.
• They employed VoxPopuli ASR dataset and convert text transcriptions to reference units for training the speech
normalizer. TTS data for HiFi-GAN vocoder along VAD to remove the silence at both ends of the audio
https://github.com/pytorch/fairseq/blob/ main/examples/speech_to_speech/docs/textless_ s2st_real_data.md
https://huggingface.co/facebook/tts_ transformer-en-ljspeech, Es: https://huggingface. co/facebook/tts_transformer-es-css10

Future of Translation
Simultaneous
translation
Large collection of S2ST
d e v e l o p e d t h r o u g h o u r
innovative NLP toolkit called
LASER.
SpeechMatrix
Building high-quality S2ST
models without any human
annotations.
Unsupervised Learning
Break down language barriers
in both the physical world and
the metaverse
Handshake between realms

References
“Enhanced Direct Speech-to-
Speech Translation Using
Self-supervised Pre-training
and Data Augmentation”
https://arxiv.org/abs/2204.02967
“Direct Speech-to-Speech
Translation With Discrete
Units”
“Textless Speech-to-
Speech Translation on
Real Data”
“Speech-to-speech
translation between
untranscribed unknown
languages”

SiddhantSancheti_MediumShortStory.pptx

SiddhantSancheti_MediumShortStory.pptx

Recommended

Recommended

More Related Content

Similar to SiddhantSancheti_MediumShortStory.pptx

Similar to SiddhantSancheti_MediumShortStory.pptx (20)

Recently uploaded

Recently uploaded (20)

SiddhantSancheti_MediumShortStory.pptx