Weapons Manufacturer Raytheon Open Sources Speech Translation Dataset.pdf

•

0 likes•2 views

Researchers Shannon Wotherspoon, William Hartmann, and Matthew Snover from Raytheon BBN published a paper in March 2024 introducing a corpus based on a set of Mandarin Chinese audio training data for speech machine translation (MT) with matched text translations into English. Raytheon BBN is a research and technology company within the Raytheon (RTX) group, a major defense contractor based in Cambridge, USA.

News & Politics

RAYTHEON, A COMPANY
THAT MANUFACTURES
WEAPONS, HAS MADE ITS
SPEECH TRANSLATION
DATASET PUBLICLY
AVAILABLE
www.slator.com

www.slator.com
Researchers Shannon Wotherspoon, William
Hartmann, and Matthew Snover from Raytheon
BBN published a paper in March 2024 introducing
a corpus based on a set of Mandarin Chinese audio
training data for speech machine translation (MT)
with matched text translations into English.
Raytheon BBN is a research and technology
company within the Raytheon (RTX) group, a major
defense contractor based in Cambridge, USA.
The purpose of the type of paired source-language
speech and target-language text dataset,
explained the researchers, was to create a
Mandarin-English corpus to train end-to-end
speech translation systems and improve cascaded
systems as well.
The researchers argue that the resulting corpus is
“addressing a critical gap in resources and
underscoring the importance of domain-specific
data in advancing the state-of-the-art in speech
translation.”

www.slator.com
The process used by the Raytheon BBN
researchers involved sourcing data from 123.5
hours of Mandarin telephone conversations.
The speech data was obtained from two public
datasets: the CallHome Mandarin Chinese
Speech and the HKUST Mandarin Telephone
Speech datasets.
The CallHome dataset contained 242
unscripted telephone conversations between
native Mandarin speakers, whereas the HKUST
dataset contained 90 hours of speech from
1,124 conversations between Mandarin
speakers (not necessarily all native speakers)
in Mainland China.
The data were split into train, development,
and test sets, with the train set being a mix of
both Mandarin datasets. For the two
development sets and the test set the
researchers used only CallHome dataset
conversations.

www.slator.com
The text translations into English were done
by Mandarin-English bilingual annotators at
Appen, using transcripts. The annotators
did not have access to the audio for the
conversations and used the surrounding
transcript text as context. For the final
resulting text corpus, identical speech
utterances were translated only once,
regardless of frequency, and the
annotators were instructed “to preserve
any disfluencies, hesitations, or code-
switching present in the data.
”
For their experiments, the researchers used
output from an automatic speech
recognition (ASR) model using Raytheon
BBN’s own speech processing platform,
called “Sage,” which the company
introduced in 2016.

Slator is the leading source of news
and research for the global
translation, localization, and
language technology industry. Our
Advisory practice is a trusted partner
to clients looking for independent
analysis. Headquartered in Zurich,
Slator has a presence in Asia,
Europe, and the US.
www.slator.com

Similar to Weapons Manufacturer Raytheon Open Sources Speech Translation Dataset.pdf

IRJET- Text to Speech Synthesis for Hindi Language using Festival FrameworkIRJET Journal

Ijetcas14 444Iasir Journals

PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...ijma

IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language ModelsIRJET Journal

TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKSIJCI JOURNAL

Recent advances in LVCSR : A benchmark comparison of performancesIJECEIAES

PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...ijma

MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORKijitcs

Contextual Analysis for Middle Eastern Languages with Hidden Markov Modelsijnlc

FYPReportDavid Ferris

Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training EnsemblesMohamed El-Geish

A Review on Speech Corpus Development for Automatic Speech Recognition in Ind...Eswar Publications

visH (fin).pptxtefflontrolegdy

EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...ijnlc

A Recorded Debating DatasetScott Faria

A Novel Method for An Intelligent Based Voice Meeting System Using Machine Le...IRJET Journal

Word Segmentation and Lexical Normalization for Unsegmented Languageshs0041

Language Identifier for Languages of Pakistan Including Arabic and PersianWaqas Tariq

Deciphering voice of customer through speech analyticsR Systems International

HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMijnlc

Similar to Weapons Manufacturer Raytheon Open Sources Speech Translation Dataset.pdf (20)

IRJET- Text to Speech Synthesis for Hindi Language using Festival Framework

Ijetcas14 444

PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...

IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models

TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKS

Recent advances in LVCSR : A benchmark comparison of performances

PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...

MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK

Contextual Analysis for Middle Eastern Languages with Hidden Markov Models

FYPReport

Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles

A Review on Speech Corpus Development for Automatic Speech Recognition in Ind...

visH (fin).pptx

EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...

A Recorded Debating Dataset

A Novel Method for An Intelligent Based Voice Meeting System Using Machine Le...

Word Segmentation and Lexical Normalization for Unsegmented Languages

Language Identifier for Languages of Pakistan Including Arabic and Persian

Deciphering voice of customer through speech analytics

HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM

Recently uploaded

Top^Clinic ^%[+27785538335__Safe*Women's clinic//Abortion Pills In Musinadoctorjoe1984

10052024_First India Newspaper Jaipur.pdfFIRST INDIA

12052024_First India Newspaper Jaipur.pdfFIRST INDIA

Wikipedia vs its evil cousin Conservapedia.pptxmatejnovak8

Income Tax Regime Dilemma – New VS. Old pdftaxguruedu

Textile Waste In India/managing-textile-waste-in-Indiatheunitedindian7

Indian Economy: The Challenge Ahead Since India gainedalianwarrr55

KING VISHNU BHAGWANON KA BHAGWAN PARAMATMONKA PARATOMIC PARAMANU KASARVAMANVA...IT Industry

11052024_First India Newspaper Jaipur.pdfFIRST INDIA

Press-Information-Bureau-14-given-citizenship.pdfbhavenpr

Recently uploaded (10)

Top^Clinic ^%[+27785538335__Safe*Women's clinic//Abortion Pills In Musina

10052024_First India Newspaper Jaipur.pdf

12052024_First India Newspaper Jaipur.pdf

Wikipedia vs its evil cousin Conservapedia.pptx

Income Tax Regime Dilemma – New VS. Old pdf

Textile Waste In India/managing-textile-waste-in-India

Indian Economy: The Challenge Ahead Since India gained

KING VISHNU BHAGWANON KA BHAGWAN PARAMATMONKA PARATOMIC PARAMANU KASARVAMANVA...

11052024_First India Newspaper Jaipur.pdf

Press-Information-Bureau-14-given-citizenship.pdf

Weapons Manufacturer Raytheon Open Sources Speech Translation Dataset.pdf

1. RAYTHEON, A COMPANY THAT MANUFACTURES WEAPONS, HAS MADE ITS SPEECH TRANSLATION DATASET PUBLICLY AVAILABLE www.slator.com

2. www.slator.com Researchers Shannon Wotherspoon, William Hartmann, and Matthew Snover from Raytheon BBN published a paper in March 2024 introducing a corpus based on a set of Mandarin Chinese audio training data for speech machine translation (MT) with matched text translations into English. Raytheon BBN is a research and technology company within the Raytheon (RTX) group, a major defense contractor based in Cambridge, USA. The purpose of the type of paired source-language speech and target-language text dataset, explained the researchers, was to create a Mandarin-English corpus to train end-to-end speech translation systems and improve cascaded systems as well. The researchers argue that the resulting corpus is “addressing a critical gap in resources and underscoring the importance of domain-specific data in advancing the state-of-the-art in speech translation.”

3. www.slator.com The process used by the Raytheon BBN researchers involved sourcing data from 123.5 hours of Mandarin telephone conversations. The speech data was obtained from two public datasets: the CallHome Mandarin Chinese Speech and the HKUST Mandarin Telephone Speech datasets. The CallHome dataset contained 242 unscripted telephone conversations between native Mandarin speakers, whereas the HKUST dataset contained 90 hours of speech from 1,124 conversations between Mandarin speakers (not necessarily all native speakers) in Mainland China. The data were split into train, development, and test sets, with the train set being a mix of both Mandarin datasets. For the two development sets and the test set the researchers used only CallHome dataset conversations.

4. www.slator.com The text translations into English were done by Mandarin-English bilingual annotators at Appen, using transcripts. The annotators did not have access to the audio for the conversations and used the surrounding transcript text as context. For the final resulting text corpus, identical speech utterances were translated only once, regardless of frequency, and the annotators were instructed “to preserve any disfluencies, hesitations, or code- switching present in the data. ” For their experiments, the researchers used output from an automatic speech recognition (ASR) model using Raytheon BBN’s own speech processing platform, called “Sage,” which the company introduced in 2016.

5. Slator is the leading source of news and research for the global translation, localization, and language technology industry. Our Advisory practice is a trusted partner to clients looking for independent analysis. Headquartered in Zurich, Slator has a presence in Asia, Europe, and the US. www.slator.com

Weapons Manufacturer Raytheon Open Sources Speech Translation Dataset.pdf

Recommended

Recommended

More Related Content

Similar to Weapons Manufacturer Raytheon Open Sources Speech Translation Dataset.pdf

Similar to Weapons Manufacturer Raytheon Open Sources Speech Translation Dataset.pdf (20)

More from Slator- Language Industry Intelligence

More from Slator- Language Industry Intelligence (7)

Recently uploaded

Recently uploaded (10)

Weapons Manufacturer Raytheon Open Sources Speech Translation Dataset.pdf