Researchers Shannon Wotherspoon, William Hartmann, and Matthew Snover from Raytheon BBN published a paper in March 2024 introducing a corpus based on a set of Mandarin Chinese audio training data for speech machine translation (MT) with matched text translations into English. Raytheon BBN is a research and technology company within the Raytheon (RTX) group, a major defense contractor based in Cambridge, USA.
Weapons Manufacturer Raytheon Open Sources Speech Translation Dataset.pdf
1. RAYTHEON, A COMPANY
THAT MANUFACTURES
WEAPONS, HAS MADE ITS
SPEECH TRANSLATION
DATASET PUBLICLY
AVAILABLE
www.slator.com
2. www.slator.com
Researchers Shannon Wotherspoon, William
Hartmann, and Matthew Snover from Raytheon
BBN published a paper in March 2024 introducing
a corpus based on a set of Mandarin Chinese audio
training data for speech machine translation (MT)
with matched text translations into English.
Raytheon BBN is a research and technology
company within the Raytheon (RTX) group, a major
defense contractor based in Cambridge, USA.
The purpose of the type of paired source-language
speech and target-language text dataset,
explained the researchers, was to create a
Mandarin-English corpus to train end-to-end
speech translation systems and improve cascaded
systems as well.
The researchers argue that the resulting corpus is
“addressing a critical gap in resources and
underscoring the importance of domain-specific
data in advancing the state-of-the-art in speech
translation.”
3. www.slator.com
The process used by the Raytheon BBN
researchers involved sourcing data from 123.5
hours of Mandarin telephone conversations.
The speech data was obtained from two public
datasets: the CallHome Mandarin Chinese
Speech and the HKUST Mandarin Telephone
Speech datasets.
The CallHome dataset contained 242
unscripted telephone conversations between
native Mandarin speakers, whereas the HKUST
dataset contained 90 hours of speech from
1,124 conversations between Mandarin
speakers (not necessarily all native speakers)
in Mainland China.
The data were split into train, development,
and test sets, with the train set being a mix of
both Mandarin datasets. For the two
development sets and the test set the
researchers used only CallHome dataset
conversations.
4. www.slator.com
The text translations into English were done
by Mandarin-English bilingual annotators at
Appen, using transcripts. The annotators
did not have access to the audio for the
conversations and used the surrounding
transcript text as context. For the final
resulting text corpus, identical speech
utterances were translated only once,
regardless of frequency, and the
annotators were instructed “to preserve
any disfluencies, hesitations, or code-
switching present in the data.
”
For their experiments, the researchers used
output from an automatic speech
recognition (ASR) model using Raytheon
BBN’s own speech processing platform,
called “Sage,” which the company
introduced in 2016.
5. Slator is the leading source of news
and research for the global
translation, localization, and
language technology industry. Our
Advisory practice is a trusted partner
to clients looking for independent
analysis. Headquartered in Zurich,
Slator has a presence in Asia,
Europe, and the US.
www.slator.com