Context: Real-time speech translation technology is today available but still lacks a complete understanding of how such technology may affect communication in global software projects. Goal: To investigate the adoption of combining speech recognition and machine translation in order to overcome language barriers among stakeholders who are remotely negotiating software requirements.
Method: We performed an empirical simulation-based study including: Google Web Speech API and Google Translate service, two groups of four subjects, speaking Italian and Brazilian Portuguese, and a test set of 60 technical and non-technical utterances.
Results: Our findings revealed that, overall: (i) a satisfactory accuracy in terms of speech recognition was achieved, although significantly affected by speaker and utterance differences; (ii) adequate translations tend to follow accurate transcripts, meaning that speech recognition is the most critical part for speech translation technology.
Conclusions: Results provide a positive albeit initial evidence towards the possibility to use speech translation technologies to help globally distributed team members to communicate in their native languages.
65 - An Empirical Simulation-based Study of Real-Time Speech Translation for Multilingual Global Project Teams
1. An Empirical Simulation-based Study of
Real-Time Speech Translation for
Multilingual Global Project Teams
Fabio Calefato
Filippo Lanubile
University of Bari, Italy
Rafael Prikladnicki
João Henrique Stocker Pinto
PUCRS, Brazil
ESEM'14 - Turin, Sept. 18-19, 2014 1
2. Motivation
• Global software projects challenged by
language differences
– Especially requirements meetings
• Speech translation technology for remote
meetings in countries with
– Opportunities for global projects
– Lack of English speaking professionals
• Goal:
Evaluate the feasibility of adopting a real-time
speech translation to support multilingual
requirements meetings
ESEM'14 - Turin, Sept. 18-19, 2014 2
3. Speech Translation
Speech Recognition +
• First prototypes date back
to early 70s
– Appropriate for dictation
only, not for real-time
captioning of speech
YET
– Recent progress,
especially with mobile
devices
– Need for further
investigation
Machine Translation
• First prototypes date back
to 50s
– Still far from 100% accurate
for multilingual group
communication
YET
– Not disruptive of the
conversation flow
– Does not prevent complex
tasks completion
– Even grants more balanced
discussions
4. Research questions
• RQ1: How well does speech translation work
for continuous speech in global software
projects?
• RQ2: How does technical jargon affect speech
translation in global software projects?
ESEM'14 - Turin, Sept. 18-19, 2014 4
5. Simulation-based study
• 8 Participants
– Software engineering professionals
– 4 from Bari (Italy) and 4 from Porto Alegre (Brazil)
– 7 males, 1 female
ESEM'14 - Turin, Sept. 18-19, 2014 5
6. Instrumentation
• 60 sentences from 5 requirements workshop logs
– Half containing jargon
– Half generic
– Increasing length (# words 5-30)
• Manually translated EN -> IT, PT
• Google Chrome Web Speech API Demo + Google
Translate
ESEM'14 - Turin, Sept. 18-19, 2014 6
Speech transcript
(IT / PT)
Resulting
translation
(IT / PT / EN)
Speech transcript
(IT / PT)
9. Translation Adequacy
scoring scheme
Category Description
4
Completely adequate. The translation clearly reflects the information
contained in the original sentence. It is perfectly clear, intelligible,
grammatically correct, and reads like ordinary tex.
3
Fairly adequate. The translation generally reflects the information
contained in the original sentence, despite some inaccuracies or
infelicities in the text. It is generally clear and intelligible, and one can
(almost) immediately understand what it means.
2
Poorly adequate. The translation poorly reflects the information
contained in the original sentence. It contains grammatical errors and/or
poor word choices. The general idea of the text is intelligible only after
considerable study.
1
Completely adequate. The translation is unintelligible and it is not
possible to obtain the information contained in the original sentence.
Studying the meaning of the text is hopeless and, even allowing for
context, one feels that guessing would be too unreliable.
Adapted from: D. Arnold et al. "Machine Translation: an Introductory Guide" (1994)
10. Results: Speech Recognition
Accuracy (1/2)
Mean
• Minimal differences in mean accuracy
– by Language and Lexicon
– by Speaker (except PT-Speaker2)
ESEM'14 - Turin, Sept. 18-19, 2014 10
Language
IT .81
PT .75
Lexicon
Generic .80
Jargon .77
Speaker Language Mean
PT-Speaker1 PT .78
PT-Speaker2 PT .68
PT-Speaker3 PT .79
PT-Speaker4 PT .73
IT-Speaker1 IT .76
IT-Speaker2 IT .88
IT-Speaker3 IT .78
IT-Speaker4 IT .82
11. Results: Speech Recognition
Accuracy (2/2)
Source df
Mean
Square
F Sig.
Intercept 1 290.785 587.408 .017
Language 1 .460 2.948 .144
Speaker(Language) 6 .166 12.907 .003†
Lexicon 1 .125 3.285 .104
Replication(Lexicon) 58 .082 1.740 .018†
Language * Lexicon 1 .003 .068 .797
UNIANOVA
• Speaker(Language) and Replication(Lexicon)
the only significant factors
ESEM'14 - Turin, Sept. 18-19, 2014 11
Language *
Replication(Lexicon)
58 .047 3.004 .000
Lexicon * Speaker(Language) 6 .013 .817 .557
† Significant at 5% level
14. Conclusions: RQ1
How well does speech translation work for
continuous speech?
• Our study setup: Simulation of a conversation
– Similar to: Automatic generation of closed captioning
from webcasts
• Our findings
– In line with baseline: 75% word accuracy [1]
YET
• Adequate speech translations 31-41%
• Some domains more critical than others
ESEM'14 - Turin, Sept. 18-19, 2014 14
[1] Munteanu et al. “Collaborative editing for improved usefulness and usability of transcript-enhanced webcasts”, CHI’08
15. Conclusions: RQ2
How does technical jargon affect speech
translation?
• No evidence that jargon generates worse
speech translations
– At least, in the CS domain
HOWEVER
• Professionals reads jargon differently
– e.g., “SQL” → SEQUEL, spelled in Italian, in
English…
ESEM'14 - Turin, Sept. 18-19, 2014 15
16. Study limitation & Future work
- Simulation-based study
- What would happen in a
real setting?
- Refine transcription
accuracy construct
(errors)
- One technology only
- i.e., Google’s Web Speech
API and Translate
- Effect of accents,
pronunciations, gender?
- i.e., only 8 speakers, 1
female
+ Run a controlled
experiment
+ Multi-language group task
+ Distinguish between
incorrect and missing
words
+ Compare more speech
translation solutions
+ e.g., Nuance, Sphinx, Bing
+ Involve more speakers in
experiments
+ Also include EN native
speakers
ESEM'14 - Turin, Sept. 18-19, 2014 16
A few bg info on Speech Translation, which is the combination of two technologies SR+MT
SR
Past decade research proved it appropriate for dictation only, not for real-time captioning of speech [1]
recent technological progress in the field of automatic speech recognition has also found its way in mobile devices, something that definitely calls for further investigation, especially in combination with machine translation
MT More established technology (~60 years in the making)
Our findings indicate that state-of-the-art MT technology is already a viable solution for multilingual group communication since it is not disruptive of the conversation flow, it does not prevent group to complete complex tasks, and it even grants discussions that are more balanced. Yet, MT technology currently available is still far from 100% accurate and, as such, its adoption comes with costs. In fact, translations inaccuracies needs to be repaired by rephrasing the original content, thus causing a decrease in efficiency.
As I am going to present a study, the overall goal of which is to assess the usage of real-time speech translation to support communication in multilingual requirements meetings.
Let me discuss first the research questions addressed by this study
RQ1 –Research from the past decade has shown evidence that the speech recognition technology available was unsuitable for providing real-time captioning or transcription of speech [14]. Although commercial speech recognition tools available today claim to achieve a word recognition accuracy as high as 99%, they have been developed for dictation rather than to produce a transcript from a continuous and unbroken stream without any punctuation [2][3][18].
RQ2 – When stakeholders communicate during requirements meetings, many technical words are used. On top of that, technical words might even be in a language different from the one used by speakers. For instance, lawyers sometimes use Latin jargon; computer scientists typically use technical words in English. As such, technical jargon is less likely to occur both in real communication and in training sets used to build language models for speech recognition engines. Therefore, it has been previously observed that speech recognition errors are more likely to occur in words given a very low probability by the language model [16].
This this simulation-based study is a necessary preliminary step towards the design of future experiments that will involve real-time communication among individuals, augmented with speech translation.
As the test set, we selected 60 sentences of growing length (word count, min. 5, max. 30). The sentences were selected from real chat logs in English, collected from five requirements workshops run as part of an experiment on the effects of text-based communication in distributed requirements engineering [6]. Participants in each workshop ranged from five to eight undergraduate students attending a requirements engineering course at the University of Victoria, Canada. During a workshop, the participants, either acting as a client or as a developer, had first to elicit the requirements specification of a web application (first session); then, they had to negotiate and reach closure on the previously collected requirements (second session).
generic utterances, contained only words that are included in an Italian or Portuguese dictionary.
jargon, contained one or more technical terms or characteristic acronyms used by software developers.
Selected sentences they were manually translated by two of the researchers from English into both Italian and Brazilian Portuguese. The set of original utterances in English together with the manual translation in Italian and Portuguese formed the experimental sample.
we were extremely careful at maintaining both the original meaning and the interaction style intact
For each of the 60 utterances in the sample, a speaker started by clicking on the microphone icon and began speaking until the end of the utterance. Participants spoke in a colloquial style at their own pace. If the researcher realized that the spoken utterance was different from the original content or the speaker stopped before arriving to the end of the utterance, then the researcher invited the speaker to try again. On average, a speaker finished the simulation in about 30 minutes, at a pace of two utterances per minute.
We have different independent and dependent vars for speech recognition and speech translation
As for SR
Lexicon is a fixed effect factor
The 30 replication under each lexicon level are considered a random effect factor nested under lexicon
Transcript accuracy is the standard measure used to evaluate speech recognition system performance
Where Errors include missing and wrong words.
As Tacc is defined in [1-,1], it is then normalized into 𝑇 𝑎𝑐𝑐 ′ = (𝑇 𝑎𝑐𝑐 +1) 2 to show its values as percentages ([0,1])
As for MT
The notation means that lang on the left is the source lang that the speaker used when reading senteces in, the other is the target languages, translated into
Adequacy
the effectiveness of a machine translation service relates to the fluency and fidelity of the translated output, the effectiveness of a speech recognition system relates to the correct number of words recognized in a spoken sentence. (because errors in the speech recognition process negatively affect the outcome of machine translation, )
not too fine grained
no middle values – avoids central tendency bias by forcing raters to judge a translation as either adequate (1-2) or inadequate (points 3-4)
description clear to raters
Once all the subjects completed their tasks, one researcher at UniBari rated the quality of translations of all the translations from Italian to English (IT->EN) and from Portuguese to Italian (PT->EN); likewise, one researcher at PUCRS rated the translations from Brazilian Portuguese to English (PT->EN) and from Italian to Brazilian Portuguese (IT->PT).
We note that the set of language pairs rated by the two researchers were disjoint, so no inter-rater agreement thru Chronbach alfa could be measured
Left-hand side Table reports the mean values of the transcript accuracy measured by language and lexicon. In both cases, we observe minimal differences. In fact, the mean accuracy for utterances spoke in Italian is 81%, whereas for Brazilian Portuguese it is 75%.
Likewise, slightly better accuracy results were achieved on average for generic utterances (80%) as compared to jargon utterances (77%).
Right-hand side Table , instead, reports the average accuracy per speaker. In addition, in this case we cannot observe large differences. The only noticeable result is the performance of Brazilian subject PT-Speaker2, who achieved the lowest accuracy of 68% especially compared to the best accuracy was achieved by the Italian subject IT-Speaker2 who achieved 88% of accurate transcript
Finally, we tried to identify any difference produced by the factors and their interactions in the accuracy of transcripts .
we run a univariate analysis of variance (UNIANOVA procedure)
The analysis of variance showed that differences in the speakers and the sentences (replications) significantly affected the result of the speech recognition process
Goal: identify differences in the quality of translation produced according to the various combinations of language pairs and lexicon
we first evaluated translation results by language pairs, calculating how many sentences were rated adequate (i.e., categories 4 and 3) and inadequate (i.e., categories 1 and 2) by language pairs. Figure on left shows such a breakdown .
We can observe a similar behavior for all the combinations, with a minimum of 75 (PT->IT, 31%) and a maximum of 99 (IT->EN, 41%) adequately translated utterances.
Then, we performed a similar analysis evaluating adequacy of translation results further grouped by Lexicon. The figure on the right-hand side shows that that, for all the four language pairs, the inadequate translations again outnumber the others categories regardless of the lexicon. In other words, generic utterances were translated no more adequately than jargon utterances, independently of the language pair combination.
Finally, because translation and recognition are clearly are interdependent, as translation adequacy results affected by the accuracy of the transcript produced in the first step of the process, we computed spearman’s rho to weigh this correlation
The results in the table shows that, regardless of both the lexicon and the language pairs, there is a moderate positive correlation between transcription accuracy and translation adequacy. In other words, when the speech recognition component produced an inaccurate transcription, the machine translation tended to produce a less adequate translation.
One explanation for the moderate only correlation is that while there may be cases where inadequate translation occurred with accurate transcriptions, the opposite, adequate translation from inaccurate transcriptions, can never happened.
75% is perfectly in line with our own findings.
Yet, speech recognition in our case is the first of two steps. The results of final translations show that, no matter what the language pair was, the # of inadequate translations outnumbered the # of the adequate ones.
Plus, the correlation test that showed only a moderate correlation between accurate transcriptions and adequate translations actually proves that speech recognition is the critical component of a speech translation systems , so 75% as a baseline might be too low
And when we’re talking about domains, we’re talking about technical words part of the specific domain vocabulary, or in one word, jargon
A more encouraging results is that, wrt RQ2, is that
As future work, we intend to seek for confirmation of these initial results.
In the simulation just described several professional developers read several utterances unrelated to each other into a speech translation system. Although collected from several real requirements meetings, such set does not fully represent an example of real requirements workshop meeting augmented with speech translation. In fact, our simulation does not take into account factors like task completion, communication flow, context and grounding. Therefore, we acknowledge the need to perform future controlled experiments that involve cross-language group communication augmented with speech translation. In particular, we will compare groups of people who communicate through a speech translations system, using either English or their native languages, to complete communication-intensive tasks in the context of globally distributed development teams.
In our simulation we only used one speech translation system (Google Translate mobile). Therefore, findings might not extend to other existing speech translation technologies available. We acknowledge the need to compare the performance of more systems in our future work.
Finally, our findings showed that accuracy in terms of speech recognition was significantly affected by speaker and utterance differences. As such, we acknowledge that the limited numbers of speakers (4 for each of the two source languages) and utterances (30 for each of the two kinds of lexicon) are not ideal from the statistical point of view. Such limitations will be addressed in future replications.
Therefore, in future controlled experiments we will involve more subjects and possibly also native EN speakers to understand the effect of pronunciations of non-native speakers when using EN as a lingua franca in multilingual group