Analyzing and Predicting MT Utility and Post-Editing Productivity in Enterprise-Scale Translation Projects by Olga Beregovaya and David Clarke from Welocalize
Alon Lavie and Michael Denkowski from Safaba Translation Solutions
Safaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-Editing
1. Analyzing and Predicting MT Utility and Post-
Editing Productivity in Enterprise-Scale
Translation Projects
Olga Beregovaya and David Clarke, Welocalize
Alon Lavie and Michael Denkowski, Safaba Translation Solutions
2. Challenges & Objectives
Status quo – “big picture” - unknowns at the launch of an enterprise MT-
based program:
oIs the source content suitable for MT?
oIs the MT-driven program going to render productivity gains compared to
human translation across all languages?
oAre all the segments in the job going to perform at the same level?
Solution: Segment-level Predictive Analysis
oReveal a correlations between productivity, expected MT PE quality and
intrinsic properties of the text being translated
oPredict machine-translated segment utility and level of effort
3. Data Statistics
Features DATA SET 1 DATA SET 2
Content Domain Website -
Combined
Marketing &
Technical
Website -
Combined
Marketing &
Technical
Content Status Live (production) Live (production)
File Origin TMS System TMS System
Total Unique
Segments
8168 2855
Locales 16 11
5. Methodology
o Analysis performed by Welocalize and Safaba on live, enterprise-scale
MT Post Editing project environment
o Underlying data based on MT post-editing productivity information
collected on a per-segment basis via an open-source CAT tool
(iOmegaT)
o The analysis contrasts and correlates the collected productivity data with
several MT quality evaluation metrics, human evaluation by trained post-
editors and detailed characteristic properties of the source text
o The data is used to develop segment-level automated quality estimation
scores, which is used to predict the expected utility of MT generated
translation segments in future production projects.
6. Evaluation Environment
Pre-processing middleware
oUsed for workflow/kitting
oiOmegaT
oA tool built on top of OmegaT, an open-source CAT tool adapted to
measure various aspects of post-editing MT output
oDeveloped by John Moran (CNGL) in collaboration with Dave Clarke
(Welocalize), it records:
Translation time
MT post-editing time
Fuzzy match editing time
+ an extended suite of industry-standard automated evaluation
methodologies, human evaluation environment and translator surveys
7. Source Text Features
Source text features considered:
o Content type category (i.e. marketing/UI/UA)
o Length of the source segment
o Source segment morpho-syntactic complexity;
o Presence/absence of pre-defined glossary terms or multi-word glossary
elements, UI elements, numeric variables, product lists, ‘do-not-translate’
and transliteration lists
o Metadata attributes and their representation in localization industry
standard formats (“tags”).
8. Content Types
Source content types generally passed to the engine:
o Technical/IT/Training Exams
o Business/Management Comms/Training
o Corporate Image/Branding/Advertising
o Voiceover/Subtitles/Video
o Marketing/Transcreation/Copywriting/Blurbs
o Technical Documentation
o User Interface/website
o User Assistance/Consumer Documentation
*Content type explicitly set in the GMS within the project/TM attributes
Content used for this study: User Interface/Website
9. Analyzing Tag Projection Accuracy
• Commercial enterprise translation data is often in the form of files of
structured formats converted for translation into XML-based schemas with
heavily-tag-annotated segments of source text
• Example:
Source (EN): Click the <g0>Advanced</g0> tab, and click <g1>Change</g1>.
Reference (PT): Clique no separador <g0>Avançado</g0> e em <g1>Alterar</g1>.
• Correctly projecting and placing these segment-internal tags from the
source language to the target language is a well-known difficult challenge
for MT in general, and statistical MT engines in particular
• Safaba has focused significant effort over the past year to developing
advanced high-accuracy algorithms for source-to-target tag projection
within our EMTGlobal MT solution
10. Analyzing Tag Projection Accuracy
o Goal: Assess tag projection and placement accuracy of EMTGlobal
version 1.1 versus 2.1, based on analysis of post-edited MT segments
generated by Welocalize for Safaba’s Dell MarkCom MT engines in
production
o Methodology: Estimate accuracy by aligning the target language raw MT
output with the post-edited MT version and assess whether each tag is
placed between the same target words on both sides
o Example:
Reference: Clique no separador <g0>Avançado</g0> e em <g1>Alterar</g1>.
EMTGlobal v1.1: <g0>Clique na guia Avançado e em</g0> <g1> Alterar.</g1>
EMTGlobal v2.1: Clique na guia <g0>Avançado</g0> e em <g1>Alterar</g1>.
11. Analyzing Tag Projection Accuracy
EMTGlobal version 1.1
Context Matched
Tag Type Both Left Right Neither Total
Beginning 33.33% 19.44% 11.46% 35.76% 100.00%
Ending 32.06% 10.10% 8.01% 49.83% 100.00%
Stand-alone 56.91% 23.98% 18.29% 0.81% 100.00%
Total 39.95% 17.54% 12.30% 30.21% 100.00%
EMTGlobal version 2.1
Contexts Matched
Tag Type Both Left Right Neither Total
Beginning 66.67% 12.50% 9.38% 11.46% 100.00%
Ending 63.41% 10.80% 11.50% 14.29% 100.00%
Stand-alone 67.89% 18.29% 13.01% 0.81% 100.00%
Total 65.90% 13.64% 11.21% 9.26% 100.00%
• Fraction of “Neither” likely incorrectly placed tags reduced from 30% to 9%
• Fraction of “Both” confirmed correctly placed tags improved from 40% to 66%
• Fraction of tags with partially-matched contexts reduced from 30% to 25%
• Data: Welocalize post-editing productivity data set
• 26 target languages, one document per language, 4907 segments
• For 15 languages (3211 segments), EMTGlobal v1.1 was post-edited
• For 11 languages (1696 segments), EMTGlobal v2.1 was post-edited
• Total of 830 tags in PE segments, 821 aligned with MT output (98.9%)
12. Tag Placement
Requirements:
oAll formatting elements have been retained and for the most part are in valid
sequence order
oThe majority of translators’ work is fairly minor repositioning of tags
Source:
<1><2>Thin design: </2></1><3>At a mere 0.9 inches (23 mm) and with up to 6 hours and 42 mi
nutes</3><4><5>1</5></4><6> of battery life, XPS 14z is super-
portable and ready to go anywhere.</6>
Target: <1><2> Schlankes Design <3>: Mit </3></2></1> einer Höhe von lediglich 23 mm (0,9
Zoll) und bis zu 6 Stunden und 42 Minuten <4><5><6> 1 Akkubetriebsdauer </6></5></4>, XPS
14z ist extrem mobil und einsatzbereit überall.
<1> correct; <2> correct, <3>, </3>, </2> and </1> slightly misplaced; <4> and <5>
correct; <6>, </6>, </5> and </4> misplaced; </1> misplaced.
100% tags retained, 33% accurate tag placement (EMT Global V. 1.1) .
13. Tag Density Ratio
Goal: analyze the impact that the presence and ratio of the standard XLIFF
tags have on the post-editing task duration and number of edit visits and
factor this impact in the post-editing effort evaluation
New variable: Tag Density Ratio (tags per word) for the machine-translated
segments
Tag Density Ratio components: string length (word count) ranges, tag
quantification, tag density and visit frequency data
Hypothesis: segments with high tag density exhibit considerably higher than
expected post-edit time as compared with low tag density segments of the
same length since tag placement adjustment is necessary during post-
editing.
16. Tag Density Ratio (TDR) - Findings
o Human Translation vs. MT - no difference in TDR impact
o Higher TDR has no major impact on PE time across all
sentence length groups
o The tags are handled intelligently/placed properly by the MT
engine (Safaba EMTGlobal v. 2.1)
17. “Lower Effort” Elements
Goal: identify segments that contain:
o Glossary terms
o “DoNotTranslate” elements
o URL strings
o Other identifiable entities
Analyze their post-edit session duration in comparison with segments of
similar length with no identified “easy-to-manipulate” or DNT elements
20. “Lower Effort” Elements - Findings
o Presence of DNT elements and terminology hits has similar positive impact
on the post-editing time
o DNT lists were created retroactively while the terminology is explicitly
highlighted to the translator; creating DNT glossaries will render additional
productivity gains
o Unlike the DNT elements, terminology entries may require edits
(plural/singular, case), which demonstrates that the Safaba engine handles
the morphological variants of terminology hits correctly
o Single isolated terminology hits slow down the translator – (standalone term
with no context possibly requiring more validation?)
o Past the 20-25 words-per-segment range the impact of DNT and term hits
is negligible
21. Source String Complexity
Goal: to perform a morpho-syntactic analysis of the input source sentences
and cross-compare with the known “most difficult to handle” errors:
22. Source String Complexity - Findings
In each “segment length” group sentences falling under these categories or similar complexity
categories required most post-editing time and effort even with the new improved version of
Safaba Translation Engine (EMT Global 2.1) with post-editors’ feedback implemented
- Combining brains with brawn the Alienware® M17x is the most powerful 17” gaming notebook in the
universe.
- With the swipe of a finger, the keyboard appears from under the display as the system is open.
- Through PartnerDirect, Registered and Certified Channel Partners can access software licensing from all of
the major publishers including Microsoft, Symantec, VMware, Citrix, Oracle and many more
- Features a top-of-rack, 1U, multiprotocol design that supports Converged Enhanced Ethernet (CEE) and
traditional Ethernet protocols, upgradable to support Fiber Channel and Fibre Channel over Ethernet
(FCoE)
- The evolutionary design consumes less than 2.5 watts of power per port for exceptional power and cooling
efficiency, and features consolidated power and fan assemblies to help improve environmental performance
and reduce ownership costs.
Conclusion: source pre-edit rules still appear to be the most viable solution; patterns are traceable
but more rules than what has been identified to-date will be needed (project WIP)
25. Developing Quality Estimation Prediction
Classifiers
• MT engines in production often vary significantly in their
translation performance from segment to segment
• Goal: develop MT-engine-specific Quality Estimation
components that generate for every MT-generated segment a
predicted estimate of its expected quality
• Useful information for a variety of MT applications:
• For MT post-editing: provide indicators of predicted level of required
post-editing effort
• For real-time raw MT applications: filter out MT-generated documents
that are poorly translated
26. Safaba Quality Estimation Preliminary Study
•
Goal: Develop and analyze the performance of basic QE
components for Safaba’s EMTGlobal Dell MT engines using
Welocalize post-editing productivity data
•
English into 12 target languages
•
Very small amounts of post-edited data for each language
•
Binary classification: will post-editing be required for this segment?
Reliable quality estimation built for free
30. Quality Estimation Systems
•
Classifier: nu-support vector classifier
(class of support vector machine)
•
Features: 17 standard quality estimation features
from ACL WMT shared tasks
•
Training data: binary judgments on MT output post-
edited by professional translators
All resources required for QE are available from the
MT engine training process in a standard post-
editing scenario
31. QE Feature Scoring
•
Input: source sentence, MT-generated translation
output
•
Key features computed for classifier:
− Source/target word count
− Source/target language model probability
− Average number of possible translations of each source
word (by word-based translation model IBM-1)
− Counts of high and low frequency source/target
unigrams/bigrams/trigrams
− Percentage of out-out-of-vocabulary source words
32. Models Required
•
Source and target 4-gram language models
•
Source and target low and high frequency n-
gram tables
•
Source and target vocabularies
All built from existing MT system training
data
33. Classifier Training
•
Classify sentences into two groups:
• Requires post-editing
• Does not require post-editing
•
Training data:
• Safaba EMTGlobal MT systems used in production for
post-editing
• In this study: triples of source, MT output, edited
translation available from Welocalize productivity study
• Compare MT-generated output to final post-edited
translation to determine if editing was required
No additional human annotation required
34. QE Prediction Preliminary Study
o Average of 250 sentences edited per language
o Classifiers trained and evaluated with 10-fold cross-
validation (found to perform comparably to leave-
one-out validation)
o Outperforms random selection and majority class
selection in 11 of 12 languages
37. Analysis of Results
o QE systems built entirely using small amounts of existing
data
o 70-80% reliability in majority of languages
o Most errors are false negatives (good sentences marked as
bad, less damaging case)
o Cases where QE performance is weaker:
o Small model training data
o Skewed classifier training data
o High statistical similarity between positive and negative
examples (Czech)
38. Future Work
o Build QE prediction components automatically for EMTGlobal
production MT systems.
o Train QE classifiers automatically as client data is edited and
fed back to Safaba
o Plug in additional sentence-level meta-data to predict other
useful measures:
o Translation time
o HTER
welocalize
www.welocalize.com
[t] +1.301.668.0330
[t] +1.800.370.9515 Toll Free
[e] : sales@welocalize.com
Editor's Notes
Olga
Possibly add a slide on content types here (from VM Excel) ; otherwise – string types visualization from Nilesh’s study and examples
Olga
Create bullet points from the text
Using a method of calculating tag count and therefore tag density (tags/word) for each individual string from MySQL data exports, we can now identify segments with and without tags, where the translatable content did not require post-editing, and test the hypothesis that tag density results in higher post-editing effort.
Olga *The event information is captured in the database in raw XML event action form and can be extracted and interpreted.