Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
TAUS Best PracticesPost-Editing Productivity GuidelinesDecember 2012
Small-scale, Short-term Post-EditingProductivity TestsPost-editing productivity measurement applies to scenarios where you...
DesignCompare apples with apples not apples withorangesProductivity measures are flawed when they don’tcompare like with l...
Employ at least three post-editors per targetlanguageProductivity will vary by individual. By includinga number of post-ed...
Include appropriate content typeMT engines will have varying degrees of successwith different content types. Test each con...
Provide clear guidelinesIf post-editors should adhere to your standardstyle-guide, then this should be clearly stated. Ina...
MeasuresMeasure actual post-editing effort and avoid measures of‘perceived’ effortWhen people are asked to rate which segm...
Extrapolate daily productivity with cautionFor short-term measures, numbers of post-editors and words are often limited an...
Measure final qualityIncreased efficiencies are meaningless if the desired level of quality isnot reached. Below are two e...
Large-Scale, Longer-Term Post-EditingProductivity TestsThe larger-scale, longer-term tests would beused if you are already...
DesignCompare apples with apples not apples withorangesProductivity measures are flawed when they don’tcompare like with l...
Employ a sufficient number of post-editors pertarget languageProductivity will vary by individual. By includinga broad gro...
Exercise control over your participant profileEngage the people who will actually do the post-editing in live projects. Em...
Include appropriate content typeMT engines will have varying degrees of successwith different content types. Test each con...
Include a sufficient number of wordsFor each content type and target language, werecommend collating post-editing throughp...
Provide clear guidelinesIf post-editors should adhere to your standardstyle-guide, then this should be clearly stated. Ina...
Involve representatives from your post-editingcommunity in the design and analysisAs with TAUS Machine Translation Post-Ed...
Large Scale, Long-Term Tests:MeasuresGauge the quality level of the raw MT outputfirstProductivity is directly related to ...
Measure actual post-editing effort and avoidmeasures of ‘perceived’ effortWhen people are asked to rate which segmentmight...
Measure more than ‘words per hour’Words per hour or words per day give asimplistic view of productivity. The importantques...
Measure the delta between translationproductivity and post-editing productivityThe delta between the two activities is the...
Measure final qualityIncreased efficiencies are meaningless if the desired levelof quality is not reached. Below are two e...
Measure opinions, get feedbackSome measurement of post-editoropinion/profile can be useful in helping tointerpret the quan...
Questions About MeasurementSelf-report or automate?If a realistic environment is used, it is difficult toautomate producti...
What about using confidence scores asindicators of productivity?Confidence scores that are automaticallygenerated by the M...
What about re-training engines over time?If SMT engines are re-trained with quality-approved post-edited content over time...
Upcoming SlideShare
Loading in …5
×

TAUS Post-Editing Productivity Guidelines

831 views

Published on

Published in: Technology, Business, Education
  • Be the first to comment

  • Be the first to like this

TAUS Post-Editing Productivity Guidelines

  1. 1. TAUS Best PracticesPost-Editing Productivity GuidelinesDecember 2012
  2. 2. Small-scale, Short-term Post-EditingProductivity TestsPost-editing productivity measurement applies to scenarios where youmight wish to use MT as a translator productivity tool. Generally,small-scale productivity tests should be used if you are thinking aboutgetting started with MT in a particular language pair, and wonderingwhether you should invest further effort. The productivity tests willhelp you understand your potential ROI per language pair/contenttype.You may also undertake such tests periodically, say annually, to get anindication of improvements (or not) in productivity.Small-scale, short-term tests, by their nature, gather less data and areless rigorous than the larger-scale, long-term ones for which weprovide a separate set of best practice guidelines.TAUS members are able to use productivity testing and humanevaluation tools at tauslabs.com, a neutral and independentenvironment. These ensure that the best practices outlined below areapplied, with the additional benefit of automated reporting.
  3. 3. DesignCompare apples with apples not apples withorangesProductivity measures are flawed when they don’tcompare like with like. When comparing post-editing against translation, it should be clearwhether or not ‘translation’ means ‘TEP’ –Translation, Edit, Proof - and whether or not post-editing includes final proofing. If translators haveaccess to terminology, so too should post-editors.
  4. 4. Employ at least three post-editors per targetlanguageProductivity will vary by individual. By includinga number of post-editors with varyingtranslation and post-editing experience, you willget a more accurate average measurement.Exercise control over your participant profileEngage the people who will actually do the post-editing in live projects.
  5. 5. Include appropriate content typeMT engines will have varying degrees of successwith different content types. Test each content typeyou wish to machine translate. For SMT engines thetraining data should not form part of the test set.Out-of-domain test data should not be tested ondomain-specific engines.Include a sufficient number of wordsFor each content type and target language, werecommend including at least 300 segments of MToutput for short-term tests. The more wordsincluded, the more reliable the results will be.
  6. 6. Provide clear guidelinesIf post-editors should adhere to your standardstyle-guide, then this should be clearly stated. Inaddition, guidelines for post-editing should beprovided so that all participants understandwhat is/is not required. See the TAUS MachineTranslation Post-Editing Guidelines. It has beennoted that people sometimes do not adhere toguidelines, so the clearer and more succinctthey are, the better.
  7. 7. MeasuresMeasure actual post-editing effort and avoid measures of‘perceived’ effortWhen people are asked to rate which segment mightbe easier to post-edit, they are onlyrating perceived, not actual effort. It is been frequently shownthat agreement between participants in such exercises is onlylow to moderate.Measure the delta between translation productivity andpost-editing productivityThe delta between the two activities is the importantmeasurement. Individuals will benefit to a greater or lesserextent from MT, so the delta should be calculated perindividual and the average delta should then be calculated.
  8. 8. Extrapolate daily productivity with cautionFor short-term measures, numbers of post-editors and words are often limited and sopotential daily productivity should beextrapolated with some caution as you mayhave outliers who will skew results in a smallgroup.The TAUS productivity testing reports show onaverage and individual post-editor productivity,enabling users to determine the influence ofoutliers on results.
  9. 9. Measure final qualityIncreased efficiencies are meaningless if the desired level of quality isnot reached. Below are two examples of techniques for qualitymeasurement.• Human evaluation: For example, using a company’s standard errortypology or adequacy/fluency evaluation.• Edit distance measures: Some research has shown the TER(Translation Edit Rate) and GTM (General Text Matcher) measuresto correlate fairly well with human assessments of quality.The TAUS quality evaluation tools enable you to undertakeadequacy/fluency evaluation and error typology review, again reportsshow on average and individual post-editor productivity, enabling usersto determine the influence of outliers on results.
  10. 10. Large-Scale, Longer-Term Post-EditingProductivity TestsThe larger-scale, longer-term tests would beused if you are already reasonably committed toMT, or at least to testing it on a large-scale, andlooking to create a virtuous cycle towardsoperational excellence, guided by such tests.
  11. 11. DesignCompare apples with apples not apples withorangesProductivity measures are flawed when they don’tcompare like with like. When comparing post-editing against translation, it should be clearwhether or not ‘translation’ means ‘TEP’ –Translation, Edit, Proof - and whether or not post-editing includes final proofing. If translators haveaccess to terminology, so too should post-editors.
  12. 12. Employ a sufficient number of post-editors pertarget languageProductivity will vary by individual. By includinga broad group of post-editors with varyingtranslation and post-editing experience, you willget a more accurate average measurement. Forlong-term measurements, we recommendemploying more than three post-editors,preferably at least five or six.
  13. 13. Exercise control over your participant profileEngage the people who will actually do the post-editing in live projects. Employing students orthe ‘crowd’ is not valid if they are not the actualpost-editors you would employ in a live project.Conduct the productivity measurement as youwould a commercial projectYou want the participants to perform the task asthey would any commercial project so that themeasures you take are reliable.
  14. 14. Include appropriate content typeMT engines will have varying degrees of successwith different content types. Test each contenttype you wish to machine translate.• For SMT engines the training data should notform part of the test set. Out-of-domain testdata should not be tested on domain-specificengines.
  15. 15. Include a sufficient number of wordsFor each content type and target language, werecommend collating post-editing throughput overa number of weeks. The more words included, themore reliable the results will be.Use realistic tools and environmentsCommonly, MT is integrated into TM tools. It isrecommended that if the post-editor is toeventually work in this standard environment, thenproductivity tests should be done in thisenvironment, as more realistic measures can beobtained.
  16. 16. Provide clear guidelinesIf post-editors should adhere to your standardstyle-guide, then this should be clearly stated. Inaddition, guidelines for post-editing should beprovided so that all participants understandwhat is/is not required. See the TAUS MachineTranslation Post-Editing Guidelines. It has beennoted that people sometimes do not adhere toguidelines, so the clearer and more succinctthey are, the better.
  17. 17. Involve representatives from your post-editingcommunity in the design and analysisAs with TAUS Machine Translation Post-EditingGuidelines, we recommend that representativesof the post-editing community be involved inthe productivity measurement. Having a stake insuch a process generally leads to a higher levelof consensus.
  18. 18. Large Scale, Long-Term Tests:MeasuresGauge the quality level of the raw MT outputfirstProductivity is directly related to the level ofquality of the raw MT output. To understand PEproductivity measurements, you need tounderstand the baseline quality of the MToutput. Random sampling of the output isrecommended.
  19. 19. Measure actual post-editing effort and avoidmeasures of ‘perceived’ effortWhen people are asked to rate which segmentmight be easier to post-edit, they are only ratingperceived, not actual effort. It is been frequentlyshown that agreement between participants insuch exercises is only low to moderate.
  20. 20. Measure more than ‘words per hour’Words per hour or words per day give asimplistic view of productivity. The importantquestion is: can production time be reducedacross the life cycle of a project (withoutcompromising quality)? Therefore, it may bemore appropriate to measure the total gain indays for delivery or publication of the translatedcontent. Spreading measurement out over timewill also show whether post-editing productivityrates rise, plateau or fall over time.
  21. 21. Measure the delta between translationproductivity and post-editing productivityThe delta between the two activities is theimportant measurement. Individuals will benefitto a greater or lesser extent from MT, so thedelta should be calculated per individual and theaverage delta should then be calculated.
  22. 22. Measure final qualityIncreased efficiencies are meaningless if the desired levelof quality is not reached. Below are two examples oftechniques for quality measurement. For longitudinalmeasures, average final quality can be compared to seewhat improvements or degradations occurred:• Human evaluation: For example, using a company’sstandard error typology. Note that human raters ofquality often display low rates of agreement. The moreraters, the better (at least three).• Edit distance measures: Some research has shown theTER (Translation Edit Rate) and GTM (General TextMatcher) measures to correlate fairly well with humanassessments of quality.
  23. 23. Measure opinions, get feedbackSome measurement of post-editoropinion/profile can be useful in helping tointerpret the quantitative measures. Gatheringfeedback on the most common or problematicerrors can help improve the MT system overtime.
  24. 24. Questions About MeasurementSelf-report or automate?If a realistic environment is used, it is difficult toautomate productivity measurements. Self-reporting is often used instead, where post-editorsfill in a table reporting the number of words theypost-edited, divided by the number of hours theyworked. Self-reporting is error-prone, but with alarge enough group of participants under- or over-reporting should be mitigated.
  25. 25. What about using confidence scores asindicators of productivity?Confidence scores that are automaticallygenerated by the MT system are potentialindicators of both quality and productivity.However, development is still in the early stagesand there is not yet enough research on thepotential links between confidence scores andactual post-editing productivity.
  26. 26. What about re-training engines over time?If SMT engines are re-trained with quality-approved post-edited content over time, then itcan be expected that the MT engine willproduce higher quality raw output as timeprogresses and that post-editing productivitymay increase. This should be tested over thelong-term.

×