SlideShare a Scribd company logo
1 of 26
TAUS Best Practices
Post-Editing Productivity Guidelines
December 2012
Small-scale, Short-term Post-Editing
Productivity Tests
Post-editing productivity measurement applies to scenarios where you
might wish to use MT as a translator productivity tool. Generally,
small-scale productivity tests should be used if you are thinking about
getting started with MT in a particular language pair, and wondering
whether you should invest further effort. The productivity tests will
help you understand your potential ROI per language pair/content
type.
You may also undertake such tests periodically, say annually, to get an
indication of improvements (or not) in productivity.
Small-scale, short-term tests, by their nature, gather less data and are
less rigorous than the larger-scale, long-term ones for which we
provide a separate set of best practice guidelines.
TAUS members are able to use productivity testing and human
evaluation tools at tauslabs.com, a neutral and independent
environment. These ensure that the best practices outlined below are
applied, with the additional benefit of automated reporting.
Design
Compare apples with apples not apples with
oranges
Productivity measures are flawed when they don’t
compare like with like. When comparing post-
editing against translation, it should be clear
whether or not ‘translation’ means ‘TEP’ –
Translation, Edit, Proof - and whether or not post-
editing includes final proofing. If translators have
access to terminology, so too should post-editors.
Employ at least three post-editors per target
language
Productivity will vary by individual. By including
a number of post-editors with varying
translation and post-editing experience, you will
get a more accurate average measurement.
Exercise control over your participant profile
Engage the people who will actually do the post-
editing in live projects.
Include appropriate content type
MT engines will have varying degrees of success
with different content types. Test each content type
you wish to machine translate. For SMT engines the
training data should not form part of the test set.
Out-of-domain test data should not be tested on
domain-specific engines.
Include a sufficient number of words
For each content type and target language, we
recommend including at least 300 segments of MT
output for short-term tests. The more words
included, the more reliable the results will be.
Provide clear guidelines
If post-editors should adhere to your standard
style-guide, then this should be clearly stated. In
addition, guidelines for post-editing should be
provided so that all participants understand
what is/is not required. See the TAUS Machine
Translation Post-Editing Guidelines. It has been
noted that people sometimes do not adhere to
guidelines, so the clearer and more succinct
they are, the better.
Measures
Measure actual post-editing effort and avoid measures of
‘perceived’ effort
When people are asked to rate which segment might
be easier to post-edit, they are only
rating perceived, not actual effort. It is been frequently shown
that agreement between participants in such exercises is only
low to moderate.
Measure the delta between translation productivity and
post-editing productivity
The delta between the two activities is the important
measurement. Individuals will benefit to a greater or lesser
extent from MT, so the delta should be calculated per
individual and the average delta should then be calculated.
Extrapolate daily productivity with caution
For short-term measures, numbers of post-
editors and words are often limited and so
potential daily productivity should be
extrapolated with some caution as you may
have outliers who will skew results in a small
group.
The TAUS productivity testing reports show on
average and individual post-editor productivity,
enabling users to determine the influence of
outliers on results.
Measure final quality
Increased efficiencies are meaningless if the desired level of quality is
not reached. Below are two examples of techniques for quality
measurement.
• Human evaluation: For example, using a company’s standard error
typology or adequacy/fluency evaluation.
• Edit distance measures: Some research has shown the TER
(Translation Edit Rate) and GTM (General Text Matcher) measures
to correlate fairly well with human assessments of quality.
The TAUS quality evaluation tools enable you to undertake
adequacy/fluency evaluation and error typology review, again reports
show on average and individual post-editor productivity, enabling users
to determine the influence of outliers on results.
Large-Scale, Longer-Term Post-Editing
Productivity Tests
The larger-scale, longer-term tests would be
used if you are already reasonably committed to
MT, or at least to testing it on a large-scale, and
looking to create a virtuous cycle towards
operational excellence, guided by such tests.
Design
Compare apples with apples not apples with
oranges
Productivity measures are flawed when they don’t
compare like with like. When comparing post-
editing against translation, it should be clear
whether or not ‘translation’ means ‘TEP’ –
Translation, Edit, Proof - and whether or not post-
editing includes final proofing. If translators have
access to terminology, so too should post-editors.
Employ a sufficient number of post-editors per
target language
Productivity will vary by individual. By including
a broad group of post-editors with varying
translation and post-editing experience, you will
get a more accurate average measurement. For
long-term measurements, we recommend
employing more than three post-editors,
preferably at least five or six.
Exercise control over your participant profile
Engage the people who will actually do the post-
editing in live projects. Employing students or
the ‘crowd’ is not valid if they are not the actual
post-editors you would employ in a live project.
Conduct the productivity measurement as you
would a commercial project
You want the participants to perform the task as
they would any commercial project so that the
measures you take are reliable.
Include appropriate content type
MT engines will have varying degrees of success
with different content types. Test each content
type you wish to machine translate.
• For SMT engines the training data should not
form part of the test set. Out-of-domain test
data should not be tested on domain-specific
engines.
Include a sufficient number of words
For each content type and target language, we
recommend collating post-editing throughput over
a number of weeks. The more words included, the
more reliable the results will be.
Use realistic tools and environments
Commonly, MT is integrated into TM tools. It is
recommended that if the post-editor is to
eventually work in this standard environment, then
productivity tests should be done in this
environment, as more realistic measures can be
obtained.
Provide clear guidelines
If post-editors should adhere to your standard
style-guide, then this should be clearly stated. In
addition, guidelines for post-editing should be
provided so that all participants understand
what is/is not required. See the TAUS Machine
Translation Post-Editing Guidelines. It has been
noted that people sometimes do not adhere to
guidelines, so the clearer and more succinct
they are, the better.
Involve representatives from your post-editing
community in the design and analysis
As with TAUS Machine Translation Post-Editing
Guidelines, we recommend that representatives
of the post-editing community be involved in
the productivity measurement. Having a stake in
such a process generally leads to a higher level
of consensus.
Large Scale, Long-Term Tests:
Measures
Gauge the quality level of the raw MT output
first
Productivity is directly related to the level of
quality of the raw MT output. To understand PE
productivity measurements, you need to
understand the baseline quality of the MT
output. Random sampling of the output is
recommended.
Measure actual post-editing effort and avoid
measures of ‘perceived’ effort
When people are asked to rate which segment
might be easier to post-edit, they are only rating
perceived, not actual effort. It is been frequently
shown that agreement between participants in
such exercises is only low to moderate.
Measure more than ‘words per hour’
Words per hour or words per day give a
simplistic view of productivity. The important
question is: can production time be reduced
across the life cycle of a project (without
compromising quality)? Therefore, it may be
more appropriate to measure the total gain in
days for delivery or publication of the translated
content. Spreading measurement out over time
will also show whether post-editing productivity
rates rise, plateau or fall over time.
Measure the delta between translation
productivity and post-editing productivity
The delta between the two activities is the
important measurement. Individuals will benefit
to a greater or lesser extent from MT, so the
delta should be calculated per individual and the
average delta should then be calculated.
Measure final quality
Increased efficiencies are meaningless if the desired level
of quality is not reached. Below are two examples of
techniques for quality measurement. For longitudinal
measures, average final quality can be compared to see
what improvements or degradations occurred:
• Human evaluation: For example, using a company’s
standard error typology. Note that human raters of
quality often display low rates of agreement. The more
raters, the better (at least three).
• Edit distance measures: Some research has shown the
TER (Translation Edit Rate) and GTM (General Text
Matcher) measures to correlate fairly well with human
assessments of quality.
Measure opinions, get feedback
Some measurement of post-editor
opinion/profile can be useful in helping to
interpret the quantitative measures. Gathering
feedback on the most common or problematic
errors can help improve the MT system over
time.
Questions About Measurement
Self-report or automate?
If a realistic environment is used, it is difficult to
automate productivity measurements. Self-
reporting is often used instead, where post-editors
fill in a table reporting the number of words they
post-edited, divided by the number of hours they
worked. Self-reporting is error-prone, but with a
large enough group of participants under- or over-
reporting should be mitigated.
What about using confidence scores as
indicators of productivity?
Confidence scores that are automatically
generated by the MT system are potential
indicators of both quality and productivity.
However, development is still in the early stages
and there is not yet enough research on the
potential links between confidence scores and
actual post-editing productivity.
What about re-training engines over time?
If SMT engines are re-trained with quality-
approved post-edited content over time, then it
can be expected that the MT engine will
produce higher quality raw output as time
progresses and that post-editing productivity
may increase. This should be tested over the
long-term.

More Related Content

Similar to TAUS Post-Editing Productivity Guidelines

T map next look inside version
T map next look inside versionT map next look inside version
T map next look inside version
Cristiano Caetano
 
Hitchhikers guide to_data_center_facility_ops
Hitchhikers guide to_data_center_facility_opsHitchhikers guide to_data_center_facility_ops
Hitchhikers guide to_data_center_facility_ops
avdsouza
 

Similar to TAUS Post-Editing Productivity Guidelines (20)

TAUS Evaluating Post-Editor Performance Guidelines
TAUS Evaluating Post-Editor Performance GuidelinesTAUS Evaluating Post-Editor Performance Guidelines
TAUS Evaluating Post-Editor Performance Guidelines
 
Learn the different approaches to machine translation and how to improve the ...
Learn the different approaches to machine translation and how to improve the ...Learn the different approaches to machine translation and how to improve the ...
Learn the different approaches to machine translation and how to improve the ...
 
What machine translation developers are doing to make post-editors happy
What machine translation developers are doing to make post-editors happyWhat machine translation developers are doing to make post-editors happy
What machine translation developers are doing to make post-editors happy
 
Data driven; People based
Data driven; People basedData driven; People based
Data driven; People based
 
Pbd Benchmarking
Pbd BenchmarkingPbd Benchmarking
Pbd Benchmarking
 
T map next look inside version
T map next look inside versionT map next look inside version
T map next look inside version
 
Fundamentals of testing
Fundamentals of testingFundamentals of testing
Fundamentals of testing
 
Function Points for Estimation - Getting Developers on Board
Function Points for Estimation - Getting Developers on BoardFunction Points for Estimation - Getting Developers on Board
Function Points for Estimation - Getting Developers on Board
 
Welocalize Throughputs and Post-Editing Productivity Webinar Laura Casanellas
Welocalize Throughputs and Post-Editing Productivity Webinar Laura CasanellasWelocalize Throughputs and Post-Editing Productivity Webinar Laura Casanellas
Welocalize Throughputs and Post-Editing Productivity Webinar Laura Casanellas
 
Hitchhikers guide to_data_center_facility_ops
Hitchhikers guide to_data_center_facility_opsHitchhikers guide to_data_center_facility_ops
Hitchhikers guide to_data_center_facility_ops
 
Testing and quality romi
Testing and quality romiTesting and quality romi
Testing and quality romi
 
Test Automation in Agile: A Successful Implementation
Test Automation in Agile: A Successful ImplementationTest Automation in Agile: A Successful Implementation
Test Automation in Agile: A Successful Implementation
 
Module 8 - Monitoring and Evaluation
Module 8 - Monitoring and EvaluationModule 8 - Monitoring and Evaluation
Module 8 - Monitoring and Evaluation
 
Kaizen_1
Kaizen_1Kaizen_1
Kaizen_1
 
Stanmech
StanmechStanmech
Stanmech
 
Improving Translator Productivity with MT: A Patent Translation Case Study
Improving Translator Productivity with MT: A Patent Translation Case StudyImproving Translator Productivity with MT: A Patent Translation Case Study
Improving Translator Productivity with MT: A Patent Translation Case Study
 
Types of Training Needs Analysis.pdf
Types of Training Needs Analysis.pdfTypes of Training Needs Analysis.pdf
Types of Training Needs Analysis.pdf
 
Performance appraisal model
Performance appraisal modelPerformance appraisal model
Performance appraisal model
 
Performance appraisal model
Performance appraisal modelPerformance appraisal model
Performance appraisal model
 
The DevOps promise: IT delivery that’s hot-off-the-catwalk and made-to-last
The DevOps promise:  IT delivery that’s hot-off-the-catwalk and made-to-lastThe DevOps promise:  IT delivery that’s hot-off-the-catwalk and made-to-last
The DevOps promise: IT delivery that’s hot-off-the-catwalk and made-to-last
 

More from TAUS - The Language Data Network

More from TAUS - The Language Data Network (20)

TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
 
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
 
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
 
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
 
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
 
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
 
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
 
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
 Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann... Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
 
A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...
 
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
 
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
 
Farmer Lv (TrueTran)
Farmer Lv (TrueTran)Farmer Lv (TrueTran)
Farmer Lv (TrueTran)
 
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
 
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
 The Theory and Practice of Computer Aided Translation Training System, Liu Q... The Theory and Practice of Computer Aided Translation Training System, Liu Q...
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
 
Translation Technology Showcase in Shenzhen
Translation Technology Showcase in ShenzhenTranslation Technology Showcase in Shenzhen
Translation Technology Showcase in Shenzhen
 
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
 
SDL Trados Studio 2017, Jocelyn He (SDL)
SDL Trados Studio 2017, Jocelyn He (SDL)SDL Trados Studio 2017, Jocelyn He (SDL)
SDL Trados Studio 2017, Jocelyn He (SDL)
 
How we train post-editors - Yongpeng Wei (Lingosail)
How we train post-editors - Yongpeng Wei (Lingosail)How we train post-editors - Yongpeng Wei (Lingosail)
How we train post-editors - Yongpeng Wei (Lingosail)
 
A use-case for getting MT into your company, Kerstin Berns (berns language c...
 A use-case for getting MT into your company, Kerstin Berns (berns language c... A use-case for getting MT into your company, Kerstin Berns (berns language c...
A use-case for getting MT into your company, Kerstin Berns (berns language c...
 
QE integrated in XTM, by Bob Willans (XTM)
QE integrated in XTM, by Bob Willans (XTM)QE integrated in XTM, by Bob Willans (XTM)
QE integrated in XTM, by Bob Willans (XTM)
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

TAUS Post-Editing Productivity Guidelines

  • 1. TAUS Best Practices Post-Editing Productivity Guidelines December 2012
  • 2. Small-scale, Short-term Post-Editing Productivity Tests Post-editing productivity measurement applies to scenarios where you might wish to use MT as a translator productivity tool. Generally, small-scale productivity tests should be used if you are thinking about getting started with MT in a particular language pair, and wondering whether you should invest further effort. The productivity tests will help you understand your potential ROI per language pair/content type. You may also undertake such tests periodically, say annually, to get an indication of improvements (or not) in productivity. Small-scale, short-term tests, by their nature, gather less data and are less rigorous than the larger-scale, long-term ones for which we provide a separate set of best practice guidelines. TAUS members are able to use productivity testing and human evaluation tools at tauslabs.com, a neutral and independent environment. These ensure that the best practices outlined below are applied, with the additional benefit of automated reporting.
  • 3. Design Compare apples with apples not apples with oranges Productivity measures are flawed when they don’t compare like with like. When comparing post- editing against translation, it should be clear whether or not ‘translation’ means ‘TEP’ – Translation, Edit, Proof - and whether or not post- editing includes final proofing. If translators have access to terminology, so too should post-editors.
  • 4. Employ at least three post-editors per target language Productivity will vary by individual. By including a number of post-editors with varying translation and post-editing experience, you will get a more accurate average measurement. Exercise control over your participant profile Engage the people who will actually do the post- editing in live projects.
  • 5. Include appropriate content type MT engines will have varying degrees of success with different content types. Test each content type you wish to machine translate. For SMT engines the training data should not form part of the test set. Out-of-domain test data should not be tested on domain-specific engines. Include a sufficient number of words For each content type and target language, we recommend including at least 300 segments of MT output for short-term tests. The more words included, the more reliable the results will be.
  • 6. Provide clear guidelines If post-editors should adhere to your standard style-guide, then this should be clearly stated. In addition, guidelines for post-editing should be provided so that all participants understand what is/is not required. See the TAUS Machine Translation Post-Editing Guidelines. It has been noted that people sometimes do not adhere to guidelines, so the clearer and more succinct they are, the better.
  • 7. Measures Measure actual post-editing effort and avoid measures of ‘perceived’ effort When people are asked to rate which segment might be easier to post-edit, they are only rating perceived, not actual effort. It is been frequently shown that agreement between participants in such exercises is only low to moderate. Measure the delta between translation productivity and post-editing productivity The delta between the two activities is the important measurement. Individuals will benefit to a greater or lesser extent from MT, so the delta should be calculated per individual and the average delta should then be calculated.
  • 8. Extrapolate daily productivity with caution For short-term measures, numbers of post- editors and words are often limited and so potential daily productivity should be extrapolated with some caution as you may have outliers who will skew results in a small group. The TAUS productivity testing reports show on average and individual post-editor productivity, enabling users to determine the influence of outliers on results.
  • 9. Measure final quality Increased efficiencies are meaningless if the desired level of quality is not reached. Below are two examples of techniques for quality measurement. • Human evaluation: For example, using a company’s standard error typology or adequacy/fluency evaluation. • Edit distance measures: Some research has shown the TER (Translation Edit Rate) and GTM (General Text Matcher) measures to correlate fairly well with human assessments of quality. The TAUS quality evaluation tools enable you to undertake adequacy/fluency evaluation and error typology review, again reports show on average and individual post-editor productivity, enabling users to determine the influence of outliers on results.
  • 10. Large-Scale, Longer-Term Post-Editing Productivity Tests The larger-scale, longer-term tests would be used if you are already reasonably committed to MT, or at least to testing it on a large-scale, and looking to create a virtuous cycle towards operational excellence, guided by such tests.
  • 11. Design Compare apples with apples not apples with oranges Productivity measures are flawed when they don’t compare like with like. When comparing post- editing against translation, it should be clear whether or not ‘translation’ means ‘TEP’ – Translation, Edit, Proof - and whether or not post- editing includes final proofing. If translators have access to terminology, so too should post-editors.
  • 12. Employ a sufficient number of post-editors per target language Productivity will vary by individual. By including a broad group of post-editors with varying translation and post-editing experience, you will get a more accurate average measurement. For long-term measurements, we recommend employing more than three post-editors, preferably at least five or six.
  • 13. Exercise control over your participant profile Engage the people who will actually do the post- editing in live projects. Employing students or the ‘crowd’ is not valid if they are not the actual post-editors you would employ in a live project. Conduct the productivity measurement as you would a commercial project You want the participants to perform the task as they would any commercial project so that the measures you take are reliable.
  • 14. Include appropriate content type MT engines will have varying degrees of success with different content types. Test each content type you wish to machine translate. • For SMT engines the training data should not form part of the test set. Out-of-domain test data should not be tested on domain-specific engines.
  • 15. Include a sufficient number of words For each content type and target language, we recommend collating post-editing throughput over a number of weeks. The more words included, the more reliable the results will be. Use realistic tools and environments Commonly, MT is integrated into TM tools. It is recommended that if the post-editor is to eventually work in this standard environment, then productivity tests should be done in this environment, as more realistic measures can be obtained.
  • 16. Provide clear guidelines If post-editors should adhere to your standard style-guide, then this should be clearly stated. In addition, guidelines for post-editing should be provided so that all participants understand what is/is not required. See the TAUS Machine Translation Post-Editing Guidelines. It has been noted that people sometimes do not adhere to guidelines, so the clearer and more succinct they are, the better.
  • 17. Involve representatives from your post-editing community in the design and analysis As with TAUS Machine Translation Post-Editing Guidelines, we recommend that representatives of the post-editing community be involved in the productivity measurement. Having a stake in such a process generally leads to a higher level of consensus.
  • 18. Large Scale, Long-Term Tests: Measures Gauge the quality level of the raw MT output first Productivity is directly related to the level of quality of the raw MT output. To understand PE productivity measurements, you need to understand the baseline quality of the MT output. Random sampling of the output is recommended.
  • 19. Measure actual post-editing effort and avoid measures of ‘perceived’ effort When people are asked to rate which segment might be easier to post-edit, they are only rating perceived, not actual effort. It is been frequently shown that agreement between participants in such exercises is only low to moderate.
  • 20. Measure more than ‘words per hour’ Words per hour or words per day give a simplistic view of productivity. The important question is: can production time be reduced across the life cycle of a project (without compromising quality)? Therefore, it may be more appropriate to measure the total gain in days for delivery or publication of the translated content. Spreading measurement out over time will also show whether post-editing productivity rates rise, plateau or fall over time.
  • 21. Measure the delta between translation productivity and post-editing productivity The delta between the two activities is the important measurement. Individuals will benefit to a greater or lesser extent from MT, so the delta should be calculated per individual and the average delta should then be calculated.
  • 22. Measure final quality Increased efficiencies are meaningless if the desired level of quality is not reached. Below are two examples of techniques for quality measurement. For longitudinal measures, average final quality can be compared to see what improvements or degradations occurred: • Human evaluation: For example, using a company’s standard error typology. Note that human raters of quality often display low rates of agreement. The more raters, the better (at least three). • Edit distance measures: Some research has shown the TER (Translation Edit Rate) and GTM (General Text Matcher) measures to correlate fairly well with human assessments of quality.
  • 23. Measure opinions, get feedback Some measurement of post-editor opinion/profile can be useful in helping to interpret the quantitative measures. Gathering feedback on the most common or problematic errors can help improve the MT system over time.
  • 24. Questions About Measurement Self-report or automate? If a realistic environment is used, it is difficult to automate productivity measurements. Self- reporting is often used instead, where post-editors fill in a table reporting the number of words they post-edited, divided by the number of hours they worked. Self-reporting is error-prone, but with a large enough group of participants under- or over- reporting should be mitigated.
  • 25. What about using confidence scores as indicators of productivity? Confidence scores that are automatically generated by the MT system are potential indicators of both quality and productivity. However, development is still in the early stages and there is not yet enough research on the potential links between confidence scores and actual post-editing productivity.
  • 26. What about re-training engines over time? If SMT engines are re-trained with quality- approved post-edited content over time, then it can be expected that the MT engine will produce higher quality raw output as time progresses and that post-editing productivity may increase. This should be tested over the long-term.