TAUS Best PracticesAdequacy/Fluency GuidelinesMay 2013
Quality Evaluation using Adequacyand/or Fluency ApproachesWHY ARE TAUS INDUSTRY GUIDELINES NEEDED?Adequacy and/or Fluency evaluations are regularly employed forassessing the quality of machine translation. However, they are alsouseful for evaluation of human and/or computer assisted translation incertain contexts. These methods are less costly and time consuming toimplement than an error typology approach and can help to focus onassessing quality attributes that are most relevant for specific contenttypes and purposes.Providing guidelines for best practices will enable the industry to:• Adopt standard approaches, ensuring a shared language andunderstanding between translation buyers, suppliers and evaluators• Better track and compare performance across projects, languagesand vendors• Reduce the cost of quality assurance
Adequacy/Fluency Best PracticeGuidelinesEstablish clear definitionsAdequacy• “How much of the meaning expressed in the gold-standard translation or the source is also expressed inthe target translation” (Linguistic Data Consortium).Fluency• To what extent the translation is “one that is well-formed grammatically, contains correctspellings, adheres to common use of terms, titles andnames, is intuitively acceptable and can be sensiblyinterpreted by a native speaker” (Linguistic DataConsortium).
Clearly define evaluation criteria and ratingscales, using examples to ensure clarityAdequacy• How much of the meaningexpressed in the gold-standard translation or thesource is also expressed inthe target translation?• On a 4-point scale rate howmuch of the meaning isrepresented in thetranslation:o Everything o Most o Little o NoneFluencyTo what extent is a target side translationgrammatically well informed, withoutspelling errors and experienced as usingnatural/intuitive language by a nativespeaker?Rate on a 4- point scale the extent to whichthe translation is well-formedgrammatically, contains correctspellings, adheres to common use ofterms, titles and names, is intuitivelyacceptable and can be sensibly interpretedby a native speaker:o Flawless o Good o Dis-fluent oIncomprehensible
Data segments should be at leastsentence length• The chosen evaluation data set must berepresentative of entire data set/content• For MT output, at a minimum two hundredsegments must be reviewed• For MT output, the order in which thedata/content order is presented should berandomized to eliminate bias
Human evaluator teams are best suited toprovide feedback for Adequacy and/or Fluency• For periodic reviews there should be at leastfour evaluators per team• The level of agreement between evaluatorsand confidence intervals must be measured• Evaluators must rate the same data
To ensure consistency quality human evaluatorsmust meet minimum requirements• Ensure minimum requirements are met by developingtraining materials, screening tests, and guidelines withexamples• Evaluators should be native or near nativespeakers, familiar with the domain of the data• Evaluators should ideally be available to perform oneevaluation pass without interruptionFor adequacy evaluation evaluators will need to be able tounderstand the source language. This requirement can beovercome if evaluators are given a gold reference segment foreach translated segment. Note that the quality of the goldreference should be validated in advance.
Determine when your evaluations are suited forbenchmarking, by making sure results are repeatable.• Define tests and test sets for each model anddetermine minimal requirements for inter-rater agreements.• Train and retain evaluator teams• Establish scalable and repeatable processes byusing tools and automated processes for datapreparation, evaluation setup and analysis
Capture evaluation results automatically to enablecomparisons across time, projects, vendors• Use color-coding for comparing performanceover time, e.g. green for meeting or exceedingexpectations, amber to signal a reduction inquality, red for problems that need addressing
Implement a CAPA (Corrective ActionPreventive Action) process• Best practice is for there to be a process in placeto deal with quality issues - corrective actionprocesses along with preventive action processes.Examples might include the provision of trainingor the improvement of terminology managementprocesses.• Adequacy/Fluency review may not identify rootcauses. If review scores highlight majorissues, more detailed analysis may berequired, for example using Error Typologyreview.
If evaluations follow these recommendationsyou will be able to achieve reliable andstatistically significant results with measurableconfidence scores.
Further resources:For TAUS members:For information on when to use adequacyand/or fluency approaches, conditions forsuccess, step-by-step process guides, ready touse templates and guidance on trainingevaluators, please refer to the TAUS DynamicQuality Framework Knowledge.
Our thanks to:Karin Berghoefer (Appen Butler Hill) for draftingthese guidelines.The following organizations for reviewing and refiningthe Guidelines at the TAUS Quality Evaluation Summit15 March 2013, Dublin:ABBYY Language Services, Capita Translation andInterpreting, Crestec, Intel, Jensen Localization, JonckersTranslation & Engineering s.r.o., Lingo24, LogrusInternational, Microsoft, Palex Languages &Software, Symantec, and Tekom.
Consultation and Publication• A public consultation was undertakenbetween 11 and 24 April 2013. The guidelineswere published on 2 May 2013.Feedback• To give feedback on how to improve theguidelines, please write to firstname.lastname@example.org.