Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
The science of 
Language Quality Assurance 
What’s behind two new ASTM work items, 
WK46396 and WK46397 
Serge Gladkoff, 
...
LISA QA Model 
SAE J2450 
SDL TMS 
Acrocheck 
ApSIC XBench 
CheckMate 
QA Distiller 
XLIFF:Doc 
EN15038 
{ …Proprietary me...
A THEORY OF BIG GAME HUNTING 
PROBLEM 
To Catch a Lion in the Sahara Desert. 
SOLUTION: THE BOLZANO-WEIERSTRASS 
METHOD 
D...
GENERAL CONSIDERATIONS 
Means 
A. Concentrating on factors 
making strongest impression 
B. Separating global (holistic) 
...
GENERAL CONSIDERATIONS 
Means 
A. Covering the whole spectrum of 
potential uses, subject areas, and 
materials; 
B. From ...
GENERAL CONSIDERATIONS 
Means 
A. Should be clear, not overly 
complicated 
B. Should be process-friendly, 
i.e. reasonabl...
GENERAL CONSIDERATIONS 
Means 
A. Concentrating on methodology rather than 
particular cases/uses. 
B. Issue typology is n...
ENTIRETY OF IMPRESSION 
Reader/consumer is primarily interested in 
overall readability and adequacy of the whole 
piece, ...
TWO KEY FACTORS 
ADEQUACY 
READABILITY
THRESHOLD OF ACCEPTANCE 
…is determined by usability expectations 
Expectation of how readable and adequate the translated...
GRADING 
If piece has serious defect, it 
has to be discarded without 
wasting time on further analysis. 
If text is inade...
Acceptance threshold 
Neither Readability nor Adequacy are 100% objective 
The solution lies in evaluating each of the two...
The scale from 0 to 10 
The smaller scale will not fit the Bell curve 
Important and direct consequence is 
that the scale...
Atomistic Quality 
Fluency (content) 
Fluency 
(mechanical) 
Spelling 
Style Guide 
Typography 
Grammar 
Locale convension...
MQM Tree
Atomistic Quality 
푄퐴 = 
푛 푁푖 ∗ 푊푖 
푖=0 
푉
Quality 
Triangle?
SHOWSTOPPER PROBLEM
..or quality 
square! 
There are things that you will 
know when you see them… 
Showstoppers…
Building the concrete LQA metrics 
1 
2 
3 
4 
5 
The methodology fully covers all types of 
translated content, including...
Applying LQA metrics 
Applying it 
correctly 
Three-dimensional 
vector 
 Holistic 
readability 
 Holistic 
adequacy 
 ...
In the vast majority of real-life cases, 
nobody can afford the luxury of 
employing an expert panel to evaluate 
the tran...
HOW FASTIDIOUS ARE YOU? 
95% Confidence Level % of Total - 0.25% CI % of Total - 0.5% CI % of Total - 0.125% CI 
Sample Si...
THE LUXURY OF FULL METRICS 
Develop metrics 
 Significant research 
- Know your area 
- Sustain R&D 
- Develop metrics 
-...
The scene of 
PUBLIC SITE
THE NEED AND THE CONSTRAINTS 
CONTRAINTS 
A. Professional LQA would require global 
federal program to develop applicable ...
CONSTRAINTS 
of public feedback 
The Crowd 
 Cannot be trained 
 Is not ready to spend 
a lot of time 
 Opinionated by ...
THE METRICS 
1. Quality square approach 
There MAY be showstopper errors. 
2. The parameters are simplified (no detailed i...
THE PROCESS 
1. LQA review scope is defined and briefly and clearly explained 
To prevent reviewers straying to other area...
THE ONLY PUBLIC LQA METRICS AVAILABLE 
POSITIVES 
• Both holistic 
measures can be 
relied upon with 
reasonable 
confiden...
MORE INFORMATION ABOUT MQM 
http://www.qt21.eu/mqm-definition/definition-2014-08-19.html
MORE INFORMATION ABOUT METHODOLOGY
WK46396= 
WK46397= 
The Proposal 
MQM 
The guide for LQA Methodology
THANK YOU! 
sgladkoff@logrus.net 
sgladkoff@gala-global.org
Upcoming SlideShare
Loading in …5
×

Serge astm-presentation-chicago-2014-final

908 views

Published on

Consistent high quality is the hallmark of a reliable localization company. Irrespective of deadlines and volumes, translations should be made professionally and thoroughly - they should be relevant, full and coherent. Language quality assurance is therefore one of the key processes to ensure the high quality of deliverables.

The material of the presentation is the content of new ASTM (www.astm.org) work item WK46397, which is currently in the process of discussion by ASTM working group F43 (Committee on language services and products).

Published in: Services
  • Be the first to comment

Serge astm-presentation-chicago-2014-final

  1. 1. The science of Language Quality Assurance What’s behind two new ASTM work items, WK46396 and WK46397 Serge Gladkoff, (GALA, Logrus International) Chicago, November 5, 2014
  2. 2. LISA QA Model SAE J2450 SDL TMS Acrocheck ApSIC XBench CheckMate QA Distiller XLIFF:Doc EN15038 { …Proprietary metrics and scorecards… } … What is translation quality? All of them disagree on what quality is. Would you dare giving a universal definition, considering all these authors had their own idea of what it is? Quality Definition?
  3. 3. A THEORY OF BIG GAME HUNTING PROBLEM To Catch a Lion in the Sahara Desert. SOLUTION: THE BOLZANO-WEIERSTRASS METHOD Divide the desert by a line running from north to south. The lion is then either in the eastern or in the western part. Lets assume it is in the eastern part. Divide this part by a line running from east to west. The lion is either in the northern or in the southern part. Lets assume it is in the northern part. We can continue this process arbitrarily and thereby constructing with each step an increasingly narrow fence around the selected area. The diameter of the chosen partitions converges to zero so that the lion is caged into a fence of arbitrarily small diameter.
  4. 4. GENERAL CONSIDERATIONS Means A. Concentrating on factors making strongest impression B. Separating global (holistic) and local issues, with the former being typically more important and playing bigger role Reflecting the perception and priorities of the target audience
  5. 5. GENERAL CONSIDERATIONS Means A. Covering the whole spectrum of potential uses, subject areas, and materials; B. From slightly post-edited MT to ultra-polished manual translations C. Common approach D. Same approach to technical materials and marketing content E. Only adjust acceptance criteria / thresholds based on expectations Universal applicability We are all humans and, irrespective of what exactly we are looking at, be it a restaurant menu or drug usage guidelines, we are making our first judgment about text quality using exactly the same criteria. We do not need a different approach or a completely new metric for each subject area or type of content. In reality, the only thing that requires adjustment is tolerance level. We are ready to accept a barely comprehensible menu translation, but expect perfect clarity and lack of ambiguity in the medical area. In technical terms, this means that we are still measuring the same thing, i.e. readability/clarity, but with different expectations, and this approach applies to all other criteria.
  6. 6. GENERAL CONSIDERATIONS Means A. Should be clear, not overly complicated B. Should be process-friendly, i.e. reasonably economical and applicable to the real world Viability of methodology
  7. 7. GENERAL CONSIDERATIONS Means A. Concentrating on methodology rather than particular cases/uses. B. Issue typology is not an inalienable part of the methodology, but rather an add-on component. It can be based for instance on MQM or other source, or legacy criteria, including those used/provided by the client. C. Weights assigned to particular issues are expected to vary within a wide range depending on the goals set, subject matter, type of material, etc. Particular issues might simply prove irrelevant for the job or area of focus, which results in zero weights being assigned to these issues. Flexibility of approach The client knows what types of are important to his content.
  8. 8. ENTIRETY OF IMPRESSION Reader/consumer is primarily interested in overall readability and adequacy of the whole piece, and only then in readability of parts (sentences).
  9. 9. TWO KEY FACTORS ADEQUACY READABILITY
  10. 10. THRESHOLD OF ACCEPTANCE …is determined by usability expectations Expectation of how readable and adequate the translated content should be, determines the acceptable quality level for these key cornerstone factors.
  11. 11. GRADING If piece has serious defect, it has to be discarded without wasting time on further analysis. If text is inadequate or unreadable, it does not make sense to count typos or see whether the terminology is right. Good stuff Substandard
  12. 12. Acceptance threshold Neither Readability nor Adequacy are 100% objective The solution lies in evaluating each of the two major holistic criteria (readability and adequacy) separately, on a PASS/FAIL basis. The logical thing to do is establish an acceptance threshold that would correspond to the lower end of the statistical range. How can we deal with this lack of complete objectivity in a real-world scenario, when no reference translations are available, there is a single reviewer who can only look at a certain percentage of the overall content, and we still need to evaluate and grade translated texts?
  13. 13. The scale from 0 to 10 The smaller scale will not fit the Bell curve Important and direct consequence is that the scale used for holistic translation ratings should be at least between 0 and 10, and by no means smaller.
  14. 14. Atomistic Quality Fluency (content) Fluency (mechanical) Spelling Style Guide Typography Grammar Locale convension …. Inconsistency Idiomatic Duplication Ambiguity Accuracy Mistranslation Omission Addition Untranslated Printing Copying Color and black and white digital printing Internationalization Compatibility (other) Design Global font choice Headers and footers Margins Page break Kerning ….
  15. 15. MQM Tree
  16. 16. Atomistic Quality 푄퐴 = 푛 푁푖 ∗ 푊푖 푖=0 푉
  17. 17. Quality Triangle?
  18. 18. SHOWSTOPPER PROBLEM
  19. 19. ..or quality square! There are things that you will know when you see them… Showstoppers…
  20. 20. Building the concrete LQA metrics 1 2 3 4 5 The methodology fully covers all types of translated content, including those produced using MT and/or MT + post-editing.
  21. 21. Applying LQA metrics Applying it correctly Three-dimensional vector  Holistic readability  Holistic adequacy  Atomic compound detailed metrics Readability threshold  Pass/Fail Adequacy threshold  Pass/Fail Atomistic rating  Detailed score Implementation keys  Holistic parameters cannot be mixed  Only those materials that pass HP are analyzed further  Experts required to produce precise and reliable atomistic score  Select content to apply the metrics
  22. 22. In the vast majority of real-life cases, nobody can afford the luxury of employing an expert panel to evaluate the translation quality of any particular document or web portal. LSPs typically have to use a single reviewer who only looks at a certain percentage of the content. To produce meaningful, reliable results despite this limitation proper sampling must be done.
  23. 23. HOW FASTIDIOUS ARE YOU? 95% Confidence Level % of Total - 0.25% CI % of Total - 0.5% CI % of Total - 0.125% CI Sample Size - 0.25% CI Sample Size - 0.5% CI Sample Size - 0.125% CI 100,000 10,000 1,000 100 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 100 1,000 10,000 100,000 1,000,000 Sample Size (Words) % to Be Checked Total Volume, Words
  24. 24. THE LUXURY OF FULL METRICS Develop metrics  Significant research - Know your area - Sustain R&D - Develop metrics - Develop processes - How much and what to QA? Build supply chain  Professional LSP  Professional linguists  Provide training  Provide reference materials Pay to apply  Terminology maintenance & support  Translation Memory maintenance and management  Localization Quality Assurance costs
  25. 25. The scene of PUBLIC SITE
  26. 26. THE NEED AND THE CONSTRAINTS CONTRAINTS A. Professional LQA would require global federal program to develop applicable LQA metrics, allocate funding, book professional LQA with specially trained LSPs. B. Yet, there’s still acute need for that, which has been demonstrated by GUIDADO DE SALUD web site of Affordable Care act. C. The methodology of public LQA is very much needed. D. There IS a feedback on the site from the public, how it is to be handled? Executive Order 13166  http://www.justice.gov/crt/about/c or/13166.php  www.lep.gov  “requires Federal agencies to examine the services they provide, identify any need for services to those with limited English proficiency (LEP), and develop and implement a system to provide those services so LEP persons can have meaningful access to them”
  27. 27. CONSTRAINTS of public feedback The Crowd  Cannot be trained  Is not ready to spend a lot of time  Opinionated by definition The Feedback  Is limited by volume  Is random by nature  Arbitrary issue classification  Can be large in number of reviewers The Approach  Using the statistical approach to turn the tables and gain in another area what we have lost
  28. 28. THE METRICS 1. Quality square approach There MAY be showstopper errors. 2. The parameters are simplified (no detailed issue definitions) No detailed Atomistic quality issue definitions can be applied. 3. Each reviewer produces four ratings on 0 – 10 scale “0 – 10” scale is the smallest one to accommodate the Bell Curve. (Each reviewer is asked to provide examples.) 4. The calibration: (a) Showstopper: 0 = two or more major errors, 10 = no major errors (b) Holistic readability (fluency): 0 = incomprehensible, 10 = a poem (c) Holistic adequacy (accuracy): 0 = inadequate, 10 = perfectly conveying meaning (d) Atomistic (small specific errors): 0 = full of small errors, 10 = completely error-free For crowd sourced LQA the atomistic quality category is not formalized in any way whatsoever.
  29. 29. THE PROCESS 1. LQA review scope is defined and briefly and clearly explained To prevent reviewers straying to other areas.. 2. The content needs to be final Despite the fact that review was by design a less than ideal Updates and scope changes are outside of the scope of crowdsourced review. 3. Communication is done via simple online portal No bandwidth to manage the crowd manually. 4. Better if volunteers are language professionals community feedback-based LQA, resulting in rating inconsistency among reviewers, most reviewers found too many noticeable and annoying It would compensate fore the lack of special training. 5. Proper sampling No less than 10 reviewers for each area; the more – the better. 6. Proper processing technical/minor mistakes in the text, as reflected in the low The results are manually vetted to remove outliers: - discard outliers w/o explanation and obvious reviewers errors - are major errors statistically significant? 30% threshold instead of 5% is recommended. - apply statistics to analyze results average rating, which is unsatisfactory. Substantial remedial work is clearly called ..an average Readability Rating as 6.2 out of 10 with a standard deviation of 2.2, and Adequacy 6.5 out of 10 with a standard deviation of 1.9. The conclusion would be: The text is readable (rating above 5), but barely so, and leaves much to be desired in view of its importance and high level of public exposure. Again, it is up to the expert who is doing the analysis to define the threshold, that, for example, for this type of content a proper target for average readability is at least 8 out of 10. for in this area. …the adjusted value for fechnical errors of 4.7 out of 10 for the average atomistic quality rating, with a 2.4 standard deviation
  30. 30. THE ONLY PUBLIC LQA METRICS AVAILABLE POSITIVES • Both holistic measures can be relied upon with reasonable confidence • Good overall assessment • Allows to identify showstoppers • Good general idea of the level of technical errors • Affordable and available for US federal agencies Is it appropriate? CONTRA • Only rough judgment • Not a good quantitative assessment • Not complete roster of errors even in the selected sample • No concrete process recommendations YOU DECIDE!
  31. 31. MORE INFORMATION ABOUT MQM http://www.qt21.eu/mqm-definition/definition-2014-08-19.html
  32. 32. MORE INFORMATION ABOUT METHODOLOGY
  33. 33. WK46396= WK46397= The Proposal MQM The guide for LQA Methodology
  34. 34. THANK YOU! sgladkoff@logrus.net sgladkoff@gala-global.org

×