Serge astm-presentation-chicago-2014-final

The science of
Language Quality Assurance
What’s behind two new ASTM work items,
WK46396 and WK46397
Serge Gladkoff,
(GALA, Logrus International)
Chicago, November 5, 2014

LISA QA Model
SAE J2450
SDL TMS
Acrocheck
ApSIC XBench
CheckMate
QA Distiller
XLIFF:Doc
EN15038
{ …Proprietary metrics and
scorecards… }
…
What is translation quality? All of them disagree
on what quality is. Would you dare giving a
universal definition, considering all these
authors had their own idea of what it is?
Quality
Definition?

A THEORY OF BIG GAME HUNTING
PROBLEM
To Catch a Lion in the Sahara Desert.
SOLUTION: THE BOLZANO-WEIERSTRASS
METHOD
Divide the desert by a line running from north to
south. The lion is then either in the eastern or in the
western part. Lets assume it is in the eastern part.
Divide this part by a line running from east to west.
The lion is either in the northern or in the southern
part. Lets assume it is in the northern part. We can
continue this process arbitrarily and thereby
constructing with each step an increasingly narrow
fence around the selected area. The diameter of the
chosen partitions converges to zero so that the lion
is caged into a fence of arbitrarily small diameter.

GENERAL CONSIDERATIONS
Means
A. Concentrating on factors
making strongest impression
B. Separating global (holistic)
and local issues, with the
former being typically more
important and playing bigger
role
Reflecting the perception
and priorities of the target
audience

Means
A. Covering the whole spectrum of
potential uses, subject areas, and
materials;
B. From slightly post-edited MT to ultra-polished
manual translations
C. Common approach
D. Same approach to technical materials
and marketing content
E. Only adjust acceptance criteria /
thresholds based on expectations
Universal applicability
We are all humans and, irrespective of what exactly
we are looking at, be it a restaurant menu or drug
usage guidelines, we are making our first judgment
about text quality using exactly the same criteria. We
do not need a different approach or a completely new
metric for each subject area or type of content. In
reality, the only thing that requires adjustment is
tolerance level. We are ready to accept a barely
comprehensible menu translation, but expect perfect
clarity and lack of ambiguity in the medical area. In
technical terms, this means that we are still measuring
the same thing, i.e. readability/clarity, but with
different expectations, and this approach applies to all
other criteria.

Means
A. Should be clear, not overly
complicated
B. Should be process-friendly,
i.e. reasonably economical
and applicable to the real
world
Viability of methodology

Means
A. Concentrating on methodology rather than
particular cases/uses.
B. Issue typology is not an inalienable part of
the methodology, but rather an add-on
component. It can be based for instance on
MQM or other source, or legacy criteria,
including those used/provided by the client.
C. Weights assigned to particular issues are
expected to vary within a wide range
depending on the goals set, subject matter,
type of material, etc. Particular issues might
simply prove irrelevant for the job or area of
focus, which results in zero weights being
assigned to these issues.
Flexibility of approach
The client knows what types of are important to his
content.

ENTIRETY OF IMPRESSION
Reader/consumer is primarily interested in
overall readability and adequacy of the whole
piece, and only then in readability of parts
(sentences).

TWO KEY FACTORS
ADEQUACY
READABILITY

THRESHOLD OF ACCEPTANCE
…is determined by usability expectations
Expectation of how readable and adequate the translated content should be,
determines the acceptable quality level for these key cornerstone factors.

GRADING
If piece has serious defect, it
has to be discarded without
wasting time on further analysis.
If text is inadequate or
unreadable, it does not make
sense to count typos or see
whether the terminology is right.
Good stuff
Substandard

Acceptance threshold
Neither Readability nor Adequacy are 100% objective
The solution lies in evaluating each of the two
major holistic criteria (readability and adequacy)
separately, on a PASS/FAIL basis.
The logical thing to do is establish an acceptance
threshold that would correspond to the lower end
of the statistical range.
How can we deal with this lack of complete
objectivity in a real-world scenario, when no
reference translations are available, there is a
single reviewer who can only look at a certain
percentage of the overall content, and we still
need to evaluate and grade translated texts?

The scale from 0 to 10
The smaller scale will not fit the Bell curve
Important and direct consequence is
that the scale used for holistic
translation ratings should be at least
between 0 and 10, and by no means
smaller.

Atomistic Quality
Fluency (content)
Fluency
(mechanical)
Spelling
Style Guide
Typography
Grammar
Locale convension
….
Inconsistency
Idiomatic
Duplication
Ambiguity
Accuracy
Mistranslation
Omission
Addition
Untranslated
Printing
Copying
Color and black
and white digital printing
Internationalization
Compatibility
(other)
Design
Global font choice
Headers and footers
Margins
Page break
Kerning
….

Atomistic Quality
푄퐴 =
푛 푁푖 ∗ 푊푖
푖=0
푉

..or quality
square!
There are things that you will
know when you see them…
Showstoppers…

Building the concrete LQA metrics
1
2
3
4
5
The methodology fully covers all types of
translated content, including those produced
using MT and/or MT + post-editing.

Applying LQA metrics
Applying it
correctly
Three-dimensional
vector
 Holistic
readability
 Holistic
adequacy
 Atomic
compound
detailed
metrics
Readability threshold
 Pass/Fail
Adequacy threshold
 Pass/Fail
Atomistic rating
 Detailed score
Implementation keys
 Holistic parameters cannot
be mixed
 Only those materials that
pass HP are analyzed
further
 Experts required to produce
precise and reliable
atomistic score
 Select content to apply the
metrics

In the vast majority of real-life cases,
nobody can afford the luxury of
employing an expert panel to evaluate
the translation quality of any particular
document or web portal. LSPs typically
have to use a single reviewer who only
looks at a certain percentage of the
content. To produce meaningful, reliable
results despite this limitation proper
sampling must be done.

HOW FASTIDIOUS ARE YOU?
95% Confidence Level % of Total - 0.25% CI % of Total - 0.5% CI % of Total - 0.125% CI
Sample Size - 0.25% CI Sample Size - 0.5% CI Sample Size - 0.125% CI
100,000
10,000
1,000
100
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
100 1,000 10,000 100,000 1,000,000
Sample
Size
(Words)
% to Be Checked
Total Volume, Words

THE LUXURY OF FULL METRICS
Develop metrics
 Significant research
- Know your area
- Sustain R&D
- Develop metrics
- Develop processes
- How much and
what to QA?
Build supply chain
 Professional LSP
 Professional linguists
 Provide training
 Provide reference
materials
Pay to apply
 Terminology
maintenance &
support
 Translation Memory
maintenance and
management
 Localization Quality
Assurance costs

THE NEED AND THE CONSTRAINTS
CONTRAINTS
A. Professional LQA would require global
federal program to develop applicable
LQA metrics, allocate funding, book
professional LQA with specially trained
LSPs.
B. Yet, there’s still acute need for that,
which has been demonstrated by
GUIDADO DE SALUD web site of
Affordable Care act.
C. The methodology of public LQA is very
much needed.
D. There IS a feedback on the site from
the public, how it is to be handled?
Executive Order 13166
 http://www.justice.gov/crt/about/c
or/13166.php
 www.lep.gov
 “requires Federal agencies to
examine the services they provide,
identify any need for services to
those with limited English
proficiency (LEP), and develop and
implement a system to provide
those services so LEP persons can
have meaningful access to them”

CONSTRAINTS
of public feedback
The Crowd
 Cannot be trained
 Is not ready to spend
a lot of time
 Opinionated by
definition
The Feedback
 Is limited by volume
 Is random by nature
 Arbitrary issue
classification
 Can be large in
number of reviewers
The Approach
 Using the statistical
approach to turn the
tables and gain in
another area what
we have lost

THE METRICS
1. Quality square approach
There MAY be showstopper errors.
2. The parameters are simplified (no detailed issue definitions)
No detailed Atomistic quality issue definitions can be applied.
3. Each reviewer produces four ratings on 0 – 10 scale
“0 – 10” scale is the smallest one to accommodate the Bell Curve.
(Each reviewer is asked to provide examples.)
4. The calibration:
(a) Showstopper: 0 = two or more major errors, 10 = no major errors
(b) Holistic readability (fluency): 0 = incomprehensible, 10 = a poem
(c) Holistic adequacy (accuracy): 0 = inadequate, 10 = perfectly conveying meaning
(d) Atomistic (small specific errors): 0 = full of small errors, 10 = completely error-free
For crowd sourced LQA the atomistic quality category is not formalized in any way whatsoever.

THE PROCESS
1. LQA review scope is defined and briefly and clearly explained
To prevent reviewers straying to other areas..
2. The content needs to be final
Despite the fact that review was
by design a less than ideal
Updates and scope changes are outside of the scope of crowdsourced review.
3. Communication is done via simple online portal
No bandwidth to manage the crowd manually.
4. Better if volunteers are language professionals
community feedback-based LQA,
resulting in rating inconsistency
among reviewers, most
reviewers found too many
noticeable and annoying
It would compensate fore the lack of special training.
5. Proper sampling
No less than 10 reviewers for each area; the more – the better.
6. Proper processing
technical/minor mistakes in the
text, as reflected in the low
The results are manually vetted to remove outliers:
- discard outliers w/o explanation and obvious reviewers errors
- are major errors statistically significant? 30% threshold instead of 5% is recommended.
- apply statistics to analyze results
average rating, which is
unsatisfactory. Substantial
remedial work is clearly called
..an average Readability
Rating as 6.2 out of 10
with a standard
deviation of 2.2, and
Adequacy 6.5 out of 10
with a standard
deviation of 1.9.
The conclusion would be: The text
is readable (rating above 5), but
barely so, and leaves much to be
desired in view of its importance
and high level of public exposure.
Again, it is up to the expert who is
doing the analysis to define the
threshold, that, for example, for this
type of content a proper target for
average readability is at least 8 out
of 10.
for in this area.
…the adjusted value for fechnical errors of
4.7 out of 10 for the average atomistic
quality rating, with a 2.4 standard deviation

THE ONLY PUBLIC LQA METRICS AVAILABLE
POSITIVES
• Both holistic
measures can be
relied upon with
reasonable
confidence
• Good overall
assessment
• Allows to identify
showstoppers
• Good general idea of
the level of technical
errors
• Affordable and
available for US
federal agencies
Is it appropriate? CONTRA
• Only rough judgment
• Not a good
quantitative
assessment
• Not complete roster
of errors even in the
selected sample
• No concrete process
recommendations
YOU DECIDE!

MORE INFORMATION ABOUT MQM
http://www.qt21.eu/mqm-definition/definition-2014-08-19.html

MORE INFORMATION ABOUT METHODOLOGY

WK46396=
WK46397=
The Proposal
MQM
The guide for LQA Methodology

THANK YOU!
sgladkoff@logrus.net
sgladkoff@gala-global.org

Serge astm-presentation-chicago-2014-final

Recommended

Recommended

More Related Content

Similar to Serge astm-presentation-chicago-2014-final

Similar to Serge astm-presentation-chicago-2014-final (20)

Recently uploaded

Recently uploaded (20)

Serge astm-presentation-chicago-2014-final

Editor's Notes