The document discusses the development of new ASTM work items WK46396 and WK46397 regarding language quality assurance (LQA) methodology. It proposes a unified LQA methodology based on a three-dimensional evaluation of translation quality, including holistic ratings of readability and adequacy, as well as atomistic/detailed metrics. The methodology aims to provide flexibility for different content types and acceptance criteria while ensuring reliability from single reviewers or crowdsourced feedback.
Call Girls In {Aerocity Delhi} 98733@20244 Indian Russian High Profile Girls ...
Serge astm-presentation-chicago-2014-final
1. The science of
Language Quality Assurance
What’s behind two new ASTM work items,
WK46396 and WK46397
Serge Gladkoff,
(GALA, Logrus International)
Chicago, November 5, 2014
2. LISA QA Model
SAE J2450
SDL TMS
Acrocheck
ApSIC XBench
CheckMate
QA Distiller
XLIFF:Doc
EN15038
{ …Proprietary metrics and
scorecards… }
…
What is translation quality? All of them disagree
on what quality is. Would you dare giving a
universal definition, considering all these
authors had their own idea of what it is?
Quality
Definition?
3. A THEORY OF BIG GAME HUNTING
PROBLEM
To Catch a Lion in the Sahara Desert.
SOLUTION: THE BOLZANO-WEIERSTRASS
METHOD
Divide the desert by a line running from north to
south. The lion is then either in the eastern or in the
western part. Lets assume it is in the eastern part.
Divide this part by a line running from east to west.
The lion is either in the northern or in the southern
part. Lets assume it is in the northern part. We can
continue this process arbitrarily and thereby
constructing with each step an increasingly narrow
fence around the selected area. The diameter of the
chosen partitions converges to zero so that the lion
is caged into a fence of arbitrarily small diameter.
4. GENERAL CONSIDERATIONS
Means
A. Concentrating on factors
making strongest impression
B. Separating global (holistic)
and local issues, with the
former being typically more
important and playing bigger
role
Reflecting the perception
and priorities of the target
audience
5. GENERAL CONSIDERATIONS
Means
A. Covering the whole spectrum of
potential uses, subject areas, and
materials;
B. From slightly post-edited MT to ultra-polished
manual translations
C. Common approach
D. Same approach to technical materials
and marketing content
E. Only adjust acceptance criteria /
thresholds based on expectations
Universal applicability
We are all humans and, irrespective of what exactly
we are looking at, be it a restaurant menu or drug
usage guidelines, we are making our first judgment
about text quality using exactly the same criteria. We
do not need a different approach or a completely new
metric for each subject area or type of content. In
reality, the only thing that requires adjustment is
tolerance level. We are ready to accept a barely
comprehensible menu translation, but expect perfect
clarity and lack of ambiguity in the medical area. In
technical terms, this means that we are still measuring
the same thing, i.e. readability/clarity, but with
different expectations, and this approach applies to all
other criteria.
6. GENERAL CONSIDERATIONS
Means
A. Should be clear, not overly
complicated
B. Should be process-friendly,
i.e. reasonably economical
and applicable to the real
world
Viability of methodology
7. GENERAL CONSIDERATIONS
Means
A. Concentrating on methodology rather than
particular cases/uses.
B. Issue typology is not an inalienable part of
the methodology, but rather an add-on
component. It can be based for instance on
MQM or other source, or legacy criteria,
including those used/provided by the client.
C. Weights assigned to particular issues are
expected to vary within a wide range
depending on the goals set, subject matter,
type of material, etc. Particular issues might
simply prove irrelevant for the job or area of
focus, which results in zero weights being
assigned to these issues.
Flexibility of approach
The client knows what types of are important to his
content.
8. ENTIRETY OF IMPRESSION
Reader/consumer is primarily interested in
overall readability and adequacy of the whole
piece, and only then in readability of parts
(sentences).
10. THRESHOLD OF ACCEPTANCE
…is determined by usability expectations
Expectation of how readable and adequate the translated content should be,
determines the acceptable quality level for these key cornerstone factors.
11. GRADING
If piece has serious defect, it
has to be discarded without
wasting time on further analysis.
If text is inadequate or
unreadable, it does not make
sense to count typos or see
whether the terminology is right.
Good stuff
Substandard
12. Acceptance threshold
Neither Readability nor Adequacy are 100% objective
The solution lies in evaluating each of the two
major holistic criteria (readability and adequacy)
separately, on a PASS/FAIL basis.
The logical thing to do is establish an acceptance
threshold that would correspond to the lower end
of the statistical range.
How can we deal with this lack of complete
objectivity in a real-world scenario, when no
reference translations are available, there is a
single reviewer who can only look at a certain
percentage of the overall content, and we still
need to evaluate and grade translated texts?
13. The scale from 0 to 10
The smaller scale will not fit the Bell curve
Important and direct consequence is
that the scale used for holistic
translation ratings should be at least
between 0 and 10, and by no means
smaller.
14. Atomistic Quality
Fluency (content)
Fluency
(mechanical)
Spelling
Style Guide
Typography
Grammar
Locale convension
….
Inconsistency
Idiomatic
Duplication
Ambiguity
Accuracy
Mistranslation
Omission
Addition
Untranslated
Printing
Copying
Color and black
and white digital printing
Internationalization
Compatibility
(other)
Design
Global font choice
Headers and footers
Margins
Page break
Kerning
….
19. ..or quality
square!
There are things that you will
know when you see them…
Showstoppers…
20. Building the concrete LQA metrics
1
2
3
4
5
The methodology fully covers all types of
translated content, including those produced
using MT and/or MT + post-editing.
21. Applying LQA metrics
Applying it
correctly
Three-dimensional
vector
Holistic
readability
Holistic
adequacy
Atomic
compound
detailed
metrics
Readability threshold
Pass/Fail
Adequacy threshold
Pass/Fail
Atomistic rating
Detailed score
Implementation keys
Holistic parameters cannot
be mixed
Only those materials that
pass HP are analyzed
further
Experts required to produce
precise and reliable
atomistic score
Select content to apply the
metrics
22. In the vast majority of real-life cases,
nobody can afford the luxury of
employing an expert panel to evaluate
the translation quality of any particular
document or web portal. LSPs typically
have to use a single reviewer who only
looks at a certain percentage of the
content. To produce meaningful, reliable
results despite this limitation proper
sampling must be done.
23. HOW FASTIDIOUS ARE YOU?
95% Confidence Level % of Total - 0.25% CI % of Total - 0.5% CI % of Total - 0.125% CI
Sample Size - 0.25% CI Sample Size - 0.5% CI Sample Size - 0.125% CI
100,000
10,000
1,000
100
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
100 1,000 10,000 100,000 1,000,000
Sample
Size
(Words)
% to Be Checked
Total Volume, Words
24. THE LUXURY OF FULL METRICS
Develop metrics
Significant research
- Know your area
- Sustain R&D
- Develop metrics
- Develop processes
- How much and
what to QA?
Build supply chain
Professional LSP
Professional linguists
Provide training
Provide reference
materials
Pay to apply
Terminology
maintenance &
support
Translation Memory
maintenance and
management
Localization Quality
Assurance costs
26. THE NEED AND THE CONSTRAINTS
CONTRAINTS
A. Professional LQA would require global
federal program to develop applicable
LQA metrics, allocate funding, book
professional LQA with specially trained
LSPs.
B. Yet, there’s still acute need for that,
which has been demonstrated by
GUIDADO DE SALUD web site of
Affordable Care act.
C. The methodology of public LQA is very
much needed.
D. There IS a feedback on the site from
the public, how it is to be handled?
Executive Order 13166
http://www.justice.gov/crt/about/c
or/13166.php
www.lep.gov
“requires Federal agencies to
examine the services they provide,
identify any need for services to
those with limited English
proficiency (LEP), and develop and
implement a system to provide
those services so LEP persons can
have meaningful access to them”
27. CONSTRAINTS
of public feedback
The Crowd
Cannot be trained
Is not ready to spend
a lot of time
Opinionated by
definition
The Feedback
Is limited by volume
Is random by nature
Arbitrary issue
classification
Can be large in
number of reviewers
The Approach
Using the statistical
approach to turn the
tables and gain in
another area what
we have lost
28. THE METRICS
1. Quality square approach
There MAY be showstopper errors.
2. The parameters are simplified (no detailed issue definitions)
No detailed Atomistic quality issue definitions can be applied.
3. Each reviewer produces four ratings on 0 – 10 scale
“0 – 10” scale is the smallest one to accommodate the Bell Curve.
(Each reviewer is asked to provide examples.)
4. The calibration:
(a) Showstopper: 0 = two or more major errors, 10 = no major errors
(b) Holistic readability (fluency): 0 = incomprehensible, 10 = a poem
(c) Holistic adequacy (accuracy): 0 = inadequate, 10 = perfectly conveying meaning
(d) Atomistic (small specific errors): 0 = full of small errors, 10 = completely error-free
For crowd sourced LQA the atomistic quality category is not formalized in any way whatsoever.
29. THE PROCESS
1. LQA review scope is defined and briefly and clearly explained
To prevent reviewers straying to other areas..
2. The content needs to be final
Despite the fact that review was
by design a less than ideal
Updates and scope changes are outside of the scope of crowdsourced review.
3. Communication is done via simple online portal
No bandwidth to manage the crowd manually.
4. Better if volunteers are language professionals
community feedback-based LQA,
resulting in rating inconsistency
among reviewers, most
reviewers found too many
noticeable and annoying
It would compensate fore the lack of special training.
5. Proper sampling
No less than 10 reviewers for each area; the more – the better.
6. Proper processing
technical/minor mistakes in the
text, as reflected in the low
The results are manually vetted to remove outliers:
- discard outliers w/o explanation and obvious reviewers errors
- are major errors statistically significant? 30% threshold instead of 5% is recommended.
- apply statistics to analyze results
average rating, which is
unsatisfactory. Substantial
remedial work is clearly called
..an average Readability
Rating as 6.2 out of 10
with a standard
deviation of 2.2, and
Adequacy 6.5 out of 10
with a standard
deviation of 1.9.
The conclusion would be: The text
is readable (rating above 5), but
barely so, and leaves much to be
desired in view of its importance
and high level of public exposure.
Again, it is up to the expert who is
doing the analysis to define the
threshold, that, for example, for this
type of content a proper target for
average readability is at least 8 out
of 10.
for in this area.
…the adjusted value for fechnical errors of
4.7 out of 10 for the average atomistic
quality rating, with a 2.4 standard deviation
30. THE ONLY PUBLIC LQA METRICS AVAILABLE
POSITIVES
• Both holistic
measures can be
relied upon with
reasonable
confidence
• Good overall
assessment
• Allows to identify
showstoppers
• Good general idea of
the level of technical
errors
• Affordable and
available for US
federal agencies
Is it appropriate? CONTRA
• Only rough judgment
• Not a good
quantitative
assessment
• Not complete roster
of errors even in the
selected sample
• No concrete process
recommendations
YOU DECIDE!
The results are presented on this graph. It’s not as complex as it might seem at a first glance. Let me explain.
The horizontal axis represents volume in words, between 100 on the left and 1,000,000 on the right. A logarithmic scale is used.
The left vertical axis displays percentage of the total volume to be checked (between 0% and 100%), while the right vertical axis shows sample size in words and uses a logarithmic scale for obvious reasons.
Now let’s proceed to the curves themselves. The ones starting at 100% and decreasing represent the percentage of the total volume to be checked.
The ones starting at 100 words and going up represent the word count to be checked, depending on the total volume.
The color of the curves represents different error margins: The higher the precision, the more we have to check. Red curves in the middle represent the medium error margin of a quarter of a percent. Green curves correspond to a tighter error margin (1/8 of a percent), and blue ones – to a more relaxed half-a-percent error margin.
Of course, we might be more or less fastidious depending on the situation, but generally it is recommended to stay between the blue and green curves. If we are checking more, we are probably checking too much, and if we are checking less, we are probably checking too little.
Now, let’s go over some specifics:
When volumes are low, below 10,000 words, the percentage to be checked is close to 100%. This means we have to check everything, and it is quite reasonable as far as the volume is very low overall. We simply can’t make reliable assumptions about quality of the material using a small sample.
In the midrange between 20,000 and 200,000 words the percentage to be checked starts going down.
When the volume exceeds 300,000 words the curves reach saturation. It means that the volume to be checked stays flat almost irrespective of the overall volume, somewhere between 100,000 and 150,000 words. This volume can be divided or multiplied by a factor of 3 depending on the precision we want to achieve.