Delivered at the TAUS Quality Evaluation Summit.
May 28th 2015
Dublin, Ireland.
In this talk, we describe how to carry out machine translation evaluation in order to extract meaningful business intelligence.
1. MT Evaluation
Seeing the Wood for the Trees
John Tinsley
CEO and Co-founder
TAUS QE Summit. Dublin. 28th May 2015
2. We need to marry data that we know from operations with data
we product during MT evaluations to create intelligence
Let’s look at how we can find that out and what it means…
Making the business case for MT
KNOWNS
• Revenue from translation
• Costs (internal, outsourced)
• Variations of this information
across content and
languages
UNKNOWNS
• MT performance
• Cost of MT
• Variations of this information
across content and
languages
3. Calculating potential ROI
Parameters
Per
word
rate
(LSP)
Vendor
Rate
Produc3vity
Gain
Project
Word
Count
MT
Cost
€0.10
€0.08
5,000,000
MT
Weighted
Word
Count
No
Machine
Transla3on
With
Machine
Transla3on
LSP
Revenue
€500,000
LSP
Revenue
€500,000
Vendor
Cost
€400,000
Vendor
Cost
MT
Cost
0
MT
Cost
Gross
Profit
€100,000
Gross
Profit
Gross
Profit
Margin
20.0%
Gross
Profit
Margin
Gross
Profit
Increase
when
using
MT
???%
**These numbers are for illustrative purposes only and not related to the case study
4. Problem
Large Chinese to English patent translation project. Challenging
content and language
Question
What if any efficiencies can machine translation add to the workflow of
RWS translators?
How we applied different types of MT evaluation and different stages
in the process, at various go/no stages, to help RWS to assess whether
MT is viable for this project
Client Case Study – RWS
- UK headquartered public company
- Founded 1958
- 9th largest LSP (CSA 2013 report)
- Leader in specialist IP translations
5. Lots of different ways to do evaluation**
– automatic scores
• BLEU, METEOR, GTM, TER
– fluency, adequacy, comparative ranking
– task-based evaluation
• error analysis, post-edit productivity
Different metrics, different intelligence
– what does each type of metric tell us?
– which ones are usable at which stage of evaluation?
e.g. can we really use automatic scores to assess productivity?
e.g. does productivity delta really tell us how good the output is?
MT Evaluation – where do we start!?
6. Can we improve our baseline engines through customisation?
Step 1: Baseline and Customisation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
BLEU TER
Iconic Baseline
Iconic Customised
What next?
How good is the output relative to the task, i.e. post-editing?
- fluency/adequacy not going to tell us
- let’s start with segment level TER
- Huge improvement
- Intuitively, scores
reflect well but don’t
really say anything
- Let’s dig deeper
7. Translation Edit Rate: correlates well with practical evaluations
If we look deeper, what can we learn?
INTELLIGENCE
• Proportion of full matches (i.e. big savings)
• Proportion of close matches (i.e. faster that fuzzy matches)
• Proportion of poor matches
ACTIONABLE INFORMATION
• Type of sentence with high/low matches
• Weaknesses and gaps
• Segments to compare and analyse in translation memory
8. TERscore
Step 2: Segment-level automatic analysis
Distribution of segment-level TER scores
This represents a 24% potential
productivity gain**
segment length
9. With MT experience and previous MT integration, productivity
testing can be run in the production environment. In this case we
used, the Dynamic Quality Framework
Beware the variables**!
• Translators: different experience, speed, perceptions of MT
– 24 translators: senior, staff, and interns
• Test sets: not representative; particularly difficult
– 2 tests sets, comprising 5 documents, and cross-fold validation
• Environment and task: inexperience and unfamiliarity
– Training materials, videos, and “dummy” segments
Step 3: Productivity testing
10. Overall average
Findings and Learnings
25% productivity gain
Experienced: 22%
Staff: 23%
Interns: 30%
Test set 1.1: 25%
Test set 1.2: 35%
Test set 2.1: 06%
Test set 2.2: 35%
Correlates with TER
Rollout with junior staff
for more immediate
impact on bottom line?
Don’t be over concerned
by outliers.
Use data to facilitate
source content profiling?
What it tells us
By Translator Profile
By Test Set
11. Look our for anomalies**
– segments with long timings (above average ratio words/minute)
– sentences that don’t change much from MT to post-edit*
– segments with unusually short timings
In this case, the next step is production roll-out to validate these
in the actual translator workflow over an extended period.
Warnings, Tips, and Next Steps
Now would be the right time to do fluency/adequacy if you need to
verify that post-editing is producing, at least, similar quality output
12. Calculating the ROI - revisited
Parameters
Per
word
rate
(LSP)
Vendor
Rate
Produc3vity
Gain
Project
Word
Count
MT
Cost
€0.10
€0.08
5,000,000
MT
Weighted
Word
Count
No
Machine
Transla3on
With
Machine
Transla3on
LSP
Revenue
€500,000
LSP
Revenue
€500,000
Vendor
Cost
€400,000
Vendor
Cost
MT
Cost
0
MT
Cost
Gross
Profit
€100,000
Gross
Profit
Gross
Profit
Margin
20.0%
Gross
Profit
Margin
Gross
Profit
Increase
when
using
MT
???%
**These numbers are for illustrative purposes only and not related to the case study
13. Calculating the ROI – plugging in the numbers
Parameters
Per
word
rate
(LSP)
Vendor
Rate
Produc3vity
Gain
Project
Word
Count
MT
Cost
€0.10
€0.08
25%
5,000,000
€0.008
MT
Weighted
Word
Count
3,750,000
No
Machine
Transla3on
With
Machine
Transla3on
LSP
Revenue
€500,000
LSP
Revenue
€500,000
Vendor
Cost
€400,000
Vendor
Cost
€300,000
MT
Cost
0
MT
Cost
€40,000
Gross
Profit
€100,000
Gross
Profit
€160,000
Gross
Profit
Margin
20.0%
Gross
Profit
Margin
32%
Gross
Profit
Increase
when
using
MT
60%
**These numbers are for illustrative purposes only and not related to the case study
14. Identify the gaps in your data
3 take home messages
Understand the process to collect
the right information
Continuous assessment
16. Iconic Translation Machines
• Machine Translation with Subject Matter Expertise
• Headquartered here in Dublin
• Strong tradition of MT research and development
underpinning the company and its technologies
This presentation
• MT evaluation: what, how, when, why?
– What ways can we evaluate MT?
– How do we carry out the evaluation?
– When in the process do we carry out certain types of evaluation?
– Why do we do certain evaluations and what do they tell us?
By way of introduction…
17. Step 2: Segment-level automatic analysis
Productivity
threshold
Plot of TER scores by length
One of the biggest challenges as an MT provider is helping the LSP client make the business case for MT.
In order to do this, we need to look as what data we HAVE and COMBINE that with data we collect through MT evaluations, to create the business intelligence around making the decision
- TM leverage, translator speeds also possibly
I won’t dwell on this but I’ll refer to it.
It helps visualise what information we may HAVE and what information we NEED in order to complete the picture
I’ll talk about how we collected this information through MT evaluations via a case study with RWS.
What I’ll focus on his WHAT MT evaluation we carried out and what STAGES to give us the information we needed to know
This is the part where we’re looking into the forest and trying to pick out the right approach.
Different metrics tell us different things, but, perhaps more appropriately is what the metrics don’t tell us
There are lots of them out there, you need to know which ones to use and when.
We’ve obviously got a lot of experience in this area given our background, I won’t focus on this now but maybe we can leave a more detailed discussion for the breakout sessions
First step is can we improve our engines through customisation.
These automatic scores tell us CONCLUSIVELY. Yes.
But the don’t really tell us anything about QUALITY, or SUITABILITY for the TASK
We need to dig deeper on a segment level and for this, we use TER. WHY?
TER has correlated well with practical evaluations for us.
It gives us practical information which we can correlate with the bottom line
It also gives us practicable (actionable) information which we can use to improve MT and do further analysis
**If you do this over a variety of test documents like we did with RWS, where we used 10, you’ll get a sense of what the MT can bring**
For example, here we see FOR EACH SEGMENT, the TER range and how long the segments are within those ranges.
This allows us to do some calculations, which I won’t detail now, can discuss in the breakout session, but it resulted in a 24% gain
Experience is crucial here. Lot’s of variables and things to look out for, like TRANSLATORS, TEST SETS, and the ENVIRONMENT as I’m sure people here can attest to.
I won’t go into detail but here’s a high level look at what we did to try to find out different information.
Again, I can go into details in the breakout session.
In terms of analysing information, there are a number of things to look out for to make sure we’re getting more accurate results.
Save to say now would be the right time to look at quality evaluation and make sure post-editing is not affecting things
To revisit this (just pointing out these aren’t real numbers… ;-) If we plug it in…
You need to know what information to collect before you can set up the evaluation
You need to understand the different evaluation processes, or work with someone who does, to make sure you collect that information
This is just the start. Engines will improve. Productivity assessment is ongoing too to ensure, at least, same gain and improvements