Thinking Strategically About Content Destined for Machine Translation2. Who Am I?
Founder and CEO of Content Rules
25+ years in content arena
Specialty areas:
Global content strategy
Terminology management
Content quality
Single-sourcing / XML / DITA
Finishing third book, “Global Content Strategy,” due out in
2014
© 2013. Content Rules, Inc. All rights reserved.
3. What is Content Rules?
Professional services firm specializing in:
• Content strategy / Global content strategy
• Content creation
• Content quality / Global readiness
Based in Silicon Valley
Founded in 1994
Acrolinx Authorized Services Provider
Authorized provider of The Rockley Strategic Method™
© 2013. Content Rules, Inc. All rights reserved.
4. Global Readiness
Ensure content is translatable
Readability
Grammar and style
Reuse
Evaluate and improve content quality using state-of-the-art tools
Reports
Metrics
Recommendations
Fixes
Save money on translation
© 2013. Content Rules, Inc. All rights reserved.
6. Today’s Presentation
Importance of content
Historic background
Types of machine translation
Content quality affects machine translation results
Bleu scores
Pre-editing instead of post-editing
© 2013. Content Rules, Inc. All rights reserved.
7. Content Is Important
87% of respondents to a recent CMO Council survey said that content had
a moderate to major impact on their buying decisions
© 2013. Content Rules, Inc. All rights reserved.
8. Content Is A Strategic Asset
© 2013.
Content Rules, Inc.
All rights reserved.
9. What Does It Mean to be Strategic?
stra·te·gic
[struh-tee-jik]
adjective
1. pertaining to, characterized by, or of the
nature of strategy: strategic movements.
2. important in or essential to strategy.
3. forming an integral part of a stratagem:
a strategic move in a game of chess.
© 2013.
Content Rules, Inc.
All rights reserved.
10. Content Creation In the Past
Content wasn't so easy to create and distribute
Created by trained professionals
Only they had access to the content
© 2013.
Content Rules, Inc.
All rights reserved.
11. Content Creation Today
Everyone creates content
Very easy to distribute
Now, we have loads and loads of content
• Some of it good
• Some of it mediocre
• Some of it downright awful
© 2013.
Content Rules, Inc.
All rights reserved.
12. Translation In The Past
Content wasn't so easy to translate.
Trained professionals
Only they understood multiple
languages well enough to translate
content
© 2013.
Content Rules, Inc.
All rights reserved.
13. Translation Today
It is easy and free to translate content
We have loads and loads of translated content
•
Some of it good
•
Some of it mediocre
•
Some of it downright awful
© 2013.
Content Rules, Inc.
All rights reserved.
14. More Machine Translation All The Time
Machine Translation (MT) is becoming more relied upon as a way to
get cost-effective, fast translations
%18.05 year-over-year growth of MT expected over next 3 years*
Must pay a more attention to the source content that goes into it
A machine cannot figure what we meant to say based on what we
actually wrote
Garbage In – Garbage Out
*http://www.researchandmarkets.com/research/2gpj3p/global_machine
© 2013.
Content Rules, Inc.
All rights reserved.
15. Source Content And Machine Translation
Types of MT engines and the effect of source content
on them
What are Bleu scores
How quality of content affects MT output
© 2013.
Content Rules, Inc.
All rights reserved.
16. MT Engine Types
There are three types of MT Engines:
1. Rule-based
2. Statistical
3. Hybrid
© 2013.
Content Rules, Inc.
All rights reserved.
17. Rule-Based MT (RBMT)
Uses linguistic rules
Extensive use of bilingual dictionaries
Transfers structure of source language into target language
Results are literal translations based on rules
Does not handle ambiguity well (word or phrase having more than
one meaning)
© 2013.
Content Rules, Inc.
All rights reserved.
18. Statistical MT (SMT)
Based on analysis of content
Engine trained over time
More content = better results
Need at least 2,000,000 million words per domain
Better quality content = better results
Results are more natural translations, based on previous source |
destination pairs
Google Translate
© 2013.
Content Rules, Inc.
All rights reserved.
19. Hybrid
Combines rule-base and statistical
Provides predictability and consistency of RBMT
Provides fluency and flexibility of SMT
Reduces the amount of data needed to train the engine
© 2013.
Content Rules, Inc.
All rights reserved.
20. Training The SMT Beast
Training SMT software extremely important
Poor quality source = Poor quality translations
Some companies have such poorly trained MT engines
that fixing the content first is actually not an option
The engine has been trained to translate poor quality
source
© 2013.
Content Rules, Inc.
All rights reserved.
21. The Effect Of Poor Content On SMT And
Hybrid MT
Poor or unpredictable translations
Increased time to retrain the system with correct
information
Increased post-editing, per language
Wasted money
© 2013.
Content Rules, Inc.
All rights reserved.
22. Evaluating MT Precision - Bleu Scores
Introduced in 2002 by the IBM Watson Research Center
Automatic evaluation metric used to compare MT output
with reference human translation
“The closer a machine translation is to a professional human translation, the
better it is.” *
Metric widely used throughout the industry
*http://acl.ldc.upenn.edu/P/P02/P02-1040.pdf
© 2013.
Content Rules, Inc.
All rights reserved.
23. Bleu Scores – Helpful Or Hype?
According to Callison-Burch, Osborne, and Koehn of the School on
Informatics, University of Edinburgh, Bleu scores have many issues*:
Synonyms and paraphrases difficult to score
All words are weighted equally
Difficult to calculate
*http://homepages.inf.ed.ac.uk/pkoehn/publications/bleu2006.pdf
© 2013.
Content Rules, Inc.
All rights reserved.
24. That’s Okay. We Can Post Edit.
Original
Source
Content
Post-Edited
Translations
© 2013.
Content Rules, Inc.
All rights reserved.
25. Why Not Pre-Edit Instead?
Fewer issues = less post editing
Save time
Save money
Improve quality
© 2013.
Content Rules, Inc.
All rights reserved.
27. Results of Pre-Editing
Save money
Improve quality
Faster time to market
Fewer in-country iterations
Better translation consistency
© 2013.
Content Rules, Inc.
All rights reserved.
28. Summary
Content is a strategic asset
Machine translation is becoming more popular
Poor quality content incorrectly trains MT engines
Poor quality content results in increased post-editing
Pre-editing saves money and time, and improves
translation quality
© 2013.
Content Rules, Inc.
All rights reserved.
32. Reduce word count
We recommend 24 words, max, for machine translation.
It is impossible for people to understand long sentences.
Imagine software having to parse through all of those
commas (half of which are probably missing or misplaced).
© 2013.
Content Rules, Inc.
All rights reserved.
33. Let's say we have 100,000 words of source content.
We are going to translate the content into 14 languages.
We will end up with 1.4 million words of content.
Let's say the 100,000 words contain all types of errors. We will have to post-edit and fix
1.4 million words on the other side.
Let's say we have to pay someone <<<$ .xx>>> per word to post-edit the content.
That's <<<$.xx>>> * 1,400,000 words.
If we paid <<<$ .07>>> per word to predit the content, we would have spent $7,000 for
preparation.
© 2013.
Content Rules, Inc.
All rights reserved.
Editor's Notes According to the CMO Council and Netline in their June 2013 survey, “Understanding How BtoB Buyers Source, Value, and Share Content Online,” 87% of respondents stated that content had a moderate to major impact on their buying decisions.For far too long, content has been treated as something that simply describes, positions, or touts a product. Technical content, in particular, has long been an after thought, something not deemed important. If we don't treat content as a strategic asset, it is just garbage. And if we put garbage into machine translation, it is just exponentiated garbage. One poorly written source document = many poorly translated resulting documentsPoor source content = more post editing Problems exponentiate based on number of language pairs