This document discusses tips for preparing high quality training data for machine translation systems. It covers:
- The key factors that influence training data quality are quantity, quality, and relevance to the domain. Balancing these is important.
- Suitable training data sources include translation memories, terminology databases, and client translated documents.
- Statistical machine translation systems use bilingual and monolingual data to form patterns and map source to target language. Additional data and rules can improve accuracy.
- Data preparation includes preprocessing, training the translation and language models, and postprocessing. Ensuring data is clean, normalized, and domain relevant improves results.
3. What we aim to cover today?
About KantanMT.com
Who are we and what we stand for?
What Makes Good Training Data?
The 3 Main Factors that influence
Quality
Transistent.com – An insiders view
5 Things to Look our for in Good
Training Data
Q&A
4. What is KantanMT.com?
Statistical MT System
Cloud-based
Highly scalable
Inexpensive to operate
Fusion of TM & MT & rules
High speed, high quality
translations
Our Vision
To put Machine Translation
Customization
Improvement
Deployment
into your hands
Active KantanMT Engines
7,501
Training Words Uploaded
105,533,605,925
Member Words Translated
1,00,291,925
Fully Operational 18 months
6. Our Journey has just started…
Q2 2013 Q3 2013 Q3 2014Q1 2013
Adoption: Uploaded 10b training
words and 200m words
translated. KantanAPI launched
www.kantanmt.com:
1st SMT Cloud Based
Platform (TotalRecall)
KantanAutoScale: Using
the power of the cloud to
maximise performance
Kantan BuildAnalytics:
Helping engineers build
better MT
Q1 2014
Kantan Analytics: 1st
Predictive Quality
Estimation Technology
Massive Adoption: 879m
translated and 100b training
words uploaded
Q1 2015Q1 2014
7. What Makes Good Training Data?
Training Data - Three main factors:
Quality
The linguistic quality of the training
material is crucially important
Relevance to domain
A high quality MT system has good
domain knowledge
Similar to the way you’ve always worked
with Translation Memories and CAT tools
Quantity
The more training data you use to build
your engine the better its capacity to
generate translations that mimic your
translation style and terminology Quantity
Quality
Relevance
8. What Makes Good Training Data?
Training Data – Balancing the equation
Quality
9. What Makes Good Training Data?
Suitable Training Data Sources
• KantanMT Stock Engines
• 200+ Language Combinations
• Translation Memories
• TMX, XLIFF, TXT
• Terminology Databases
• (TBX)
• Client Translated Data
• DOCX, PDF, TXT
Bilingual
TMs
1
Monolingual
Translated Data
2
Glossary/Terms
Sources
3
Language
Base Data
(Optional)
4
Training Data
13. › Established in December, 2014
› Based in Istanbul
› MT Services including raw output, custom engine
and post-editing
› Additional services including quality automation,
training consultancy and traditional translation
22. Tasks - Optimization
Testing Team
• Gap Analysis
• QE and Test Reports
• Output – Corpus Analytics
• Including New Data and Rules
Data Analysts
23. Including New Data and Rules
before the first two training steps? to reach out mature production system?
Bilingual Corpus Analytics
Lemmatization
Missing Inflections
Word/lemma distribution map
Gap and Broken Pattern Detection
This process requires GA and QE reports
to be utilized.
Monolingual Corpus Analytics
Bilingual – monolingual comparison
Defining the most appropriate LM config
Rule and Data Patch Distinction
The issues included in the reports are
identified.
Term Extraction
Extracting candidate terms
Term and lexical unit separation
Specific glossary and dictionary
Feedback Loops
Next chapter!
What we do
24. Ensure that your training data
• Is clean and normalized
• Is relevant to the related domain
• Has a complete and healthy linguistic pattern form
• Consists coherent monolingual and bilingual data
25. KantanMT Rejected Segments Feature
• Segments too long
• Mismatched Tags/Placeholders
• Source/Target mis-alignment
• Bad formatting
• Incorrect language combinations
No more expensive deployments
Monthly subscription plan
Customised subscription plan
No more complexity
KantanMT does all the heavy lifting
You focus on what you do best – grow and develop your business
We are the fastest growing MT provider, even though we are one of the young-guns!