Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
in association with#KantanWebinar
Tips for Preparing Training Data for High
Quality MT
What we aim to cover today?
 About KantanMT.com
 Who are we and what we stand for?
 What Makes Good Training Data?
 Th...
What is KantanMT.com?
 Statistical MT System
 Cloud-based
 Highly scalable
 Inexpensive to operate
 Fusion of TM & MT...
The KantanMT Community
Our Journey has just started…
Q2 2013 Q3 2013 Q3 2014Q1 2013
Adoption: Uploaded 10b training
words and 200m words
translat...
What Makes Good Training Data?
 Training Data - Three main factors:
Quality
 The linguistic quality of the training
mate...
What Makes Good Training Data?
 Training Data – Balancing the equation
Quality
What Makes Good Training Data?
 Suitable Training Data Sources
• KantanMT Stock Engines
• 200+ Language Combinations
• Tr...
In conclusion
 What makes good training data?
Quantity
Quality
Relevance
in association with#KantanWebinar
Tips for Preparing Training Data for High
Quality MT
Training Data Preparation for
SMT Systems
Selçuk Özcan
selcuk.ozcan@transistent.com
› Established in December, 2014
› Based in Istanbul
› MT Services including raw output, custom engine
and post-editing
› A...
Statistical Machine Translation
(SMT)
Utilized
components
• Monolingual Data
• Bilingual Data
• Glossaries
• Rules and Tas...
Pattern Formation and Mapping
Source Segments Target Segments
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxx...
Pattern Formation and Mapping
Target Segments
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx...
Pattern Formation and Mapping
Target Segments
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx...
Pre-processing
Rules
SOURCE TEXT
Translation
Model
Language Model
Post-processing
Rules
TARGET TEXT
Pre-processing Rules Post-processing Rules
Sentence Segmentation Capitalization
Word Segmentation Post-process Formatting
...
Tasks - Optimization
Data Analysts
• Data Crawling
• Gathering and Normalizing Data
• Building Corpus
• Corpus Analytics
T...
QL – Version Diagram
Tasks - Optimization
Testing Team
• Gap Analysis
• QE and Test Reports
• Output – Corpus Analytics
• Including New Data an...
Including New Data and Rules
before the first two training steps? to reach out mature production system?
Bilingual Corpus ...
Ensure that your training data
• Is clean and normalized
• Is relevant to the related domain
• Has a complete and healthy ...
KantanMT Rejected Segments Feature
• Segments too long
• Mismatched Tags/Placeholders
• Source/Target mis-alignment
• Bad ...
Tweet your questions to
#KantanWebinar, or via the webinar
chat feature.
Tips for Preparing Training Data for High Quality Machine Translation
Upcoming SlideShare
Loading in …5
×

Tips for Preparing Training Data for High Quality Machine Translation

956 views

Published on

Machine Translation (MT) has experienced a surge in popularity in recent years. However, achieving the right level of quality output can be challenging, even for the most expert MT engineers.

MT engines learn from carefully selected bilingual and monolingual training data, and engine quality is enhanced through the use of terminology, fine tuning and a series of pre and post processing steps. Since these practices have a significant effect on the results of an MT workflow, it’s important to map out each step and develop a clear training strategy before deploying an MT solution.

Joining KantanMT’s Founder and Chief Architect, Tony O’Dowd is Selçuk Özcan, Co-founder of Transistent Language Automation Services. Transistent helps companies invest and integrate new language automation procedures into translation workflows. It is also the first company to focus on MT and quality automation services in Turkey and the Middle East.

During this webinar, Selçuk will talk about Transistent’s experience using KantanMT.com to build and deploy high quality KantanMT engines.

During this webinar you will learn:

• About the potential uses of Machine Translation
• Importance of training data and how it impacts on MT quality
• Tips for Preparing Training Data for High Quality MT

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Tips for Preparing Training Data for High Quality Machine Translation

  1. 1. in association with#KantanWebinar Tips for Preparing Training Data for High Quality MT
  2. 2. What we aim to cover today?  About KantanMT.com  Who are we and what we stand for?  What Makes Good Training Data?  The 3 Main Factors that influence Quality  Transistent.com – An insiders view  5 Things to Look our for in Good Training Data  Q&A
  3. 3. What is KantanMT.com?  Statistical MT System  Cloud-based  Highly scalable  Inexpensive to operate  Fusion of TM & MT & rules  High speed, high quality translations  Our Vision  To put Machine Translation  Customization  Improvement  Deployment  into your hands Active KantanMT Engines 7,501 Training Words Uploaded 105,533,605,925 Member Words Translated 1,00,291,925 Fully Operational 18 months
  4. 4. The KantanMT Community
  5. 5. Our Journey has just started… Q2 2013 Q3 2013 Q3 2014Q1 2013 Adoption: Uploaded 10b training words and 200m words translated. KantanAPI launched www.kantanmt.com: 1st SMT Cloud Based Platform (TotalRecall) KantanAutoScale: Using the power of the cloud to maximise performance Kantan BuildAnalytics: Helping engineers build better MT Q1 2014 Kantan Analytics: 1st Predictive Quality Estimation Technology Massive Adoption: 879m translated and 100b training words uploaded Q1 2015Q1 2014
  6. 6. What Makes Good Training Data?  Training Data - Three main factors: Quality  The linguistic quality of the training material is crucially important Relevance to domain  A high quality MT system has good domain knowledge  Similar to the way you’ve always worked with Translation Memories and CAT tools Quantity  The more training data you use to build your engine the better its capacity to generate translations that mimic your translation style and terminology Quantity Quality Relevance
  7. 7. What Makes Good Training Data?  Training Data – Balancing the equation Quality
  8. 8. What Makes Good Training Data?  Suitable Training Data Sources • KantanMT Stock Engines • 200+ Language Combinations • Translation Memories • TMX, XLIFF, TXT • Terminology Databases • (TBX) • Client Translated Data • DOCX, PDF, TXT Bilingual TMs 1 Monolingual Translated Data 2 Glossary/Terms Sources 3 Language Base Data (Optional) 4 Training Data
  9. 9. In conclusion  What makes good training data? Quantity Quality Relevance
  10. 10. in association with#KantanWebinar Tips for Preparing Training Data for High Quality MT
  11. 11. Training Data Preparation for SMT Systems Selçuk Özcan selcuk.ozcan@transistent.com
  12. 12. › Established in December, 2014 › Based in Istanbul › MT Services including raw output, custom engine and post-editing › Additional services including quality automation, training consultancy and traditional translation
  13. 13. Statistical Machine Translation (SMT) Utilized components • Monolingual Data • Bilingual Data • Glossaries • Rules and Tasks Language Model Translation Model
  14. 14. Pattern Formation and Mapping Source Segments Target Segments xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx Bilingual Data
  15. 15. Pattern Formation and Mapping Target Segments xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx Source Segments xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx Translation Model
  16. 16. Pattern Formation and Mapping Target Segments xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx Additional Monolingual Data xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx Language Model
  17. 17. Pre-processing Rules SOURCE TEXT Translation Model Language Model Post-processing Rules TARGET TEXT
  18. 18. Pre-processing Rules Post-processing Rules Sentence Segmentation Capitalization Word Segmentation Post-process Formatting Word/Phrase Re-ordering Grammar Check Date – Numbers Tag Injection Formulas Currency – Metric Unit Conversion Pre-normalization Final Normalization Spellcheck Reference Check Pre-process Formatting Customized Tasks
  19. 19. Tasks - Optimization Data Analysts • Data Crawling • Gathering and Normalizing Data • Building Corpus • Corpus Analytics Testing Team ?
  20. 20. QL – Version Diagram
  21. 21. Tasks - Optimization Testing Team • Gap Analysis • QE and Test Reports • Output – Corpus Analytics • Including New Data and Rules Data Analysts
  22. 22. Including New Data and Rules before the first two training steps? to reach out mature production system? Bilingual Corpus Analytics Lemmatization Missing Inflections Word/lemma distribution map Gap and Broken Pattern Detection This process requires GA and QE reports to be utilized. Monolingual Corpus Analytics Bilingual – monolingual comparison Defining the most appropriate LM config Rule and Data Patch Distinction The issues included in the reports are identified. Term Extraction Extracting candidate terms Term and lexical unit separation Specific glossary and dictionary Feedback Loops Next chapter! What we do
  23. 23. Ensure that your training data • Is clean and normalized • Is relevant to the related domain • Has a complete and healthy linguistic pattern form • Consists coherent monolingual and bilingual data
  24. 24. KantanMT Rejected Segments Feature • Segments too long • Mismatched Tags/Placeholders • Source/Target mis-alignment • Bad formatting • Incorrect language combinations
  25. 25. Tweet your questions to #KantanWebinar, or via the webinar chat feature.

×