SlideShare a Scribd company logo
1 of 26
in association with#KantanWebinar
Tips for Preparing Training Data for High
Quality MT
What we aim to cover today?
 About KantanMT.com
 Who are we and what we stand for?
 What Makes Good Training Data?
 The 3 Main Factors that influence
Quality
 Transistent.com – An insiders view
 5 Things to Look our for in Good
Training Data
 Q&A
What is KantanMT.com?
 Statistical MT System
 Cloud-based
 Highly scalable
 Inexpensive to operate
 Fusion of TM & MT & rules
 High speed, high quality
translations
 Our Vision
 To put Machine Translation
 Customization
 Improvement
 Deployment
 into your hands
Active KantanMT Engines
7,501
Training Words Uploaded
105,533,605,925
Member Words Translated
1,00,291,925
Fully Operational 18 months
The KantanMT Community
Our Journey has just started…
Q2 2013 Q3 2013 Q3 2014Q1 2013
Adoption: Uploaded 10b training
words and 200m words
translated. KantanAPI launched
www.kantanmt.com:
1st SMT Cloud Based
Platform (TotalRecall)
KantanAutoScale: Using
the power of the cloud to
maximise performance
Kantan BuildAnalytics:
Helping engineers build
better MT
Q1 2014
Kantan Analytics: 1st
Predictive Quality
Estimation Technology
Massive Adoption: 879m
translated and 100b training
words uploaded
Q1 2015Q1 2014
What Makes Good Training Data?
 Training Data - Three main factors:
Quality
 The linguistic quality of the training
material is crucially important
Relevance to domain
 A high quality MT system has good
domain knowledge
 Similar to the way you’ve always worked
with Translation Memories and CAT tools
Quantity
 The more training data you use to build
your engine the better its capacity to
generate translations that mimic your
translation style and terminology Quantity
Quality
Relevance
What Makes Good Training Data?
 Training Data – Balancing the equation
Quality
What Makes Good Training Data?
 Suitable Training Data Sources
• KantanMT Stock Engines
• 200+ Language Combinations
• Translation Memories
• TMX, XLIFF, TXT
• Terminology Databases
• (TBX)
• Client Translated Data
• DOCX, PDF, TXT
Bilingual
TMs
1
Monolingual
Translated Data
2
Glossary/Terms
Sources
3
Language
Base Data
(Optional)
4
Training Data
In conclusion
 What makes good training data?
Quantity
Quality
Relevance
in association with#KantanWebinar
Tips for Preparing Training Data for High
Quality MT
Training Data Preparation for
SMT Systems
Selçuk Özcan
selcuk.ozcan@transistent.com
› Established in December, 2014
› Based in Istanbul
› MT Services including raw output, custom engine
and post-editing
› Additional services including quality automation,
training consultancy and traditional translation
Statistical Machine Translation
(SMT)
Utilized
components
• Monolingual Data
• Bilingual Data
• Glossaries
• Rules and Tasks
Language Model
Translation
Model
Pattern Formation and Mapping
Source Segments Target Segments
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
Bilingual Data
Pattern Formation and Mapping
Target Segments
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
Source Segments
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
Translation Model
Pattern Formation and Mapping
Target Segments
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
Additional Monolingual Data
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
Language Model
Pre-processing
Rules
SOURCE TEXT
Translation
Model
Language Model
Post-processing
Rules
TARGET TEXT
Pre-processing Rules Post-processing Rules
Sentence Segmentation Capitalization
Word Segmentation Post-process Formatting
Word/Phrase Re-ordering Grammar Check
Date – Numbers Tag Injection
Formulas Currency – Metric Unit Conversion
Pre-normalization Final Normalization
Spellcheck Reference Check
Pre-process Formatting Customized Tasks
Tasks - Optimization
Data Analysts
• Data Crawling
• Gathering and Normalizing Data
• Building Corpus
• Corpus Analytics
Testing Team
?
QL – Version Diagram
Tasks - Optimization
Testing Team
• Gap Analysis
• QE and Test Reports
• Output – Corpus Analytics
• Including New Data and Rules
Data Analysts
Including New Data and Rules
before the first two training steps? to reach out mature production system?
Bilingual Corpus Analytics
Lemmatization
Missing Inflections
Word/lemma distribution map
Gap and Broken Pattern Detection
This process requires GA and QE reports
to be utilized.
Monolingual Corpus Analytics
Bilingual – monolingual comparison
Defining the most appropriate LM config
Rule and Data Patch Distinction
The issues included in the reports are
identified.
Term Extraction
Extracting candidate terms
Term and lexical unit separation
Specific glossary and dictionary
Feedback Loops
Next chapter!
What we do
Ensure that your training data
• Is clean and normalized
• Is relevant to the related domain
• Has a complete and healthy linguistic pattern form
• Consists coherent monolingual and bilingual data
KantanMT Rejected Segments Feature
• Segments too long
• Mismatched Tags/Placeholders
• Source/Target mis-alignment
• Bad formatting
• Incorrect language combinations
Tweet your questions to
#KantanWebinar, or via the webinar
chat feature.

More Related Content

What's hot

What's hot (16)

AISA DIGITAL - ENABLER OF DIGITAL ECONOMY IN SOUTHEAST ASIA
AISA DIGITAL - ENABLER OF DIGITAL ECONOMY IN SOUTHEAST ASIAAISA DIGITAL - ENABLER OF DIGITAL ECONOMY IN SOUTHEAST ASIA
AISA DIGITAL - ENABLER OF DIGITAL ECONOMY IN SOUTHEAST ASIA
 
Good Applications of Bad Machine Translation
Good Applications of Bad Machine TranslationGood Applications of Bad Machine Translation
Good Applications of Bad Machine Translation
 
SAP for Automotive
SAP for Automotive SAP for Automotive
SAP for Automotive
 
Advantages and features of Sage ERP ACCPAC
Advantages and features of Sage ERP ACCPACAdvantages and features of Sage ERP ACCPAC
Advantages and features of Sage ERP ACCPAC
 
Blue Prism Training Demo
Blue Prism Training DemoBlue Prism Training Demo
Blue Prism Training Demo
 
The Future of How Work Gets Done: Are You Seeing the Big Picture?
The Future of How Work Gets Done: Are You Seeing the Big Picture?The Future of How Work Gets Done: Are You Seeing the Big Picture?
The Future of How Work Gets Done: Are You Seeing the Big Picture?
 
TAUS Quality Dashboard: Turning QE into Business Intelligence, by Jaap van de...
TAUS Quality Dashboard: Turning QE into Business Intelligence, by Jaap van de...TAUS Quality Dashboard: Turning QE into Business Intelligence, by Jaap van de...
TAUS Quality Dashboard: Turning QE into Business Intelligence, by Jaap van de...
 
ERPNext Open Day - December 2013
ERPNext Open Day - December 2013ERPNext Open Day - December 2013
ERPNext Open Day - December 2013
 
How to Become a Business Analyst from Scratch?
How to Become a Business Analyst from Scratch?How to Become a Business Analyst from Scratch?
How to Become a Business Analyst from Scratch?
 
Cognitive Procurement Masterclass with IBM - SID 51774
Cognitive Procurement Masterclass with IBM - SID 51774Cognitive Procurement Masterclass with IBM - SID 51774
Cognitive Procurement Masterclass with IBM - SID 51774
 
Data Science Engineer Resume | Data Scientist Resume | Data Science Resume Ti...
Data Science Engineer Resume | Data Scientist Resume | Data Science Resume Ti...Data Science Engineer Resume | Data Scientist Resume | Data Science Resume Ti...
Data Science Engineer Resume | Data Scientist Resume | Data Science Resume Ti...
 
Webinar: How API Lifecycle Management can help to Accelerate Growth
Webinar: How API Lifecycle Management can help to Accelerate GrowthWebinar: How API Lifecycle Management can help to Accelerate Growth
Webinar: How API Lifecycle Management can help to Accelerate Growth
 
Digital Transformation with SAP Solution Extensions
Digital Transformation with SAP Solution Extensions Digital Transformation with SAP Solution Extensions
Digital Transformation with SAP Solution Extensions
 
Odoo Vs ERPNext
Odoo Vs ERPNextOdoo Vs ERPNext
Odoo Vs ERPNext
 
SAP HANA
SAP HANASAP HANA
SAP HANA
 
UiPath Insights
UiPath InsightsUiPath Insights
UiPath Insights
 

Similar to Tips for Preparing Training Data for High Quality Machine Translation

ITCONS Software Services Introduction
ITCONS Software Services IntroductionITCONS Software Services Introduction
ITCONS Software Services Introduction
Gaurav Mittal
 
Maximising Machine Translation Return on Investment (KantanMT/Medialocate)
Maximising Machine Translation Return on Investment (KantanMT/Medialocate)Maximising Machine Translation Return on Investment (KantanMT/Medialocate)
Maximising Machine Translation Return on Investment (KantanMT/Medialocate)
kantanmt
 
Oracle Apps Technical Training & Placement in Pune Kharadi
Oracle Apps Technical Training & Placement in Pune KharadiOracle Apps Technical Training & Placement in Pune Kharadi
Oracle Apps Technical Training & Placement in Pune Kharadi
Amit Giri
 
TAUS Roundtable Moscow, User Empowered Machine Translation, Dion Wiggins, Asi...
TAUS Roundtable Moscow, User Empowered Machine Translation, Dion Wiggins, Asi...TAUS Roundtable Moscow, User Empowered Machine Translation, Dion Wiggins, Asi...
TAUS Roundtable Moscow, User Empowered Machine Translation, Dion Wiggins, Asi...
TAUS - The Language Data Network
 
NSSPL(Niranta) Company Profile
NSSPL(Niranta) Company ProfileNSSPL(Niranta) Company Profile
NSSPL(Niranta) Company Profile
gopalbakshi
 
Driving Insightful, Quantifiable Results
Driving Insightful, Quantifiable ResultsDriving Insightful, Quantifiable Results
Driving Insightful, Quantifiable Results
lshahs
 

Similar to Tips for Preparing Training Data for High Quality Machine Translation (20)

TAUS MT SHOWCASE, Creating Competitive Advantage with Rapid Customization & D...
TAUS MT SHOWCASE, Creating Competitive Advantage with Rapid Customization & D...TAUS MT SHOWCASE, Creating Competitive Advantage with Rapid Customization & D...
TAUS MT SHOWCASE, Creating Competitive Advantage with Rapid Customization & D...
 
KantanMT Brochure
KantanMT BrochureKantanMT Brochure
KantanMT Brochure
 
KantanMT
KantanMT KantanMT
KantanMT
 
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
 
ITCONS Software Services Introduction
ITCONS Software Services IntroductionITCONS Software Services Introduction
ITCONS Software Services Introduction
 
TAUS MT Showcase 2014, Enabling MT for the Everyone! Tony O’Dowd, KantanMT
TAUS MT Showcase 2014, Enabling MT for the Everyone! Tony O’Dowd, KantanMTTAUS MT Showcase 2014, Enabling MT for the Everyone! Tony O’Dowd, KantanMT
TAUS MT Showcase 2014, Enabling MT for the Everyone! Tony O’Dowd, KantanMT
 
Maximising Machine Translation Return on Investment (KantanMT/Medialocate)
Maximising Machine Translation Return on Investment (KantanMT/Medialocate)Maximising Machine Translation Return on Investment (KantanMT/Medialocate)
Maximising Machine Translation Return on Investment (KantanMT/Medialocate)
 
Aekra consulting
Aekra consultingAekra consulting
Aekra consulting
 
Oracle Apps Technical Training & Placement in Pune Kharadi
Oracle Apps Technical Training & Placement in Pune KharadiOracle Apps Technical Training & Placement in Pune Kharadi
Oracle Apps Technical Training & Placement in Pune Kharadi
 
5 challenges of scaling l10n workflows KantanMT/bmmt webinar
5 challenges of scaling l10n workflows KantanMT/bmmt webinar5 challenges of scaling l10n workflows KantanMT/bmmt webinar
5 challenges of scaling l10n workflows KantanMT/bmmt webinar
 
KantanFest: Tony O'Dowd
KantanFest: Tony O'DowdKantanFest: Tony O'Dowd
KantanFest: Tony O'Dowd
 
TAUS Roundtable Moscow, User Empowered Machine Translation, Dion Wiggins, Asi...
TAUS Roundtable Moscow, User Empowered Machine Translation, Dion Wiggins, Asi...TAUS Roundtable Moscow, User Empowered Machine Translation, Dion Wiggins, Asi...
TAUS Roundtable Moscow, User Empowered Machine Translation, Dion Wiggins, Asi...
 
Gazelle presentation
Gazelle presentationGazelle presentation
Gazelle presentation
 
NSSPL(Niranta) Company Profile
NSSPL(Niranta) Company ProfileNSSPL(Niranta) Company Profile
NSSPL(Niranta) Company Profile
 
Driving Insightful, Quantifiable Results
Driving Insightful, Quantifiable ResultsDriving Insightful, Quantifiable Results
Driving Insightful, Quantifiable Results
 
Presentation eSofLabs
Presentation  eSofLabsPresentation  eSofLabs
Presentation eSofLabs
 
cv
cvcv
cv
 
Apagen company profile
Apagen company profileApagen company profile
Apagen company profile
 
Routeget Technologies - Corporate presentation
Routeget Technologies - Corporate presentationRouteget Technologies - Corporate presentation
Routeget Technologies - Corporate presentation
 
Mindshare company presentation rev 9.0
Mindshare company presentation rev 9.0 Mindshare company presentation rev 9.0
Mindshare company presentation rev 9.0
 

More from kantanmt

EAMT Workshop 2015 - KantanMT
EAMT Workshop 2015 - KantanMTEAMT Workshop 2015 - KantanMT
EAMT Workshop 2015 - KantanMT
kantanmt
 

More from kantanmt (20)

KantanFest: Mindaugas Kazlauskas
KantanFest: Mindaugas KazlauskasKantanFest: Mindaugas Kazlauskas
KantanFest: Mindaugas Kazlauskas
 
Kantanfest: Dimitar Shterionov - Part 2
Kantanfest: Dimitar Shterionov - Part 2Kantanfest: Dimitar Shterionov - Part 2
Kantanfest: Dimitar Shterionov - Part 2
 
Kantanfest: Laura Casanellas
Kantanfest: Laura CasanellasKantanfest: Laura Casanellas
Kantanfest: Laura Casanellas
 
Kantanfest: Dimitar Shterionov - Part 1
Kantanfest: Dimitar Shterionov - Part 1Kantanfest: Dimitar Shterionov - Part 1
Kantanfest: Dimitar Shterionov - Part 1
 
KantanFest: Andy Way
KantanFest: Andy WayKantanFest: Andy Way
KantanFest: Andy Way
 
Get Started with KantanNeural
Get Started with KantanNeuralGet Started with KantanNeural
Get Started with KantanNeural
 
You Asked, We Will Answer
You Asked, We Will AnswerYou Asked, We Will Answer
You Asked, We Will Answer
 
ATC Summit 2016: The 7th Habit of 7 Habits of Effective MT Systems
ATC Summit 2016: The 7th Habit of 7 Habits of Effective MT SystemsATC Summit 2016: The 7th Habit of 7 Habits of Effective MT Systems
ATC Summit 2016: The 7th Habit of 7 Habits of Effective MT Systems
 
Cross Border Selling: Breaking the Language Barrier with Automated Translation
Cross Border Selling: Breaking the Language Barrier with Automated TranslationCross Border Selling: Breaking the Language Barrier with Automated Translation
Cross Border Selling: Breaking the Language Barrier with Automated Translation
 
Go global with this Winning Combination – Content strategy and Machine Transl...
Go global with this Winning Combination – Content strategy and Machine Transl...Go global with this Winning Combination – Content strategy and Machine Transl...
Go global with this Winning Combination – Content strategy and Machine Transl...
 
Webinar automotive and engineering content 16.06.16
Webinar   automotive and engineering content 16.06.16Webinar   automotive and engineering content 16.06.16
Webinar automotive and engineering content 16.06.16
 
IC4 Cloud Security Workshop 2016
IC4 Cloud Security Workshop 2016IC4 Cloud Security Workshop 2016
IC4 Cloud Security Workshop 2016
 
New Ways to Engage Clients with Custom Machine Translation
New Ways to Engage Clients with Custom Machine TranslationNew Ways to Engage Clients with Custom Machine Translation
New Ways to Engage Clients with Custom Machine Translation
 
Improving your Bottom Line with Custom Machine Translation
Improving your Bottom Line with Custom Machine TranslationImproving your Bottom Line with Custom Machine Translation
Improving your Bottom Line with Custom Machine Translation
 
How to Improve Translation Productivity
How to Improve Translation ProductivityHow to Improve Translation Productivity
How to Improve Translation Productivity
 
How to save 16 million euro for your start up business
How to save 16 million euro for your start up businessHow to save 16 million euro for your start up business
How to save 16 million euro for your start up business
 
EAMT Workshop 2015 - KantanMT
EAMT Workshop 2015 - KantanMTEAMT Workshop 2015 - KantanMT
EAMT Workshop 2015 - KantanMT
 
Breaking Language Barriers: Machine Translation for eCommerce
Breaking Language Barriers: Machine Translation for eCommerceBreaking Language Barriers: Machine Translation for eCommerce
Breaking Language Barriers: Machine Translation for eCommerce
 
Cloud Computing: IC4 Cloud On-Boarding Clinic, DCU
Cloud Computing: IC4 Cloud On-Boarding Clinic, DCUCloud Computing: IC4 Cloud On-Boarding Clinic, DCU
Cloud Computing: IC4 Cloud On-Boarding Clinic, DCU
 
How to set up a high tech business in the Cloud for 2,000 EUR
How to set up a high tech business in the Cloud for 2,000 EURHow to set up a high tech business in the Cloud for 2,000 EUR
How to set up a high tech business in the Cloud for 2,000 EUR
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 

Tips for Preparing Training Data for High Quality Machine Translation

  • 1.
  • 2. in association with#KantanWebinar Tips for Preparing Training Data for High Quality MT
  • 3. What we aim to cover today?  About KantanMT.com  Who are we and what we stand for?  What Makes Good Training Data?  The 3 Main Factors that influence Quality  Transistent.com – An insiders view  5 Things to Look our for in Good Training Data  Q&A
  • 4. What is KantanMT.com?  Statistical MT System  Cloud-based  Highly scalable  Inexpensive to operate  Fusion of TM & MT & rules  High speed, high quality translations  Our Vision  To put Machine Translation  Customization  Improvement  Deployment  into your hands Active KantanMT Engines 7,501 Training Words Uploaded 105,533,605,925 Member Words Translated 1,00,291,925 Fully Operational 18 months
  • 6. Our Journey has just started… Q2 2013 Q3 2013 Q3 2014Q1 2013 Adoption: Uploaded 10b training words and 200m words translated. KantanAPI launched www.kantanmt.com: 1st SMT Cloud Based Platform (TotalRecall) KantanAutoScale: Using the power of the cloud to maximise performance Kantan BuildAnalytics: Helping engineers build better MT Q1 2014 Kantan Analytics: 1st Predictive Quality Estimation Technology Massive Adoption: 879m translated and 100b training words uploaded Q1 2015Q1 2014
  • 7. What Makes Good Training Data?  Training Data - Three main factors: Quality  The linguistic quality of the training material is crucially important Relevance to domain  A high quality MT system has good domain knowledge  Similar to the way you’ve always worked with Translation Memories and CAT tools Quantity  The more training data you use to build your engine the better its capacity to generate translations that mimic your translation style and terminology Quantity Quality Relevance
  • 8. What Makes Good Training Data?  Training Data – Balancing the equation Quality
  • 9. What Makes Good Training Data?  Suitable Training Data Sources • KantanMT Stock Engines • 200+ Language Combinations • Translation Memories • TMX, XLIFF, TXT • Terminology Databases • (TBX) • Client Translated Data • DOCX, PDF, TXT Bilingual TMs 1 Monolingual Translated Data 2 Glossary/Terms Sources 3 Language Base Data (Optional) 4 Training Data
  • 10. In conclusion  What makes good training data? Quantity Quality Relevance
  • 11. in association with#KantanWebinar Tips for Preparing Training Data for High Quality MT
  • 12. Training Data Preparation for SMT Systems Selçuk Özcan selcuk.ozcan@transistent.com
  • 13. › Established in December, 2014 › Based in Istanbul › MT Services including raw output, custom engine and post-editing › Additional services including quality automation, training consultancy and traditional translation
  • 14. Statistical Machine Translation (SMT) Utilized components • Monolingual Data • Bilingual Data • Glossaries • Rules and Tasks Language Model Translation Model
  • 15. Pattern Formation and Mapping Source Segments Target Segments xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx Bilingual Data
  • 16. Pattern Formation and Mapping Target Segments xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx Source Segments xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx Translation Model
  • 17. Pattern Formation and Mapping Target Segments xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx Additional Monolingual Data xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx Language Model
  • 19. Pre-processing Rules Post-processing Rules Sentence Segmentation Capitalization Word Segmentation Post-process Formatting Word/Phrase Re-ordering Grammar Check Date – Numbers Tag Injection Formulas Currency – Metric Unit Conversion Pre-normalization Final Normalization Spellcheck Reference Check Pre-process Formatting Customized Tasks
  • 20. Tasks - Optimization Data Analysts • Data Crawling • Gathering and Normalizing Data • Building Corpus • Corpus Analytics Testing Team ?
  • 21. QL – Version Diagram
  • 22. Tasks - Optimization Testing Team • Gap Analysis • QE and Test Reports • Output – Corpus Analytics • Including New Data and Rules Data Analysts
  • 23. Including New Data and Rules before the first two training steps? to reach out mature production system? Bilingual Corpus Analytics Lemmatization Missing Inflections Word/lemma distribution map Gap and Broken Pattern Detection This process requires GA and QE reports to be utilized. Monolingual Corpus Analytics Bilingual – monolingual comparison Defining the most appropriate LM config Rule and Data Patch Distinction The issues included in the reports are identified. Term Extraction Extracting candidate terms Term and lexical unit separation Specific glossary and dictionary Feedback Loops Next chapter! What we do
  • 24. Ensure that your training data • Is clean and normalized • Is relevant to the related domain • Has a complete and healthy linguistic pattern form • Consists coherent monolingual and bilingual data
  • 25. KantanMT Rejected Segments Feature • Segments too long • Mismatched Tags/Placeholders • Source/Target mis-alignment • Bad formatting • Incorrect language combinations
  • 26. Tweet your questions to #KantanWebinar, or via the webinar chat feature.

Editor's Notes

  1. No more expensive deployments Monthly subscription plan Customised subscription plan No more complexity KantanMT does all the heavy lifting You focus on what you do best – grow and develop your business
  2. We are the fastest growing MT provider, even though we are one of the young-guns!