SlideShare a Scribd company logo
1 of 21
Better translations through automated
source and post-edit analysis
David Landan
Welocalize
Background
• MT is here to stay
– Better MT = less PE effort = higher throughput
for less money

• MT quality depends on training data
quantity, quality, and relevance
– Selecting in-domain data increases BLEU
scores by 10-20 BLEU over generic engines

• LSPs have less control over quantity, so
we need to focus on quality & relevance
A data-driven approach
• Analytics at each step
Training

MT Production

•

•

Perplexity Evaluator

•

Candidate Scorer

•

Source Content
Profiler (joint project
w/CNGL)

•

StyleScorer

•

Number checking

StyleScorer

•

UGC Normalization

Post-Editing

•

WeScore

•

StyleScorer
Candidate Scorer
• Uses corpus of known “difficult” text
• Compares part of speech (POS) n-grams
– Generates per-sentence scores
Perplexity (PPL) Evaluator
• Build language models (LMs) from
multiple corpora
– Known “good” sentences for MT
– Known “bad” sentences for MT
– Client-specific in-domain data

• Each document gets a PPL score
against each LM
StyleScorer
• Combines PPL ratio, dissimilarity
score, and classification score
– Each document receives a score from 0-4
– Higher score indicates better match to
style established by client’s documents
– Does not require parallel data

• Source scored for training/tuning
suitability
Source Content Profiler
• CNGL project (beta)
– Classification of docs into profiles
– Features based on:
•
•
•
•
•
•
•

Word & sent. length
Readability score
Syntactic structure
Terminology
Tag ratios
Do Not Translate lists
Glossary matches
Does it work?
en-USnl-NL

en-USpl-PL

en-UShu-HU

Plain vanilla

21.26

16.88

17.31

Domain match

36.39

37.07

38.36

Plain + target

44.07

34.61

30.43

Domain + target

64.40

54.55

49.53

Engine
A data-driven approach
• Analytics at each step
Training

MT Production

•

•

Perplexity Evaluator

•

Candidate Scorer

•

Source Content
Profiler (joint project
w/CNGL)

•

StyleScorer

•

Number checking

StyleScorer

•

UGC Normalization

Post-Editing

•

WeScore

•

StyleScorer
UGC normalization
• Make substitutions in source for known
MT pain points before translating
– Frequent misspellings – “teh”, “mroe”, etc.
– Abbreviations – “imho”, “tyvm”, etc.
– Missing punctuation – “cant”, “theyll”, etc.
– Emoticons
– Spelling variants/slang – “cuz”, “usu”, etc.
Number checking
• Verify that numeric MT output is
localized correctly
– Currency – “$1B” vs “1 млрд. $”
– Dates – “2/28/2014” vs “28/2/2014”
– Time – “2pm” vs “14h00”
– Separator & radix – “1,234.5” vs “1 234,5”
StyleScorer revisited
• MT output is compared to client’s
historical (in-domain) PE data
– Treat each target segment as a
document
– Lower scores indicate segments likely to
require greater PE effort
A data-driven approach
• Analytics at each step
Training

MT Production

•

•

Perplexity Evaluator

•

Candidate Scorer

•

Source Content
Profiler (joint project
w/CNGL)

•

StyleScorer

•

Number checking

StyleScorer

•

UGC Normalization

Post-Editing

•

WeScore

•

StyleScorer
WeScore
• Dashboard for viewing MT metrics
– Tokenizes input from variety of formats &
runs several scoring algorithms in parallel
– Exports detailed analysis to spreadsheet
for sentence-by-sentence review
WeScore
WeScore
StyleScorer III
• PE output is compared to client’s
historical (in-domain) data
– Treat each PE segment as a document
– Lower score indicates possible deviation
from established style
Feedback loop
• Data collected and lessons learned
– Update client-specific data for future
engine training
– Mine data for generalizable patterns in
problem areas
– Work with post-editors to understand how
to make a better system & how to
improve PE experience and throughput
Q&A
Thank you!

More Related Content

Similar to Automated source and post-edit analysis improves MT quality

Kerstin Berns (berns language consulting) at Industry Leaders Forum 2015
Kerstin Berns (berns language consulting) at Industry Leaders Forum 2015Kerstin Berns (berns language consulting) at Industry Leaders Forum 2015
Kerstin Berns (berns language consulting) at Industry Leaders Forum 2015TAUS - The Language Data Network
 
Data anylysis consultancy services india
Data anylysis consultancy services indiaData anylysis consultancy services india
Data anylysis consultancy services indiaPRAKASH BHOSALE
 
Is Your Enterprise “fire-fighting” translation issues? Optimize the process w...
Is Your Enterprise “fire-fighting” translation issues? Optimize the process w...Is Your Enterprise “fire-fighting” translation issues? Optimize the process w...
Is Your Enterprise “fire-fighting” translation issues? Optimize the process w...dclsocialmedia
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comSimon Hughes
 
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Lucidworks
 
Custom spellchecker for SOLR
Custom spellchecker for SOLRCustom spellchecker for SOLR
Custom spellchecker for SOLRMurthy Remella
 
How Much Cake to Eat: The Case for Targeted MT Engines
How Much Cake to Eat: The Case for Targeted MT EnginesHow Much Cake to Eat: The Case for Targeted MT Engines
How Much Cake to Eat: The Case for Targeted MT EnginesWelocalize
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxElasticsearch
 
What machine translation developers are doing to make post-editors happy
What machine translation developers are doing to make post-editors happyWhat machine translation developers are doing to make post-editors happy
What machine translation developers are doing to make post-editors happyIconic Translation Machines
 
Pushing translation quality upstream (Klaus Fleischman, Managing Director of ...
Pushing translation quality upstream (Klaus Fleischman, Managing Director of ...Pushing translation quality upstream (Klaus Fleischman, Managing Director of ...
Pushing translation quality upstream (Klaus Fleischman, Managing Director of ...TAUS - The Language Data Network
 
Evaluation of MT Quality/Productivity at eBay - AMTA 2018
Evaluation of MT Quality/Productivity at eBay - AMTA 2018Evaluation of MT Quality/Productivity at eBay - AMTA 2018
Evaluation of MT Quality/Productivity at eBay - AMTA 2018Jose Luis Bonilla Sánchez
 
LavaCon 2015: Efficient Translation Management - 5 Specific Metrics That Wil...
LavaCon 2015:  Efficient Translation Management - 5 Specific Metrics That Wil...LavaCon 2015:  Efficient Translation Management - 5 Specific Metrics That Wil...
LavaCon 2015: Efficient Translation Management - 5 Specific Metrics That Wil...Scott Carothers
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxElasticsearch
 
The effects of preferred text formatting on performance and perceptual appeal
The effects of preferred text formatting on performance and perceptual appealThe effects of preferred text formatting on performance and perceptual appeal
The effects of preferred text formatting on performance and perceptual appealPaul Doncaster
 
Creating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyCreating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyRTTS
 

Similar to Automated source and post-edit analysis improves MT quality (20)

What Data would you like to Track? - Fred Tuinstra
What Data would you like to Track? - Fred TuinstraWhat Data would you like to Track? - Fred Tuinstra
What Data would you like to Track? - Fred Tuinstra
 
Kerstin Berns (berns language consulting) at Industry Leaders Forum 2015
Kerstin Berns (berns language consulting) at Industry Leaders Forum 2015Kerstin Berns (berns language consulting) at Industry Leaders Forum 2015
Kerstin Berns (berns language consulting) at Industry Leaders Forum 2015
 
Data anylysis consultancy services india
Data anylysis consultancy services indiaData anylysis consultancy services india
Data anylysis consultancy services india
 
Is Your Enterprise “fire-fighting” translation issues? Optimize the process w...
Is Your Enterprise “fire-fighting” translation issues? Optimize the process w...Is Your Enterprise “fire-fighting” translation issues? Optimize the process w...
Is Your Enterprise “fire-fighting” translation issues? Optimize the process w...
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
 
5. bleu
5. bleu5. bleu
5. bleu
 
Ogdc 2007 David Lakritz
Ogdc 2007 David LakritzOgdc 2007 David Lakritz
Ogdc 2007 David Lakritz
 
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
 
Custom spellchecker for SOLR
Custom spellchecker for SOLRCustom spellchecker for SOLR
Custom spellchecker for SOLR
 
How Much Cake to Eat: The Case for Targeted MT Engines
How Much Cake to Eat: The Case for Targeted MT EnginesHow Much Cake to Eat: The Case for Targeted MT Engines
How Much Cake to Eat: The Case for Targeted MT Engines
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolbox
 
What machine translation developers are doing to make post-editors happy
What machine translation developers are doing to make post-editors happyWhat machine translation developers are doing to make post-editors happy
What machine translation developers are doing to make post-editors happy
 
Churn modelling
Churn modellingChurn modelling
Churn modelling
 
Pushing translation quality upstream (Klaus Fleischman, Managing Director of ...
Pushing translation quality upstream (Klaus Fleischman, Managing Director of ...Pushing translation quality upstream (Klaus Fleischman, Managing Director of ...
Pushing translation quality upstream (Klaus Fleischman, Managing Director of ...
 
Evaluation of MT Quality/Productivity at eBay - AMTA 2018
Evaluation of MT Quality/Productivity at eBay - AMTA 2018Evaluation of MT Quality/Productivity at eBay - AMTA 2018
Evaluation of MT Quality/Productivity at eBay - AMTA 2018
 
LavaCon 2015: Efficient Translation Management - 5 Specific Metrics That Wil...
LavaCon 2015:  Efficient Translation Management - 5 Specific Metrics That Wil...LavaCon 2015:  Efficient Translation Management - 5 Specific Metrics That Wil...
LavaCon 2015: Efficient Translation Management - 5 Specific Metrics That Wil...
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolbox
 
The effects of preferred text formatting on performance and perceptual appeal
The effects of preferred text formatting on performance and perceptual appealThe effects of preferred text formatting on performance and perceptual appeal
The effects of preferred text formatting on performance and perceptual appeal
 
IR
IRIR
IR
 
Creating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyCreating a Data validation and Testing Strategy
Creating a Data validation and Testing Strategy
 

More from Welocalize

Automating the Localization Workflow. What Works?
Automating the Localization Workflow. What Works?Automating the Localization Workflow. What Works?
Automating the Localization Workflow. What Works?Welocalize
 
MT Quality Evaluations: From Test Environment to Production
MT Quality Evaluations: From Test Environment to ProductionMT Quality Evaluations: From Test Environment to Production
MT Quality Evaluations: From Test Environment to ProductionWelocalize
 
EAMT Presentation by Welocalize Olga Beregovaya May 2015
EAMT Presentation by Welocalize Olga Beregovaya May 2015EAMT Presentation by Welocalize Olga Beregovaya May 2015
EAMT Presentation by Welocalize Olga Beregovaya May 2015Welocalize
 
Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...
Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...
Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...Welocalize
 
Welocalize Throughputs and Post-Editing Productivity Webinar Laura Casanellas
Welocalize Throughputs and Post-Editing Productivity Webinar Laura CasanellasWelocalize Throughputs and Post-Editing Productivity Webinar Laura Casanellas
Welocalize Throughputs and Post-Editing Productivity Webinar Laura CasanellasWelocalize
 
Content Marketing World 2014 Language Fun Fact Challenge by welocalize
Content Marketing World 2014 Language Fun Fact Challenge by welocalizeContent Marketing World 2014 Language Fun Fact Challenge by welocalize
Content Marketing World 2014 Language Fun Fact Challenge by welocalizeWelocalize
 
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...Welocalize
 
Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014
Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014
Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014Welocalize
 
TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...
TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...
TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...Welocalize
 
Beyond Disruption: Make Way for Return on Content by Welocalize Olga Beregovaya
Beyond Disruption: Make Way for Return on Content by Welocalize Olga BeregovayaBeyond Disruption: Make Way for Return on Content by Welocalize Olga Beregovaya
Beyond Disruption: Make Way for Return on Content by Welocalize Olga BeregovayaWelocalize
 
2013 CHAT tcworld tekom Welocalize Teaminology
2013 CHAT tcworld tekom Welocalize Teaminology 2013 CHAT tcworld tekom Welocalize Teaminology
2013 CHAT tcworld tekom Welocalize Teaminology Welocalize
 
Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...
Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...
Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...Welocalize
 
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...Welocalize
 
An MT Journey Intuit and Welocalize Localization World 2013
An MT Journey Intuit and Welocalize Localization World 2013An MT Journey Intuit and Welocalize Localization World 2013
An MT Journey Intuit and Welocalize Localization World 2013Welocalize
 
Safaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-Editing
Safaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-EditingSafaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-Editing
Safaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-EditingWelocalize
 
MT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L Marg
MT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L MargMT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L Marg
MT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L MargWelocalize
 

More from Welocalize (16)

Automating the Localization Workflow. What Works?
Automating the Localization Workflow. What Works?Automating the Localization Workflow. What Works?
Automating the Localization Workflow. What Works?
 
MT Quality Evaluations: From Test Environment to Production
MT Quality Evaluations: From Test Environment to ProductionMT Quality Evaluations: From Test Environment to Production
MT Quality Evaluations: From Test Environment to Production
 
EAMT Presentation by Welocalize Olga Beregovaya May 2015
EAMT Presentation by Welocalize Olga Beregovaya May 2015EAMT Presentation by Welocalize Olga Beregovaya May 2015
EAMT Presentation by Welocalize Olga Beregovaya May 2015
 
Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...
Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...
Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...
 
Welocalize Throughputs and Post-Editing Productivity Webinar Laura Casanellas
Welocalize Throughputs and Post-Editing Productivity Webinar Laura CasanellasWelocalize Throughputs and Post-Editing Productivity Webinar Laura Casanellas
Welocalize Throughputs and Post-Editing Productivity Webinar Laura Casanellas
 
Content Marketing World 2014 Language Fun Fact Challenge by welocalize
Content Marketing World 2014 Language Fun Fact Challenge by welocalizeContent Marketing World 2014 Language Fun Fact Challenge by welocalize
Content Marketing World 2014 Language Fun Fact Challenge by welocalize
 
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
 
Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014
Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014
Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014
 
TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...
TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...
TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...
 
Beyond Disruption: Make Way for Return on Content by Welocalize Olga Beregovaya
Beyond Disruption: Make Way for Return on Content by Welocalize Olga BeregovayaBeyond Disruption: Make Way for Return on Content by Welocalize Olga Beregovaya
Beyond Disruption: Make Way for Return on Content by Welocalize Olga Beregovaya
 
2013 CHAT tcworld tekom Welocalize Teaminology
2013 CHAT tcworld tekom Welocalize Teaminology 2013 CHAT tcworld tekom Welocalize Teaminology
2013 CHAT tcworld tekom Welocalize Teaminology
 
Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...
Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...
Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...
 
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
 
An MT Journey Intuit and Welocalize Localization World 2013
An MT Journey Intuit and Welocalize Localization World 2013An MT Journey Intuit and Welocalize Localization World 2013
An MT Journey Intuit and Welocalize Localization World 2013
 
Safaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-Editing
Safaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-EditingSafaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-Editing
Safaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-Editing
 
MT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L Marg
MT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L MargMT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L Marg
MT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L Marg
 

Recently uploaded

8447779800, Low rate Call girls in New Ashok Nagar Delhi NCR
8447779800, Low rate Call girls in New Ashok Nagar Delhi NCR8447779800, Low rate Call girls in New Ashok Nagar Delhi NCR
8447779800, Low rate Call girls in New Ashok Nagar Delhi NCRashishs7044
 
Organizational Structure Running A Successful Business
Organizational Structure Running A Successful BusinessOrganizational Structure Running A Successful Business
Organizational Structure Running A Successful BusinessSeta Wicaksana
 
International Business Environments and Operations 16th Global Edition test b...
International Business Environments and Operations 16th Global Edition test b...International Business Environments and Operations 16th Global Edition test b...
International Business Environments and Operations 16th Global Edition test b...ssuserf63bd7
 
Guide Complete Set of Residential Architectural Drawings PDF
Guide Complete Set of Residential Architectural Drawings PDFGuide Complete Set of Residential Architectural Drawings PDF
Guide Complete Set of Residential Architectural Drawings PDFChandresh Chudasama
 
Financial-Statement-Analysis-of-Coca-cola-Company.pptx
Financial-Statement-Analysis-of-Coca-cola-Company.pptxFinancial-Statement-Analysis-of-Coca-cola-Company.pptx
Financial-Statement-Analysis-of-Coca-cola-Company.pptxsaniyaimamuddin
 
Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024Kirill Klimov
 
Darshan Hiranandani [News About Next CEO].pdf
Darshan Hiranandani [News About Next CEO].pdfDarshan Hiranandani [News About Next CEO].pdf
Darshan Hiranandani [News About Next CEO].pdfShashank Mehta
 
Innovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdfInnovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdfrichard876048
 
Market Sizes Sample Report - 2024 Edition
Market Sizes Sample Report - 2024 EditionMarket Sizes Sample Report - 2024 Edition
Market Sizes Sample Report - 2024 EditionMintel Group
 
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City GurgaonCall Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaoncallgirls2057
 
Traction part 2 - EOS Model JAX Bridges.
Traction part 2 - EOS Model JAX Bridges.Traction part 2 - EOS Model JAX Bridges.
Traction part 2 - EOS Model JAX Bridges.Anamaria Contreras
 
Church Building Grants To Assist With New Construction, Additions, And Restor...
Church Building Grants To Assist With New Construction, Additions, And Restor...Church Building Grants To Assist With New Construction, Additions, And Restor...
Church Building Grants To Assist With New Construction, Additions, And Restor...Americas Got Grants
 
Unlocking the Future: Explore Web 3.0 Workshop to Start Earning Today!
Unlocking the Future: Explore Web 3.0 Workshop to Start Earning Today!Unlocking the Future: Explore Web 3.0 Workshop to Start Earning Today!
Unlocking the Future: Explore Web 3.0 Workshop to Start Earning Today!Doge Mining Website
 
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607dollysharma2066
 
1911 Gold Corporate Presentation Apr 2024.pdf
1911 Gold Corporate Presentation Apr 2024.pdf1911 Gold Corporate Presentation Apr 2024.pdf
1911 Gold Corporate Presentation Apr 2024.pdfShaun Heinrichs
 
FULL ENJOY Call girls in Paharganj Delhi | 8377087607
FULL ENJOY Call girls in Paharganj Delhi | 8377087607FULL ENJOY Call girls in Paharganj Delhi | 8377087607
FULL ENJOY Call girls in Paharganj Delhi | 8377087607dollysharma2066
 
Investment in The Coconut Industry by Nancy Cheruiyot
Investment in The Coconut Industry by Nancy CheruiyotInvestment in The Coconut Industry by Nancy Cheruiyot
Investment in The Coconut Industry by Nancy Cheruiyotictsugar
 

Recently uploaded (20)

8447779800, Low rate Call girls in New Ashok Nagar Delhi NCR
8447779800, Low rate Call girls in New Ashok Nagar Delhi NCR8447779800, Low rate Call girls in New Ashok Nagar Delhi NCR
8447779800, Low rate Call girls in New Ashok Nagar Delhi NCR
 
Organizational Structure Running A Successful Business
Organizational Structure Running A Successful BusinessOrganizational Structure Running A Successful Business
Organizational Structure Running A Successful Business
 
Corporate Profile 47Billion Information Technology
Corporate Profile 47Billion Information TechnologyCorporate Profile 47Billion Information Technology
Corporate Profile 47Billion Information Technology
 
Japan IT Week 2024 Brochure by 47Billion (English)
Japan IT Week 2024 Brochure by 47Billion (English)Japan IT Week 2024 Brochure by 47Billion (English)
Japan IT Week 2024 Brochure by 47Billion (English)
 
International Business Environments and Operations 16th Global Edition test b...
International Business Environments and Operations 16th Global Edition test b...International Business Environments and Operations 16th Global Edition test b...
International Business Environments and Operations 16th Global Edition test b...
 
Guide Complete Set of Residential Architectural Drawings PDF
Guide Complete Set of Residential Architectural Drawings PDFGuide Complete Set of Residential Architectural Drawings PDF
Guide Complete Set of Residential Architectural Drawings PDF
 
Financial-Statement-Analysis-of-Coca-cola-Company.pptx
Financial-Statement-Analysis-of-Coca-cola-Company.pptxFinancial-Statement-Analysis-of-Coca-cola-Company.pptx
Financial-Statement-Analysis-of-Coca-cola-Company.pptx
 
Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024
 
Darshan Hiranandani [News About Next CEO].pdf
Darshan Hiranandani [News About Next CEO].pdfDarshan Hiranandani [News About Next CEO].pdf
Darshan Hiranandani [News About Next CEO].pdf
 
Innovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdfInnovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdf
 
Market Sizes Sample Report - 2024 Edition
Market Sizes Sample Report - 2024 EditionMarket Sizes Sample Report - 2024 Edition
Market Sizes Sample Report - 2024 Edition
 
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City GurgaonCall Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
 
Traction part 2 - EOS Model JAX Bridges.
Traction part 2 - EOS Model JAX Bridges.Traction part 2 - EOS Model JAX Bridges.
Traction part 2 - EOS Model JAX Bridges.
 
Church Building Grants To Assist With New Construction, Additions, And Restor...
Church Building Grants To Assist With New Construction, Additions, And Restor...Church Building Grants To Assist With New Construction, Additions, And Restor...
Church Building Grants To Assist With New Construction, Additions, And Restor...
 
Unlocking the Future: Explore Web 3.0 Workshop to Start Earning Today!
Unlocking the Future: Explore Web 3.0 Workshop to Start Earning Today!Unlocking the Future: Explore Web 3.0 Workshop to Start Earning Today!
Unlocking the Future: Explore Web 3.0 Workshop to Start Earning Today!
 
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
 
1911 Gold Corporate Presentation Apr 2024.pdf
1911 Gold Corporate Presentation Apr 2024.pdf1911 Gold Corporate Presentation Apr 2024.pdf
1911 Gold Corporate Presentation Apr 2024.pdf
 
FULL ENJOY Call girls in Paharganj Delhi | 8377087607
FULL ENJOY Call girls in Paharganj Delhi | 8377087607FULL ENJOY Call girls in Paharganj Delhi | 8377087607
FULL ENJOY Call girls in Paharganj Delhi | 8377087607
 
No-1 Call Girls In Goa 93193 VIP 73153 Escort service In North Goa Panaji, Ca...
No-1 Call Girls In Goa 93193 VIP 73153 Escort service In North Goa Panaji, Ca...No-1 Call Girls In Goa 93193 VIP 73153 Escort service In North Goa Panaji, Ca...
No-1 Call Girls In Goa 93193 VIP 73153 Escort service In North Goa Panaji, Ca...
 
Investment in The Coconut Industry by Nancy Cheruiyot
Investment in The Coconut Industry by Nancy CheruiyotInvestment in The Coconut Industry by Nancy Cheruiyot
Investment in The Coconut Industry by Nancy Cheruiyot
 

Automated source and post-edit analysis improves MT quality

  • 1. Better translations through automated source and post-edit analysis David Landan Welocalize
  • 2. Background • MT is here to stay – Better MT = less PE effort = higher throughput for less money • MT quality depends on training data quantity, quality, and relevance – Selecting in-domain data increases BLEU scores by 10-20 BLEU over generic engines • LSPs have less control over quantity, so we need to focus on quality & relevance
  • 3. A data-driven approach • Analytics at each step Training MT Production • • Perplexity Evaluator • Candidate Scorer • Source Content Profiler (joint project w/CNGL) • StyleScorer • Number checking StyleScorer • UGC Normalization Post-Editing • WeScore • StyleScorer
  • 4. Candidate Scorer • Uses corpus of known “difficult” text • Compares part of speech (POS) n-grams – Generates per-sentence scores
  • 5.
  • 6. Perplexity (PPL) Evaluator • Build language models (LMs) from multiple corpora – Known “good” sentences for MT – Known “bad” sentences for MT – Client-specific in-domain data • Each document gets a PPL score against each LM
  • 7.
  • 8. StyleScorer • Combines PPL ratio, dissimilarity score, and classification score – Each document receives a score from 0-4 – Higher score indicates better match to style established by client’s documents – Does not require parallel data • Source scored for training/tuning suitability
  • 9. Source Content Profiler • CNGL project (beta) – Classification of docs into profiles – Features based on: • • • • • • • Word & sent. length Readability score Syntactic structure Terminology Tag ratios Do Not Translate lists Glossary matches
  • 10. Does it work? en-USnl-NL en-USpl-PL en-UShu-HU Plain vanilla 21.26 16.88 17.31 Domain match 36.39 37.07 38.36 Plain + target 44.07 34.61 30.43 Domain + target 64.40 54.55 49.53 Engine
  • 11. A data-driven approach • Analytics at each step Training MT Production • • Perplexity Evaluator • Candidate Scorer • Source Content Profiler (joint project w/CNGL) • StyleScorer • Number checking StyleScorer • UGC Normalization Post-Editing • WeScore • StyleScorer
  • 12. UGC normalization • Make substitutions in source for known MT pain points before translating – Frequent misspellings – “teh”, “mroe”, etc. – Abbreviations – “imho”, “tyvm”, etc. – Missing punctuation – “cant”, “theyll”, etc. – Emoticons – Spelling variants/slang – “cuz”, “usu”, etc.
  • 13. Number checking • Verify that numeric MT output is localized correctly – Currency – “$1B” vs “1 млрд. $” – Dates – “2/28/2014” vs “28/2/2014” – Time – “2pm” vs “14h00” – Separator & radix – “1,234.5” vs “1 234,5”
  • 14. StyleScorer revisited • MT output is compared to client’s historical (in-domain) PE data – Treat each target segment as a document – Lower scores indicate segments likely to require greater PE effort
  • 15. A data-driven approach • Analytics at each step Training MT Production • • Perplexity Evaluator • Candidate Scorer • Source Content Profiler (joint project w/CNGL) • StyleScorer • Number checking StyleScorer • UGC Normalization Post-Editing • WeScore • StyleScorer
  • 16. WeScore • Dashboard for viewing MT metrics – Tokenizes input from variety of formats & runs several scoring algorithms in parallel – Exports detailed analysis to spreadsheet for sentence-by-sentence review
  • 19. StyleScorer III • PE output is compared to client’s historical (in-domain) data – Treat each PE segment as a document – Lower score indicates possible deviation from established style
  • 20. Feedback loop • Data collected and lessons learned – Update client-specific data for future engine training – Mine data for generalizable patterns in problem areas – Work with post-editors to understand how to make a better system & how to improve PE experience and throughput

Editor's Notes

  1. Trained on ~200k TUs, tuning size: ~2k TUs