Automated source and post-edit analysis improves MT quality

•Download as PPTX, PDF•

1 like•923 views

This document discusses an approach to improving machine translation quality through automated analysis of source texts, translation candidates, and post-edited outputs. It involves profiling source content, evaluating translation candidates for perplexity and style, normalizing user-generated content, checking numbers and dates, and scoring post-edits for consistency. Metrics collected at each stage then feed back into the system to refine machine translation and better support the post-editing process.

Business

Better translations through automated
source and post-edit analysis
David Landan
Welocalize

Background
• MT is here to stay
– Better MT = less PE effort = higher throughput
for less money

• MT quality depends on training data
quantity, quality, and relevance
– Selecting in-domain data increases BLEU
scores by 10-20 BLEU over generic engines

• LSPs have less control over quantity, so
we need to focus on quality & relevance

A data-driven approach
• Analytics at each step
Training

MT Production

•

•

Perplexity Evaluator

•

Candidate Scorer

•

Source Content
Profiler (joint project
w/CNGL)

•

StyleScorer

•

Number checking

StyleScorer

•

UGC Normalization

Post-Editing

•

WeScore

•

StyleScorer

Candidate Scorer
• Uses corpus of known “difficult” text
• Compares part of speech (POS) n-grams
– Generates per-sentence scores

Perplexity (PPL) Evaluator
• Build language models (LMs) from
multiple corpora
– Known “good” sentences for MT
– Known “bad” sentences for MT
– Client-specific in-domain data

• Each document gets a PPL score
against each LM

StyleScorer
• Combines PPL ratio, dissimilarity
score, and classification score
– Each document receives a score from 0-4
– Higher score indicates better match to
style established by client’s documents
– Does not require parallel data

• Source scored for training/tuning
suitability

Source Content Profiler
• CNGL project (beta)
– Classification of docs into profiles
– Features based on:
•
•
•
•
•
•
•

Word & sent. length
Readability score
Syntactic structure
Terminology
Tag ratios
Do Not Translate lists
Glossary matches

Does it work?
en-USnl-NL

en-USpl-PL

en-UShu-HU

Plain vanilla

21.26

16.88

17.31

Domain match

36.39

37.07

38.36

Plain + target

44.07

34.61

30.43

Domain + target

64.40

54.55

49.53

Engine

UGC normalization
• Make substitutions in source for known
MT pain points before translating
– Frequent misspellings – “teh”, “mroe”, etc.
– Abbreviations – “imho”, “tyvm”, etc.
– Missing punctuation – “cant”, “theyll”, etc.
– Emoticons
– Spelling variants/slang – “cuz”, “usu”, etc.

Number checking
• Verify that numeric MT output is
localized correctly
– Currency – “$1B” vs “1 млрд. $”
– Dates – “2/28/2014” vs “28/2/2014”
– Time – “2pm” vs “14h00”
– Separator & radix – “1,234.5” vs “1 234,5”

StyleScorer revisited
• MT output is compared to client’s
historical (in-domain) PE data
– Treat each target segment as a
document
– Lower scores indicate segments likely to
require greater PE effort

WeScore
• Dashboard for viewing MT metrics
– Tokenizes input from variety of formats &
runs several scoring algorithms in parallel
– Exports detailed analysis to spreadsheet
for sentence-by-sentence review

StyleScorer III
• PE output is compared to client’s
historical (in-domain) data
– Treat each PE segment as a document
– Lower score indicates possible deviation
from established style

Feedback loop
• Data collected and lessons learned
– Update client-specific data for future
engine training
– Mine data for generalizable patterns in
problem areas
– Work with post-editors to understand how
to make a better system & how to
improve PE experience and throughput

Recently uploaded

8447779800, Low rate Call girls in New Ashok Nagar Delhi NCRashishs7044

Organizational Structure Running A Successful BusinessSeta Wicaksana

Corporate Profile 47Billion Information TechnologyData Analytics Company - 47Billion Inc.

Japan IT Week 2024 Brochure by 47Billion (English)Data Analytics Company - 47Billion Inc.

International Business Environments and Operations 16th Global Edition test b...ssuserf63bd7

Guide Complete Set of Residential Architectural Drawings PDFChandresh Chudasama

Financial-Statement-Analysis-of-Coca-cola-Company.pptxsaniyaimamuddin

Flow Your Strategy at Flight Levels Day 2024Kirill Klimov

Darshan Hiranandani [News About Next CEO].pdfShashank Mehta

Innovation Conference 5th March 2024.pdfrichard876048

Market Sizes Sample Report - 2024 EditionMintel Group

Call Us 📲8800102216📞 Call Girls In DLF City Gurgaoncallgirls2057

Traction part 2 - EOS Model JAX Bridges.Anamaria Contreras

Church Building Grants To Assist With New Construction, Additions, And Restor...Americas Got Grants

Unlocking the Future: Explore Web 3.0 Workshop to Start Earning Today!Doge Mining Website

(Best) ENJOY Call Girls in Faridabad Ex | 8377087607dollysharma2066

1911 Gold Corporate Presentation Apr 2024.pdfShaun Heinrichs

FULL ENJOY Call girls in Paharganj Delhi | 8377087607dollysharma2066

No-1 Call Girls In Goa 93193 VIP 73153 Escort service In North Goa Panaji, Ca...Call girls in Goa, +91 9319373153 Escort Service in North Goa

Investment in The Coconut Industry by Nancy Cheruiyotictsugar

Recently uploaded (20)

8447779800, Low rate Call girls in New Ashok Nagar Delhi NCR

Organizational Structure Running A Successful Business

Corporate Profile 47Billion Information Technology

Japan IT Week 2024 Brochure by 47Billion (English)

International Business Environments and Operations 16th Global Edition test b...

Guide Complete Set of Residential Architectural Drawings PDF

Financial-Statement-Analysis-of-Coca-cola-Company.pptx

Flow Your Strategy at Flight Levels Day 2024

Darshan Hiranandani [News About Next CEO].pdf

Innovation Conference 5th March 2024.pdf

Market Sizes Sample Report - 2024 Edition

Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon

Traction part 2 - EOS Model JAX Bridges.

Church Building Grants To Assist With New Construction, Additions, And Restor...

Unlocking the Future: Explore Web 3.0 Workshop to Start Earning Today!

(Best) ENJOY Call Girls in Faridabad Ex | 8377087607

1911 Gold Corporate Presentation Apr 2024.pdf

FULL ENJOY Call girls in Paharganj Delhi | 8377087607

No-1 Call Girls In Goa 93193 VIP 73153 Escort service In North Goa Panaji, Ca...

Investment in The Coconut Industry by Nancy Cheruiyot

Automated source and post-edit analysis improves MT quality

1. Better translations through automated source and post-edit analysis David Landan Welocalize

2. Background • MT is here to stay – Better MT = less PE effort = higher throughput for less money • MT quality depends on training data quantity, quality, and relevance – Selecting in-domain data increases BLEU scores by 10-20 BLEU over generic engines • LSPs have less control over quantity, so we need to focus on quality & relevance

3. A data-driven approach • Analytics at each step Training MT Production • • Perplexity Evaluator • Candidate Scorer • Source Content Profiler (joint project w/CNGL) • StyleScorer • Number checking StyleScorer • UGC Normalization Post-Editing • WeScore • StyleScorer

4. Candidate Scorer • Uses corpus of known “difficult” text • Compares part of speech (POS) n-grams – Generates per-sentence scores

6. Perplexity (PPL) Evaluator • Build language models (LMs) from multiple corpora – Known “good” sentences for MT – Known “bad” sentences for MT – Client-specific in-domain data • Each document gets a PPL score against each LM

8. StyleScorer • Combines PPL ratio, dissimilarity score, and classification score – Each document receives a score from 0-4 – Higher score indicates better match to style established by client’s documents – Does not require parallel data • Source scored for training/tuning suitability

9. Source Content Profiler • CNGL project (beta) – Classification of docs into profiles – Features based on: • • • • • • • Word & sent. length Readability score Syntactic structure Terminology Tag ratios Do Not Translate lists Glossary matches

10. Does it work? en-USnl-NL en-USpl-PL en-UShu-HU Plain vanilla 21.26 16.88 17.31 Domain match 36.39 37.07 38.36 Plain + target 44.07 34.61 30.43 Domain + target 64.40 54.55 49.53 Engine

11. A data-driven approach • Analytics at each step Training MT Production • • Perplexity Evaluator • Candidate Scorer • Source Content Profiler (joint project w/CNGL) • StyleScorer • Number checking StyleScorer • UGC Normalization Post-Editing • WeScore • StyleScorer

12. UGC normalization • Make substitutions in source for known MT pain points before translating – Frequent misspellings – “teh”, “mroe”, etc. – Abbreviations – “imho”, “tyvm”, etc. – Missing punctuation – “cant”, “theyll”, etc. – Emoticons – Spelling variants/slang – “cuz”, “usu”, etc.

13. Number checking • Verify that numeric MT output is localized correctly – Currency – “$1B” vs “1 млрд. $” – Dates – “2/28/2014” vs “28/2/2014” – Time – “2pm” vs “14h00” – Separator & radix – “1,234.5” vs “1 234,5”

14. StyleScorer revisited • MT output is compared to client’s historical (in-domain) PE data – Treat each target segment as a document – Lower scores indicate segments likely to require greater PE effort

15. A data-driven approach • Analytics at each step Training MT Production • • Perplexity Evaluator • Candidate Scorer • Source Content Profiler (joint project w/CNGL) • StyleScorer • Number checking StyleScorer • UGC Normalization Post-Editing • WeScore • StyleScorer

16. WeScore • Dashboard for viewing MT metrics – Tokenizes input from variety of formats & runs several scoring algorithms in parallel – Exports detailed analysis to spreadsheet for sentence-by-sentence review

17. WeScore

18. WeScore

19. StyleScorer III • PE output is compared to client’s historical (in-domain) data – Treat each PE segment as a document – Lower score indicates possible deviation from established style

20. Feedback loop • Data collected and lessons learned – Update client-specific data for future engine training – Mine data for generalizable patterns in problem areas – Work with post-editors to understand how to make a better system & how to improve PE experience and throughput

21. Q&A Thank you!

Editor's Notes

Trained on ~200k TUs, tuning size: ~2k TUs

Automated source and post-edit analysis improves MT quality

Recommended

Recommended

More Related Content

Similar to Automated source and post-edit analysis improves MT quality

Similar to Automated source and post-edit analysis improves MT quality (20)

More from Welocalize

More from Welocalize (16)

Recently uploaded

Recently uploaded (20)

Automated source and post-edit analysis improves MT quality

Editor's Notes