Towards industry benchmarking of MT engines and a database of MT use cases (JP Barraza, Systran)

© 2015
TAUS QE Summit San Jose 2015
12:00 / Topic 4:
Benchmarking Machine Translation engines

© 2015
Session leader: JP Barraza (SYSTRAN)
Discussants: Julie Chang (Intel), Karin Berghoefer (Appen),
Tony O’Dowd (KantanMT)
• Towards industry benchmarking of MT engines and a database of MT use
cases
• One of the main problems in the translation industry today is the lack of
benchmarking. The output of MT engines cannot be compared to industry
averages or standards because these are not yet available. Automated scores
are meaningless outside the “laboratory”. At the same time, buyers of translation
services are increasingly interested in translated content of different quality
levels. They want to save on some content and invest more in other. They also
want to know how the different engines are performing on different content in
different language pairs. How can we be sure MT providers deliver what they are
paid for? Benchmarking MT engines and creating a library of MT use cases are
one way to move forward. Using industry benchmarking based on evaluation and
productivity data is another option. One way or another, buyers need to be able
to compare and benchmark MT solutions to make informed decisions.

© 2015
Session Goals:
• Present Panelists’ Proposed Draft Solution
• Solicit feedback, requirements and desired outcomes from TAUS
members (survey)
• Establish a Working Group to move the Project forward

© 2015
Panelists’ Proposed Draft Solution
TAUS MT Benchmarking Site
Use Cases
Benchmarks
Use Cases
Success
Stories
Best
Practices
Learned
Benchmarks
Domain/Industry
Expectations
MT Vendor
Benchmarks

© 2015
MT Evaluation/Benchmarking:
Learning from the Academics…
Academic MT Competition Framework:
• Training dataset for Training/Tuning is selected by the MT Evaluator
• Training dataset typically includes the following 3 sets.
• Training set
• Tuning set
• Dev-testing set
• Participants train models with training set, tuning models with tuning set, and test models using dev-test
set. Depending on the competition, they receive multiple references for tuning and dev-testing sets.
• Gold Test Sets (multiple), are set aside by MT Evaluator and never seen by and competitors
• This is the “testing set”, a Gold test set never seen by the competitors.
• So in sum, we have 3 sets disclosed to participants, and 1 hidden for official scoring and human evaluation.
• The organizer sends out the Gold test set and each participant translates it and submits their best
translations.
• Typically two translations are allowed to be submitted (primary and secondary)
• The scores of each submitted translation are published (primary and secondary)
• Most often used scores (sorted by use frequency):
• BLEU
• NIST
• TER
• METEOR
• WER
• RIBES*
• Human Evaluation has been utilized in WMT, WAT, and a few others.
• In GALE competition, HTER was the official score to measure the minimum edits against MT-PE’d reference**
• Typical HE criteria: Fluency and Adequacy

© 2015
Proposed TAUS
MT Benchmarking WorkFlow…
TAUS
Benchmarking
Data
MT Engines
Automatic
Scores
Human
Evaluation
Published
Results
MT dashboard

© 2015
Proposed TAUS
MT Benchmarking Process…
STEP 1: Test Data
• TAUS selects industry/domain specific corpora as GOLD Test Sets to be used in evaluating MT engines across vendors.
• Possible Domains:
• Colloquial & Dialog
• IT / Technical Support
• Finance/Economics
• Pharmaceuticals & Life Sciences
• eCommerce
STEP 2: MT Vendors
• Vendors submit their best, commercially available MT engines per domain per LP:
• Vendors make MT engines available via API, so the TAUS MT Benchmarking system connects via API
• The Automatic scores (BLEU, TER,…) are stored
STEP 3: TAUS HE
• Human Evaluation of each tested system:
• TAUS will provide access to a subset of the translated Gold Test Sets to a 3 Human Evaluators:
• Preferably, Professional Human Translators and/or Post-Editors.
• TAUS DQF system will be used for this Human Evaluation portion:
• Quality Evaluation using Adequacy and/or Fluency Approaches
• Quality Evaluation using an Error Typology Approach
STEP 4: Publishing Results
• The results of both Automatic Scoring and Human Evaluation would then be published, per domain and per language pair tested, on a TAUS MT
Benchmarking site.
• There should be new GOLD Test Sets for each Benchmarking cycle.
• TAUS MT Benchmarking Frequency:
• Twice a Year (?)
MT dashboard

© 2015
Automatic
Scores
Human
Evaluation
Results
Translation
MT Engine
Translation
Output
Automatic
Scores
MT Vendor
MT Benchmarking
DQF Project
MT
Evaluators
Training
Tuning
Testing
Prepare
Web UI
Dashboard
TAUS
Benchmarking
System
GOLD Test
Set
Translation
MT dashboard
Visual Workflow

© 2015
TAUS Member SurveyMT benchmark
Please help us establish requirements and desired
outcomes for a TAUS MT dashboard!
Take our Online Survey and Join our Discussion:
https://goo.gl/sIggG5

Towards industry benchmarking of MT engines and a database of MT use cases (JP Barraza, Systran)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Towards industry benchmarking of MT engines and a database of MT use cases (JP Barraza, Systran)

Similar to Towards industry benchmarking of MT engines and a database of MT use cases (JP Barraza, Systran) (20)

More from TAUS - The Language Data Network

More from TAUS - The Language Data Network (20)

Recently uploaded

Recently uploaded (20)

Towards industry benchmarking of MT engines and a database of MT use cases (JP Barraza, Systran)

Editor's Notes