Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Evaluating Domain-Adaptive Neural Machine Translation (IMUG March 2019)

524 views

Published on

Choosing the right Machine Translation for a project is hard: the performance of every engine varies across domains and language pairs, and changes often. With domain-adaptive NMT it's even harder, as there's an additional difference in training corpus requirements and complex cost of ownership models.

In this talk, we describe the approach to address those issues. A large-scale reference-based scoring filters candidate MT engines and identifies a "decisive" set of segments for human LQA. A specific LQA procedure is tuned to choosing the best rather than just an acceptance. Also, we look at the learning gradient to estimate if the same engine will be suitable as more data become available.

Published in: Technology
  • Login to see the comments

Evaluating Domain-Adaptive Neural Machine Translation (IMUG March 2019)

  1. 1. EVALUATING DOMAIN-ADAPTIVE NEURAL MACHINE TRANSLATION Konstantin Savenkov CEO Intento, Inc. © Intento, Inc. IMUG Adobe March 2019 - San-Jose, CA
  2. 2. Intento AGENDA 1 INTRO & DISCLAIMERS 2 DOMAIN-ADAPTIVE NMT 3 AVAILABLE SOLUTIONS 4 EVALUATION PROCESS 2© Intento, Inc. / March 2019
  3. 3. Intento 3 INTENTO Discover, evaluate and use best-of- breed AI models © Intento, Inc. / March 2019
  4. 4. Intento 1 HOW WE STARTED TO EVALUATE MT Universal API Middleware — Noticed 300x price difference for Machine Translation — Are price & language support the only differences? — Evaluated stock MT engines - performance as well! — Also, changes fast! 4© Intento, Inc. / March 2019
  5. 5. Intento MT QUALITY LANDSCAPE IS WILD And changes fast! 5© Intento, Inc. / March 2019
  6. 6. 6 1949 You Are Here I Memorandum on Translation 1996 II Affordable stock RBMT 2006 III Affordable stock SMT 2016 IV Affordable stock NMT V Affordable custom NMT © Intento, Inc. / March 2019 Intento 2 DOMAIN ADAPTIVE NMT
  7. 7. Affordable Custom NMT — Domain-Adaptive Models 7 “Custom NMT” (from scratch) Domain-Adaptive NMT builds upon open or proprietary NMT frameworks baseline models or datasets as a service training data size 1M…10M segments 1K…1M segments training process heavily curated automatic main cost drivers licenses, human services $$$$-$$$$$ computing time $$-$$$ © Intento, Inc. / March 2019 Intento Super expensive just to try
  8. 8. 2018 in Machine Translation Rise of Domain-Adaptive NMT* 8 Sep 2017 Oct 2018 * Neural Machine Translation with an automated customisation using domain-specific corpora, also known as the domain adaptation. Nov 2017 May 2018 Jun 2018 Jul 2018 Globalese Custom NMT Lilt Adaptive NMT IBM Custom NMT Microsoft Custom Translate Google AutoML Translation SDL ETS 8.0 ModernMT Enterprise Apr 2018 Systran PNMT Intento
  9. 9. Intento WHY EVALUATE? Different learning curves 9 Starting dataset sizes vary — Learning curves vary — Depends on language pair and domain © Intento, Inc. / March 2019
  10. 10. Intento WHY EVALUATE? The right choice drives ROI (English-to-German, Life Sciences) 10© Intento, Inc. / March 2019
  11. 11. Intento 4 EVALUATION PROCESS 4.1 Identify the goals (PE vs Raw vs IR ets) — 4.2 Define set of projects — 4.3 Select candidate engines (stock and adaptive) — 4.4 Large-scale automatic scoring — 4.5 Human evaluation — 4.6 Identify complementary engines — 4.7 Setup workflow, train people, track performance (PE or conversions) 11© Intento, Inc. / March 2019
  12. 12. Intento 4.1 IDENTIFY THE GOALS Use case: PEMT, Raw MT or Information Retrieval — Mission critical? (inbound vs outbound) — How to calculate ROI? — Expected gains? — BATNA? 12© Intento, Inc. / March 2019
  13. 13. Intento 4.2 IDENTIFY DISTINCT PROJECTS One TM = One Project — Fewer the better (error matrix will help) — Track productivity to identify new projects 13© Intento, Inc. / March 2019
  14. 14. Intento 4.3 PROJECT ATTRIBUTES Language pair — Domain — Availability of reference data — Availability and volume of training data — Availability of glossary 14© Intento, Inc. / March 2019 Data Protection requirements — Data Locality requirements — Contracting requirements — Deployment requirements — Usage scenario (monthly and peak volume, frequency of updates)
  15. 15. Intento 4.4 SELECTING CANDIDATE ENGINES Language support — Domain- and dialect-specific baseline models — Minimal training data requirements — Glossary support 15© Intento, Inc. / March 2019 Data Protection — Deployment options (regions) — Contracting jurisdictions, payment options — Deployment options (cloud, on-premise etc) — TCO model
  16. 16. Intento 3 AVAILABLE MT SOLUTIONS 16© Intento, Inc. / March 2019
  17. 17. November 2018© Intento, Inc. Globalese Custom NMT Language Pairs “all of them” Customization parallel corpus Deployment cloud on-premise Cost to train* - Cost to maintain* $58 per month Cost to translate* - (limited volume) 17 * base pricing tier
  18. 18. October 2018© Intento, Inc. Google Cloud AutoML Translation β Language Pairs 50 Customization parallel corpus Deployment cloud Cost to train* $76 per hour of training Cost to maintain* - Cost to translate* $80 per 1M symbols 18 * base pricing tier
  19. 19. October 2018© Intento, Inc. IBM Cloud Language Translator v3 Language Pairs 48 Customization parallel corpus glossary Deployment cloud** Cost to train* free Cost to maintain* $15 per month Cost to translate* $100 per 1M symbols 19 * advanced pricing tier ** with optional no-trace
  20. 20. October 2018© Intento, Inc. Microsoft Custom Translator β APIv3 Language Pairs 74*** Customization parallel corpus, glossary*** Deployment cloud** Cost to train* $10 per 1M symbols of training data (capped $300) Cost to maintain* $10 per month Cost to translate* $40 per 1M symbols * base pricing tier ** no trace *** since October 25, 2018 20
  21. 21. October 2018© Intento, Inc. ModernMT Enterprise Edition Language Pairs 45 Customization parallel corpus Deployment cloud** on-premise Cost to train* free Cost to maintain* free Cost to translate* 4 EUR per 1000 words (~ $960 per 1M symbols) * base pricing tier ** claims non-exclusive rights on content 21
  22. 22. October 2018© Intento, Inc. Tilde Custom Machine Translation Language Pairs ?* Customization parallel corpus Deployment cloud on-premise Cost to train** N/A Cost to maintain** N/A Cost to translate** N/A * language pair support provided on-demand ** no public pricing available 22
  23. 23. Intento OTHER DIMENSIONS Pricing - Stock Engines 23 USD / 1M symbols now 7.1 © Intento, Inc. / March 2019
  24. 24. Intento OTHER DIMENSIONS Pricing - Total Cost of Ownership 24© Intento, Inc. / March 2019
  25. 25. Intento OTHER DIMENSIONS Different learning curves 25 Some engines start low, but improve fast — Others start high and improve low — Also, some have unusual behavior — Depends on language pair and domain — Worth exploring to plan the engine update & re-evaluation strategy © Intento, Inc. / March 2019
  26. 26. Intento OTHER DIMENSIONS Language Support 26 All stock engines combined support 14290 language pairs out of 29070 possible (45%) * https://w3techs.com/technologies/overview/content_language/all © Intento, Inc. / March 2019
  27. 27. Intento Language Support by MT engines 27 1 100 10000 G oogle Yandex M icrosoftN M T M icrosoftSM T Baidu Tencent Systran Systran PN M T PRO M T SDL Language C loud Youdao SAP M odernM T DeepL IBM N M T Am azon IBM SM T Alibaba G TC om 2 11 2 56 138 119 1 074 3 022 6 8 20 24 34 424447 72 104106110110 210 812 3 7823 660 8 556 10 712 Total Unique https://bit.ly/mt_jul2018 © Intento, Inc. / March 2019
  28. 28. Intento MT QUALITY EVALUATION Automatic Scores vs. Expert Judgement 28© Intento, Inc. / March 2019
  29. 29. Intento MT QUALITY EVALUATION Getting the best of two worlds 29 1.Run automatic scoring on scale 2.Filter out outlier MT engines 3.Extract meaningful segments for LQA 4.Run LQA © Intento, Inc. / March 2019
  30. 30. Intento MT QUALITY EVALUATION Reference-Based Scores 30© Intento, Inc. / March 2019
  31. 31. Intento MT QUALITY EVALUATION Filter outlier engines 31© Intento, Inc. / March 2019
  32. 32. Intento MT QUALITY EVALUATION Focus on segments that matter 32© Intento, Inc. / March 2019
  33. 33. Intento (N)MT quirks 33 “It’s not a story the Jedi would tell you.” “” “Star Wars franchise is overrated” A good NMT engine, in top-5% for a couple of pairs © Intento, Inc. / March 2019
  34. 34. Intento Otherwise a good NMT from a famous brand (N)MT quirks 34 “Unisex Nylon Laptop Backpack School Travel Rucksack Satchel Shoulder Bag” “рюкзак” “Author is an idiot. I will fix it!” (a backpack) © Intento, Inc. / March 2019
  35. 35. Intento (N)MT quirks 35 “hello, world!” “” “Are you kidding me?!” Good new NMT engine, best at some language pairs © Intento, Inc. / March 2019
  36. 36. Intento (N)MT quirks 36 “Revenue X, profit Y.” “Выручка X, прибыль Y, EBITDA Z” “I see you love numbers, here you are!” Good new NMT engine, best at some language pairs © Intento, Inc. / March 2019
  37. 37. Intento (N)MT quirks 37 “3.29346” “3.29” “Let me round it up for you…” Good new NMT engine, best at some language pairs © Intento, Inc. / March 2019
  38. 38. Intento (N)MT quirks 38 “And here you may buy our new product X” “And here you may buy our new product X https:/// zzz.com/SKUXXX” “Let me google it for you…” Good new NMT engine, best at some language pairs © Intento, Inc. / March 2019
  39. 39. Intento (N)MT quirks 39 “Please, one more time” “Поёалуйста, ещж раз” “Let’s play “find the difference?” Good popular NMT engine © Intento, Inc. / March 2019
  40. 40. Intento MT QUALITY EVALUATION Expert judgement 40 A.Human LQA – error type and severity labeled by linguists or subject matter experts B.Post-editing effort – zero-edits, keystrokes and time spent © Intento, Inc. / March 2019
  41. 41. November 2018© Intento, Inc. Human Linguistic Quality Analysis Blind within-subjects review An expert receives a source segment and all translations (including the human reference) without labels. 45 segments are distributed across 5 experts. — Human LQA was conducted by Logrus IT using their issue type and issue severity metrics (see Slides 26-27) — Each expert records all issues and their estimated severity rates for every translated segment. Segments without errors are considered “perfect”. — Typically, for LQA of a single text, the weighted normalized sum or errors is used (errors per word). For the engine selection problem we aggregated errors on a segment level (see slide 30 for details). 41
  42. 42. November 2018© Intento, Inc. Human Linguistic Quality Analysis Issue types (© LogrusIT) 42 Issue Type Description Adequacy Adequacy of translation defines how closely the translation follows the meaning of and the message conveyed by the source text. Problems with adequacy can reflect additions and omissions, mistranslation, partially translated or untranslated pieces, or pieces that should not have been translated at all. Readability Intelligibility of translation defines how easy it is to read/consume and understand the target text/content. Language Language issues include all deviations from formal language rules and requirements. Includes grammar, spelling and typography issues. Terminology A term (domain-specific word) is translated with a term other than the one expected for the domain or otherwise specified in a terminology glossary or client requirements. Alternatively, terminology can be correct but inconsistent throughout the content. Locale This category refers only to whether the text is given the proper mechanical form for the locale, not whether the content applies to the locale or not. Style Style measures compliance with existing formal style requirements as well as language, cultural and regional specifics. © Logrus IT, 2018
  43. 43. November 2018© Intento, Inc. Human Linguistic Quality Analysis Issue severity (© LogrusIT) 43 Issue Severity Description Critical Showstopper-level errors are the ones that have the biggest, sometimes dramatic impact on the reader’s perception. Showstoppers are regular errors that can result in dire consequences for the publisher, including causing life or health risks, equipment damage, violating local or international laws, unintentionally offending large groups of people, potential risks of misinformation and/or incorrect or dangerous user behavior, etc. Typically the content should not be published without fixing all showstopper-level problems first. Major The issue is serious, and has a noticeable effect on the overall text perception. Typical examples include locale errors (like incorrect date, numeric or currency format) as well as problems with translation adequacy or readability for a particular sentence or string. Medium The issue has noticeable, but moderate effect on the overall text perception. Typical examples include incorrect capitalization, wrong markup and regular spelling errors. While somewhat annoying and more noticeable, medium- severity issues still do not result in misinformation, and do not affect the reader’s perception seriously. Minor The issue has minimal effect on the overall text perception. Typical examples include dual spaces or incorrect typography (as far as it is not misleading and does not change the meaning). Formally a dual space in place of a single one or a redundant comma represents an error, but its effect on the reader’s perception is minimal. Preferential Use this severity for recommendations and preferrential issues that should not affect the rating. © Logrus IT, 2018
  44. 44. November 2018© Intento, Inc. Linguistic Quality Analysis Dealing with reviewer disagreement 44 For LQA, accurate counting of errors may not work. Human reviewers disagree in several situations: • Stop counting errors after discovering a critical one vs. count all errors in a segment (including multiple critical). • Major vs. critical severity (e.g. if a dependent clause is missing). • Counting several consequent errors as one vs. many. — Weighted average ranking is not stable in presence of such disagreement. — Mitigating the disagreement: • Count critical errors as major (in both cases post-editing is likely to take as much time as translating from scratch) • Count not individual errors, but a number of segments with each of the severity levels (including zero errors). — NMT customization should have the most impact on terminology. 84% of terminology errors has “Medium” severity, hence we rank engines by the amount of segments with severity < “Medium” (the next slide)
  45. 45. Intento MT QUALITY EVALUATION Human LQA 45© Intento, Inc. / March 2019
  46. 46. Intento MT QUALITY EVALUATION Do good engines make mistakes in the same sentences? 46© Intento, Inc. / March 2019
  47. 47. Intento MT QUALITY EVALUATION Looking for complementary engines 47© Intento, Inc. / March 2019
  48. 48. November 2018© Intento, Inc. Post-Editing Effort “Hard evidence” Post-editor accepts or corrects MT output. — Things to watch: zero-edits, keystrokes, time to edit. — Highly dependent on tools, workflow (simple or complex task) and PE training (up to 4x difference across different reviewers) — Requires PE tracking, e.g. PET, but better use production tools — Results may be surprising (e.g. one engine produces 2x more perfect segments, while another 4x easier to edit). 48
  49. 49. Intento EVALUATION IN USE SMALL PEMT PROJECTS 1. Get a list of stock MT engines for your language pair — 2. Select 4-5 candidate MT engines — 3a. Enable them in your CAT tool — 3b. Translate everything by 4-5 candidate engines and upload as a TM in your CAT tool — 4. Choose per segment as you translate 49© Intento, Inc. / March 2019
  50. 50. Intento EVALUATION IN USE MEDIUM / ONGOING / MT-FIRST 1. Prepare a reference translation (1,500-2,000 segments) — 2. Get a list of stock MT engines for your language pair — 3. Translate the sample by appropriate engines — 4. Calculate a reference-based score for the MT results — 5. Evaluate top-performing engines manually — 6. Translate everything with the winning engine 50© Intento, Inc. / March 2019
  51. 51. Intento EVALUATION IN USE LARGE PROJECTS / MT ONLY 1. Evaluate stock engines and get a baseline quality score — 2. Prepare a term base and a domain adaptation corpus (from 10K segments) — 3. Train appropriate custom NMT engines — 4. Evaluate custom NMT to see if it works better than stock MT — 5. Update and re-evaluate the winning model as you collect more post-edited content 51© Intento, Inc. / March 2019
  52. 52. THANK YOU! Konstantin Savenkov ks@inten.to Konstantin Savenkov ks@inten.to (415) 429-0021 2150 Shattuck Ave Berkeley CA 94705 52

×