Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

State of the Domain-Adaptive Machine Translation by Intento (November 2018)

8,353 views

Published on

In this report, we have evaluated 6 modern domain-adaptive NMT engines on Biomedical dataset (English to German). ModernMT, Globalese, Google AutoML, IBM Custom NMT, Microsoft Custom Translate, and Tilde. We explored how they compare by performance (using reference-based scores, linguistic quality analysis and automatic quality estimation), total cost of ownership, dataset size requirements, training time, data protection policy and how to start using this advanced technology.

Published in: Technology
  • Be the first to comment

State of the Domain-Adaptive Machine Translation by Intento (November 2018)

  1. 1. STATE OF THE DOMAIN-ADAPTIVE MACHINE TRANSLATION by Intento November 2018
  2. 2. November 2018© Intento, Inc. DISCLAIMER 2 The systems used in this report were trained and evaluated from Oct 15 to Nov 15, 2018. They may have been changed many times since then. — This report demonstrates performance of those systems exclusively on the dataset used for this report (English-German, UFAL corpus (Biomedical). — We run multiple evaluations for our clients in the same domain for other language pairs and observed different rankings of the MT systems. — There’s no “best” MT system. Performance depends on how your data is similar to what they used to train their baseline models and on their algorithms. — Don’t jump to conclusions. Do your homework.
  3. 3. November 2018© Intento, Inc. About At Intento (https://inten.to), we make Cloud Cognitive AI easy to discover, access, and evaluate for a specific use. — We evaluate stock models for Machine Translation since May 2017 (the most recent - https://bit.ly/mt_jul2018). — In Summer 2018, a number of MT vendors launched a new breed of domain-adaptive Neural Machine Translation engines. This report answers the most important questions about this new technology. — We deliver this overview report for FREE. To evaluate on your own dataset, reach us at hello@inten.to 3
  4. 4. November 2018© Intento, Inc. Intento MT Gateway - that’s how we run such evaluations Vendor-agnostic API Sync and async modes CLI tools and SDKs Works with files of any size 10-20x faster due to multi- threading Get your API key at inten.to 4
  5. 5. November 2018© Intento, Inc. OVERVIEW 1 WHAT IS DOMAIN-ADAPTIVE NMT? 2 HOW GOOD IS IT? 3 HOW MUCH DOES IT COST? 4 HOW LONG DOES IT TRAIN? 5 HOW MUCH DATA DO I NEED? 6 HOW SAFE IS MY DATA? 7 HOW TO USE IT? 1 Domain: Biomedical 6 Domain-Adaptive NMT Engines 5 11 NMT 3 SMT Stock Engines 1 Language Pair: en-de
  6. 6. November 2018© Intento, Inc. 2018 in Machine Translation A New Chapter 6 1949 You Are Here I Memorandum on Translation 1996 II Affordable stock RBMT 2006 III Affordable stock SMT 2016 IV Affordable stock NMT V Affordable custom NMT
  7. 7. November 2018© Intento, Inc. Affordable Custom NMT — Domain-Adaptive Models 7 “Custom NMT” (from scratch) Domain-Adaptive NMT builds upon open or proprietary NMT frameworks baseline models or datasets as a service training data size 1M…10M segments 10K…1M segments training process heavily curated automatic main cost drivers licenses, human services $$$$-$$$$$ computing time $$-$$$
  8. 8. November 2018© Intento, Inc. 2018 in Machine Translation Rise of Domain-Adaptive NMT* 8 Sep 2017 Oct 2018 * Neural Machine Translation with an automated customisation using domain-specific corpora, also known as the domain adaptation. Nov 2017 May 2018 Jun 2018 Jul 2018 Globalese Custom NMT Lilt Adaptive NMT IBM Custom NMT Microsoft Custom Translate Google AutoML Translation SDL ETS 8.0 ModernMT Enterprise Apr 2018 Systran PNMT
  9. 9. November 2018© Intento, Inc. Machine Translation Engines Evaluated Google Cloud AutoML Translation IBM Cloud Language Translator v3 Microsoft Custom Translate v3 ModernMT Enterprise API Tilde Custom MT Globalese Custom NMT Amazon Translate Baidu Translate API DeepL API Google Cloud Translation API IBM Cloud Language Translator v3 Microsoft Translator Text API v3 ModernMT Enterprise API Systran PNMT (stock) Yandex Translate API PROMT REST API SDL Language Cloud (SMT) Systran REST (stock) 9 custom NMT stock NMT (baseline) stock SMT
  10. 10. November 2018© Intento, Inc. Globalese Custom NMT Language Pairs “all of them” Customization parallel corpus Deployment cloud on-premise Cost to train* - Cost to maintain* $58 per month Cost to translate* - (limited volume) 10 * base pricing tier
  11. 11. October 2018© Intento, Inc. Google Cloud AutoML Translation β Language Pairs 50 Customization parallel corpus Deployment cloud Cost to train* $76 per hour of training Cost to maintain* - Cost to translate* $80 per 1M symbols 11 * base pricing tier
  12. 12. October 2018© Intento, Inc. IBM Cloud Language Translator v3 Language Pairs 48 Customization parallel corpus glossary Deployment cloud** Cost to train* free Cost to maintain* $15 per month Cost to translate* $100 per 1M symbols 12 * advanced pricing tier ** with optional no-trace
  13. 13. October 2018© Intento, Inc. Microsoft Custom Translator β APIv3 Language Pairs 74*** Customization parallel corpus, glossary*** Deployment cloud** Cost to train* $10 per 1M symbols of training data Cost to maintain* $10 per month Cost to translate* $10 per 1M symbols * base pricing tier ** no trace *** since October 25, 2018 13
  14. 14. October 2018© Intento, Inc. ModernMT Enterprise Edition Language Pairs 9 Customization parallel corpus Deployment cloud** on-premise Cost to train* free Cost to maintain* free Cost to translate* $960 per 1M symbols * base pricing tier ** claims non-exclusive rights on content 14
  15. 15. October 2018© Intento, Inc. Tilde Custom Machine Translation Language Pairs ?* Customization parallel corpus Deployment cloud on-premise Cost to train** N/A Cost to maintain** N/A Cost to translate** N/A * language pair support provided on-demand ** no public pricing available 15
  16. 16. November 2018© Intento, Inc. 2 HOW GOOD IS IT? 2.1 Evaluation Methodology 2.2 Reference-Based Scores 2.3 Human Linguistic Quality Analysis 2.4 MT Fusion Analysis 2.5 Automated Quality Estimation 16
  17. 17. November 2018© Intento, Inc. 2.1 Evaluation methodology Pick biomedical dataset from UFAL, randomly extract 2000 segments as a test set, use the rest for training NMT (Slide XX). — Compute LEPOR score between reference translations and the MT output for stock and customised engines for 2000 segments. Identify the top- performing engines (Slide 21). — Choose a set of segments that expose difference between the top engines. Perform human expert LQA of MT translations for these segments (Slide 24). — Analyse how different MT systems may be combined to achieve better performance than available from any individual MT engine (Slide 33). — Evaluate low-level quality using automatic evaluation tool LexiQA (Slide 37). 17
  18. 18. November 2018© Intento, Inc. Dataset English-German, Biomedical UFAL Medical Corpus v. 1.0 • download the en-de dataset (37,814,533 records) • keep only “medical_corpus” records • remove records with “Subtitles” type (2,958,644 records) • remove records without words (2,627,817 records) • remove duplicates and shuffle (2,351,277 records) • extract 2000 records as a test set* • prepare four training sets: 10K, 100K, 500K and 1M records** Long, complex sentences: 18 * tail -2000 ** head -X for X = 10,000, 100,000, 500,000 and 1,000,000 Plasmid vector for expression in Caenorhabditis elegans and in Escherichia colia comprising in the 5' to 3' direction of transcription operably linked to each other a heat shock promoter nucleotide sequence, a synthetic intron nucleotide sequence containing a Shine-Dalgarno sequence, optionally a nucleotide sequence coding for a nuclear localisation signal or a secretion signal, a nucleotide sequence coding for a recognisable tag, optionally a nucleotide sequence coding for a fluorescent protein, a nucleotide sequence coding for a protease cleavage site, a multiple cloning site containing a nucleotide sequence coding for an eukaryotic, such as human, protein or a nucleic acid molecule, and a nucleotide sequence coding for termination of translation.
  19. 19. November 2018© Intento, Inc. hLEPOR score LEPOR: automatic machine translation evaluation metric considering the enhanced Length Penalty, n-gram Position difference Penalty and Recall — In our evaluation, we used hLEPORA v.3.1: — (best metric from ACL-WMT 2013 contest) https://www.slideshare.net/AaronHanLiFeng/lepor-an-augmented-machine-translation-evaluation-metric-thesis-ppt https://github.com/aaronlifenghan/aaron-project-lepor LIKE BLEU, BUT BETTER 19
  20. 20. November 2018© Intento, Inc. We used 2000 sentences per language pair. The metric stabilises and adding more from the same domain won’t change the outcome. number of sentences averagehLEPORscores English-German, biomedical LEPOR Convergence Confi- dence interval Mean score 20
  21. 21. November 2018© Intento, Inc. 2.2 Reference-Based Scores Average Values customised models Performance boost achieved by customisation with up to 1M segments stock models Performance of the stock (pre-trained) models, if provided 21 * ** * ModernMT gets a non- exclusive license on all data processed via their API to improve their models. ** Some sentences were not translated due to their length.****
  22. 22. November 2018© Intento, Inc. Reference-Based Scores Score distributions with median values 22 Baidu PROMT Systran SMT Systran PNMT stock SDL SMT Amazon Tilde 1M DeepL Microsoft stock Microsoft 1M Yandex IBM stock IBM 500K Google stock Google 500K Globalese 1M ModernMT stock ModernMT 1M May indicate fuzzy-match with the UFAL corpus
  23. 23. November 2018© Intento, Inc. Reference-Based Scores Discussion Based on the reference-based scores, the best performing (closest to the human translation) is ModernMT model trained on 1M segments. However, given the fast training time and cooperative TM used by ModernMT, this may be a result of our test dataset (UFAL) available as a TM on their side. Also, pay attention to the fact that ModernMT gets non-exclusive rights for all content submitted to their system, to improve their MT. — The next group of engines is Globalese, Google AutoML and IBM Custom NMT (two latter trained on 500K segments*), our testing dataset does not show significant difference between them. — The next group of engines is Yandex (stock), Microsoft Custom Translate** (trained on 1M segments), DeepL (stock) and Tilde (trained on 1M segments). 23 * For IBM v3 Custom, 1M segments do not fit into the training corpora limits; for Google AutoML, we observed no improvement from 100K to 500K training segments and decided not to proceed with 1M. ** We used Medicine model as the baseline (there is also Healthcare available, which we didn’t try) 1 2-4 5-8
  24. 24. November 2018© Intento, Inc. 2.3 Human Linguistic Quality Analysis We needed to pick 45 segments from 2000 for human LQA and focused on segments which are translated differently by top engines*. — We calculated average hLEPOR scores for test segments across all top-performing engines and selected 45 segments with average hLEPOR close to median (0.71) and the maximal difference (stddev) across the top-performing MT engines. — We selected segments of substantial length (>100 characters) to provide experts with enough context. — Those segments were analysed by human linguists with expertise in biomedical domain and both English and German languages. 24 * Please note this approach does not count mistakes made by majority of the engines. It may increase impact of the near-exact TM-like matches (ModernMT, Globalese). Average hLEPOR scores for this sample are presented in Appendix A
  25. 25. November 2018© Intento, Inc. Human Linguistic Quality Analysis Blind within-subjects review An expert receives a source segment and all translations (including the human reference) without labels. 45 segments are distributed across 5 experts. — Human LQA was conducted by Logrus IT using their issue type and issue severity metrics (see Slides 26-27) — Each expert records all issues and their estimated severity rates for every translated segment. Segments without errors are considered “perfect”. — Typically, for LQA of a single text, the weighted normalized sum or errors is used (errors per word). For the engine selection problem we aggregated errors on a segment level (see slide 30 for details). 25
  26. 26. November 2018© Intento, Inc. Human Linguistic Quality Analysis Issue types (© LogrusIT) 26 Issue Type Description Adequacy Adequacy of translation defines how closely the translation follows the meaning of and the message conveyed by the source text. Problems with adequacy can reflect additions and omissions, mistranslation, partially translated or untranslated pieces, or pieces that should not have been translated at all. Readability Intelligibility of translation defines how easy it is to read/consume and understand the target text/content. Language Language issues include all deviations from formal language rules and requirements. Includes grammar, spelling and typography issues. Terminology A term (domain-specific word) is translated with a term other than the one expected for the domain or otherwise specified in a terminology glossary or client requirements. Alternatively, terminology can be correct but inconsistent throughout the content. Locale This category refers only to whether the text is given the proper mechanical form for the locale, not whether the content applies to the locale or not. Style Style measures compliance with existing formal style requirements as well as language, cultural and regional specifics. © Logrus IT, 2018
  27. 27. November 2018© Intento, Inc. Human Linguistic Quality Analysis Issue severity (© LogrusIT) 27 Issue Severity Description Critical Showstopper-level errors are the ones that have the biggest, sometimes dramatic impact on the reader’s perception. Showstoppers are regular errors that can result in dire consequences for the publisher, including causing life or health risks, equipment damage, violating local or international laws, unintentionally offending large groups of people, potential risks of misinformation and/or incorrect or dangerous user behavior, etc. Typically the content should not be published without fixing all showstopper-level problems first. Major The issue is serious, and has a noticeable effect on the overall text perception. Typical examples include locale errors (like incorrect date, numeric or currency format) as well as problems with translation adequacy or readability for a particular sentence or string. Medium The issue has noticeable, but moderate effect on the overall text perception. Typical examples include incorrect capitalization, wrong markup and regular spelling errors. While somewhat annoying and more noticeable, medium-severity issues still do not result in misinformation, and do not affect the reader’s perception seriously. Minor The issue has minimal effect on the overall text perception. Typical examples include dual spaces or incorrect typography (as far as it is not misleading and does not change the meaning). Formally a dual space in place of a single one or a redundant comma represents an error, but its effect on the reader’s perception is minimal. Preferential Use this severity for recommendations and preferrential issues that should not affect the rating. © Logrus IT, 2018
  28. 28. November 2018© Intento, Inc. Linguistic Quality Analysis Issue Type Results 28 Human Reference
  29. 29. November 2018© Intento, Inc. Linguistic Quality Analysis Issue Severity Results 29 Human Reference
  30. 30. November 2018© Intento, Inc. Linguistic Quality Analysis Dealing with reviewer disagreement 30 For LQA, accurate counting of errors may not work. Human reviewers disagree in several situations: • Stop counting errors after discovering a critical one vs. count all errors in a segment (including multiple critical). • Major vs. critical severity (e.g. if a dependent clause is missing). • Counting several consequent errors as one vs. many. — Weighted average ranking is not stable in presence of such disagreement. — Mitigating the disagreement: • Count critical errors as major (in both cases post-editing is likely to take as much time as translating from scratch) • Count not individual errors, but a number of segments with each of the severity levels (including zero errors). — NMT customization should have the most impact on terminology. 84% of terminology errors has “Medium” severity, hence we rank engines by the amount of segments with severity < “Medium” (the next slide)
  31. 31. November 2018© Intento, Inc. Linguistic Quality Analysis MT model ranking 31 minor severity perfect segments medium severity major severity * 4 segments were not translated due to their length. * H um an R eference
  32. 32. November 2018© Intento, Inc. Linguistic Quality Analysis Discussion ModernMT is on the first place, possibly due to the public corpus we used and fuzzy-matching (see hLEPOR score distributions and the discussion). Google AutoML and IBM Custom are very close — Globalese scored significantly lower in the LQA due to omissions of large dependent clauses, which do not contribute to the hLEPOR much (long sentences!) but considered major to critical by human experts. — Microsoft Custom scored significantly lower in the LQA due to wrong word order in complex segment structures and high level of terminology and capitalisation issues, which do not contribute to the hLEPOR much (long sentences!) but considered major to critical by human experts. — Human reference is not on the first place by far, as indeed the corpus quality is far from being ideal. We are actually happy about that as that’s exactly what we see in real-life evaluations. Corpus cleaning is hard, automatic cleaning before the training is one of the decisive factors. — DeepL is one of the best by the number of “perfect” translations for segments translated with errors by other engines. However, in longer segments it also makes critical omissions. In such cases, combining several engines may reduce the amount of segments with medium and major issues. Let’s explore it further! 32
  33. 33. November 2018© Intento, Inc. 2.4 MT Fusion Analysis Segment severity by MT model 33 minor severity perfect segments medium severity major severity * 4 segments were not translated due to their length. * Segmentid H um an R eference
  34. 34. November 2018© Intento, Inc. MT Fusion Analysis MT model ranking as a choice process 34 Let’s simulate a scenario when a translator is selecting one of the MT results to correct. — We assume they will select the translation with less severe errors, and if the maximal severity is similar, then with less errors. — What are the right two MT models to combine and how far it will reduce the amount of segments with medium and major issues?
 — We decided to do not consider ModernMT engines in those combinations due to the TM matching concerns mentioned above. — Also, as the custom NMT is quite expensive, we decided to focus on pairs of custom + stock MT models.
  35. 35. November 2018© Intento, Inc. MT Fusion Analysis Google AutoML + DeepL* 35 * the same amount of medium and major errors provided also by “IBM Custom + Google stock”, “Google AutoML + IBM stock”, “Globalese + IBM stock”, but they produce significantly more segments with major severity. minor severity perfect segments medium severity major severity H um an R eference
  36. 36. November 2018© Intento, Inc. MT Fusion Analysis Discussion Adding to Google AutoML the second MT result from the stock DeepL model would increase the MT costs by ~30% (which is ~0.2% of the human translation price), while increasing the number of perfect segments by 34% and reducing the amount of segments with major issues by 50%. — On the full testing corpus the hLEPOR score for the fusion of the same model pair is 0.71. This is well above the upper bound (0.70) of the confidence interval for hLEPOR score for Google AutoML (0.69) — This performance boost may be explained by the fact that while Custom NMT engines have less amount of terminology errors, even domain-specific texts contain some general domain sentences. DeepL was ranked the best for English to German in our Study on the Stock MT Engines. Combining them provides the best of both worlds (for this language pair and dataset). 36
  37. 37. November 2018© Intento, Inc. 2.4 Automated Quality Estimation with 37 Low-level quality issues (punctuation, number formatting etc) are not captured by reference-based metrics and may go below radar for human LQA in a presence of more important mistakes. — Also, many of those errors may be corrected using textual pre-processing and therefore are not decisive for MT. However, we got curious. — We used automatic quality evaluation tool LexiQA with default rule set to check what’s there (for all 2000 test segments).
  38. 38. November 2018© Intento, Inc. Automated Quality Estimation Error types 38 Inconsistencies - serious inconsistency between source and target, including missing translation. — Numbers - a set of numbers in the translation differs from the source. — Punctuation - the punctuation is different or wrong for the target locale. — Spaces - something wrong with a number of spaces. — SpecialCharDetect - a set of special characters in the translation differs from the source. — Spelling - spelling errors detected. — URLs - either URL is changed, or disappeared, or emerged from nowhere.
  39. 39. November 2018© Intento, Inc. Automated Quality Estimation Total number of errors per MT system 39 human reference level
  40. 40. November 2018© Intento, Inc. Automated Quality Estimation Conclusions 40 LexiQA counts errors both in source and target. Apparently, some MT engines are good at correcting the errors in source. — Source text and human translation also contained numerous low- level errors. Some of them are false positives. — Higher level of inconsistencies for ModernMT, Tilde and Amazon is due to untranslated segments (length exceeded MT engine limits). — Many “Numbers” errors are due to different or inconsistent use of comma and period signs. But also dropped numbers and digits. Good it’s easy to check and fix automatically! — Some of the engines add bizarre URLs where were none. Probably ones which learned on the crawled data. Easy to fix.
  41. 41. November 2018© Intento, Inc. 3 HOW MUCH DOES IT COST? 3.1 Price Comparison - Training 3.2 Price Comparison - Maintenance 3.3 Price Comparison - Translation 3.4 Total Cost of Ownership 3.5 Discussion 41
  42. 42. November 2018© Intento, Inc. 3.1 Price Comparison - Training Cost to train1 3rd-party Machine Translation engines, USD 42 1 1 engine re-training per month, different training set sizes, based on the public prices provided on vendor websites. 2 Segment length in symbols based on our benchmark: 508 symbols per segment (long segments). 3 The price will be effective once Microsoft Custom Translate is launched in production (as of October 2018 it's in preview and free). 4 Based on the actual training time we observed and a list price of $76 per hour of training 2,3 4
  43. 43. November 2018© Intento, Inc. 3.2 Price Comparison - Maintenance Cost to maintain1 3rd-party Machine Translation engines, USD per month 43 1 Based on the public prices provided on vendor websites. 2 The price will be effective once Microsoft Custom Translate is launched in production (as of October 2018 it's in preview and free). 3 Depends on the subscription tier. 2,3
  44. 44. November 2018© Intento, Inc. Cost to translate1 with 3rd-party custom NMT engines 3.3 Price Comparison - Translation 44 1 Based on the public prices provided on vendor websites. 2 Subject to low volume limits. 3 Based on 4.79 symbols per word. 2 3
  45. 45. November 2018© Intento, Inc. Several possible scenarios1 3.4 Total Cost of Ownership 45 1 All estimates are based on public pricing, may be inaccurate and do not account for the cost of human labor to implement the solutions. 2 3
  46. 46. November 2018© Intento, Inc. All vendors have different pricing models, tailored to different use-cases — It’s possible to keep the evaluation cost reasonable — When evaluating the domain-adaptive MT, TCO analysis is a must 46 3.5 Discussion
  47. 47. November 2018© Intento, Inc. 4 HOW LONG DOES IT TRAIN? 47 Time to train custom NMT engines, hours
  48. 48. November 2018© Intento, Inc. 5 HOW MUCH DATA DO I NEED? 5.1 Experimental Setting 5.2 Learning curves by MT engine 5.3 Discussion 48
  49. 49. November 2018© Intento, Inc. 5.1 Experimental Setting One test and four training datasets, prepared as described on Slide 17 (10K, 100K, 500K and 1M parallel segments). — Evaluate how average hLEPOR score changes with increasing the training set growth. 49
  50. 50. November 2018© Intento, Inc. 5.2 Learning Curves by MT engine 50 1 1,4 3 2 1 does not provide stock models 2 did not train on 1M segments as observed a plateau 3 did not train on 1M segments due to technical limitations of the system 4 did not translate 44 sentences due to their length number of training samples averagehLEPORscores
  51. 51. November 2018© Intento, Inc. The learning curves are steep for most of the engines, the performance grow slower after 100K dataset size. Adjusting the baseline engine with as little as 10K parallel segments delivers the most performance boost. — ModernMT and Globalese demonstrate rapid performance growth with the number of training samples without a plateau within 1M sample size range. It may indicate that the data we used for training is in their baseline dataset, with either over-fitting or smart TM matching phase in the Machine Translation process. Check this on your own data before jumping to conclusions. — Engines from Google and IBM start high but improve a little with training. It may mean that they have already seen similar data during the pre-training and domain adaptation is more refocusing than training on new data. — The performance of Microsoft Translator’s stock model on en-de Biomedical texts is much lower than on en-de General texts (0.56 vs. 0.62 in our stock MT evaluation), but it learns pretty fast. It may mean the stock model haven’t seen similar data and the actual training happens. — Our observations suggest that the domain-adaptive NMT engines differ a lot in the data they used for training, in how they treat this data in the translation process and in how they train over the baseline model. We expect the performance landscape to be significantly different for other language pairs and domains. Always do your homework and check it for your projects. 51 5.3 Discussion
  52. 52. November 2018© Intento, Inc. 6 HOW SAFE IS MY DATA? 52 Data protected by ToS: Google (link), IBM (link), Microsoft (link), Globalese (link), Tilde (link) — ModernMT claims non-exclusive rights on the content sent to the API to improve their engine (link). — Some of the NMT engines support on-premise deployment: Globalese, ModernMT, Tilde
  53. 53. November 2018© Intento, Inc. 7 HOW TO USE IT? 53 7.1 Preparing the Training Data 7.2 Training the Model 7.3 Using the Model
  54. 54. November 2018© Intento, Inc. 7.1 Preparing the Training Data — things to watch for 54 All domain-adaptive NMT engines accept TMX format, which is easy to export from any CAT/TMS system. — Segment alignment (hint - some of the MT platforms provide auto-alignment, e.g. Microsoft Custom Translate). — Duplicates - while not entirely useless, having duplicates across training and test datasets spoils the model. — Noise - sentence and word fragments, segments without words et cetera — Translation Errors - obvious translation errors (word ratio between source and target and other heuristics)/
  55. 55. November 2018© Intento, Inc. 7.2 Training the Model 55 All NMT engines support training via API, all but ModernMT and IBM - also via the web interface. — Web interface may have a limit on the number of training sentences, while the API does not. — When there’s a limit on the TMX size, compressing the TMX file helps. — Only IBM and Microsoft solutions support user-defined glossary, others use only parallel corpus / TM. — Many NMT engines produce sporadic errors while training the model. If failed, keep trying.
  56. 56. November 2018© Intento, Inc. 7.3 Using the Model 56 All NMT engines provide different APIs to access the trained model, with different limits and quotas. We built a vendor- agnostic middleware, which unifies that. — Google AutoML API has somewhat complicated authorisation procedure. We also simplify that. — Globalese and Tilde provide web-applications for translating files. We have a universal web-application to translate files with any NMT engine. — Check what connectors are available in your CAT tool. We have universal plugins for MemoQ, soon Matecat and SDL Trados.
  57. 57. November 2018© Intento, Inc. Intento Professional Services MT Evaluation and Integration Training and statistically significant evaluation of NMT engines (see Slide 9), which may bring the most cost and time reduction on the post-editing stage. — Identifying a subset of MT results for fast and affordable manual inspection (see Slide 24). — LQA and HTER also available via our LSP partners. — MT Integration - SDK and connectors to open platforms and in-house software. — Reach us at hello@inten.to 57
  58. 58. November 2018© Intento, Inc. Intento Web-Tools Human-Friendly UI working directly with the Intento API — Quick way to try every MT engine and translate large files without API integration. — Available in preview at no added cost to Intento API until January 2019 58 SIGN UP at https://console.inten.to
  59. 59. November 2018© Intento, Inc. Intento MT Gateway - that’s how we run such evaluations Vendor-agnostic API Sync and async modes CLI tools and SDKs Works with files of any size 10-20x faster due to multi- threading Get your API key at inten.to 59 MAY BE DEPLOYED AT PRIVATE CLOUD
  60. 60. by Intento (https://inten.to) November 2018 Konstantin Savenkov ks@inten.to 2150 Shattuck Ave Berkeley CA 94705 60 STATE OF THE DOMAIN-ADAPTIVE MACHINE TRANSLATION
  61. 61. November 2018© Intento, Inc. A Average Reference-Based Scores For the LQA subsample (45 segments) customised models Performance boost achieved by customisation with up to 1M segments stock models Performance of the stock (pre-trained) models, if provided 61 * 4 segments were not translated due to their length. *

×