Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Pangeanic Cor-ActivaTM-Neural machine translation Taus Tokyo 2017


Published on

Presentation of Pangeanic language technologies as a result of EU and national R&D: Cor for web crawling and website translation, linked to Elastic Search-based ActivaTM and NeuralMT

Published in: Technology

Pangeanic Cor-ActivaTM-Neural machine translation Taus Tokyo 2017

  1. 1. The Web, The Database and The Neural Manuel Herranz, CEO Pangeanic TAUS Tokyo, April 2017 What changes in EN-JP?
  2. 2. The Aim After building 1000’s of MT systems for different purposes and clients, we realized shortcoming in several areas for which existing tools were “locked”, had no innovation, were too inflexible, or presented several shortcomings. We needed systems that talked to each other, yet were independent. This is the result of a EU research project (ActivaTM) and a national project in Spain (Cor)
  3. 3. The Web Cor Eases estimation in any translation format (doc or web) National research project with EU funding Full platform Use by Pangeanic, LSPs, 3rd parties CMS agnostic – extracts text and converts to xliff (doc or web)
  4. 4. The Web Cor Translate sections of a web only (batches) Detect new content or content that has been eliminated to update language versions
  5. 5. The Web Eases estimation in any translation format (doc or web) Documents, too.
  6. 6. The Database ActivaTM Elastic Search-based All language assets in one database, irrespective of tool that created them Deep learning for tag handling CAT-tool agnostic (solves interoperability issues) Automatic fuzzy match repair. More powerful (strict) fuzzy matching than traditional CAT-tools Subsegment split
  7. 7. The Database Matrix (triangulate to create new language pairs) Statistics on all segment units, words, domains Remote access, API Pre-filter prior to MT (TM+MT)
  8. 8. The Neural Artificial Neural Networks for SMT History of ANN-based Machine Translation and Language Modelling for SMT: 1997 [Castano & Casacuberta 97] (JAUME I & U.Politécnica): Machine translation using neural networks and finite-state models (PangeaMT: areas/mt-showcase) 2007 [Schwenk & Costa-jussa 07]: Smooth bilingual n-gram translation. 2012 [Le & Allauzen 12, Schwenk 12]: Continuous space translation models with neural networks. 2014 [Devlin & Zbib 14]: Fast and robust neural networks for SMT Conventional SMT Use of statistics has been controversial in computational linguistics: Chomsky 1969: ... the notion ’probability of a sentence’ is an entirely useless one, under any known interpretation of this term. Considered to be true by most experts in (rule- based) natural language processing and artificial intelligence History of Statistical Approach to MT 1989-94: IBM’s pioneering work since 1996: only a few teams favored SMT: U.Politécnica Valencia, RWTH Aachen, HKUST, CMU 2006/2007 Google Translate 2006-2012 Euromatrix 2009: PangeaMT
  9. 9. Training data: TAUS data for Electronics Computer Hardware (ECH) plus SOFT (IT) 4,6M sentences / 56M words (EN) EN and JA tokenized (tokenizer.perl and Mecab respectively) The Neural Seemingly…. Not such a big difference Results EN->JA :
  10. 10. The Neural BLEU: higher is better TER: lower is better WER: lower is better BLEU: detects precision in ngrams TER: derived from the Levenshtein distance, working at the character level WER: derived from the Levenshtein distance, working at the word level Results EN->JA:
  11. 11. The Neural Results EN->JA by length: In smaller sentences (0-10 words), our SMT system gets better results in BLEU, but if we take a look to the TER and WER, we see that in character and word level, NMT has better results that results in less postedition effort. In medium sentences (11-25), NMT gets always better results in BLEU, WER and TER. In long sentences (26++), NMT tends to have same results than PangeaMT. BLEU TER WER
  12. 12. The Neural A: Very good, perfect or very light post-editing B: OK but needs light post-editingt C: Not good but some meaning can be understood. D: Not good at all. Needs HT. Do we need new metrics? BLEU does not seem to correlate well to perception of NMT being much better.
  13. 13. The Neural Tests in F/I/G/S, RU, PT point to a very strong preference towards NMT (results to be published in May). On average: from a set of 250 sentences, around 60%-65% were good or very good (A or B). ES/PT/IT results similar to FR Evaluation: Translation companies and professional freelance translators
  14. 14. Questions NMT scary? Almost there? (as good as human)? Just a matter time (data and connectors) to make NMT ubiquitous? Where will be in 3 years, 5 years? Translation Companies need to change business model and become something else?
  15. 15. Thank you!