9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

1,511 views

Published on

Published in: Business, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,511
On SlideShare
0
From Embeds
0
Number of Embeds
899
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

  1. 1. PangeaMT Sharing Experiences on MT System, Data management, Hybridation Alex Helle / Manuel Herranz
  2. 2. Intro Brief history Pangea system introduction / features for EXPERT Hybridation experiences at Pangeanic (+future work)
  3. 3. Intro Brief history • “1-2 million words an hour” • “quite adequate speed to cope with the whole output of the Soviet Union in a week… a few hours computer time a week” • [full scale production] “if our experiments go well, within 5 years or so” http://youtu.be/K-HfpsHPmvw
  4. 4. What is PangeaMT?  The first commercial application of Open Source Moses (AMTA 2010, http://euromatrixplus.net/moses)  A development overcoming Moses limitations for localization industry presented at Association for MT in the Americas : PangeaMT putting open standards to work... well AMTA 2010 http://bit.ly/uM8x6V  06/2011 PangeaMT launches the DIY Solution to Machine Translate independently and flexibly like never before http://bit.ly/kSd3wC  07/2011 MT experiences Sony Europe http://slidesha.re/oxZmBS  07/2011 A harness that eases re-training and updating  DIY SMT as presented at TAUS Barcelona 2011 http://slidesha.re/nEe5mU  02/2012 API for hosted solutions
  5. 5. What is PangeaMT? 2007 and before • RB tests with commercial software • Insufficiently good output • Only internal production 2007/08 • V1: Small data sets (2-5M words), automotive & electronics • (ES), then Fr/It/De in other fields • EU Post-Editing Award 2009/10 • Division born • 00's of engine trials and language combinations • Open-Source to commercial 2011/12 • DIY SMT • Automated retraining • API v1 • Glossary • Automated re-training • Transfer architecture and know-how to users • Compatibility with commercial formats (ttx, sdlxliff, docx, odt) • TMX / XLIFF workflows • Powerful API v2 for live translation • Confidence scores • Compatibility with more commercial formats 2013
  6. 6. SMT at work Unrest is continuing in Cairo as protesters set up their demand for Egypt’s military rulers to resign + specific language rules + job or client glossary + hybrid technologies
  7. 7. Data? best clean, thank you Cleaning <tu srclang="en-GB"> <tuv xml:lang="EN-GB"> <seg>A system for recovering the methane that is emitted from the manure so that it does not leak into the atmosphere.</seg> </tuv> <tuv xml:lang="FR-FR"> <seg>Système permettant de r€ pérer le méthane qui se dégage de l'engrais naturel d'origine animale de sorte qu'il ne se dissipe pas dans l'atmosphère.</seg> </tuv> Cleaning <tu creationdate="20090817T114430Z" creationid="APIACCESS" changedate="20110617T141159Z" changeid=“pat"> <tuv xml:lang="EN-US"> <seg>Overall heigtht –<bpt i="1">{f43 </bpt> <ept i="1">}</ept>25&quot;; width – <bpt i="2">{f43 </bpt> <ept i="2">}</ept>20.1&quot;.</seg> </tuv> <tuv xml:lang="ES-EM"> <seg><bpt i="1">{f2 </bpt>Altura total - 25&quot;; anchura <ept i="1">}</ept>– <bpt i="2">{f43 </bpt> <ept i="2">}</ept><bpt i="3">{f2 </bpt>20,1&quot;.<ept i="3">}</ept></seg> </tuv> </tu> More cleaning <tuv xml:lang=“EN-US"> <seg>On 22nd May we decided not to join the group.</seg> <tuv xml:lang=“DE-DE"> <seg>Am 22. </seg>
  8. 8. Data? best clean, thank you Cleaning <tu srclang="en-GB"> <tuv xml:lang="EN-GB"> <seg>The President of the United States visited Costa Rica.</seg> </tuv> <tuv xml:lang=“ES-ES"> <seg>El Presidente de los Estados Unidos, el señor Obama y su esposa la señora Michelle, visitaron Costa Rica el pasado sábado.</seg> </tuv> Cleaning <tuv xml:lang=“JP"> <seg>同書は「通訳・翻訳キャリアガイド」の2011-2012年度版。 英字新聞のジャパンタイムズ社が強みとするジャーナリスティックな視点で、通訳や翻訳という仕事が持つ魅 力ややりがい、プロに要求されるスキルおよび意識の持ち方などを紹介。また通訳者・翻訳者になるための道 すじから、実際の仕事の現場にいたるまで、今日の通訳・翻訳業界の実像を包括的に紹介。</seg> <tuv xml:lang=“EN-US"> <seg>It is a journalistic point of view and strengths of the Englishlanguage newspaper Japan Times. It includes a description of the exciting and rewarding work of translation and interpretation, as well as the introduction of consciousness and how to acquire the required professional skills. The road to becoming a translator and interpreter also down to the actual work site, a comprehensive guide to interpreting the reality of today'stranslation industry. </seg> More cleaning
  9. 9. Data? best clean, thank you Parallel text extraction / Translation input / Post-edited material Cleaning This is often comes from CAT tools or document alignments, crawling Engine training with clean data Having approved, terminologically sound, clean data improves engine accuracy and performance with even small sets of data. Data Cleaning (in-lines) Remove all non-translation data. Data cleaning modules • • • TMX Human approval Some of this material may actually be OK for training. It is then input in the training set. • • Remove any “suspects”: Sentences that are too long Mismatches (of many kinds!) Terminological inaccuracies Non-useful segments, etc
  10. 10. System features – For EXPERT Cleaning
  11. 11. System features – For EXPERT Domain
  12. 12. System features – For EXPERT Engine Creation
  13. 13. System features – For EXPERT Engine Training
  14. 14. System features – For EXPERT Typically a 5 n-gram, DL, table Unrest is continuing in Cairo as protesters set up their demand for Egypt’s military rulers to resign • • • • specific language rules job / client glossary hybrid technologies good bleu tracking, ideal for experimentation
  15. 15. Different MT Systems for Different Lang Pairs? Related languages  SMT, with accurate n-gram training and in-domain data (typically 5, distorsion limit, weighs and fine-tuning) Morphology-rich languages  Data is not enough and casuistry too large (Baltic languages like Lavian are extreme, Turkish is regular but too many suffixes) SMT cannot cope. Rulebased or Hybrid Syntactically distant languages  Need additional information, this is where different HYBRID TECHNIQUES come into place. NO “SIZE FITS ALL”
  16. 16. Hybridation Experiences at Pangeanic Rationale when the syntactic distance between languages is very large (unrelated languages). Patterns are lost (or not found)  monotone TR - Linguistic Information Language Knowledge Data Output Translation
  17. 17. Hybridation Experiences at Pangeanic TWO OPTIONS SYNTAX-BASED HYBRID SMT Altaic languages   English Arabic   European languages Agglutinative   Non- agglutinative Linguistic Information Language Knowledge Data RE-ORDERING Toshiba / Mecab benchmarking EN   JP Output Translation
  18. 18. Hybridation Experiences at Pangeanic TWO METHODS CHALLENGES  SVO vs SOV  Tokenization: No spaces between words Mecab/KyTea for JP, Peterson Segmentor for ZH  RBMT systems have traditionally worked with linguistic & morphological analyzers. Thus “units” were segmented.  SMT can’t and so we need to tokenize to leave similar amount of “words” on both sides  Giza++ can then relate words and groups.
  19. 19. Hybridation Experiences at Pangeanic TWO OPTIONS CHALLENGES  SVO vs SOV
  20. 20. Hybridation Experiences at Pangeanic TWO METHODS CHALLENGES  SVO vs SOV  Re-ordering?  Phrase-based or hierarchical models (syntactical)? Continue to press the button to scroll through the components of the program until the display shows the desired current selection. Japanese proper word order would be the display the desired current selection shows until the components the program of through to scroll the button to press continue.
  21. 21. Hybridation Experiences at Pangeanic Syntax-based analysis & re-ordering rules SYNTAX-BASED (TREE) FOR HYBRID SMT Tree depth: 10 Calc time +59% !!
  22. 22. Hybridation Experiences at Pangeanic Syntax-based analysis & re-ordering rules SYNTAX-BASED RULES FOR HYBRID SMT 発売 時 には、 同社は 次の バージョンを 提供する 予定 です 。 Translation & Cleaning available When , the company the following : plans to offer : Nipponization module (Cond clause), (Subject) (VBPt) (to) (Predicate) (ADV) (ADJ) (Punct) (DET) (NNSing) (VBPt3) (to) (VBinf) (DET) (NN) When available, the company plans to offer the following:
  23. 23. Hybridation Experiences at Pangeanic TWO OPTIONS TOSHIBA vs MECAB Toshiba’s The Honyaku is a established RB system (+30 years) Lacks flexibility, rules contradict each other Proposal: re-arrange whole corpus EN for JP with Toshiba’s rules, but this meant dependency on a proprietary system for future inputs.
  24. 24. Hybridation Experiences at Pangeanic TWO OPTIONS TOSHIBA vs MECAB – LESSONS LEARNT Mecab re-ordering produced higher BLEU than Toshiba’s 5-fold structure
  25. 25. Hybridation Experiences at Pangeanic TWO OPTIONS TOSHIBA vs MECAB – LESSONS LEARNT Mecab re-ordering produced higher BLEU than Toshiba’s Paper published December 2011 AAMT Going Hybrid: Pangeanic’s and Toshiba’s First Steps Toward ENJP MT Hybridation
  26. 26. Hybridation Experiences at Pangeanic TWO OPTIONS TOSHIBA vs MECAB – LESSONS LEARNT Mecab re-ordering produced higher BLEU than Toshiba’s Paper published December 2011 AAMT Going Hybrid: Pangeanic’s and Toshiba’s First Steps Toward ENJP MT Hybridation
  27. 27. Future (current) Work on Hybrids  Morphology-rich langs: RU in particular. Improve DE  Distant languages: re-ordering for AR?  Agglutinative langs: TK – new paradigm
  28. 28. Brief history Intro Pangea system introduction / features for EXPERT Hybridation experiences at Pangeanic (+future work)
  29. 29. Questions? m.herranz@pangeanic.com #manuelhrrnz #pangeanic pangeanic

×