TAUS MT SHOWCASE, MT for Southeast Asian Languages, Ai Ti Aw, Institute for Infocomm, 10 April 2013


Published on

This presentation is a part of the MosesCore project that encourages the development and usage of open source machine translation tools, notably the Moses statistical MT toolkit.

MosesCore is supported by the European Commission Grant Number 288487 under the 7th Framework Programme.

For the latest updates, follow us on Twitter - #MosesCore

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

TAUS MT SHOWCASE, MT for Southeast Asian Languages, Ai Ti Aw, Institute for Infocomm, 10 April 2013

  1. 1. TAUS  MACHINE  TRANSLATION  SHOWCASE  MT for Southeast Asian Languages14:00 – 14:20Wednesday, 10 April 2013Ai Ti AwInstitute for Infocomm, Singapore
  2. 2. Southeast Asian Language Machine Translation Ms Ai Ti AW Human Language Technology Department Institute for Infocomm Research, Singapore
  3. 3. Agenda 1.  Machine Translation 2.  Southeast Asian Languages 3.  Institute for Infocomm Research (I2R) 4.  Challenges for Southeast Asian Language Translation 5.  Machine Translation ApplicationsLocalization World, Singapore, 10-12 Apr 2013 3
  4. 4. The Tower of Babel Pieter Brueghel the Elder (1563) (Wiki)Localization World, Singapore, 10-12 Apr 2013 4
  5. 5. Languages of the World Each  dot  represents  the  geographic  center  of  the   6,912  living  languages  in  the  Ethnologue  database.     Gordon,  Raymond  G.,  Jr.  (ed.),  2005.  Ethnologue:  Languages   of  the  World,  FiAeenth  ediBon.  Dallas,  Tex.:  SIL   InternaBonal.  Online  version:  hJp://www.ethnologue.com/.  Localization World, Singapore, 10-12 Apr 2013 5
  6. 6. Father of Translation Xuanzang (玄奘,602‐664): 
 St. Jerome (347-420) First Translator in China Translation of Bible into Latin Http://baike.baidu.com http://mb-soft.com/believe/txn/jerome.htmLocalization World, Singapore, 10-12 Apr 2013 6
  7. 7. Pioneer of Machine Translation Warren Weaver (1894-1978): Decoding When I look at an article in Russian, I say: “This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.” (1949) http://en.wikipedia.org/wiki/Warren_WeaverLocalization World, Singapore, 10-12 Apr 2013 7
  8. 8. Translation JokesLocalization World, Singapore, 10-12 Apr 2013 8
  9. 9. Machine Translation • Word Translation • Phrase Unit • Tree • Lexical Linguistic • POS Expert knowledge ranslation examples T Complexity • Syntax Translation Model Language Model Decoding AlgorithmLocalization World, Singapore, 10-12 Apr 2013 9
  10. 10. The Vauquois Triangle interlingua semantic transfer syntactic transfer directLocalization World, Singapore, 10-12 Apr 2013 10
  11. 11. Translation Methodology Word-to-Word Translation Phrase-based Translation S VBA VO P NG VG R WJ Ts: 把 钢笔 给 我 。 (NULL) (pen) (give) (me) (.) A: Give the pen to me . Tt: VBP DT NN TO PRP PUNC. NP PP VP S Syntax-based TranslationLocalization World, Singapore, 10-12 Apr 2013 11
  12. 12. Rule-based Approach lexical structural Transfer Bilingual Structural Dictionary Mapping stru al Rules ctur ctur Structure Generation stru al Parsing Rules Lingware Interpreter generation analysis Morphological Language Rules Model Morph Dictionary Generation The story is interesting . Cerita menarik .Localization World, Singapore, 10-12 Apr 2013 12
  13. 13. Statistical-based Approach TRAINING Target language corpus Parallel corpus Word Statistical Language alignment modeling modeling Translation Re-ordering Language model (TM) model (RM) model (LM) Source language Target language Input f Statistical output e decoding TESTINGLocalization World, Singapore, 10-12 Apr 2013 13
  14. 14. Southeast Asian Languages Englis h Chinese Malay Lao Khmer Filipino Thai Indonesia n Myanmar Vietnamese Localization World, Singapore, 10-12 Apr 2013 14
  15. 15. Characteristics of Southeast Asian Languages Tone Affix Inflection Re- Word Sentence duplication Segmentatio Concept n Chinese Yes No No No Yes Yes Filipino No Yes Yes Yes No Yes Indonesia No Yes No Yes No Yes n Khmer No No No Yes Yes Yes Lao Yes No No No Yes Yes Malay No Yes No Yes No Yes Myanmar Yes No Yes No Yes Yes Thai Yes No No No Yes No Vietnames Yes No No No Yes Yes e - Contributed by the ASEAN-MT ProjectLocalization World, Singapore, 10-12 Apr 2013 15
  16. 16. Language Processing Tools Morphologica Word Sentence l Analysis Segmentation Boundary Detection Chinese (Singapore) NA Available NA Filipino (Philippine) Available NA NA Indonesian Available NA NA (Indonesia) Khmer (Cambodian) NA Available NA Lao (Laos) NA Available NA Malaysian (Malaysia) Available NA NA Myanmar (Myanmar) Available Available NA Thai (Thailand) NA Available Available Vietnamese NA Available NA (Vietnam) - Contributed by the ASEAN-MT ProjectLocalization World, Singapore, 10-12 Apr 2013 16
  17. 17. Research Institutes and Companies
  18. 18. Localization World, Singapore, 10-12 Apr 2013 18
  19. 19. Localization World, Singapore, 10-12 Apr 2013 19
  20. 20. Machine Translation Research 1989: Initiated R&D in English→Chinese MT 1990: Awarded S$2m IBM English→Chinese MT project 1992: Developed in-house English↔Malay MT 1993: Set up MT Service Unit 1997: Spin-off AsiaRain Automated Translation 2000: Commercialized MT technology Chinese → English MT Indonesian ↔ English MT English → Thai MT 2004: Enhance and construct lexical resources, machine learning techniques in source text analysis 2005: Started Statistical Machine Translation 2007: Vietnamese → English MT 2010: Hybrid MT 2012: Malay→Chinese MT, Vietnamese → Chinese MTLocalization World, Singapore, 10-12 Apr 2013 20
  21. 21. Phrase-based SMT: Learning Heuristics 1)  Source  Phrase  Segmentation   2)  Phrase  Translation   3)  Target  Phrase  Reordering   •  Discover  effective  heuristics  from  a  limited  dataset     •  Phrase  Segmentation  Model   v  中国的/经济/发展 中国的/经济发展 中国的经济/发展 …..     •  From  Word  to  Pseudo-­‐Word  and    then  to  Phrase   v  “想”  and  “would  like  to”    “多少 钱”  and  “how  much  is  it”     •  Hypothesis  Regeneration  with  System  Combination   v  Generating  new  hypothesis  from  translation  results  (one  or  more  systems)   v  Combining  results  and  re-­‐scoring   Xiangyu Duan, Min zhang and Haizhou Li. Pseudo-word for Phrase-based Machine Translation. ACL-2010 Boxing Chen, Min Zhang and Aiti Aw. Two-Stage Hypotheses Generation for Spoken Language Translation. ACM TALP 8(1) (2009) Deyi Xiong, Min zhang and Haizhou Li. Learning Translation Boundaries for Phrase-Based Decoding. NAACL-HLT 2010Localization World, Singapore, 10-12 Apr 2013 21
  22. 22. Linguistic Syntax-based SMT Tree  Sequence-­‐based  SMT   Bleu-4 on NIST 05 (Trained on FBIS Corpus) 0.27 0.26 0.25 0.24 0.23 0.22 0.21 SCFG Moses Ours: STSG Ours: STSSG Min Zhang, Hongfei Jiang, Aiti Aw and Haizhou Li. A Tree Sequence Alignment-based Tree-to-Tree Translation Model. ACL-2008:HLT Forest-­‐based  SMT   Bleu-4 on NIST 05 (Trained on FBIS Corpus) 0.3 0.29 0.28 0.27 0.26 0.25 0.24 Moses Ours: Ours: Ours: Ours: TT2S TTS2S FT2S FTS2S Hui Zhang, Min Zhang, Haizhou Li, Aiti Aw and Chew Lim Tan. Forest-based Tree Sequence to String Translation Model. ACL- IJCNLP-2009 Hui Zhang, Min Zhang, Haizhou Li and Chew Lim Tan. Fast Translation Rule Matching for Syntax-based Statistical Machine 22Localization World, Singapore, 10-12 Apr 2013 22 Translation. EMNLP-2009 22
  23. 23. Exploring Semantic in Phrase-based SMT Predicate  Translation  &  Argument  Reordering   Deyi Xiong, Min Zhang, Haizhou Li. Modeling the Translation of Predicate-Argument Structure for SMT. ACL 2012. Localization World, Singapore, 10-12 Apr 2013 23
  24. 24. Discourse-based SMT (Topic Model) Xinyan XIAO, Deyi XIONG, Min ZHANG, Qun LIU and Shouxun LIN. A Topic Similarity Model for Hierarchical Phrase-based Translation. ACL-2012 Localization World, Singapore, 10-12 Apr 2013 24
  25. 25. Discourse-based SMT (Document Cache Model) §  Use  document-­‐level  informaIon  to  choose   translaIon  candidates   Zhengxian GONG and Min ZHANG. Cache-based Document-level Statistical Machine Translation. EMNLP-2011 Localization World, Singapore, 10-12 Apr 2013 25
  26. 26. Challenge: Overcome Low Resources 1.  How to build system with limited language resources? 2.  How to leverage on human translation knowledge for SMT? 3.  How to improve the system when large language resources are available?Localization World, Singapore, 10-12 Apr 2013 26
  27. 27. Approach 1.  Given limited statistics, consider using prior linguistic knowledge to improve the statistical model 2.  When we are able to craft rules, consider using statistical approach to improve the productivityLocalization World, Singapore, 10-12 Apr 2013 27
  28. 28. Lexical Pattern: Term Translation Ø Term   §  Phrase  whose  structure  as  a  whole  carries  a  specific  meaning Ø Term  IdenIficaIon  and  TranslaIon   §  Domain  Specific     • Skills  Upgrading  and   • Program  Kemahiran     Resilience  Programme   bagi  Peningkatan  dan     • SPUR   Ketahanan     • SPUR       • 技能提升与应变计划     • 策马扬鞭     §  Tedious  and  Bme  consuming  to  acquire  them  manually  for  a  new   domain  Localization World, Singapore, 10-12 Apr 2013 28
  29. 29. Mining Bilingual Terms Mono   Monolingual   Term   Monolingual   Term   Mono   Ø   Ways  of  acquiring   Corpus   ExtracBon   ExtracBon   Corpus   bilingual  terms   §  Alignment  on  parallel   sentences   Mono Mono   §  Using  web  data  to  search  for   Terms   Terms   translaBon  candidates   §  Mining  from  comparable   corpora   Document   §  Manual  coding/analysis  of   Alignment   new  MWEs     Align Ø   Our  approach   Doc §  AutomaBc  mining  of  bilingual   terms  from  comparable  corpora   §  Unavailability  of  large   Bilingual  Term   parallel  text   Alignment     ExtracBon   Bi-­‐ §  Easy  accessibility  of   Terms   monolingual  corpus   Lianhau  Lee,  Ai+  Aw,  Thuy  Vu,  Sharifah  Aljunied  Mahani,  Min  Zhang  and  Haizhou  Li  “MARS:  Mul+lingual  Access  and  Retrieval  System  with   Enhanced  Query  Transla+on  and  Document  Retrieval”  ACL-­‐IJCNLP  2009.     Lianhau  Lee,  Ai+  Aw,  Min  Zhang  and  Haizhou  Li  “EM-­‐based  Hybrid  Model  for  Bilingual  Terminology  Extrac+on  from  Comparable  Corpora ,   COLING  2010  Localization World, Singapore, 10-12 Apr 2013 29
  30. 30. Parallel Sentence Extraction: Document Alignment 0.2 0.15 0.1 0.05 0 1 11 21 31 41 51 61 71 81 91 Bank Dunia World Bank 世界银行 0.03 0.02 0.01 0 1 11 21 31 41 51 61 71 81 91 Dunia World 世界 Thuy  Vu,  Ai  Ti  Aw,  Min  Zhang.  2009.  Feature-­‐based  Method  for  Document  Alignment  in  Comparable   News  Corpora.  In  12th  EACL  2009,  Athens,  Greece   Localization World, Singapore, 10-12 Apr 2013 30
  31. 31. Document Alignment : Example New Changi Hospital will be health-care hub Hospital Changi Baru dibuka mulai bulan depan for eastern Spore §  Author: Allison Lim, 28/11/1996. §  Author: Nazry Mokhtar, 28/11/1996 §  THE New Changi Hospital will be Singapores first purpose-built §  [Kemudahan $312 juta dijangka jadi …] Selain kemudahan regional hospital, said Health Minister George Yeo. It will cater for perubatan penuh, ia akan mempunyai wad bersalin dan klinik bagi up to 750,000 people who live in the east and northeast regions. merawat bayi - sama seperti Hospital Kandang Kerbau. Sebuah [To reach out to them, it has been designed to be a meeting place hospital masyarakat baru juga akan dibina berdekatan hospital …] tersebut untuk menjadikan NCH sebagai pusat perubatan §  Brigadier-General (NS) Yeo, who is also Minister for Information terunggul di kawasan timur Singapura yang mampu memenuhi and the Arts, said that the hospital will have a birthing centre for keperluan sekitar 750,000 penduduk di situ. Ini menjadikannya young couples living in the region. It will be run as a satellite of the sebagai hospital daerah pertama di sini yang dibangunkan khusus Kandang Kerbau Womens and Childrens Hospital. In addition, bagi memenuhi pelbagai keperluan perubatan penduduk di sesuatu there will be satellite facilities for psychiatry, rehabilitation daerah. medicine and other medical specialities. The whole idea is a whole §  [Menteri Kesihatan, Brigedier-Jeneral (Kerahan) George Yeo, range of medical facilities in a hospital that will also serve as a berkata demikian …] Antara kemudahannya termasuk kemudahan health-care hub for the entire region, he said of the $480-million bersalin yang dikelolakan oleh Hospital Kandang Kerbau dan hospital. [The regional hospital concept ….] kemudahan bagi rawatan psikiatri dan pemulihan. Hospital baru itu §  The minister, who was accompanied by senior officials from the menggantikan Hospital Toa Payoh dan Hospital Changi. Health Ministry, later planted a Chengai sapling, near the hospital §  BG Yeo, yang juga Menteri Penerangan dan Kesenian, berkata: entrance. Senior Minister of State (Health and Education) Aline Rancangan hospital ini ialah menawarkan kemudahan perubatan Wong planted a Tampines sapling. [Health care will remain lengkap sejajar dengan matlamat menjadikannya sebuah pusat affordable …] perubatan terunggul di daerah timur Singapura.” Mengenai hospital §  BG Yeo said that later on, a community hospital will be built next masyarakat yang bakal dibina berdekatan hospital baru itu, beliau to the New Changi Hospital, between it and the Pan-Island berkata ia akan melengkapi kemudahan NCH. Hospital masyarakat Expressway. In fact, plans are already being drawn up and the St dengan 200 katil pesakit itu akan diuruskan oleh Hospital St Andrews Mission Hospital will run this new community hospital Andrews Mission dan dijangka siap menjelang tahun 2000. [BG which will have more than 200 beds. So in this way we will provide, Yeo selanjutnya berkata …] close to the housing estates here, a full range of medical facilities, §  Dalam lawatan semalam, BG Yeo yang ditemani Menteri Negara he said. It should be ready by 2000. [He said that the regional Kanan (Pendidikan dan Kesihatan), Dr Aline Wong, masing-masing hospital ….] menanam sebatang pokok di luar lobi hospital itu. §  The new regional hospital will replace Toa Payoh Hospital, which will become a community hospital, and the existing Changi Hospital. [The latters site will be returned ….] 31 Localization World, Singapore, 10-12 Apr 2013
  32. 32. Document Alignment : Example MAS profit falls 68% to $1.22b 金管局看好未来数季度增长 on higher rates, stronger S$ §  Author: Ericia Tay, 21/07/2006. §  Author: 罗文燕, 21/07/2006 §  [Central banks total assets up …] The futures market suggests §  [ 在中东紧张局势升温。。。] 金融管理局董事经理王瑞杰昨天在发表常 that oil prices could stay at around US$80 a barrel, and while the 年报告书的记者会上说,高油价转嫁到能源相关消费物品和商业营运成本 world economy has so far been resilient, the risks of a sharper 的程度预料会提高,但整体国内通货膨胀压力应该会受到相当好的控制。 slowdown due to supply disruptions have gone up, noted Mr Heng. 尽管油价升高,金管局保持对我国今年的通胀率将介于1%到2%的预测。 §  Nevertheless, inflationary pressures at home should be fairly [ 王瑞杰说 。。。] well contained, even though the indirect effects of higher oil §  根据贸工部上星期发表的预估数据,我国经济今年上半年强劲增长了 prices on energy-related consumer items and business costs are 9.1%。不过,下半年的。王瑞杰说: 美国经济增长可能在下半年放缓,这 expected to strengthen. The MAS stuck to its earlier prediction that 或许会抑制全球资讯科技需求的增长,但(新加坡)今后几个季度持续保持 Singapores economic growth this year is likely to be between 5 per 经济增长的前景似乎没变。 cent and 7 per cent, barring unexpected shocks in the rest of the §  因此,排除地缘政治风险激增等无法预见的外来冲击,金管局预期全年 year. 的经济增长率多数会保持在5%到7%。然而,王瑞杰指出: 石油供应被中 §  Although global IT demand growth may be capped somewhat 断以致经济更急速放缓的风险现在增加了。显然的,地缘政治跟油价。。。 by potentially slower growth in the United States in the second half §  中东紧张局势最近升温,已导致油价进一步升高。王瑞杰说,从期货市 of 2006, the prospects for continued economic growth in the 场的走势来看,油价预料会保持在每桶80美元左右的高水平。他说,金管 quarters ahead appear intact, said Mr Heng of the outlook for 局对通胀和经济的预测,有考虑到平均油价可能处于每桶65美元到78美 Singapore. The MAS also kept its inflation forecast of between 1 元的价位。 per cent and 2 per cent for the whole of this year. These §  在考虑到我国的增长和通胀前景后,王瑞杰表示,金管局认为当局目前 macroeconomic projections are based on the assumption that 让新元汇率继续适度及逐步增值的政策立场仍然适合。当局下一次将在1 crude oil prices average US$68 to US$75 a barrel. 0月发表半年一次的货币政策声明。 §  In the first half of this year, Singapores gross domestic product (GDP) grew by an estimated 9 per cent from the same period last year. Taking into account Singapores GDP growth and inflation prospects, the central bank said its policy stance on the Singdollar - a modest and gradual strengthening of the currency - remains appropriate. [Unlike many central banks which use interest rates as a policy tool…] Localization World, Singapore, 10-12 Apr 2013 32
  33. 33. Hybrid System Source Beliau juga berterima kasih kepada MAS dan AirAsia kerana menyediakan penerbangan terus ke Macau, yang memudahkan MGTO untuk mempromosikan bandar itu. SMT He was also grateful to mas and airasia for providing direct flights to macau, which facilitate promoting the MGTO to. MEMT He also is thankful for MAS and Airasia for preparing flight directly to Macau, which facilitates MGTO to promote the town. SMT+ He was also grateful to MAS and Airasia for providing direct flights to MEMT macau, which facilitates the MGTO to promote the city. BLEU SMT 0.4062 MEMT 0.2725 SMT+MEMT 0.4165Localization World, Singapore, 10-12 Apr 2013 33
  34. 34. Scientific Achievements §  Papers in leading journals •  IEEE Transactions on Audio, Speech and Language Processing •  ACM Transaction on Asian Language Information Processing •  Information Processing and Management •  Computational Linguistics §  Papers in leading conferences •  The Annual Meeting of The Association for Computational Linguistics (ACL) •  Conference on Empirical Methods in Natural Language Processing (EMNLP) •  International Conference on Computational Linguistics (COLING)Localization World, Singapore, 10-12 Apr 2013 34
  35. 35. Baidu-I2R Research Centre Baidus Box Computing: Beating Google At Its Own Game March 27, 2012, Seeking Alpha “… According to Baidu, 60% of search results are produced by Box Computing, which delivers interactive, relevant, and intuitive search experience that makes Baidu a clear leader in Chinas online search market. Unfortunately, Google has yet to catch up with Baidu on semantic search.” “…Recently, Baidu formed a partnership with Agency for Science, Technology and Research (A*STAR) to establish an RD center in Singapore that focuses on developing South Asian language processing technology. The joint research lab will initially focus on Vietnamese and Thai.”Localization World, Singapore, 10-12 Apr 2013 35
  36. 36. Network-based Speech to Speech Translation Servic
  37. 37. Malay-English S2S Mobile Translation MALAY ↔ ENGLISH - No existing commercial Malay speech recognition. - Small footprint – compact models, can run on small devices. Usable in many contexts - Humanitarian and Disaster Relief Efforts - Tourist travelLocalization World, Singapore, 10-12 Apr 2013 37
  38. 38. Document TranslationLocalization World, Singapore, 10-12 Apr 2013 38
  39. 39. Multilingual Chat Messaging Default Dictionary User User definedUser defined dictionary User defined dictionary defined dictionary dictionary Web Service Server Translation Bot Normalization Bot 4 1 3 2 Chat Client Chat Client 1. Chat message normalized by normalization bot. 2. Chat message sent to chat server. Chat Server 3. Chat message sent to the recipient. 4. Chat message translated by the translation bot.Localization World, Singapore, 10-12 Apr 2013 39
  40. 40. Localization World, Singapore, 10-12 Apr 2013 40