Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro


Published on

OpenLogos open source machine translation - the ideal platform for a hybrid machine translation solution

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

  1. 1. Towards OpenLogos Hybrid TranslationAnabela 1
  2. 2. Introduction with Contextual Information Research goals – OpenLogos – 1st hybrid open source machine translation solution – Hybridization of the OpenLogos system consists on embedding linguistic knowledge into statistical machine translation (SMT) The timing is just right… – Recognition by SMT researchers and developers of the need to integrate linguistic knowledge in machine translation (MT) systems – Benefit from cloud computing, big data and advanced alignment techniques, which contribute to an easier and faster development of new language pairs – Use crowd sourcing support to increase MT quality 2
  3. 3. Introduction with Contextual Information The ideal platform for hybrid translation – Logos legacy (one of the first RBMT systems - 1970) – Logos Corporation – one of the longest run commercial MT companies in the world (in business for over 30 years) – The Logos MT product put its emphasis on semantic understanding – The Logos approach was through linguistic analysis of English to render it in a form that was “understood” by the computing system – To a certain extent, the Logos approach is similar in spirit to the SMT approach, and complements SMT by providing answers that help overcome statistical weaknesses 3
  4. 4. Introduction with Contextual Information The open source initiative – OpenLogos is publicly available as open source software – It has some enthusiastic advocates and fervent supporters in different parts of the world  who believe that: • OpenLogos will be used as the rule-based component of a new linguistically enhanced hybrid translation system • The open source components of the OpenLogos will help the NLP/CL research community make scientific advances 4
  5. 5. Presentation Outline Background on OpenLogos MT System pipeline architecture SAL representation language Classic problems with rule-driven systems How SAL benefits translation Advantages of the OpenLogos architecture Uniqueness of the OpenLogos MT system Exploiting OpenLogos resources for new applications Availability of OpenLogos free resources 5
  6. 6. Background to OpenLogos Open source copy of the Logos system (1970-2001) adapted by DFKI – Developed in US, Germany, Italy – 25-100 development staff for 30 years – + 80 million US Dollar Investment 8 language pairs: EN-GE, EN-FR, EN-ES, EN-IT, EN-PT GR-EN, GE-FR, GE-IT Commercial product was considered high quality Industrial strength MT used successfully in 12 countries Users included: Ericsson of Sweden, the Canadian Secretary of State, SAP, Siemens-Nixdorg, Oce Netherlands, and Union Fenosa 6
  7. 7. OpenLogos Characteristics Multi-target System – One source language analysis can generate any number of targets Pipeline Architecture Language-neutral Software – All linguistic knowledge is in data files, stored in a relational database Semantico-Syntactic Abstraction Language (SAL Representation) – Taxonomy-ontology – NL sentences entering the system are immediately converted into SAL sentences – SAL is the driving force of the OpenLogos process Semantic Processing – Semantic Table (= SEMTAB) containing thousands of transformation rules 7
  8. 8. OpenLogos Pipeline Architecture Input SAL Rules Format SEMTAB RES1 RES2 P1 P2• Highly Modular P3 P4• Incremental Processing• Multi-Target System S• Bottom-up Analysis T4• Deterministic Parse T3 T2 T1 GEN Target Rules SEMTAB Format Target Rules SEMTAB Target Rules SEMTAB Output 8
  9. 9. Incremental Source Analysis - 1 Enter Pipeline SAL Rules Format SEMTAB RES1 RES2Clause Segmentation ways of cooking lentils - VHomograph Resolution types of [cooking utensils] - ADJDeterministic parsing requires that all ambiguous PoS be resolved (98% precision) 9
  10. 10. Incremental Source Analysis - 2 SAL Rules Parse1 Semtab Parse2• Simple NP Parse3 • Semantic resolution • NP Prep NP Parse4 • Relative • Verb clauses • Semantic semantics •Complex NP S resolution • Simple •Order in clauses complex • Semantic sentences resolution • Semantic E.g: a book on the presidency on = about; concerning resolution ≠ a book on the table on = over 10 10
  11. 11. SAL Representation LanguageSAL - Semantico-syntactic Abstraction Language SAL Taxonomy: 3 levels organized hierarchically – Supersets / Sets / Subsets Semantico-Syntactic continuum from NL word to Word Class – Literal word: airport – Head morph: port – SAL Subset: Agfunc (agentive functional location) – SAL Set: func (functional location) – SAL Superset: PL (place) – Word Class: N Both Pipeline Input Stream and Rulebases are expressed in SAL 11
  12. 12. SAL Noun Supersets E.g: two pieces of cakeDeveloped:- inductively NP parse must have:- by trial and error - Plural morphology of pieces- over a period of years - Semantics of cake- by the development team 12
  13. 13. Abstract Noun Taxonomy Abstract Noun Superset Non-verbal Abstract Set   Non-verbal Subsets Classifications Verbal Abstract Set  Methods / Procedures Verbal Subsets 13
  14. 14. Use of SAL Codes to Resolve HomographsIs the word cooking a verb or an adjective? ways of cooking lentils types of cooking utensilsways  N(AB/method)  parser verb biastypes  N(AB/class)  non-verb bias SAL contributes toThe SAL code N(AB/method) in the rule the resolution ofmatches on a similar code in the SAL input the homographstream.The effect of such a match is to resolvecooking as a verb 14
  15. 15. What SAL Rules Look Like Rules Have Five Components SAL Pattern – PARSE2 example: N(IN/data;u) Prep(“on”;u) N(u;u) (a book on the presidency) Constraints – Match only if conditions are true or false Source Actions – RES Rulebase: Resolves syntactic ambiguity – PARSE Rulebase: Creates parse tree – SEMTAB Rules: Effects semantic disambiguation Target Action (optional) – Effects syntactic and/or semantic transfer Comment Line – PARSE2 example: NP(info) Prep(“on”) NP  N1 “about” N2 E.g., book on political satire  book about .... 15
  16. 16. Classic Problem of RBMT Complexity – Logic saturation – Rulebase grows too large – Performance degradation – Difficult maintainability – System improvability stasis Ambiguity – Quality/accuracy of output – depends on effective disambiguation – Effective disambiguation cause rulebase growth Classic Dilemma of the Developer – Reduce rulebase size to relieve complexity weakens disambiguation – Increase rulebase size to address ambiguities increases complexity 16
  17. 17. How OpenLogos Addresses Complexity and Ambiguity Complexity – Rules and input stream are expressed as SAL patterns – Homogeneous ‘apples-to-apples’ matching – Rules are SAL patterns stored/organized in an indexed pattern dictionary – SAL input stream serves as search argument to SAL rulebase – No limit on rule size and no impact on performance – Rules are self organizing – Rulebase is easy to maintain 17
  18. 18. How Rules Are Applied Metaphor: biological neural net As the analysis progresses: 1- cells become fewer (abstract nature of the parse) 2- vectors become lighter (semantic dismbiguation)– Vectors labeled V1-V6 = SAL input stream of the pipeline– Cells in input vectors = SAL elements/words to which the NL input stream has been converted– In this network, R1 through P4 = hidden layers containing SAL rules– R1 represents RES1, P1 represents Parse1 and so on.– Each hidden layer contains between 2-4 thousand rules, organized by their SAL pattern, as in a dictionary. 18
  19. 19. How Rules Are Applied Metaphor: biological neural net Chief similarity – Efficient interaction between the SAL input stream and the rules of the hidden layers – Only those rules which should be looked at are accessed – The developer does not need to develop metarules or discrimination networks to achieve efficiency in rule matching – Efficiency in rule matching is an automatic by-product of system design 19
  20. 20. How OpenLogos Addresses Complexity and Ambiguity Ambiguity – Syntactic Homograph Resolution – Scoping of adjectives, prepositions – Polysemy 20
  21. 21. Resolution of Polysemy in OpenLogos SAL Representation Language in interaction with SEMTABSEMTAB provides a transfer that overrides the default dictionary transferfor the verb “raise”NL String SEMTAB Rule Portuguese Transferraise a child  V(‘raise’) N(ANdes)  criar. . .raise corn  V(‘raise’) N(MAedib)  cultivar. . .raise the rent  V(‘raise’) N(MEabs)  aumentar. . . 21
  22. 22. Deep Structure Rules of SEMTAB A single deep-structure rule matches multiple surface-structures and produces correct target transfershe raised the rent  ele aumentou a renda V+Objectthe raising of the rent  o aumento da renda Gerundthe rent, raised by …  a renda, aumentada por… Part. ADJa rent raise  um aumento de renda Noun 22
  23. 23. How SAL Benefits Translation Examples showing voice transformations EN passive voice >>> FR active voiceThe situation was alluded to by my friend in his letterMon ami a fait allusion à la situation dans sa lettreThe situation was alluded to in their letterOn a fait allusion à la situation dans leur lettre Voice transformations are possible due to: • incremental pipeline approach • strong semantic sensitivity 23
  24. 24. Advantages of OpenLogos Machine Translation Architecture Creation of systems involving small or neglected/endangered languages – not targeted by commercial programs – to fulfil the goals of administrations and NGOs dealing with these languages, contributing to their promotion and/or revival Freely available – any user can access the technology Customizable - institutions or businesses adopting an open-source MT can customize the system to their needs in many ways – developing new linguistic data (vocabularies, rules, corpora) – integrating system/data with other packages – etc. 24
  25. 25. OpenLogos Uniqueness Extensible dictionaries with underlying semantic foundation Analyses whole source sentences, considering: – Morphology – Meaning (semantics) – Grammatical structure and function Semantico-Syntactic Abstraction Language (SAL) – the parser is able to achieve better results than syntactic analysis alone would allow. Parsing is only source language specific; generation is target language specific Originally a transfer approach, evolved to the present system (which has interlingual features inherent to the system) 25
  26. 26. OpenLogos Uniqueness OpenLogos comprehensive analysis permits to construct a complete and idiomatically correct translation in the target language OpenLogos is suitable for research and academic use – make OpenLogos the standard MT platform for universities, education and other governmental institutions – bring new life into a dormant technology (Phoenix rising metaphor) OpenLogos linguistic data representation can be established as the foundation – freely available for private and commercial use – there is still need for the provision of linguistic and technical services and/or customer support on a fee basis – packaging OpenLogos with the top five Linux distributions will generate a constant revenue stream OpenLogos has an ideal platform for a hybrid MT solution 26
  27. 27. Contribution of OpenLogos Resources for New NLP Applications Initially, OpenLogos EN-PT dictionary data were adapted and enhanced with new properties (derivational, etc.) to create a new resource: Port4NooJ ( ReEscreve uses Port4NooJ.  SPIDER – System for Paraphrasing In Document Editing and Revision. – Based on NooJ’s technology ( – Publicly available at: – Designed to help with writing optimization, but its applicability extends to MT pre-editing.  1st version – ReEscreve (for Portuguese) and ReWriter (for English)  2nd version – eSPERTo (Portuguese: the smart/clever one; expert) Designed for integration in a cyber school project within the scope of an educational program to teach students how to improve their writing skills in the Portuguese language  EXPERT (prototype) - to assist writing of domain-specific texts 27
  28. 28. Contribution of OpenLogos Resources for New NLP Applications  ParaMT – Bilingual/multilingual paraphraser (translator prototype) – Uses similar methodology to that employed by SPIDER – Uses bilingual data – Directly applicable to MT  Corpógrafo – Multilingual corpora management tool – Available at: 28
  29. 29. Uses of SPIDER– Authoring aid (word processing applications)– Language composition tool– Text production and style editor– Empirical testbed for linguistic quality assurance– Text (pre-)editing (machine translation)– “Revision memory” tool (≈ “translation memory”)– Applicable to general and technical language When integrating terminologies, it helps writing in technical domains (e.g. student texts - ReWriter or legal texts - EXPERT) 30
  30. 30. ReEscreve: Suggestions for Text Rewriting Paraphrases of SVC presented by ReEscreve’s paraphrasing system 31
  31. 31. ReEscreve: a Rewritten Text Text rewritten based on the user’s preferences Users can suggest new expressions! 32
  32. 32. Suggestions for Text ReWritingSuggestions for general language linguistic phenomena Compound adverbs > single adverbs Relatives > participial adjectives Support verb constructions > single verbs 34
  33. 33. Selection of paraphrasing grammars for specific linguistic phenomena Users can select among general and technical dictionaries (more than one selection allowed),grammars for specific linguistic transformations (one, several or all grammars can be selected). The interface provides sample texts for testing. Informative details about the linguistic resources selected Sample LEGAL text 35
  34. 34. Selection of a Domain Dictionary Identification of legal terms in the textSuggestions for the term “breach of law” Users can select one term from the list of suggestions or provide a new 36 suggestion
  35. 35. Suggestions provided and user’s capability to add new rewriting options The user can suggest new words or expressions (synonyms or paraphrases) It is possible to go back and change the user option as many times as necessary Text rewritten • In red, the expressions in the source text• In green, suggestions provided by SPIDER and selected by the user 37
  36. 36. ParaMT: a Paraphraser Applicable to MTPT support verb construction > EN verbs MACHINE TRANSLATION Recognition of Portuguese SVC and translation into English verbs 38 $EN
  37. 37. Selected Publications on Paraphrasing Applications Anabela Barreiro. "SPIDER: a System for Paraphrasing In Document Editing and Revision - Applicability in Machine Translation Pre-Editing". Computational Linguistics and Intelligent Text Processing. Proceedings of the 12th International Conference 6609 (2011), pp. 365-376. Springer. ISSN: 0302-9743. e-ISSN: 1611-3349. DOI: 10.1007/978-3-642- 19400-9. Part II, Lecture Notes in Computer Science Anabela Barreiro. "ParaMT: a Paraphraser for Machine Translation". In António Teixeira, Vera Lúcia Strube de Lima, Luís Caldas de Oliveira & Paulo Quaresma (eds.), Computational Processing of the Portuguese Language, 8th International Conference, Proceedings (PROPOR 2008) Vol. 5190, (Aveiro, Portugal, 8-10 de Setembro de 2008), Springer Verlag. Lecture Notes in Computer Science,pp. 202-211. Anabela Barreiro & Luís Miguel Cabral. "ReEscreve: a translator-friendly multi-purpose paraphrasing software tool". In Marie-Josée Goulet, Christiane Melançon, Alain Désilets & Elliott Macklovitch (eds.),Proceedings of the Workshop Beyond Translation Memories: New Tools for Translators, The Twelfth Machine Translation Summit (Château Laurier, Ottawa, Ontario, Canada, 29 August 2009), pp. 1-8. 39
  38. 38. OpenLogos for Indian Languages Anusaaraka group at LTRC, IIIT-Hyderabad – Integrating OpenLogos in their English to Hindi Language accessor – An OpenLogos-based English-Hindi MT prototype is already functional, but needs refinement before release Chaudhury, S.; Rao, A.; Sharma, D. M. (2010). "Anusaaraka: An Expert System based Machine Translation System". In Proceedings of 2010 IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLP-KE2010), Beijing, China, Aug 21- 23, 2010. Kalinga Institute of Industrial Technology, KIIT – Setting up a research lab with MT based on OpenLogos technology 40
  39. 39. Other Efforts with OpenLogos Department of Political, Social and Communication Sciences, University of Salerno – PhD dissertation where the OpenLogos English-Italian SEMTAB rules methodology was applied, supported with the NooJ NLP environment to represent the theoretical and methodological principles of the Lexicon- Grammar Theory Monti, Johanna (2013). Multi-word unit processing in Machine Translation. Developing and using linguistic resources for multi-word unit processing in Machine Translation Southern African main universities – Initial efforts to bring OpenLogos as a MT platform for translation between English and the African languages (scarce resources, lack of parallel corpora, etc.) in a initiative similar to that one done for Indian languages 41
  40. 40. OpenLogos Resources at DFKI The Language Technology Lab of DFKI has adapted OpenLogos from the commercial Logos System Also at Sourceforge under a GPL license OpenLogos employs only open source components: – Use of open source development tools and compilers, such as GCC – Replacement of non-open code and libraries – Use of open source databases instead of a commercial database. All language specific resources have been converted to PostgreSQL – Use of open standards instead of vendor specific protocols – As a proof of concept for the software migration, Linux is used as target platform for the first open source release of Logos 42
  41. 41. OpenLogos Components Core code libraries of the server side system and basic executables to start and run the system (APITest, logos_batch) Resources, such as analysis (RES) and transfer (TRAN) grammars for source and target languages, and a multi-language dictionary database Tools: LogosTermBuilder, User administration (LogosAdmin), Command line tools (APITest, openlogos), and multi-user GUI for initiating and inspecting translation jobs and results (LogosTransCenter) 43
  42. 42. DFKI User Assistance with OpenLogos DFKI hosts an open OpenLogos mailing list dedicated to discussion and exchange of information concerning OpenLogos developments and problems at: LinkedIn Discussion Group on OpenLogos Machine Translation OpenLogos Facebook page 44
  43. 43. Selected PublicationsA few publications and technical papers are available with description of the SAL representation language the system architecture and workflowAnabela Barreiro, Bernard Scott, Walter Kasper and Bernd Kiefer. OpenLogos Rule-Based Machine Translation: Philosophy, Model, Resources, and Customization. In Machine Translation, volume 25 number 2, Pages 107-126, Springer, Heidelberg, 2011. ISSN: 0922- 6567. DOI: 10.1007/s10590-011-9091-zBernard Scott and Anabela Barreiro. OpenLogos MT and the SAL Representation Language. In Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation. Edited by Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Francis M. Tyers. Alicante, Spain: Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos. 2–3 November 2009, pp. 19–26Bernard Scott. The Logos Model: an Historical Perspective. In Machine Translation, vol. 18 (2003), pp. 1–72. 45
  44. 44. Towards OpenLogos Hybrid TranslationAnabela 46