Concepts
    Opportunities from open-source MT systems
                                    Challenges
                    ...
Concepts
        Opportunities from open-source MT systems
                                        Challenges
            ...
Concepts
        Opportunities from open-source MT systems
                                        Challenges     Open-sou...
Concepts
      Opportunities from open-source MT systems
                                      Challenges     Open-source ...
Concepts
       Opportunities from open-source MT systems
                                       Challenges     Open-sourc...
Concepts
      Opportunities from open-source MT systems
                                      Challenges     Open-source ...
Concepts
      Opportunities from open-source MT systems
                                      Challenges     Open-source ...
Concepts
      Opportunities from open-source MT systems
                                      Challenges     Increasing e...
Concepts
      Opportunities from open-source MT systems
                                      Challenges     Increasing e...
Concepts
      Opportunities from open-source MT systems
                                      Challenges     Increasing e...
Concepts
      Opportunities from open-source MT systems
                                      Challenges     Increasing e...
Concepts
      Opportunities from open-source MT systems
                                      Challenges     Increasing e...
Concepts
       Opportunities from open-source MT systems
                                       Challenges     Increasing...
Concepts
      Opportunities from open-source MT systems
                                      Challenges     Increasing e...
Concepts
                                                    Organizing community development
     Opportunities from open...
Concepts
                                                      Organizing community development
       Opportunities from ...
Concepts
                                                      Organizing community development
       Opportunities from ...
Concepts
                                                      Organizing community development
       Opportunities from ...
Concepts
                                                     Organizing community development
      Opportunities from op...
Concepts
                                                     Organizing community development
      Opportunities from op...
Background
                                          Concepts
                                                      Ration...
Background
                                          Concepts
                                                      Ration...
Background
                                          Concepts
                                                      Ration...
Background
                                         Concepts
                                                     Rational...
Background
                                         Concepts
                                                     Rational...
Background
                                          Concepts
                                                      Ration...
Background
                                          Concepts
                                                      Ration...
Background
                                          Concepts
                                                      Ration...
Background
                                         Concepts
                                                     Rational...
Background
                                          Concepts
                                                      Ration...
Background
                                         Concepts
                                                     Rational...
Background
                                         Concepts
                                                     Rational...
Background
                                         Concepts
                                                     Rational...
Background
                                          Concepts
                                                      Ration...
Background
                                          Concepts
                                                      Ration...
Background
                                         Concepts
                                                     Rational...
Background
                                         Concepts
                                                     Rational...
Background
                                         Concepts
                                                     Rational...
Background
                                         Concepts
                                                     Rational...
Background
                                          Concepts
                                                      Ration...
Background
                                         Concepts
                                                     Rational...
Background
                                          Concepts
                                                      Ration...
Background
                                        Concepts
                                                    Rationale
...
Concepts
       Opportunities from open-source MT systems
                                       Challenges
              ...
Concepts
       Opportunities from open-source MT systems
                                       Challenges
              ...
Concepts
       Opportunities from open-source MT systems
                                       Challenges
              ...
Concepts
       Opportunities from open-source MT systems
                                       Challenges
              ...
Concepts
       Opportunities from open-source MT systems
                                       Challenges
              ...
Concepts
       Opportunities from open-source MT systems
                                       Challenges
              ...
Upcoming SlideShare
Loading in …5
×

Open-source machine translation for Icelandic: the Apertium platform as an opportunity

1,477 views

Published on

Published in: Technology, News & Politics
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,477
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
41
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Open-source machine translation for Icelandic: the Apertium platform as an opportunity

  1. 1. Concepts Opportunities from open-source MT systems Challenges The Apertium platform Apertium for Icelandic? Concluding remarks Open-source machine translation for Icelandic: the Apertium platform as an opportunity Mikel L. Forcada1,2 1 Departament de Llenguatges i Sistemes Informàtics, Universitat d’Alacant, E-03071 Alacant (Spain) 2 Prompsit Language Engineering, S.L., E-03690 St. Vicent del Raspeig (Spain) April 18, 2008: Icelandic Language Technology Conference Mikel L. Forcada Open-source MT for Icelandic
  2. 2. Concepts Opportunities from open-source MT systems Challenges The Apertium platform Apertium for Icelandic? Concluding remarks Contents 1 Concepts 2 Opportunities from open-source MT systems 3 Challenges 4 The Apertium platform 5 Apertium for Icelandic? 6 Concluding remarks Mikel L. Forcada Open-source MT for Icelandic
  3. 3. Concepts Opportunities from open-source MT systems Challenges Open-source and free software The Apertium platform Machine translation software Apertium for Icelandic? Concluding remarks Open-source and free software Open-source software is also called free software: 0 anyone can use it for any purpose 1 anyone can examine it to see how it works and modify it for any new purpose 2 anyone can freely distribute it 3 anyone may release an improved version so that everyone benefits For conditions 1 and 3 to be met, anyone should be able to access the source code, hence the name open source. Mikel L. Forcada Open-source MT for Icelandic
  4. 4. Concepts Opportunities from open-source MT systems Challenges Open-source and free software The Apertium platform Machine translation software Apertium for Icelandic? Concluding remarks Machine translation software/1 MT is special: it strongly depends on data rule-based MT (RBMT): dictionaries, rules corpus-based MT (CBMT): sentence-aligned parallel text, monolingual corpora Three components in every MT system: The engine (also decoder , recombinator . . . ) Data (linguistic data, corpora) Tools to maintain these data and convert them to the format used by the engine Mikel L. Forcada Open-source MT for Icelandic
  5. 5. Concepts Opportunities from open-source MT systems Challenges Open-source and free software The Apertium platform Machine translation software Apertium for Icelandic? Concluding remarks Machine translation software/2 I will focus on RBMT. Reasons: CBMT requires massive amounts of sentence-aligned parallel text (is there such a resource for Icelandic?). RBMT may use linguistic data elicited by speakers without access to existing machine-readable resources. RBMT is more transparent: errors are easier to diagnose and debug. I am more familiar with RBMT! Mikel L. Forcada Open-source MT for Icelandic
  6. 6. Concepts Opportunities from open-source MT systems Challenges Open-source and free software The Apertium platform Machine translation software Apertium for Icelandic? Concluding remarks MT software/3 : commercial machine translation Most commercial MT systems are RBMT (but: LanguageWeaver, Google Labs). They use proprietary technologies which are not disclosed (perceived as their main competitive advantage). Only partial modification (customization) of linguistic data is allowed. Mikel L. Forcada Open-source MT for Icelandic
  7. 7. Concepts Opportunities from open-source MT systems Challenges Open-source and free software The Apertium platform Machine translation software Apertium for Icelandic? Concluding remarks MT software/4: open-source machine translation For MT to be open-source, the engine, the data and the tools must all be open-source. In the case of CBMT this means that corpora must also be open. Mikel L. Forcada Open-source MT for Icelandic
  8. 8. Concepts Opportunities from open-source MT systems Challenges Increasing expertise and language resources The Apertium platform Increasing independence Apertium for Icelandic? Concluding remarks Commercial MT systems and small languages: limited opportunities The main MT companies target major world languages. Not Icelandic. . . Some closed-source systems: TranExp’s InterTran offers en↔is “interactive translation” (with limited lexical coverage): test at http: //www.translation-guide.com/free_online_ translators.php?from=Icelandic&to=English Stefán Briem’s prototypes for is↔en or is↔da may be tested at tungutorg.is. A company named ESTeam (www.esteam.gr) is also listed as offering MT for Icelandic. It is very hard to adapt closed, commercial MT systems to small languages Mikel L. Forcada Open-source MT for Icelandic
  9. 9. Concepts Opportunities from open-source MT systems Challenges Increasing expertise and language resources The Apertium platform Increasing independence Apertium for Icelandic? Concluding remarks Opportunities from open-source MT systems Even if reasonable-quality closed-source MT is available, the development and use of open-source MT systems provides additional opportunities: Increases language expertise and resources Increases technological independence Mikel L. Forcada Open-source MT for Icelandic
  10. 10. Concepts Opportunities from open-source MT systems Challenges Increasing expertise and language resources The Apertium platform Increasing independence Apertium for Icelandic? Concluding remarks Increasing expertise and language resources When building an open-source MT system for a small language, a variety of situations may occur. All of them involve building small-language expertise and resources through reflection about the small language elicitation of linguistic (monolingual and bilingual) knowledge about it subsequent encoding of this knowledge The open-source setting makes new expertise and resources naturally available to the community. Three scenarios may occur: Mikel L. Forcada Open-source MT for Icelandic
  11. 11. Concepts Opportunities from open-source MT systems Challenges Increasing expertise and language resources The Apertium platform Increasing independence Apertium for Icelandic? Concluding remarks Building data for an existing MT engine from scratch One needs: A freely available (open-source or not) MT engine Freely available (open-source or not) tools to manage linguistic data Complete documentation on how to build linguistic data for use with the engine and tools This is a very unfavourable setting. Many decisions have to be made, e.g., defining the set of lexical categories and inflection indicators. The blank sheet syndrome may paralyze the project. If overcome, the expertise acquired and the resulting open-source data could be improved or used for other purposes: positive effect on the small language. Mikel L. Forcada Open-source MT for Icelandic
  12. 12. Concepts Opportunities from open-source MT systems Challenges Increasing expertise and language resources The Apertium platform Increasing independence Apertium for Icelandic? Concluding remarks Building data for an existing MT engine from existing language-pair data If free tools and engine and open-source data are available for another pair with a similar or related language, the blank sheet syndrome is drastically reduced. One could, for example: use the same set of lexical categories and inflection indicators build inflection paradigms on top of existing ones Mikel L. Forcada Open-source MT for Icelandic
  13. 13. Concepts Opportunities from open-source MT systems Challenges Increasing expertise and language resources The Apertium platform Increasing independence Apertium for Icelandic? Concluding remarks Adapting a new open-source engine or tools for a new language pair If source code is available for the engine and tools, experts could enhance or adapt them to address new features of the small language not dealt with adequately by the current code: character sets structural transfer not powerful enough, etc. More challenging than building new data But programmers do not need to have full command of the small language (abstract management of linguistic issues possible). Code rewriting would add expertise and resources to the language community. Mikel L. Forcada Open-source MT for Icelandic
  14. 14. Concepts Opportunities from open-source MT systems Challenges Increasing expertise and language resources The Apertium platform Increasing independence Apertium for Icelandic? Concluding remarks Increasing technological independence Having an open-source engine, tools and data makes users of the small language less dependent on a single commercial, closed-source provider. This has an analogous effect, not only on machine translation, but also on other language technologies. Mikel L. Forcada Open-source MT for Icelandic
  15. 15. Concepts Organizing community development Opportunities from open-source MT systems Eliciting linguistic knowledge Challenges Simplicity of linguistic knowledge needed The Apertium platform Standardization and documentation of linguistic data formats Apertium for Icelandic? Modularity Concluding remarks Organizing community development/1 Assume we are just developing linguistic data. Open-source makes it possible for a small-language community to collaboratively develop machine translation for it. Some small languages have people with good linguistic and translation skills (this is the case of Icelandic). But the availability of human resources with language and translation skills is necessary but not sufficient. Mikel L. Forcada Open-source MT for Icelandic
  16. 16. Concepts Organizing community development Opportunities from open-source MT systems Eliciting linguistic knowledge Challenges Simplicity of linguistic knowledge needed The Apertium platform Standardization and documentation of linguistic data formats Apertium for Icelandic? Modularity Concluding remarks Organizing community development/2 Some structure is necessary. Ideally: A coordinating team mastering the engine and tools used is needed to lead the effort, including: code coordinators (installing, maintainance, modifications to the code) linguistic coordinators (linguistic data maintenance) A project web server to distribute the last version of the system to execute it online for developers to contribute new linguistic data or code A group of skilled developers, certified in some sense by the coordinating team. Mikel L. Forcada Open-source MT for Icelandic
  17. 17. Concepts Organizing community development Opportunities from open-source MT systems Eliciting linguistic knowledge Challenges Simplicity of linguistic knowledge needed The Apertium platform Standardization and documentation of linguistic data formats Apertium for Icelandic? Modularity Concluding remarks Eliciting linguistic knowledge Existing linguistic knowledge has to made explicit (elicited) to contribute it to the system. Elicitation of lexical knowledge is possible through well-designed web form interfaces: to provide the lemmas of the source and target word to select the inflection paradigm of the source and target word to establish the scope of the equivalence (bidirectional, left-to-right, right-to-left). Elicitation of other knowledge (e.g., structural transfer rules) is harder (a subject of research indeed). Mikel L. Forcada Open-source MT for Icelandic
  18. 18. Concepts Organizing community development Opportunities from open-source MT systems Eliciting linguistic knowledge Challenges Simplicity of linguistic knowledge needed The Apertium platform Standardization and documentation of linguistic data formats Apertium for Icelandic? Modularity Concluding remarks Simplicity of linguistic knowledge needed To encourage and ease collaborative development, the level of linguistic knowledge necessary to start build a new MT system should be kept to a minimum (basic high-school grammar skills and concepts). This is rather easy in shallow-transfer MT systems. But is very difficult (if not impossible) for deep transfer systems. Well-written documentation may be very helpful. Having someone available online to ask questions to is even better. Mikel L. Forcada Open-source MT for Icelandic
  19. 19. Concepts Organizing community development Opportunities from open-source MT systems Eliciting linguistic knowledge Challenges Simplicity of linguistic knowledge needed The Apertium platform Standardization and documentation of linguistic data formats Apertium for Icelandic? Modularity Concluding remarks Standardization and documentation of linguistic data formats An adequate documentation of the format of linguistic data is crucial. The way: using XML. Why? Each data item is explicitly labeled with a descriptive, named tag with a clear meaning attached The structure of documents may easily be validated against DTDs or schemas Many technologies exist for XML (converting from and to XML, interoperability ). Mikel L. Forcada Open-source MT for Icelandic
  20. 20. Concepts Organizing community development Opportunities from open-source MT systems Eliciting linguistic knowledge Challenges Simplicity of linguistic knowledge needed The Apertium platform Standardization and documentation of linguistic data formats Apertium for Icelandic? Modularity Concluding remarks Modularity The emphasis of open-source is the reusability of code and linguistic data to build new MT systems or other language-technology applications. For that objective modularity is a must. A modular engine induces modularity in its data. For example, having an independent morphological analyser and an independent morphological dictionary Makes it easier to build an MT system for a different target language May be used to build an intelligent search engine (inflection-independent search) Mikel L. Forcada Open-source MT for Icelandic
  21. 21. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Background Apertium is based on the technologies developed by the Transducens group at the Universitat d’Alacant during the development of two existing systems: interNOSTRUM (interNOSTRUM.com, Spanish–Catalan) Tradutor Universia (tradutor.universia.net, Spanish–Portuguese) These technologies, initially designed for related-language pairs, have been extended to handle language pairs which are not so related. Mikel L. Forcada Open-source MT for Icelandic
  22. 22. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Rationale /1 To generate translations which are reasonably intelligible and easy to correct between related languages such as Spanish (es) and Catalan (ca) or Portuguese (pt), etc., or Nynorsk (nn), Bokmål (no) and Icelandic (is), one can just augment word for word translation with robust lexical processing (including multi-word units) lexical categorial disambiguation (part-of-speech tagging) local structural processing based on simple and well-formulated rules for frequent structural transformations (reordering, agreement) Mikel L. Forcada Open-source MT for Icelandic
  23. 23. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Rationale /2 For harder, not so related, language pairs: It should be possible to build on that simple model. It should be possible to generalize its concepts so that complexity is kept as low as possible. Mikel L. Forcada Open-source MT for Icelandic
  24. 24. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Rationale /3 It should be possible to generate the whole system from linguistic data (monolingual and bilingual dictionaries, grammar rules) specified in a declarative way. This information should be provided in an interoperable format ⇒ XML. These are the different types of data: (language-independent) rules to treat text formats specification of the part-of-speech tagger morphological and bilingual dictionaries and dictionaries of orthographical transformation rules structural transfer rules Mikel L. Forcada Open-source MT for Icelandic
  25. 25. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Rationale /4 It should be possible to have a single generic (language-independent) engine reading language-pair data (“separation of algorithms and data”). Language-pair data should be preprocessed so that the system is fast (>10,000 words per second) and compact; for example, lexical transformations are performed by minimized finite-state transducers (FSTs). Mikel L. Forcada Open-source MT for Icelandic
  26. 26. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Rationale /5 Reasons for the open-source development of Apertium: To give everyone free, unlimited access to the best possible machine-translation technologies. To establish a modular, documented, open platform for shallow-transfer machine translation and other human language processing tasks. To favour the interchange and reuse of existing linguistic data. To make integration with other open-source technologies easier. Mikel L. Forcada Open-source MT for Icelandic
  27. 27. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Rationale /6 More reasons for open-source development of Apertium: To benefit from collaborative development of the machine translation engine of language-pair data for currently existing or new language pairs from industries, academia and small-language support organizations. To help shift MT business from the obsolescent licence-centered model to a service-centered model. To radically guarantee the reproducibility of machine translation and natural language processing research. Because it does not make sense to use public funds to develop non-free, closed-source software. Mikel L. Forcada Open-source MT for Icelandic
  28. 28. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community The Apertium platform Apertium is an open-source machine translation platform (http://www.apertium.org) providing: 1 An open-source modular shallow-transfer machine translation engine with: text format management finite-state lexical processing statistical lexical disambiguation shallow transfer based on finite-state pattern matching 2 Open-source linguistic data in well-specified XML formats for a variety of language pairs 3 Open-source tools: compilers to turn linguistic data into a fast and compact form used by the engine and software to learn disambiguation or structural transfer rules. Mikel L. Forcada Open-source MT for Icelandic
  29. 29. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community The Apertium engine/1 SL text→ De-formatter ↓ Morphological analyser [←FST] ↓ Categorial disambiguator [←FST+stat.] ↓ [rules→] Structural transfer ↔ Lexical transfer [←FST] ↓ Morphological generator [←FST] ↓ Post-generator [←FST] ↓ Re-formatter →TL text Mikel L. Forcada Open-source MT for Icelandic
  30. 30. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community The Apertium engine/2 Communication between modules: text (Unix “pipelines”). Advantages: Simplifies diagnosis and debugging Allows the modification of data between two modules using, e.g., filters Makes it easy to insert alternative modules (interesting for research and development purposes) Mikel L. Forcada Open-source MT for Icelandic
  31. 31. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community De-formatter Separates text from format information. Currently available for ISO-8859 or UTF-8 plain text, HTML, RTF, ODF, OpenOffice.org .sxw, etc.). Based on finite-state techniques. Most of these filters are generated (using a XSLT stylesheet) from an XML de-formatter specification file. Mikel L. Forcada Open-source MT for Icelandic
  32. 32. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Morphological analyser segments the source text in surface forms (SFs), assigns to each SF one or more lexical forms (LFs), each one with: lemma lexical category (part-of-speech) morphological inflection information processes contractions (en: can’t=can+not; is: talarðu=talar +þú, ertu=ert+þú) and multi-word units which may be invariable (is: með öðrum orðum, við hlíðina á) or variable (is: brjóta af sér → braut af sér ). reads finite-state transducers generated from a morphological dictionary in XML (using a compiler). Mikel L. Forcada Open-source MT for Icelandic
  33. 33. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Categorial disambiguator (part-of-speech tagger) picks one of the LFs corresponding to each ambiguous SF (about 30% of them) according to context uses hidden Markov models and hand-written constraint rules is trained using representative corpora for the source language (manually disambiguated or not) or, recently, using statistical models for the TL its behavior is completely specified by an XML archive Mikel L. Forcada Open-source MT for Icelandic
  34. 34. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Structural transfer /1 It is based on finite-state techniques (finite-state recognizers). The XML transfer-rule file is preprocessed for faster interpreting. Rules have a pattern–action form. It detects LF patterns to be processed using a left-to-right, longest-match strategy. It executes the actions associated to each pattern in the rule file to generate the corresponding LF pattern for the TL. Mikel L. Forcada Open-source MT for Icelandic
  35. 35. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Structural transfer /2 For “harder” language pairs, a three-stage structural transfer is available: Patterns of LFs (chunks) are detected, processed and marked Patterns of chunks are detected and processed: this interchunk processing allows for longer-range (“inter-chunk”) syntactic transformations The output chunks are finished and the resulting LFs are written. Mikel L. Forcada Open-source MT for Icelandic
  36. 36. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Lexical transfer module reads each SL LF and generates the corresponding TL LF reads finite-state transducers generated from bilingual dictionaries in XML (using a compiler). invoked by the structural transfer module Mikel L. Forcada Open-source MT for Icelandic
  37. 37. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Morphological generator Generates from each TL LF, a TL SF, after adequately inflecting it It reads finite-state transducers generated from a morphological dictionary in XML (using a compiler) Mikel L. Forcada Open-source MT for Icelandic
  38. 38. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Post-generator Performs some TL orthographical transformations, such as contractions (ca: de +els → dels; en: can + not → cannot), inserting apostrophes (ca: de + amics → d’amics), etc. It is based on finite-state transducers generated from a post-generation rule dictionary (using a compiler). Mikel L. Forcada Open-source MT for Icelandic
  39. 39. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Re-formatter Integrates format information (plain ISO-8859 or UTF-8 text, HTML, RTF, ODT, OpenOffice.org .sxw, etc.) into the translated text. Also used to modify URLs in links for translate-as-you-surf . It is based on finite-state techniques. It is generated (using a XSLT stylesheet) from an XML de-formatter specification file Mikel L. Forcada Open-source MT for Icelandic
  40. 40. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Language-pair data The Apertium project hosts the development of a large number of language pairs: Stable language pairs include: es↔ca, es↔gl, es↔pt, en↔ca, en↔es, es↔fr, ca↔oc, ro→es, es→eo, ca→eo. There is also a growing number of language pairs under development. Some include Scandinavian languages (da, sv, nn, nb). Mikel L. Forcada Open-source MT for Icelandic
  41. 41. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Project funding Funded by The Ministry of Industry, Tourism and Commerce of Spain (also, the Ministries of Education and Science and of Science and Technology of Spain) The Secretariat for Technology and the Information Society of the Government of Catalonia The Ministry of Foreign Affairs of Romania The Universitat d’Alacant Companies: Prompsit Language Engineering, ABC Enciklopedioj, etc. Mikel L. Forcada Open-source MT for Icelandic
  42. 42. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community The Apertium community/1 Not the ideal community development situation, but close. In addition to the original (funded) developers, a community has formed around the platform (instigated by Francis Tyers). More than 60 developers in sourceforge.net/projects/apertium/, many outside the original group; code updated very frequently, hundreds of monthly SVN commits. A collectively-maintained wiki shows the current development and tips for people building new language pairs or code. Mikel L. Forcada Open-source MT for Icelandic
  43. 43. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community The Apertium community/2 Externally developed tools and code: a graphical user interface apertium-tolk, and the diagnostic tool apertium-view plugins for OpenOffice.org or the Pidgin (previously Gaim) messaging program Windows ports, etc. Many people gather and interact in the #apertium IRC channel (at freenode.net). Stable packages ported to Debian GNU/Linux (and the next Ubuntu). Mikel L. Forcada Open-source MT for Icelandic
  44. 44. Concepts Opportunities from open-source MT systems Challenges The Apertium platform Apertium for Icelandic? Concluding remarks Apertium for Icelandic /1 To build, for instance, a GPL apertium-is-en prototype: one could reuse the en dictionaries in apertium-en-ca or apertium-en-es (analysis and generation) and the part-of-speech taggers too one should build an is dictionary: getting some inspiration from existing (incomplete) data in Apertium for sv, da, fo. . . using Wiktionary [an experiment by Francis Tyers: http://apertium.svn.sourceforge.net/viewvc/ apertium/trunk/incubator/apertium-fo-is.is. dix?view=markup] convincing the authors of icemorphy or tungutorg to release (part of) their data under the GPL license. Mikel L. Forcada Open-source MT for Icelandic
  45. 45. Concepts Opportunities from open-source MT systems Challenges The Apertium platform Apertium for Icelandic? Concluding remarks Apertium for Icelandic /2 one could train an is part-of-speech tagger, perhaps with some help from icetagger or tungutorg one should build a bilingual is–en dictionary, for instance: by completing the English and Icelandic dictionaries in Ergane by modifying bilingual dictionaries learned from a sentence-aligned bilingual corpus using Caseli et al.’s ReTraTos (sf.net/projects/retratos) one could then use Sanchez-Martínez and Forcada’s method to learn an initial set of structural transfer rules using the same or a different corpus, and then refine it. A prototype would be available in 1 person·year! Who dares? Mikel L. Forcada Open-source MT for Icelandic
  46. 46. Concepts Opportunities from open-source MT systems Challenges The Apertium platform Apertium for Icelandic? Concluding remarks Apertium for Icelandic /3 Is the time right? The Government of Iceland has agreed on a “Policy on Free and Open-source Software” (“Stefna um frjálsan og opinn hugbúnað”, Mar. 11, 2008). “Giving access to the source code expands the opportunities for adapting and examining security aspects of the software, in addition to allowing for its further development if the producers discontinue it for some reason.” “There is a great need to increase the return on public body investments in software design. [...] Once software has been prepared, it is important that it has the potential of being reused [...] Reusability can be achieved by [...] ensuring that it is free and open-source.” Mikel L. Forcada Open-source MT for Icelandic
  47. 47. Concepts Opportunities from open-source MT systems Challenges The Apertium platform Apertium for Icelandic? Concluding remarks Concluding remarks /1 Icelandic, as any other living language, however small, needs machine translation and has the right to it! The development of open-source MT for Icelandic can have specific, additional effects (increasing expertise, contributing reusable resources, reducing technological dependency). Apertium eases this task. Development of MT for a small language faces a number of challenges: elictation of linguistic knowledge, need for standard formats, modularity. Apertium offers the last two. Of course, I will be happy to discuss these conclusions! Mikel L. Forcada Open-source MT for Icelandic
  48. 48. Concepts Opportunities from open-source MT systems Challenges The Apertium platform Apertium for Icelandic? Concluding remarks Takk fyrir! Thanks, Hrafn Loftsson, and the rest of the colleagues at Reykjavík University and the University of Iceland for inviting me to this conference and making me feel at home. Thank you all for your attention. Mikel L. Forcada Open-source MT for Icelandic
  49. 49. Concepts Opportunities from open-source MT systems Challenges The Apertium platform Apertium for Icelandic? Concluding remarks I should practice what I preach. . . This work may be distributed under the terms of the Creative Commons Attribution–Share Alike license: http: //creativecommons.org/licenses/by-sa/3.0/ the GNU GPL v. 3.0 License: http://www.gnu.org/licenses/gpl.html Dual license! E-mail me to get the sources: mlf@ua.es Mikel L. Forcada Open-source MT for Icelandic

×