SlideShare a Scribd company logo
1 of 49
Download to read offline
Concepts
    Opportunities from open-source MT systems
                                    Challenges
                         The Apertium platform
                         Apertium for Icelandic?
                           Concluding remarks




Open-source machine translation for Icelandic:
  the Apertium platform as an opportunity

                                  Mikel L. Forcada1,2
   1 Departament     de Llenguatges i Sistemes Informàtics, Universitat d’Alacant,
                              E-03071 Alacant (Spain)
 2 Prompsit   Language Engineering, S.L., E-03690 St. Vicent del Raspeig (Spain)


April 18, 2008: Icelandic Language Technology Conference


                               Mikel L. Forcada    Open-source MT for Icelandic
Concepts
        Opportunities from open-source MT systems
                                        Challenges
                             The Apertium platform
                             Apertium for Icelandic?
                               Concluding remarks


Contents

  1   Concepts

  2   Opportunities from open-source MT systems

  3   Challenges

  4   The Apertium platform

  5   Apertium for Icelandic?

  6   Concluding remarks


                                   Mikel L. Forcada    Open-source MT for Icelandic
Concepts
        Opportunities from open-source MT systems
                                        Challenges     Open-source and free software
                             The Apertium platform     Machine translation software
                             Apertium for Icelandic?
                               Concluding remarks


Open-source and free software

  Open-source software is also called free software:
   0   anyone can use it for any purpose
   1   anyone can examine it to see how it works and modify it for
       any new purpose
   2   anyone can freely distribute it
   3   anyone may release an improved version so that everyone
       benefits
  For conditions 1 and 3 to be met, anyone should be able to
  access the source code, hence the name open source.


                                   Mikel L. Forcada    Open-source MT for Icelandic
Concepts
      Opportunities from open-source MT systems
                                      Challenges     Open-source and free software
                           The Apertium platform     Machine translation software
                           Apertium for Icelandic?
                             Concluding remarks


Machine translation software/1


     MT is special: it strongly depends on data
            rule-based MT (RBMT): dictionaries, rules
            corpus-based MT (CBMT): sentence-aligned parallel text,
            monolingual corpora
     Three components in every MT system:
            The engine (also decoder , recombinator . . . )
            Data (linguistic data, corpora)
            Tools to maintain these data and convert them to the format
            used by the engine



                                 Mikel L. Forcada    Open-source MT for Icelandic
Concepts
       Opportunities from open-source MT systems
                                       Challenges     Open-source and free software
                            The Apertium platform     Machine translation software
                            Apertium for Icelandic?
                              Concluding remarks


Machine translation software/2


  I will focus on RBMT. Reasons:
      CBMT requires massive amounts of sentence-aligned
      parallel text (is there such a resource for Icelandic?).
      RBMT may use linguistic data elicited by speakers without
      access to existing machine-readable resources.
      RBMT is more transparent: errors are easier to diagnose
      and debug.
      I am more familiar with RBMT!



                                  Mikel L. Forcada    Open-source MT for Icelandic
Concepts
      Opportunities from open-source MT systems
                                      Challenges     Open-source and free software
                           The Apertium platform     Machine translation software
                           Apertium for Icelandic?
                             Concluding remarks


MT software/3 : commercial machine translation



     Most commercial MT systems are RBMT (but:
     LanguageWeaver, Google Labs).
     They use proprietary technologies which are not disclosed
     (perceived as their main competitive advantage).
     Only partial modification (customization) of linguistic data
     is allowed.




                                 Mikel L. Forcada    Open-source MT for Icelandic
Concepts
      Opportunities from open-source MT systems
                                      Challenges     Open-source and free software
                           The Apertium platform     Machine translation software
                           Apertium for Icelandic?
                             Concluding remarks


MT software/4: open-source machine translation



     For MT to be open-source, the engine, the data and the
     tools must all be open-source.
     In the case of CBMT this means that corpora must also be
     open.




                                 Mikel L. Forcada    Open-source MT for Icelandic
Concepts
      Opportunities from open-source MT systems
                                      Challenges     Increasing expertise and language resources
                           The Apertium platform     Increasing independence
                           Apertium for Icelandic?
                             Concluding remarks


Commercial MT systems and small languages: limited
opportunities
     The main MT companies target major world languages.
     Not Icelandic. . . Some closed-source systems:
            TranExp’s InterTran offers en↔is “interactive translation”
            (with limited lexical coverage): test at http:
            //www.translation-guide.com/free_online_
            translators.php?from=Icelandic&to=English
            Stefán Briem’s prototypes for is↔en or is↔da may be
            tested at tungutorg.is.
            A company named ESTeam (www.esteam.gr) is also
            listed as offering MT for Icelandic.
     It is very hard to adapt closed, commercial MT systems to
     small languages
                                 Mikel L. Forcada    Open-source MT for Icelandic
Concepts
      Opportunities from open-source MT systems
                                      Challenges     Increasing expertise and language resources
                           The Apertium platform     Increasing independence
                           Apertium for Icelandic?
                             Concluding remarks


Opportunities from open-source MT systems



     Even if reasonable-quality closed-source MT is available,
     the development and use of open-source MT systems
     provides additional opportunities:
            Increases language expertise and resources
            Increases technological independence




                                 Mikel L. Forcada    Open-source MT for Icelandic
Concepts
      Opportunities from open-source MT systems
                                      Challenges     Increasing expertise and language resources
                           The Apertium platform     Increasing independence
                           Apertium for Icelandic?
                             Concluding remarks


Increasing expertise and language resources

     When building an open-source MT system for a small
     language, a variety of situations may occur.
     All of them involve building small-language expertise and
     resources through
            reflection about the small language
            elicitation of linguistic (monolingual and bilingual)
            knowledge about it
            subsequent encoding of this knowledge
     The open-source setting makes new expertise and
     resources naturally available to the community.
     Three scenarios may occur:

                                 Mikel L. Forcada    Open-source MT for Icelandic
Concepts
      Opportunities from open-source MT systems
                                      Challenges     Increasing expertise and language resources
                           The Apertium platform     Increasing independence
                           Apertium for Icelandic?
                             Concluding remarks


Building data for an existing MT engine from scratch
     One needs:
            A freely available (open-source or not) MT engine
            Freely available (open-source or not) tools to manage
            linguistic data
            Complete documentation on how to build linguistic data for
            use with the engine and tools
     This is a very unfavourable setting. Many decisions have to
     be made, e.g., defining the set of lexical categories and
     inflection indicators.
     The blank sheet syndrome may paralyze the project.
     If overcome, the expertise acquired and the resulting
     open-source data could be improved or used for other
     purposes: positive effect on the small language.
                                 Mikel L. Forcada    Open-source MT for Icelandic
Concepts
      Opportunities from open-source MT systems
                                      Challenges     Increasing expertise and language resources
                           The Apertium platform     Increasing independence
                           Apertium for Icelandic?
                             Concluding remarks


Building data for an existing MT engine from existing
language-pair data


      If free tools and engine and open-source data are available
      for another pair with a similar or related language, the
      blank sheet syndrome is drastically reduced. One could,
      for example:
            use the same set of lexical categories and inflection
            indicators
            build inflection paradigms on top of existing ones




                                 Mikel L. Forcada    Open-source MT for Icelandic
Concepts
       Opportunities from open-source MT systems
                                       Challenges     Increasing expertise and language resources
                            The Apertium platform     Increasing independence
                            Apertium for Icelandic?
                              Concluding remarks


Adapting a new open-source engine or tools for a new
language pair
      If source code is available for the engine and tools, experts
      could enhance or adapt them to address new features of
      the small language not dealt with adequately by the current
      code:
             character sets
             structural transfer not powerful enough, etc.
      More challenging than building new data
      But programmers do not need to have full command of the
      small language (abstract management of linguistic issues
      possible).
  Code rewriting would add expertise and resources to the
  language community.
                                  Mikel L. Forcada    Open-source MT for Icelandic
Concepts
      Opportunities from open-source MT systems
                                      Challenges     Increasing expertise and language resources
                           The Apertium platform     Increasing independence
                           Apertium for Icelandic?
                             Concluding remarks


Increasing technological independence



     Having an open-source engine, tools and data makes
     users of the small language less dependent on a single
     commercial, closed-source provider.
     This has an analogous effect, not only on machine
     translation, but also on other language technologies.




                                 Mikel L. Forcada    Open-source MT for Icelandic
Concepts
                                                    Organizing community development
     Opportunities from open-source MT systems
                                                    Eliciting linguistic knowledge
                                     Challenges
                                                    Simplicity of linguistic knowledge needed
                          The Apertium platform
                                                    Standardization and documentation of linguistic data formats
                          Apertium for Icelandic?
                                                    Modularity
                            Concluding remarks


Organizing community development/1


     Assume we are just developing linguistic data.
     Open-source makes it possible for a small-language
     community to collaboratively develop machine translation
     for it.
     Some small languages have people with good linguistic
     and translation skills (this is the case of Icelandic).
     But the availability of human resources with language and
     translation skills is necessary but not sufficient.



                                Mikel L. Forcada    Open-source MT for Icelandic
Concepts
                                                      Organizing community development
       Opportunities from open-source MT systems
                                                      Eliciting linguistic knowledge
                                       Challenges
                                                      Simplicity of linguistic knowledge needed
                            The Apertium platform
                                                      Standardization and documentation of linguistic data formats
                            Apertium for Icelandic?
                                                      Modularity
                              Concluding remarks


Organizing community development/2
  Some structure is necessary. Ideally:
     A coordinating team mastering the engine and tools used
     is needed to lead the effort, including:
             code coordinators (installing, maintainance, modifications
             to the code)
             linguistic coordinators (linguistic data maintenance)
      A project web server
             to distribute the last version of the system
             to execute it online
             for developers to contribute new linguistic data or code
      A group of skilled developers, certified in some sense by
      the coordinating team.

                                  Mikel L. Forcada    Open-source MT for Icelandic
Concepts
                                                      Organizing community development
       Opportunities from open-source MT systems
                                                      Eliciting linguistic knowledge
                                       Challenges
                                                      Simplicity of linguistic knowledge needed
                            The Apertium platform
                                                      Standardization and documentation of linguistic data formats
                            Apertium for Icelandic?
                                                      Modularity
                              Concluding remarks


Eliciting linguistic knowledge

      Existing linguistic knowledge has to made explicit (elicited)
      to contribute it to the system.
      Elicitation of lexical knowledge is possible through
      well-designed web form interfaces:
             to provide the lemmas of the source and target word
             to select the inflection paradigm of the source and target
             word
             to establish the scope of the equivalence (bidirectional,
             left-to-right, right-to-left).
      Elicitation of other knowledge (e.g., structural transfer
      rules) is harder (a subject of research indeed).

                                  Mikel L. Forcada    Open-source MT for Icelandic
Concepts
                                                      Organizing community development
       Opportunities from open-source MT systems
                                                      Eliciting linguistic knowledge
                                       Challenges
                                                      Simplicity of linguistic knowledge needed
                            The Apertium platform
                                                      Standardization and documentation of linguistic data formats
                            Apertium for Icelandic?
                                                      Modularity
                              Concluding remarks


Simplicity of linguistic knowledge needed

  To encourage and ease collaborative development, the level of
  linguistic knowledge necessary to start build a new MT system
  should be kept to a minimum (basic high-school grammar skills
  and concepts).
      This is rather easy in shallow-transfer MT systems.
      But is very difficult (if not impossible) for deep transfer
      systems.
  Well-written documentation may be very helpful. Having
  someone available online to ask questions to is even better.


                                  Mikel L. Forcada    Open-source MT for Icelandic
Concepts
                                                     Organizing community development
      Opportunities from open-source MT systems
                                                     Eliciting linguistic knowledge
                                      Challenges
                                                     Simplicity of linguistic knowledge needed
                           The Apertium platform
                                                     Standardization and documentation of linguistic data formats
                           Apertium for Icelandic?
                                                     Modularity
                             Concluding remarks


Standardization and documentation of linguistic data
formats

     An adequate documentation of the format of linguistic data
     is crucial.
     The way: using XML. Why?
            Each data item is explicitly labeled with a descriptive,
            named tag with a clear meaning attached
            The structure of documents may easily be validated against
            DTDs or schemas
            Many technologies exist for XML (converting from and to
            XML, interoperability ).


                                 Mikel L. Forcada    Open-source MT for Icelandic
Concepts
                                                     Organizing community development
      Opportunities from open-source MT systems
                                                     Eliciting linguistic knowledge
                                      Challenges
                                                     Simplicity of linguistic knowledge needed
                           The Apertium platform
                                                     Standardization and documentation of linguistic data formats
                           Apertium for Icelandic?
                                                     Modularity
                             Concluding remarks


Modularity

     The emphasis of open-source is the reusability of code
     and linguistic data to build new MT systems or other
     language-technology applications.
     For that objective modularity is a must.
     A modular engine induces modularity in its data.
     For example, having an independent morphological
     analyser and an independent morphological dictionary
            Makes it easier to build an MT system for a different target
            language
            May be used to build an intelligent search engine
            (inflection-independent search)

                                 Mikel L. Forcada    Open-source MT for Icelandic
Background
                                          Concepts
                                                      Rationale
       Opportunities from open-source MT systems
                                                      The Apertium platform
                                       Challenges
                                                      The Apertium engine
                            The Apertium platform
                                                      Language-pair data
                            Apertium for Icelandic?
                                                      Funding
                              Concluding remarks
                                                      The Apertium community


Background

  Apertium is based on the technologies developed by the
  Transducens group at the Universitat d’Alacant during the
  development of two existing systems:
      interNOSTRUM (interNOSTRUM.com, Spanish–Catalan)
      Tradutor Universia (tradutor.universia.net,
      Spanish–Portuguese)
  These technologies, initially designed for related-language
  pairs, have been extended to handle language pairs which are
  not so related.


                                  Mikel L. Forcada    Open-source MT for Icelandic
Background
                                          Concepts
                                                      Rationale
       Opportunities from open-source MT systems
                                                      The Apertium platform
                                       Challenges
                                                      The Apertium engine
                            The Apertium platform
                                                      Language-pair data
                            Apertium for Icelandic?
                                                      Funding
                              Concluding remarks
                                                      The Apertium community


Rationale /1
  To generate translations which are
      reasonably intelligible and
      easy to correct
  between related languages such as Spanish (es) and Catalan
  (ca) or Portuguese (pt), etc., or Nynorsk (nn), Bokmål (no)
  and Icelandic (is), one can just augment word for word
  translation with
       robust lexical processing (including multi-word units)
       lexical categorial disambiguation (part-of-speech tagging)
       local structural processing based on simple and
       well-formulated rules for frequent structural
       transformations (reordering, agreement)
                                  Mikel L. Forcada    Open-source MT for Icelandic
Background
                                          Concepts
                                                      Rationale
       Opportunities from open-source MT systems
                                                      The Apertium platform
                                       Challenges
                                                      The Apertium engine
                            The Apertium platform
                                                      Language-pair data
                            Apertium for Icelandic?
                                                      Funding
                              Concluding remarks
                                                      The Apertium community


Rationale /2



  For harder, not so related, language pairs:
      It should be possible to build on that simple model.
      It should be possible to generalize its concepts so that
      complexity is kept as low as possible.




                                  Mikel L. Forcada    Open-source MT for Icelandic
Background
                                         Concepts
                                                     Rationale
      Opportunities from open-source MT systems
                                                     The Apertium platform
                                      Challenges
                                                     The Apertium engine
                           The Apertium platform
                                                     Language-pair data
                           Apertium for Icelandic?
                                                     Funding
                             Concluding remarks
                                                     The Apertium community


Rationale /3


     It should be possible to generate the whole system from
     linguistic data (monolingual and bilingual dictionaries,
     grammar rules) specified in a declarative way.
     This information should be provided in an interoperable
     format ⇒ XML. These are the different types of data:
            (language-independent) rules to treat text formats
            specification of the part-of-speech tagger
            morphological and bilingual dictionaries and dictionaries of
            orthographical transformation rules
            structural transfer rules


                                 Mikel L. Forcada    Open-source MT for Icelandic
Background
                                         Concepts
                                                     Rationale
      Opportunities from open-source MT systems
                                                     The Apertium platform
                                      Challenges
                                                     The Apertium engine
                           The Apertium platform
                                                     Language-pair data
                           Apertium for Icelandic?
                                                     Funding
                             Concluding remarks
                                                     The Apertium community


Rationale /4


     It should be possible to have a single generic
     (language-independent) engine reading language-pair
     data (“separation of algorithms and data”).
     Language-pair data should be preprocessed so that the
     system is fast (>10,000 words per second) and compact;
     for example, lexical transformations are performed by
     minimized finite-state transducers (FSTs).




                                 Mikel L. Forcada    Open-source MT for Icelandic
Background
                                          Concepts
                                                      Rationale
       Opportunities from open-source MT systems
                                                      The Apertium platform
                                       Challenges
                                                      The Apertium engine
                            The Apertium platform
                                                      Language-pair data
                            Apertium for Icelandic?
                                                      Funding
                              Concluding remarks
                                                      The Apertium community


Rationale /5

  Reasons for the open-source development of Apertium:
      To give everyone free, unlimited access to the best
      possible machine-translation technologies.
      To establish a modular, documented, open platform for
      shallow-transfer machine translation and other human
      language processing tasks.
      To favour the interchange and reuse of existing linguistic
      data.
      To make integration with other open-source technologies
      easier.

                                  Mikel L. Forcada    Open-source MT for Icelandic
Background
                                          Concepts
                                                      Rationale
       Opportunities from open-source MT systems
                                                      The Apertium platform
                                       Challenges
                                                      The Apertium engine
                            The Apertium platform
                                                      Language-pair data
                            Apertium for Icelandic?
                                                      Funding
                              Concluding remarks
                                                      The Apertium community


Rationale /6
  More reasons for open-source development of Apertium:
      To benefit from collaborative development
             of the machine translation engine
             of language-pair data for currently existing or new language
             pairs
      from industries, academia and small-language support
      organizations.
      To help shift MT business from the obsolescent
      licence-centered model to a service-centered model.
      To radically guarantee the reproducibility of machine
      translation and natural language processing research.
      Because it does not make sense to use public funds to
      develop non-free, closed-source software.
                                  Mikel L. Forcada    Open-source MT for Icelandic
Background
                                          Concepts
                                                      Rationale
       Opportunities from open-source MT systems
                                                      The Apertium platform
                                       Challenges
                                                      The Apertium engine
                            The Apertium platform
                                                      Language-pair data
                            Apertium for Icelandic?
                                                      Funding
                              Concluding remarks
                                                      The Apertium community


The Apertium platform
  Apertium is an open-source machine translation platform
  (http://www.apertium.org) providing:
    1 An open-source modular shallow-transfer machine
      translation engine with:
             text format management
             finite-state lexical processing
             statistical lexical disambiguation
             shallow transfer based on finite-state pattern matching
   2   Open-source linguistic data in well-specified XML formats
       for a variety of language pairs
   3   Open-source tools: compilers to turn linguistic data into a
       fast and compact form used by the engine and software to
       learn disambiguation or structural transfer rules.
                                  Mikel L. Forcada    Open-source MT for Icelandic
Background
                                         Concepts
                                                     Rationale
      Opportunities from open-source MT systems
                                                     The Apertium platform
                                      Challenges
                                                     The Apertium engine
                           The Apertium platform
                                                     Language-pair data
                           Apertium for Icelandic?
                                                     Funding
                             Concluding remarks
                                                     The Apertium community


The Apertium engine/1
   SL text→               De-formatter
                                ↓
                     Morphological analyser                             [←FST]
                                ↓
                    Categorial disambiguator                        [←FST+stat.]
                                ↓
   [rules→]            Structural transfer                     ↔ Lexical transfer   [←FST]
                                ↓
                    Morphological generator                             [←FST]
                                ↓
                        Post-generator                                  [←FST]
                                ↓
                          Re-formatter                                 →TL text
                                 Mikel L. Forcada    Open-source MT for Icelandic
Background
                                          Concepts
                                                      Rationale
       Opportunities from open-source MT systems
                                                      The Apertium platform
                                       Challenges
                                                      The Apertium engine
                            The Apertium platform
                                                      Language-pair data
                            Apertium for Icelandic?
                                                      Funding
                              Concluding remarks
                                                      The Apertium community


The Apertium engine/2


  Communication between modules: text (Unix “pipelines”).
  Advantages:
      Simplifies diagnosis and debugging
      Allows the modification of data between two modules
      using, e.g., filters
      Makes it easy to insert alternative modules (interesting for
      research and development purposes)




                                  Mikel L. Forcada    Open-source MT for Icelandic
Background
                                         Concepts
                                                     Rationale
      Opportunities from open-source MT systems
                                                     The Apertium platform
                                      Challenges
                                                     The Apertium engine
                           The Apertium platform
                                                     Language-pair data
                           Apertium for Icelandic?
                                                     Funding
                             Concluding remarks
                                                     The Apertium community


De-formatter



     Separates text from format information.
     Currently available for ISO-8859 or UTF-8 plain text,
     HTML, RTF, ODF, OpenOffice.org .sxw, etc.).
     Based on finite-state techniques.
     Most of these filters are generated (using a XSLT
     stylesheet) from an XML de-formatter specification file.




                                 Mikel L. Forcada    Open-source MT for Icelandic
Background
                                         Concepts
                                                     Rationale
      Opportunities from open-source MT systems
                                                     The Apertium platform
                                      Challenges
                                                     The Apertium engine
                           The Apertium platform
                                                     Language-pair data
                           Apertium for Icelandic?
                                                     Funding
                             Concluding remarks
                                                     The Apertium community


Morphological analyser
     segments the source text in surface forms (SFs),
     assigns to each SF one or more lexical forms (LFs), each
     one with:
            lemma
            lexical category (part-of-speech)
            morphological inflection information
     processes contractions (en: can’t=can+not; is:
     talarðu=talar +þú, ertu=ert+þú) and multi-word units which
     may be invariable (is: með öðrum orðum, við hlíðina á) or
     variable (is: brjóta af sér → braut af sér ).
     reads finite-state transducers generated from a
     morphological dictionary in XML (using a compiler).
                                 Mikel L. Forcada    Open-source MT for Icelandic
Background
                                         Concepts
                                                     Rationale
      Opportunities from open-source MT systems
                                                     The Apertium platform
                                      Challenges
                                                     The Apertium engine
                           The Apertium platform
                                                     Language-pair data
                           Apertium for Icelandic?
                                                     Funding
                             Concluding remarks
                                                     The Apertium community


Categorial disambiguator (part-of-speech tagger)


     picks one of the LFs corresponding to each ambiguous SF
     (about 30% of them) according to context
     uses hidden Markov models and hand-written constraint
     rules
     is trained using representative corpora for the source
     language (manually disambiguated or not) or, recently,
     using statistical models for the TL
     its behavior is completely specified by an XML archive



                                 Mikel L. Forcada    Open-source MT for Icelandic
Background
                                          Concepts
                                                      Rationale
       Opportunities from open-source MT systems
                                                      The Apertium platform
                                       Challenges
                                                      The Apertium engine
                            The Apertium platform
                                                      Language-pair data
                            Apertium for Icelandic?
                                                      Funding
                              Concluding remarks
                                                      The Apertium community


Structural transfer /1

      It is based on finite-state techniques (finite-state
      recognizers).
      The XML transfer-rule file is preprocessed for faster
      interpreting.
      Rules have a pattern–action form.
      It detects LF patterns to be processed using a left-to-right,
      longest-match strategy.
      It executes the actions associated to each pattern in the
      rule file to generate the corresponding LF pattern for the
      TL.

                                  Mikel L. Forcada    Open-source MT for Icelandic
Background
                                          Concepts
                                                      Rationale
       Opportunities from open-source MT systems
                                                      The Apertium platform
                                       Challenges
                                                      The Apertium engine
                            The Apertium platform
                                                      Language-pair data
                            Apertium for Icelandic?
                                                      Funding
                              Concluding remarks
                                                      The Apertium community


Structural transfer /2

  For “harder” language pairs, a three-stage structural transfer is
  available:
      Patterns of LFs (chunks) are detected, processed and
      marked
      Patterns of chunks are detected and processed: this
      interchunk processing allows for longer-range
      (“inter-chunk”) syntactic transformations
      The output chunks are finished and the resulting LFs are
      written.


                                  Mikel L. Forcada    Open-source MT for Icelandic
Background
                                         Concepts
                                                     Rationale
      Opportunities from open-source MT systems
                                                     The Apertium platform
                                      Challenges
                                                     The Apertium engine
                           The Apertium platform
                                                     Language-pair data
                           Apertium for Icelandic?
                                                     Funding
                             Concluding remarks
                                                     The Apertium community


Lexical transfer module



     reads each SL LF and generates the corresponding TL LF
     reads finite-state transducers generated from bilingual
     dictionaries in XML (using a compiler).
     invoked by the structural transfer module




                                 Mikel L. Forcada    Open-source MT for Icelandic
Background
                                         Concepts
                                                     Rationale
      Opportunities from open-source MT systems
                                                     The Apertium platform
                                      Challenges
                                                     The Apertium engine
                           The Apertium platform
                                                     Language-pair data
                           Apertium for Icelandic?
                                                     Funding
                             Concluding remarks
                                                     The Apertium community


Morphological generator



     Generates from each TL LF, a TL SF, after adequately
     inflecting it
     It reads finite-state transducers generated from a
     morphological dictionary in XML (using a compiler)




                                 Mikel L. Forcada    Open-source MT for Icelandic
Background
                                         Concepts
                                                     Rationale
      Opportunities from open-source MT systems
                                                     The Apertium platform
                                      Challenges
                                                     The Apertium engine
                           The Apertium platform
                                                     Language-pair data
                           Apertium for Icelandic?
                                                     Funding
                             Concluding remarks
                                                     The Apertium community


Post-generator



     Performs some TL orthographical transformations, such as
     contractions (ca: de +els → dels; en: can + not →
     cannot), inserting apostrophes (ca: de + amics →
     d’amics), etc.
     It is based on finite-state transducers generated from a
     post-generation rule dictionary (using a compiler).




                                 Mikel L. Forcada    Open-source MT for Icelandic
Background
                                         Concepts
                                                     Rationale
      Opportunities from open-source MT systems
                                                     The Apertium platform
                                      Challenges
                                                     The Apertium engine
                           The Apertium platform
                                                     Language-pair data
                           Apertium for Icelandic?
                                                     Funding
                             Concluding remarks
                                                     The Apertium community


Re-formatter


     Integrates format information (plain ISO-8859 or UTF-8
     text, HTML, RTF, ODT, OpenOffice.org .sxw, etc.) into the
     translated text.
     Also used to modify URLs in links for translate-as-you-surf .
     It is based on finite-state techniques.
     It is generated (using a XSLT stylesheet) from an XML
     de-formatter specification file




                                 Mikel L. Forcada    Open-source MT for Icelandic
Background
                                          Concepts
                                                      Rationale
       Opportunities from open-source MT systems
                                                      The Apertium platform
                                       Challenges
                                                      The Apertium engine
                            The Apertium platform
                                                      Language-pair data
                            Apertium for Icelandic?
                                                      Funding
                              Concluding remarks
                                                      The Apertium community


Language-pair data


  The Apertium project hosts the development of a large number
  of language pairs:
      Stable language pairs include: es↔ca, es↔gl, es↔pt,
      en↔ca, en↔es, es↔fr, ca↔oc, ro→es, es→eo,
      ca→eo.
      There is also a growing number of language pairs under
      development. Some include Scandinavian languages (da,
      sv, nn, nb).



                                  Mikel L. Forcada    Open-source MT for Icelandic
Background
                                         Concepts
                                                     Rationale
      Opportunities from open-source MT systems
                                                     The Apertium platform
                                      Challenges
                                                     The Apertium engine
                           The Apertium platform
                                                     Language-pair data
                           Apertium for Icelandic?
                                                     Funding
                             Concluding remarks
                                                     The Apertium community


Project funding

  Funded by
     The Ministry of Industry, Tourism and Commerce of Spain
     (also, the Ministries of Education and Science and of
     Science and Technology of Spain)
     The Secretariat for Technology and the Information Society
     of the Government of Catalonia
     The Ministry of Foreign Affairs of Romania
     The Universitat d’Alacant
     Companies: Prompsit Language Engineering, ABC
     Enciklopedioj, etc.

                                 Mikel L. Forcada    Open-source MT for Icelandic
Background
                                          Concepts
                                                      Rationale
       Opportunities from open-source MT systems
                                                      The Apertium platform
                                       Challenges
                                                      The Apertium engine
                            The Apertium platform
                                                      Language-pair data
                            Apertium for Icelandic?
                                                      Funding
                              Concluding remarks
                                                      The Apertium community


The Apertium community/1

  Not the ideal community development situation, but close.
  In addition to the original (funded) developers, a community has
  formed around the platform (instigated by Francis Tyers).
      More than 60 developers in
      sourceforge.net/projects/apertium/, many
      outside the original group; code updated very frequently,
      hundreds of monthly SVN commits.
      A collectively-maintained wiki shows the current
      development and tips for people building new language
      pairs or code.


                                  Mikel L. Forcada    Open-source MT for Icelandic
Background
                                        Concepts
                                                    Rationale
     Opportunities from open-source MT systems
                                                    The Apertium platform
                                     Challenges
                                                    The Apertium engine
                          The Apertium platform
                                                    Language-pair data
                          Apertium for Icelandic?
                                                    Funding
                            Concluding remarks
                                                    The Apertium community


The Apertium community/2

     Externally developed tools and code:
           a graphical user interface apertium-tolk, and the
           diagnostic tool apertium-view
           plugins for OpenOffice.org or the Pidgin (previously Gaim)
           messaging program
           Windows ports, etc.
     Many people gather and interact in the #apertium IRC
     channel (at freenode.net).
     Stable packages ported to Debian GNU/Linux (and the
     next Ubuntu).


                                Mikel L. Forcada    Open-source MT for Icelandic
Concepts
       Opportunities from open-source MT systems
                                       Challenges
                            The Apertium platform
                            Apertium for Icelandic?
                              Concluding remarks


Apertium for Icelandic /1
  To build, for instance, a GPL apertium-is-en prototype:
      one could reuse the en dictionaries in apertium-en-ca
      or apertium-en-es (analysis and generation) and the
      part-of-speech taggers too
      one should build an is dictionary:
             getting some inspiration from existing (incomplete) data in
             Apertium for sv, da, fo. . .
             using Wiktionary [an experiment by Francis Tyers:
             http://apertium.svn.sourceforge.net/viewvc/
             apertium/trunk/incubator/apertium-fo-is.is.
             dix?view=markup]
             convincing the authors of icemorphy or tungutorg to
             release (part of) their data under the GPL license.
                                  Mikel L. Forcada    Open-source MT for Icelandic
Concepts
       Opportunities from open-source MT systems
                                       Challenges
                            The Apertium platform
                            Apertium for Icelandic?
                              Concluding remarks


Apertium for Icelandic /2

      one could train an is part-of-speech tagger, perhaps with
      some help from icetagger or tungutorg
      one should build a bilingual is–en dictionary, for instance:
             by completing the English and Icelandic dictionaries in
             Ergane
             by modifying bilingual dictionaries learned from a
             sentence-aligned bilingual corpus using Caseli et al.’s
             ReTraTos (sf.net/projects/retratos)
      one could then use Sanchez-Martínez and Forcada’s
      method to learn an initial set of structural transfer rules
      using the same or a different corpus, and then refine it.
  A prototype would be available in 1 person·year! Who dares?
                                  Mikel L. Forcada    Open-source MT for Icelandic
Concepts
       Opportunities from open-source MT systems
                                       Challenges
                            The Apertium platform
                            Apertium for Icelandic?
                              Concluding remarks


Apertium for Icelandic /3
  Is the time right? The Government of Iceland has agreed on a
  “Policy on Free and Open-source Software” (“Stefna um
  frjálsan og opinn hugbúnað”, Mar. 11, 2008).
      “Giving access to the source code expands the
      opportunities for adapting and examining security aspects
      of the software, in addition to allowing for its further
      development if the producers discontinue it for some
      reason.”
      “There is a great need to increase the return on public body
      investments in software design. [...] Once software has
      been prepared, it is important that it has the potential of
      being reused [...] Reusability can be achieved by [...]
      ensuring that it is free and open-source.”

                                  Mikel L. Forcada    Open-source MT for Icelandic
Concepts
       Opportunities from open-source MT systems
                                       Challenges
                            The Apertium platform
                            Apertium for Icelandic?
                              Concluding remarks


Concluding remarks /1

      Icelandic, as any other living language, however small,
      needs machine translation and has the right to it!
      The development of open-source MT for Icelandic can
      have specific, additional effects (increasing expertise,
      contributing reusable resources, reducing technological
      dependency). Apertium eases this task.
      Development of MT for a small language faces a number of
      challenges: elictation of linguistic knowledge, need for
      standard formats, modularity. Apertium offers the last two.
  Of course, I will be happy to discuss these conclusions!

                                  Mikel L. Forcada    Open-source MT for Icelandic
Concepts
       Opportunities from open-source MT systems
                                       Challenges
                            The Apertium platform
                            Apertium for Icelandic?
                              Concluding remarks


Takk fyrir!



      Thanks, Hrafn Loftsson, and the rest of the colleagues at
      Reykjavík University and the University of Iceland for
      inviting me to this conference and making me feel at home.
      Thank you all for your attention.




                                  Mikel L. Forcada    Open-source MT for Icelandic
Concepts
       Opportunities from open-source MT systems
                                       Challenges
                            The Apertium platform
                            Apertium for Icelandic?
                              Concluding remarks


I should practice what I preach. . .


  This work may be distributed under the terms of
      the Creative Commons Attribution–Share Alike license:
      http:
      //creativecommons.org/licenses/by-sa/3.0/
      the GNU GPL v. 3.0 License:
      http://www.gnu.org/licenses/gpl.html
  Dual license! E-mail me to get the sources: mlf@ua.es




                                  Mikel L. Forcada    Open-source MT for Icelandic

More Related Content

What's hot

Programming languages in bioinformatics by dr. jayarama reddy
Programming languages in bioinformatics by dr. jayarama reddyProgramming languages in bioinformatics by dr. jayarama reddy
Programming languages in bioinformatics by dr. jayarama reddyDr. Jayarama Reddy
 
IRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation SystemIRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation SystemIRJET Journal
 
Language Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and PersianLanguage Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and PersianWaqas Tariq
 
Python – The Fastest Growing Programming Language
Python – The Fastest Growing Programming LanguagePython – The Fastest Growing Programming Language
Python – The Fastest Growing Programming LanguageIRJET Journal
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTGuy De Pauw
 
Introducing cat tools
Introducing cat toolsIntroducing cat tools
Introducing cat toolsAdrian Brand
 
Generations of programming language
Generations of programming languageGenerations of programming language
Generations of programming languageJAIDEVPAUL
 
Python an-intro youtube-livestream-day1
Python an-intro youtube-livestream-day1Python an-intro youtube-livestream-day1
Python an-intro youtube-livestream-day1MAHALAKSHMI P
 
Php vs Python: The Comparison You Should Know
Php vs Python: The Comparison You Should KnowPhp vs Python: The Comparison You Should Know
Php vs Python: The Comparison You Should Knowcalltutors
 
GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003butest
 
Programing paradigm & implementation
Programing paradigm & implementationPrograming paradigm & implementation
Programing paradigm & implementationBilal Maqbool ツ
 
Portuguese Linguistic Tools: What, Why and How
Portuguese Linguistic Tools: What, Why and HowPortuguese Linguistic Tools: What, Why and How
Portuguese Linguistic Tools: What, Why and HowValeria de Paiva
 
INTERPRETER AND APPLIED DEVELOPMENT ENVIRONMENT FOR LEARNING CONCEPTS OF OBJE...
INTERPRETER AND APPLIED DEVELOPMENT ENVIRONMENT FOR LEARNING CONCEPTS OF OBJE...INTERPRETER AND APPLIED DEVELOPMENT ENVIRONMENT FOR LEARNING CONCEPTS OF OBJE...
INTERPRETER AND APPLIED DEVELOPMENT ENVIRONMENT FOR LEARNING CONCEPTS OF OBJE...ijpla
 

What's hot (19)

Programming languages in bioinformatics by dr. jayarama reddy
Programming languages in bioinformatics by dr. jayarama reddyProgramming languages in bioinformatics by dr. jayarama reddy
Programming languages in bioinformatics by dr. jayarama reddy
 
IRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation SystemIRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation System
 
Tools of translation
Tools of translationTools of translation
Tools of translation
 
Python
PythonPython
Python
 
Language Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and PersianLanguage Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and Persian
 
Python – The Fastest Growing Programming Language
Python – The Fastest Growing Programming LanguagePython – The Fastest Growing Programming Language
Python – The Fastest Growing Programming Language
 
Cmpe202 01 Research
Cmpe202 01 ResearchCmpe202 01 Research
Cmpe202 01 Research
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFST
 
Introducing cat tools
Introducing cat toolsIntroducing cat tools
Introducing cat tools
 
thrift-20070401
thrift-20070401thrift-20070401
thrift-20070401
 
Generations of programming language
Generations of programming languageGenerations of programming language
Generations of programming language
 
Python overview
Python overviewPython overview
Python overview
 
Python an-intro youtube-livestream-day1
Python an-intro youtube-livestream-day1Python an-intro youtube-livestream-day1
Python an-intro youtube-livestream-day1
 
Php vs Python: The Comparison You Should Know
Php vs Python: The Comparison You Should KnowPhp vs Python: The Comparison You Should Know
Php vs Python: The Comparison You Should Know
 
GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003
 
01 python introduction
01 python introduction 01 python introduction
01 python introduction
 
Programing paradigm & implementation
Programing paradigm & implementationPrograming paradigm & implementation
Programing paradigm & implementation
 
Portuguese Linguistic Tools: What, Why and How
Portuguese Linguistic Tools: What, Why and HowPortuguese Linguistic Tools: What, Why and How
Portuguese Linguistic Tools: What, Why and How
 
INTERPRETER AND APPLIED DEVELOPMENT ENVIRONMENT FOR LEARNING CONCEPTS OF OBJE...
INTERPRETER AND APPLIED DEVELOPMENT ENVIRONMENT FOR LEARNING CONCEPTS OF OBJE...INTERPRETER AND APPLIED DEVELOPMENT ENVIRONMENT FOR LEARNING CONCEPTS OF OBJE...
INTERPRETER AND APPLIED DEVELOPMENT ENVIRONMENT FOR LEARNING CONCEPTS OF OBJE...
 

Similar to Open-source machine translation for Icelandic: the Apertium platform as an opportunity

Apertium: Free/open-source rule-based machine translation and language proces...
Apertium: Free/open-source rule-based machine translation and language proces...Apertium: Free/open-source rule-based machine translation and language proces...
Apertium: Free/open-source rule-based machine translation and language proces...TAUS - The Language Data Network
 
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana
 
Programming language design and implemenation
Programming language design and implemenationProgramming language design and implemenation
Programming language design and implemenationAshwini Awatare
 
8th Ethiopian ICT Conference Bazaar and Exhibition.pptx
8th Ethiopian ICT Conference Bazaar and Exhibition.pptx8th Ethiopian ICT Conference Bazaar and Exhibition.pptx
8th Ethiopian ICT Conference Bazaar and Exhibition.pptxssusera032bc
 
Python Programming Unit1_Aditya College of Engg & Tech
Python Programming Unit1_Aditya College of Engg & TechPython Programming Unit1_Aditya College of Engg & Tech
Python Programming Unit1_Aditya College of Engg & TechRamanamurthy Banda
 
Speech To Speech Translation
Speech To Speech TranslationSpeech To Speech Translation
Speech To Speech TranslationIRJET Journal
 
THE ULTIMATE GUIDE ON PYTHON
THE ULTIMATE GUIDE ON PYTHONTHE ULTIMATE GUIDE ON PYTHON
THE ULTIMATE GUIDE ON PYTHONrobinkumar70125
 
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinalProf. Wim Van Criekinge
 
Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...Prompsit Language Engineering
 
Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...Gema Ramirez-Sanchez
 
Benefits of Python Courses
Benefits of Python Courses Benefits of Python Courses
Benefits of Python Courses Vimalkrishna11
 
Promoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language TechnologyPromoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language Technologytechiaith
 
2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT IntroductionRIILP
 
Python | What is Python | History of Python | Python Tutorial
Python | What is Python | History of Python | Python TutorialPython | What is Python | History of Python | Python Tutorial
Python | What is Python | History of Python | Python TutorialQA TrainingHub
 

Similar to Open-source machine translation for Icelandic: the Apertium platform as an opportunity (20)

Apertium: Free/open-source rule-based machine translation and language proces...
Apertium: Free/open-source rule-based machine translation and language proces...Apertium: Free/open-source rule-based machine translation and language proces...
Apertium: Free/open-source rule-based machine translation and language proces...
 
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
 
Programming language design and implemenation
Programming language design and implemenationProgramming language design and implemenation
Programming language design and implemenation
 
Programming.language
Programming.languageProgramming.language
Programming.language
 
8th Ethiopian ICT Conference Bazaar and Exhibition.pptx
8th Ethiopian ICT Conference Bazaar and Exhibition.pptx8th Ethiopian ICT Conference Bazaar and Exhibition.pptx
8th Ethiopian ICT Conference Bazaar and Exhibition.pptx
 
Python Programming Unit1_Aditya College of Engg & Tech
Python Programming Unit1_Aditya College of Engg & TechPython Programming Unit1_Aditya College of Engg & Tech
Python Programming Unit1_Aditya College of Engg & Tech
 
Achievement And Lessons Learned By An Loc
Achievement And Lessons Learned By An LocAchievement And Lessons Learned By An Loc
Achievement And Lessons Learned By An Loc
 
ppt
pptppt
ppt
 
Speech To Speech Translation
Speech To Speech TranslationSpeech To Speech Translation
Speech To Speech Translation
 
THE ULTIMATE GUIDE ON PYTHON
THE ULTIMATE GUIDE ON PYTHONTHE ULTIMATE GUIDE ON PYTHON
THE ULTIMATE GUIDE ON PYTHON
 
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
 
Programming languages
Programming languagesProgramming languages
Programming languages
 
Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...
 
Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...
 
Benefits of Python Courses
Benefits of Python Courses Benefits of Python Courses
Benefits of Python Courses
 
Python IPCS.pdf
Python IPCS.pdfPython IPCS.pdf
Python IPCS.pdf
 
Promoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language TechnologyPromoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language Technology
 
2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction
 
Ayushi
AyushiAyushi
Ayushi
 
Python | What is Python | History of Python | Python Tutorial
Python | What is Python | History of Python | Python TutorialPython | What is Python | History of Python | Python Tutorial
Python | What is Python | History of Python | Python Tutorial
 

More from Forcada Mikel

Statistical machine translation in a few slides
Statistical machine translation in a few slidesStatistical machine translation in a few slides
Statistical machine translation in a few slidesForcada Mikel
 
Integrating corpus-based and rule-based approaches in an open-source machine ...
Integrating corpus-based and rule-based approaches in an open-source machine ...Integrating corpus-based and rule-based approaches in an open-source machine ...
Integrating corpus-based and rule-based approaches in an open-source machine ...Forcada Mikel
 
Traducció automàtica de codi obert: Apertium, una oportunitat per a llengües ...
Traducció automàtica de codi obert: Apertium, una oportunitat per a llengües ...Traducció automàtica de codi obert: Apertium, una oportunitat per a llengües ...
Traducció automàtica de codi obert: Apertium, una oportunitat per a llengües ...Forcada Mikel
 
Curs urgent de traducció automàtica
Curs urgent de traducció automàticaCurs urgent de traducció automàtica
Curs urgent de traducció automàticaForcada Mikel
 
Curso urgente de traducción automática
Curso urgente de traducción automáticaCurso urgente de traducción automática
Curso urgente de traducción automáticaForcada Mikel
 
Traducción automática de código abierto: una oportunidad para lenguas menores
Traducción automática de código abierto: una oportunidad para lenguas menoresTraducción automática de código abierto: una oportunidad para lenguas menores
Traducción automática de código abierto: una oportunidad para lenguas menoresForcada Mikel
 

More from Forcada Mikel (9)

softcatala.pdf
softcatala.pdfsoftcatala.pdf
softcatala.pdf
 
Cairo 2019-seminar
Cairo 2019-seminarCairo 2019-seminar
Cairo 2019-seminar
 
Smt in-a-few-slides
Smt in-a-few-slidesSmt in-a-few-slides
Smt in-a-few-slides
 
Statistical machine translation in a few slides
Statistical machine translation in a few slidesStatistical machine translation in a few slides
Statistical machine translation in a few slides
 
Integrating corpus-based and rule-based approaches in an open-source machine ...
Integrating corpus-based and rule-based approaches in an open-source machine ...Integrating corpus-based and rule-based approaches in an open-source machine ...
Integrating corpus-based and rule-based approaches in an open-source machine ...
 
Traducció automàtica de codi obert: Apertium, una oportunitat per a llengües ...
Traducció automàtica de codi obert: Apertium, una oportunitat per a llengües ...Traducció automàtica de codi obert: Apertium, una oportunitat per a llengües ...
Traducció automàtica de codi obert: Apertium, una oportunitat per a llengües ...
 
Curs urgent de traducció automàtica
Curs urgent de traducció automàticaCurs urgent de traducció automàtica
Curs urgent de traducció automàtica
 
Curso urgente de traducción automática
Curso urgente de traducción automáticaCurso urgente de traducción automática
Curso urgente de traducción automática
 
Traducción automática de código abierto: una oportunidad para lenguas menores
Traducción automática de código abierto: una oportunidad para lenguas menoresTraducción automática de código abierto: una oportunidad para lenguas menores
Traducción automática de código abierto: una oportunidad para lenguas menores
 

Recently uploaded

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 

Recently uploaded (20)

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 

Open-source machine translation for Icelandic: the Apertium platform as an opportunity

  • 1. Concepts Opportunities from open-source MT systems Challenges The Apertium platform Apertium for Icelandic? Concluding remarks Open-source machine translation for Icelandic: the Apertium platform as an opportunity Mikel L. Forcada1,2 1 Departament de Llenguatges i Sistemes Informàtics, Universitat d’Alacant, E-03071 Alacant (Spain) 2 Prompsit Language Engineering, S.L., E-03690 St. Vicent del Raspeig (Spain) April 18, 2008: Icelandic Language Technology Conference Mikel L. Forcada Open-source MT for Icelandic
  • 2. Concepts Opportunities from open-source MT systems Challenges The Apertium platform Apertium for Icelandic? Concluding remarks Contents 1 Concepts 2 Opportunities from open-source MT systems 3 Challenges 4 The Apertium platform 5 Apertium for Icelandic? 6 Concluding remarks Mikel L. Forcada Open-source MT for Icelandic
  • 3. Concepts Opportunities from open-source MT systems Challenges Open-source and free software The Apertium platform Machine translation software Apertium for Icelandic? Concluding remarks Open-source and free software Open-source software is also called free software: 0 anyone can use it for any purpose 1 anyone can examine it to see how it works and modify it for any new purpose 2 anyone can freely distribute it 3 anyone may release an improved version so that everyone benefits For conditions 1 and 3 to be met, anyone should be able to access the source code, hence the name open source. Mikel L. Forcada Open-source MT for Icelandic
  • 4. Concepts Opportunities from open-source MT systems Challenges Open-source and free software The Apertium platform Machine translation software Apertium for Icelandic? Concluding remarks Machine translation software/1 MT is special: it strongly depends on data rule-based MT (RBMT): dictionaries, rules corpus-based MT (CBMT): sentence-aligned parallel text, monolingual corpora Three components in every MT system: The engine (also decoder , recombinator . . . ) Data (linguistic data, corpora) Tools to maintain these data and convert them to the format used by the engine Mikel L. Forcada Open-source MT for Icelandic
  • 5. Concepts Opportunities from open-source MT systems Challenges Open-source and free software The Apertium platform Machine translation software Apertium for Icelandic? Concluding remarks Machine translation software/2 I will focus on RBMT. Reasons: CBMT requires massive amounts of sentence-aligned parallel text (is there such a resource for Icelandic?). RBMT may use linguistic data elicited by speakers without access to existing machine-readable resources. RBMT is more transparent: errors are easier to diagnose and debug. I am more familiar with RBMT! Mikel L. Forcada Open-source MT for Icelandic
  • 6. Concepts Opportunities from open-source MT systems Challenges Open-source and free software The Apertium platform Machine translation software Apertium for Icelandic? Concluding remarks MT software/3 : commercial machine translation Most commercial MT systems are RBMT (but: LanguageWeaver, Google Labs). They use proprietary technologies which are not disclosed (perceived as their main competitive advantage). Only partial modification (customization) of linguistic data is allowed. Mikel L. Forcada Open-source MT for Icelandic
  • 7. Concepts Opportunities from open-source MT systems Challenges Open-source and free software The Apertium platform Machine translation software Apertium for Icelandic? Concluding remarks MT software/4: open-source machine translation For MT to be open-source, the engine, the data and the tools must all be open-source. In the case of CBMT this means that corpora must also be open. Mikel L. Forcada Open-source MT for Icelandic
  • 8. Concepts Opportunities from open-source MT systems Challenges Increasing expertise and language resources The Apertium platform Increasing independence Apertium for Icelandic? Concluding remarks Commercial MT systems and small languages: limited opportunities The main MT companies target major world languages. Not Icelandic. . . Some closed-source systems: TranExp’s InterTran offers en↔is “interactive translation” (with limited lexical coverage): test at http: //www.translation-guide.com/free_online_ translators.php?from=Icelandic&to=English Stefán Briem’s prototypes for is↔en or is↔da may be tested at tungutorg.is. A company named ESTeam (www.esteam.gr) is also listed as offering MT for Icelandic. It is very hard to adapt closed, commercial MT systems to small languages Mikel L. Forcada Open-source MT for Icelandic
  • 9. Concepts Opportunities from open-source MT systems Challenges Increasing expertise and language resources The Apertium platform Increasing independence Apertium for Icelandic? Concluding remarks Opportunities from open-source MT systems Even if reasonable-quality closed-source MT is available, the development and use of open-source MT systems provides additional opportunities: Increases language expertise and resources Increases technological independence Mikel L. Forcada Open-source MT for Icelandic
  • 10. Concepts Opportunities from open-source MT systems Challenges Increasing expertise and language resources The Apertium platform Increasing independence Apertium for Icelandic? Concluding remarks Increasing expertise and language resources When building an open-source MT system for a small language, a variety of situations may occur. All of them involve building small-language expertise and resources through reflection about the small language elicitation of linguistic (monolingual and bilingual) knowledge about it subsequent encoding of this knowledge The open-source setting makes new expertise and resources naturally available to the community. Three scenarios may occur: Mikel L. Forcada Open-source MT for Icelandic
  • 11. Concepts Opportunities from open-source MT systems Challenges Increasing expertise and language resources The Apertium platform Increasing independence Apertium for Icelandic? Concluding remarks Building data for an existing MT engine from scratch One needs: A freely available (open-source or not) MT engine Freely available (open-source or not) tools to manage linguistic data Complete documentation on how to build linguistic data for use with the engine and tools This is a very unfavourable setting. Many decisions have to be made, e.g., defining the set of lexical categories and inflection indicators. The blank sheet syndrome may paralyze the project. If overcome, the expertise acquired and the resulting open-source data could be improved or used for other purposes: positive effect on the small language. Mikel L. Forcada Open-source MT for Icelandic
  • 12. Concepts Opportunities from open-source MT systems Challenges Increasing expertise and language resources The Apertium platform Increasing independence Apertium for Icelandic? Concluding remarks Building data for an existing MT engine from existing language-pair data If free tools and engine and open-source data are available for another pair with a similar or related language, the blank sheet syndrome is drastically reduced. One could, for example: use the same set of lexical categories and inflection indicators build inflection paradigms on top of existing ones Mikel L. Forcada Open-source MT for Icelandic
  • 13. Concepts Opportunities from open-source MT systems Challenges Increasing expertise and language resources The Apertium platform Increasing independence Apertium for Icelandic? Concluding remarks Adapting a new open-source engine or tools for a new language pair If source code is available for the engine and tools, experts could enhance or adapt them to address new features of the small language not dealt with adequately by the current code: character sets structural transfer not powerful enough, etc. More challenging than building new data But programmers do not need to have full command of the small language (abstract management of linguistic issues possible). Code rewriting would add expertise and resources to the language community. Mikel L. Forcada Open-source MT for Icelandic
  • 14. Concepts Opportunities from open-source MT systems Challenges Increasing expertise and language resources The Apertium platform Increasing independence Apertium for Icelandic? Concluding remarks Increasing technological independence Having an open-source engine, tools and data makes users of the small language less dependent on a single commercial, closed-source provider. This has an analogous effect, not only on machine translation, but also on other language technologies. Mikel L. Forcada Open-source MT for Icelandic
  • 15. Concepts Organizing community development Opportunities from open-source MT systems Eliciting linguistic knowledge Challenges Simplicity of linguistic knowledge needed The Apertium platform Standardization and documentation of linguistic data formats Apertium for Icelandic? Modularity Concluding remarks Organizing community development/1 Assume we are just developing linguistic data. Open-source makes it possible for a small-language community to collaboratively develop machine translation for it. Some small languages have people with good linguistic and translation skills (this is the case of Icelandic). But the availability of human resources with language and translation skills is necessary but not sufficient. Mikel L. Forcada Open-source MT for Icelandic
  • 16. Concepts Organizing community development Opportunities from open-source MT systems Eliciting linguistic knowledge Challenges Simplicity of linguistic knowledge needed The Apertium platform Standardization and documentation of linguistic data formats Apertium for Icelandic? Modularity Concluding remarks Organizing community development/2 Some structure is necessary. Ideally: A coordinating team mastering the engine and tools used is needed to lead the effort, including: code coordinators (installing, maintainance, modifications to the code) linguistic coordinators (linguistic data maintenance) A project web server to distribute the last version of the system to execute it online for developers to contribute new linguistic data or code A group of skilled developers, certified in some sense by the coordinating team. Mikel L. Forcada Open-source MT for Icelandic
  • 17. Concepts Organizing community development Opportunities from open-source MT systems Eliciting linguistic knowledge Challenges Simplicity of linguistic knowledge needed The Apertium platform Standardization and documentation of linguistic data formats Apertium for Icelandic? Modularity Concluding remarks Eliciting linguistic knowledge Existing linguistic knowledge has to made explicit (elicited) to contribute it to the system. Elicitation of lexical knowledge is possible through well-designed web form interfaces: to provide the lemmas of the source and target word to select the inflection paradigm of the source and target word to establish the scope of the equivalence (bidirectional, left-to-right, right-to-left). Elicitation of other knowledge (e.g., structural transfer rules) is harder (a subject of research indeed). Mikel L. Forcada Open-source MT for Icelandic
  • 18. Concepts Organizing community development Opportunities from open-source MT systems Eliciting linguistic knowledge Challenges Simplicity of linguistic knowledge needed The Apertium platform Standardization and documentation of linguistic data formats Apertium for Icelandic? Modularity Concluding remarks Simplicity of linguistic knowledge needed To encourage and ease collaborative development, the level of linguistic knowledge necessary to start build a new MT system should be kept to a minimum (basic high-school grammar skills and concepts). This is rather easy in shallow-transfer MT systems. But is very difficult (if not impossible) for deep transfer systems. Well-written documentation may be very helpful. Having someone available online to ask questions to is even better. Mikel L. Forcada Open-source MT for Icelandic
  • 19. Concepts Organizing community development Opportunities from open-source MT systems Eliciting linguistic knowledge Challenges Simplicity of linguistic knowledge needed The Apertium platform Standardization and documentation of linguistic data formats Apertium for Icelandic? Modularity Concluding remarks Standardization and documentation of linguistic data formats An adequate documentation of the format of linguistic data is crucial. The way: using XML. Why? Each data item is explicitly labeled with a descriptive, named tag with a clear meaning attached The structure of documents may easily be validated against DTDs or schemas Many technologies exist for XML (converting from and to XML, interoperability ). Mikel L. Forcada Open-source MT for Icelandic
  • 20. Concepts Organizing community development Opportunities from open-source MT systems Eliciting linguistic knowledge Challenges Simplicity of linguistic knowledge needed The Apertium platform Standardization and documentation of linguistic data formats Apertium for Icelandic? Modularity Concluding remarks Modularity The emphasis of open-source is the reusability of code and linguistic data to build new MT systems or other language-technology applications. For that objective modularity is a must. A modular engine induces modularity in its data. For example, having an independent morphological analyser and an independent morphological dictionary Makes it easier to build an MT system for a different target language May be used to build an intelligent search engine (inflection-independent search) Mikel L. Forcada Open-source MT for Icelandic
  • 21. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Background Apertium is based on the technologies developed by the Transducens group at the Universitat d’Alacant during the development of two existing systems: interNOSTRUM (interNOSTRUM.com, Spanish–Catalan) Tradutor Universia (tradutor.universia.net, Spanish–Portuguese) These technologies, initially designed for related-language pairs, have been extended to handle language pairs which are not so related. Mikel L. Forcada Open-source MT for Icelandic
  • 22. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Rationale /1 To generate translations which are reasonably intelligible and easy to correct between related languages such as Spanish (es) and Catalan (ca) or Portuguese (pt), etc., or Nynorsk (nn), Bokmål (no) and Icelandic (is), one can just augment word for word translation with robust lexical processing (including multi-word units) lexical categorial disambiguation (part-of-speech tagging) local structural processing based on simple and well-formulated rules for frequent structural transformations (reordering, agreement) Mikel L. Forcada Open-source MT for Icelandic
  • 23. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Rationale /2 For harder, not so related, language pairs: It should be possible to build on that simple model. It should be possible to generalize its concepts so that complexity is kept as low as possible. Mikel L. Forcada Open-source MT for Icelandic
  • 24. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Rationale /3 It should be possible to generate the whole system from linguistic data (monolingual and bilingual dictionaries, grammar rules) specified in a declarative way. This information should be provided in an interoperable format ⇒ XML. These are the different types of data: (language-independent) rules to treat text formats specification of the part-of-speech tagger morphological and bilingual dictionaries and dictionaries of orthographical transformation rules structural transfer rules Mikel L. Forcada Open-source MT for Icelandic
  • 25. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Rationale /4 It should be possible to have a single generic (language-independent) engine reading language-pair data (“separation of algorithms and data”). Language-pair data should be preprocessed so that the system is fast (>10,000 words per second) and compact; for example, lexical transformations are performed by minimized finite-state transducers (FSTs). Mikel L. Forcada Open-source MT for Icelandic
  • 26. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Rationale /5 Reasons for the open-source development of Apertium: To give everyone free, unlimited access to the best possible machine-translation technologies. To establish a modular, documented, open platform for shallow-transfer machine translation and other human language processing tasks. To favour the interchange and reuse of existing linguistic data. To make integration with other open-source technologies easier. Mikel L. Forcada Open-source MT for Icelandic
  • 27. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Rationale /6 More reasons for open-source development of Apertium: To benefit from collaborative development of the machine translation engine of language-pair data for currently existing or new language pairs from industries, academia and small-language support organizations. To help shift MT business from the obsolescent licence-centered model to a service-centered model. To radically guarantee the reproducibility of machine translation and natural language processing research. Because it does not make sense to use public funds to develop non-free, closed-source software. Mikel L. Forcada Open-source MT for Icelandic
  • 28. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community The Apertium platform Apertium is an open-source machine translation platform (http://www.apertium.org) providing: 1 An open-source modular shallow-transfer machine translation engine with: text format management finite-state lexical processing statistical lexical disambiguation shallow transfer based on finite-state pattern matching 2 Open-source linguistic data in well-specified XML formats for a variety of language pairs 3 Open-source tools: compilers to turn linguistic data into a fast and compact form used by the engine and software to learn disambiguation or structural transfer rules. Mikel L. Forcada Open-source MT for Icelandic
  • 29. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community The Apertium engine/1 SL text→ De-formatter ↓ Morphological analyser [←FST] ↓ Categorial disambiguator [←FST+stat.] ↓ [rules→] Structural transfer ↔ Lexical transfer [←FST] ↓ Morphological generator [←FST] ↓ Post-generator [←FST] ↓ Re-formatter →TL text Mikel L. Forcada Open-source MT for Icelandic
  • 30. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community The Apertium engine/2 Communication between modules: text (Unix “pipelines”). Advantages: Simplifies diagnosis and debugging Allows the modification of data between two modules using, e.g., filters Makes it easy to insert alternative modules (interesting for research and development purposes) Mikel L. Forcada Open-source MT for Icelandic
  • 31. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community De-formatter Separates text from format information. Currently available for ISO-8859 or UTF-8 plain text, HTML, RTF, ODF, OpenOffice.org .sxw, etc.). Based on finite-state techniques. Most of these filters are generated (using a XSLT stylesheet) from an XML de-formatter specification file. Mikel L. Forcada Open-source MT for Icelandic
  • 32. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Morphological analyser segments the source text in surface forms (SFs), assigns to each SF one or more lexical forms (LFs), each one with: lemma lexical category (part-of-speech) morphological inflection information processes contractions (en: can’t=can+not; is: talarðu=talar +þú, ertu=ert+þú) and multi-word units which may be invariable (is: með öðrum orðum, við hlíðina á) or variable (is: brjóta af sér → braut af sér ). reads finite-state transducers generated from a morphological dictionary in XML (using a compiler). Mikel L. Forcada Open-source MT for Icelandic
  • 33. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Categorial disambiguator (part-of-speech tagger) picks one of the LFs corresponding to each ambiguous SF (about 30% of them) according to context uses hidden Markov models and hand-written constraint rules is trained using representative corpora for the source language (manually disambiguated or not) or, recently, using statistical models for the TL its behavior is completely specified by an XML archive Mikel L. Forcada Open-source MT for Icelandic
  • 34. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Structural transfer /1 It is based on finite-state techniques (finite-state recognizers). The XML transfer-rule file is preprocessed for faster interpreting. Rules have a pattern–action form. It detects LF patterns to be processed using a left-to-right, longest-match strategy. It executes the actions associated to each pattern in the rule file to generate the corresponding LF pattern for the TL. Mikel L. Forcada Open-source MT for Icelandic
  • 35. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Structural transfer /2 For “harder” language pairs, a three-stage structural transfer is available: Patterns of LFs (chunks) are detected, processed and marked Patterns of chunks are detected and processed: this interchunk processing allows for longer-range (“inter-chunk”) syntactic transformations The output chunks are finished and the resulting LFs are written. Mikel L. Forcada Open-source MT for Icelandic
  • 36. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Lexical transfer module reads each SL LF and generates the corresponding TL LF reads finite-state transducers generated from bilingual dictionaries in XML (using a compiler). invoked by the structural transfer module Mikel L. Forcada Open-source MT for Icelandic
  • 37. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Morphological generator Generates from each TL LF, a TL SF, after adequately inflecting it It reads finite-state transducers generated from a morphological dictionary in XML (using a compiler) Mikel L. Forcada Open-source MT for Icelandic
  • 38. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Post-generator Performs some TL orthographical transformations, such as contractions (ca: de +els → dels; en: can + not → cannot), inserting apostrophes (ca: de + amics → d’amics), etc. It is based on finite-state transducers generated from a post-generation rule dictionary (using a compiler). Mikel L. Forcada Open-source MT for Icelandic
  • 39. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Re-formatter Integrates format information (plain ISO-8859 or UTF-8 text, HTML, RTF, ODT, OpenOffice.org .sxw, etc.) into the translated text. Also used to modify URLs in links for translate-as-you-surf . It is based on finite-state techniques. It is generated (using a XSLT stylesheet) from an XML de-formatter specification file Mikel L. Forcada Open-source MT for Icelandic
  • 40. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Language-pair data The Apertium project hosts the development of a large number of language pairs: Stable language pairs include: es↔ca, es↔gl, es↔pt, en↔ca, en↔es, es↔fr, ca↔oc, ro→es, es→eo, ca→eo. There is also a growing number of language pairs under development. Some include Scandinavian languages (da, sv, nn, nb). Mikel L. Forcada Open-source MT for Icelandic
  • 41. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community Project funding Funded by The Ministry of Industry, Tourism and Commerce of Spain (also, the Ministries of Education and Science and of Science and Technology of Spain) The Secretariat for Technology and the Information Society of the Government of Catalonia The Ministry of Foreign Affairs of Romania The Universitat d’Alacant Companies: Prompsit Language Engineering, ABC Enciklopedioj, etc. Mikel L. Forcada Open-source MT for Icelandic
  • 42. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community The Apertium community/1 Not the ideal community development situation, but close. In addition to the original (funded) developers, a community has formed around the platform (instigated by Francis Tyers). More than 60 developers in sourceforge.net/projects/apertium/, many outside the original group; code updated very frequently, hundreds of monthly SVN commits. A collectively-maintained wiki shows the current development and tips for people building new language pairs or code. Mikel L. Forcada Open-source MT for Icelandic
  • 43. Background Concepts Rationale Opportunities from open-source MT systems The Apertium platform Challenges The Apertium engine The Apertium platform Language-pair data Apertium for Icelandic? Funding Concluding remarks The Apertium community The Apertium community/2 Externally developed tools and code: a graphical user interface apertium-tolk, and the diagnostic tool apertium-view plugins for OpenOffice.org or the Pidgin (previously Gaim) messaging program Windows ports, etc. Many people gather and interact in the #apertium IRC channel (at freenode.net). Stable packages ported to Debian GNU/Linux (and the next Ubuntu). Mikel L. Forcada Open-source MT for Icelandic
  • 44. Concepts Opportunities from open-source MT systems Challenges The Apertium platform Apertium for Icelandic? Concluding remarks Apertium for Icelandic /1 To build, for instance, a GPL apertium-is-en prototype: one could reuse the en dictionaries in apertium-en-ca or apertium-en-es (analysis and generation) and the part-of-speech taggers too one should build an is dictionary: getting some inspiration from existing (incomplete) data in Apertium for sv, da, fo. . . using Wiktionary [an experiment by Francis Tyers: http://apertium.svn.sourceforge.net/viewvc/ apertium/trunk/incubator/apertium-fo-is.is. dix?view=markup] convincing the authors of icemorphy or tungutorg to release (part of) their data under the GPL license. Mikel L. Forcada Open-source MT for Icelandic
  • 45. Concepts Opportunities from open-source MT systems Challenges The Apertium platform Apertium for Icelandic? Concluding remarks Apertium for Icelandic /2 one could train an is part-of-speech tagger, perhaps with some help from icetagger or tungutorg one should build a bilingual is–en dictionary, for instance: by completing the English and Icelandic dictionaries in Ergane by modifying bilingual dictionaries learned from a sentence-aligned bilingual corpus using Caseli et al.’s ReTraTos (sf.net/projects/retratos) one could then use Sanchez-Martínez and Forcada’s method to learn an initial set of structural transfer rules using the same or a different corpus, and then refine it. A prototype would be available in 1 person·year! Who dares? Mikel L. Forcada Open-source MT for Icelandic
  • 46. Concepts Opportunities from open-source MT systems Challenges The Apertium platform Apertium for Icelandic? Concluding remarks Apertium for Icelandic /3 Is the time right? The Government of Iceland has agreed on a “Policy on Free and Open-source Software” (“Stefna um frjálsan og opinn hugbúnað”, Mar. 11, 2008). “Giving access to the source code expands the opportunities for adapting and examining security aspects of the software, in addition to allowing for its further development if the producers discontinue it for some reason.” “There is a great need to increase the return on public body investments in software design. [...] Once software has been prepared, it is important that it has the potential of being reused [...] Reusability can be achieved by [...] ensuring that it is free and open-source.” Mikel L. Forcada Open-source MT for Icelandic
  • 47. Concepts Opportunities from open-source MT systems Challenges The Apertium platform Apertium for Icelandic? Concluding remarks Concluding remarks /1 Icelandic, as any other living language, however small, needs machine translation and has the right to it! The development of open-source MT for Icelandic can have specific, additional effects (increasing expertise, contributing reusable resources, reducing technological dependency). Apertium eases this task. Development of MT for a small language faces a number of challenges: elictation of linguistic knowledge, need for standard formats, modularity. Apertium offers the last two. Of course, I will be happy to discuss these conclusions! Mikel L. Forcada Open-source MT for Icelandic
  • 48. Concepts Opportunities from open-source MT systems Challenges The Apertium platform Apertium for Icelandic? Concluding remarks Takk fyrir! Thanks, Hrafn Loftsson, and the rest of the colleagues at Reykjavík University and the University of Iceland for inviting me to this conference and making me feel at home. Thank you all for your attention. Mikel L. Forcada Open-source MT for Icelandic
  • 49. Concepts Opportunities from open-source MT systems Challenges The Apertium platform Apertium for Icelandic? Concluding remarks I should practice what I preach. . . This work may be distributed under the terms of the Creative Commons Attribution–Share Alike license: http: //creativecommons.org/licenses/by-sa/3.0/ the GNU GPL v. 3.0 License: http://www.gnu.org/licenses/gpl.html Dual license! E-mail me to get the sources: mlf@ua.es Mikel L. Forcada Open-source MT for Icelandic