Generative Artificial Intelligence: How generative AI works.pdf
Open-source machine translation for Icelandic:
the Apertium platform as an opportunity
1. Concepts
Opportunities from open-source MT systems
Challenges
The Apertium platform
Apertium for Icelandic?
Concluding remarks
Open-source machine translation for Icelandic:
the Apertium platform as an opportunity
Mikel L. Forcada1,2
1 Departament de Llenguatges i Sistemes Informàtics, Universitat d’Alacant,
E-03071 Alacant (Spain)
2 Prompsit Language Engineering, S.L., E-03690 St. Vicent del Raspeig (Spain)
April 18, 2008: Icelandic Language Technology Conference
Mikel L. Forcada Open-source MT for Icelandic
2. Concepts
Opportunities from open-source MT systems
Challenges
The Apertium platform
Apertium for Icelandic?
Concluding remarks
Contents
1 Concepts
2 Opportunities from open-source MT systems
3 Challenges
4 The Apertium platform
5 Apertium for Icelandic?
6 Concluding remarks
Mikel L. Forcada Open-source MT for Icelandic
3. Concepts
Opportunities from open-source MT systems
Challenges Open-source and free software
The Apertium platform Machine translation software
Apertium for Icelandic?
Concluding remarks
Open-source and free software
Open-source software is also called free software:
0 anyone can use it for any purpose
1 anyone can examine it to see how it works and modify it for
any new purpose
2 anyone can freely distribute it
3 anyone may release an improved version so that everyone
benefits
For conditions 1 and 3 to be met, anyone should be able to
access the source code, hence the name open source.
Mikel L. Forcada Open-source MT for Icelandic
4. Concepts
Opportunities from open-source MT systems
Challenges Open-source and free software
The Apertium platform Machine translation software
Apertium for Icelandic?
Concluding remarks
Machine translation software/1
MT is special: it strongly depends on data
rule-based MT (RBMT): dictionaries, rules
corpus-based MT (CBMT): sentence-aligned parallel text,
monolingual corpora
Three components in every MT system:
The engine (also decoder , recombinator . . . )
Data (linguistic data, corpora)
Tools to maintain these data and convert them to the format
used by the engine
Mikel L. Forcada Open-source MT for Icelandic
5. Concepts
Opportunities from open-source MT systems
Challenges Open-source and free software
The Apertium platform Machine translation software
Apertium for Icelandic?
Concluding remarks
Machine translation software/2
I will focus on RBMT. Reasons:
CBMT requires massive amounts of sentence-aligned
parallel text (is there such a resource for Icelandic?).
RBMT may use linguistic data elicited by speakers without
access to existing machine-readable resources.
RBMT is more transparent: errors are easier to diagnose
and debug.
I am more familiar with RBMT!
Mikel L. Forcada Open-source MT for Icelandic
6. Concepts
Opportunities from open-source MT systems
Challenges Open-source and free software
The Apertium platform Machine translation software
Apertium for Icelandic?
Concluding remarks
MT software/3 : commercial machine translation
Most commercial MT systems are RBMT (but:
LanguageWeaver, Google Labs).
They use proprietary technologies which are not disclosed
(perceived as their main competitive advantage).
Only partial modification (customization) of linguistic data
is allowed.
Mikel L. Forcada Open-source MT for Icelandic
7. Concepts
Opportunities from open-source MT systems
Challenges Open-source and free software
The Apertium platform Machine translation software
Apertium for Icelandic?
Concluding remarks
MT software/4: open-source machine translation
For MT to be open-source, the engine, the data and the
tools must all be open-source.
In the case of CBMT this means that corpora must also be
open.
Mikel L. Forcada Open-source MT for Icelandic
8. Concepts
Opportunities from open-source MT systems
Challenges Increasing expertise and language resources
The Apertium platform Increasing independence
Apertium for Icelandic?
Concluding remarks
Commercial MT systems and small languages: limited
opportunities
The main MT companies target major world languages.
Not Icelandic. . . Some closed-source systems:
TranExp’s InterTran offers en↔is “interactive translation”
(with limited lexical coverage): test at http:
//www.translation-guide.com/free_online_
translators.php?from=Icelandic&to=English
Stefán Briem’s prototypes for is↔en or is↔da may be
tested at tungutorg.is.
A company named ESTeam (www.esteam.gr) is also
listed as offering MT for Icelandic.
It is very hard to adapt closed, commercial MT systems to
small languages
Mikel L. Forcada Open-source MT for Icelandic
9. Concepts
Opportunities from open-source MT systems
Challenges Increasing expertise and language resources
The Apertium platform Increasing independence
Apertium for Icelandic?
Concluding remarks
Opportunities from open-source MT systems
Even if reasonable-quality closed-source MT is available,
the development and use of open-source MT systems
provides additional opportunities:
Increases language expertise and resources
Increases technological independence
Mikel L. Forcada Open-source MT for Icelandic
10. Concepts
Opportunities from open-source MT systems
Challenges Increasing expertise and language resources
The Apertium platform Increasing independence
Apertium for Icelandic?
Concluding remarks
Increasing expertise and language resources
When building an open-source MT system for a small
language, a variety of situations may occur.
All of them involve building small-language expertise and
resources through
reflection about the small language
elicitation of linguistic (monolingual and bilingual)
knowledge about it
subsequent encoding of this knowledge
The open-source setting makes new expertise and
resources naturally available to the community.
Three scenarios may occur:
Mikel L. Forcada Open-source MT for Icelandic
11. Concepts
Opportunities from open-source MT systems
Challenges Increasing expertise and language resources
The Apertium platform Increasing independence
Apertium for Icelandic?
Concluding remarks
Building data for an existing MT engine from scratch
One needs:
A freely available (open-source or not) MT engine
Freely available (open-source or not) tools to manage
linguistic data
Complete documentation on how to build linguistic data for
use with the engine and tools
This is a very unfavourable setting. Many decisions have to
be made, e.g., defining the set of lexical categories and
inflection indicators.
The blank sheet syndrome may paralyze the project.
If overcome, the expertise acquired and the resulting
open-source data could be improved or used for other
purposes: positive effect on the small language.
Mikel L. Forcada Open-source MT for Icelandic
12. Concepts
Opportunities from open-source MT systems
Challenges Increasing expertise and language resources
The Apertium platform Increasing independence
Apertium for Icelandic?
Concluding remarks
Building data for an existing MT engine from existing
language-pair data
If free tools and engine and open-source data are available
for another pair with a similar or related language, the
blank sheet syndrome is drastically reduced. One could,
for example:
use the same set of lexical categories and inflection
indicators
build inflection paradigms on top of existing ones
Mikel L. Forcada Open-source MT for Icelandic
13. Concepts
Opportunities from open-source MT systems
Challenges Increasing expertise and language resources
The Apertium platform Increasing independence
Apertium for Icelandic?
Concluding remarks
Adapting a new open-source engine or tools for a new
language pair
If source code is available for the engine and tools, experts
could enhance or adapt them to address new features of
the small language not dealt with adequately by the current
code:
character sets
structural transfer not powerful enough, etc.
More challenging than building new data
But programmers do not need to have full command of the
small language (abstract management of linguistic issues
possible).
Code rewriting would add expertise and resources to the
language community.
Mikel L. Forcada Open-source MT for Icelandic
14. Concepts
Opportunities from open-source MT systems
Challenges Increasing expertise and language resources
The Apertium platform Increasing independence
Apertium for Icelandic?
Concluding remarks
Increasing technological independence
Having an open-source engine, tools and data makes
users of the small language less dependent on a single
commercial, closed-source provider.
This has an analogous effect, not only on machine
translation, but also on other language technologies.
Mikel L. Forcada Open-source MT for Icelandic
15. Concepts
Organizing community development
Opportunities from open-source MT systems
Eliciting linguistic knowledge
Challenges
Simplicity of linguistic knowledge needed
The Apertium platform
Standardization and documentation of linguistic data formats
Apertium for Icelandic?
Modularity
Concluding remarks
Organizing community development/1
Assume we are just developing linguistic data.
Open-source makes it possible for a small-language
community to collaboratively develop machine translation
for it.
Some small languages have people with good linguistic
and translation skills (this is the case of Icelandic).
But the availability of human resources with language and
translation skills is necessary but not sufficient.
Mikel L. Forcada Open-source MT for Icelandic
16. Concepts
Organizing community development
Opportunities from open-source MT systems
Eliciting linguistic knowledge
Challenges
Simplicity of linguistic knowledge needed
The Apertium platform
Standardization and documentation of linguistic data formats
Apertium for Icelandic?
Modularity
Concluding remarks
Organizing community development/2
Some structure is necessary. Ideally:
A coordinating team mastering the engine and tools used
is needed to lead the effort, including:
code coordinators (installing, maintainance, modifications
to the code)
linguistic coordinators (linguistic data maintenance)
A project web server
to distribute the last version of the system
to execute it online
for developers to contribute new linguistic data or code
A group of skilled developers, certified in some sense by
the coordinating team.
Mikel L. Forcada Open-source MT for Icelandic
17. Concepts
Organizing community development
Opportunities from open-source MT systems
Eliciting linguistic knowledge
Challenges
Simplicity of linguistic knowledge needed
The Apertium platform
Standardization and documentation of linguistic data formats
Apertium for Icelandic?
Modularity
Concluding remarks
Eliciting linguistic knowledge
Existing linguistic knowledge has to made explicit (elicited)
to contribute it to the system.
Elicitation of lexical knowledge is possible through
well-designed web form interfaces:
to provide the lemmas of the source and target word
to select the inflection paradigm of the source and target
word
to establish the scope of the equivalence (bidirectional,
left-to-right, right-to-left).
Elicitation of other knowledge (e.g., structural transfer
rules) is harder (a subject of research indeed).
Mikel L. Forcada Open-source MT for Icelandic
18. Concepts
Organizing community development
Opportunities from open-source MT systems
Eliciting linguistic knowledge
Challenges
Simplicity of linguistic knowledge needed
The Apertium platform
Standardization and documentation of linguistic data formats
Apertium for Icelandic?
Modularity
Concluding remarks
Simplicity of linguistic knowledge needed
To encourage and ease collaborative development, the level of
linguistic knowledge necessary to start build a new MT system
should be kept to a minimum (basic high-school grammar skills
and concepts).
This is rather easy in shallow-transfer MT systems.
But is very difficult (if not impossible) for deep transfer
systems.
Well-written documentation may be very helpful. Having
someone available online to ask questions to is even better.
Mikel L. Forcada Open-source MT for Icelandic
19. Concepts
Organizing community development
Opportunities from open-source MT systems
Eliciting linguistic knowledge
Challenges
Simplicity of linguistic knowledge needed
The Apertium platform
Standardization and documentation of linguistic data formats
Apertium for Icelandic?
Modularity
Concluding remarks
Standardization and documentation of linguistic data
formats
An adequate documentation of the format of linguistic data
is crucial.
The way: using XML. Why?
Each data item is explicitly labeled with a descriptive,
named tag with a clear meaning attached
The structure of documents may easily be validated against
DTDs or schemas
Many technologies exist for XML (converting from and to
XML, interoperability ).
Mikel L. Forcada Open-source MT for Icelandic
20. Concepts
Organizing community development
Opportunities from open-source MT systems
Eliciting linguistic knowledge
Challenges
Simplicity of linguistic knowledge needed
The Apertium platform
Standardization and documentation of linguistic data formats
Apertium for Icelandic?
Modularity
Concluding remarks
Modularity
The emphasis of open-source is the reusability of code
and linguistic data to build new MT systems or other
language-technology applications.
For that objective modularity is a must.
A modular engine induces modularity in its data.
For example, having an independent morphological
analyser and an independent morphological dictionary
Makes it easier to build an MT system for a different target
language
May be used to build an intelligent search engine
(inflection-independent search)
Mikel L. Forcada Open-source MT for Icelandic
21. Background
Concepts
Rationale
Opportunities from open-source MT systems
The Apertium platform
Challenges
The Apertium engine
The Apertium platform
Language-pair data
Apertium for Icelandic?
Funding
Concluding remarks
The Apertium community
Background
Apertium is based on the technologies developed by the
Transducens group at the Universitat d’Alacant during the
development of two existing systems:
interNOSTRUM (interNOSTRUM.com, Spanish–Catalan)
Tradutor Universia (tradutor.universia.net,
Spanish–Portuguese)
These technologies, initially designed for related-language
pairs, have been extended to handle language pairs which are
not so related.
Mikel L. Forcada Open-source MT for Icelandic
22. Background
Concepts
Rationale
Opportunities from open-source MT systems
The Apertium platform
Challenges
The Apertium engine
The Apertium platform
Language-pair data
Apertium for Icelandic?
Funding
Concluding remarks
The Apertium community
Rationale /1
To generate translations which are
reasonably intelligible and
easy to correct
between related languages such as Spanish (es) and Catalan
(ca) or Portuguese (pt), etc., or Nynorsk (nn), Bokmål (no)
and Icelandic (is), one can just augment word for word
translation with
robust lexical processing (including multi-word units)
lexical categorial disambiguation (part-of-speech tagging)
local structural processing based on simple and
well-formulated rules for frequent structural
transformations (reordering, agreement)
Mikel L. Forcada Open-source MT for Icelandic
23. Background
Concepts
Rationale
Opportunities from open-source MT systems
The Apertium platform
Challenges
The Apertium engine
The Apertium platform
Language-pair data
Apertium for Icelandic?
Funding
Concluding remarks
The Apertium community
Rationale /2
For harder, not so related, language pairs:
It should be possible to build on that simple model.
It should be possible to generalize its concepts so that
complexity is kept as low as possible.
Mikel L. Forcada Open-source MT for Icelandic
24. Background
Concepts
Rationale
Opportunities from open-source MT systems
The Apertium platform
Challenges
The Apertium engine
The Apertium platform
Language-pair data
Apertium for Icelandic?
Funding
Concluding remarks
The Apertium community
Rationale /3
It should be possible to generate the whole system from
linguistic data (monolingual and bilingual dictionaries,
grammar rules) specified in a declarative way.
This information should be provided in an interoperable
format ⇒ XML. These are the different types of data:
(language-independent) rules to treat text formats
specification of the part-of-speech tagger
morphological and bilingual dictionaries and dictionaries of
orthographical transformation rules
structural transfer rules
Mikel L. Forcada Open-source MT for Icelandic
25. Background
Concepts
Rationale
Opportunities from open-source MT systems
The Apertium platform
Challenges
The Apertium engine
The Apertium platform
Language-pair data
Apertium for Icelandic?
Funding
Concluding remarks
The Apertium community
Rationale /4
It should be possible to have a single generic
(language-independent) engine reading language-pair
data (“separation of algorithms and data”).
Language-pair data should be preprocessed so that the
system is fast (>10,000 words per second) and compact;
for example, lexical transformations are performed by
minimized finite-state transducers (FSTs).
Mikel L. Forcada Open-source MT for Icelandic
26. Background
Concepts
Rationale
Opportunities from open-source MT systems
The Apertium platform
Challenges
The Apertium engine
The Apertium platform
Language-pair data
Apertium for Icelandic?
Funding
Concluding remarks
The Apertium community
Rationale /5
Reasons for the open-source development of Apertium:
To give everyone free, unlimited access to the best
possible machine-translation technologies.
To establish a modular, documented, open platform for
shallow-transfer machine translation and other human
language processing tasks.
To favour the interchange and reuse of existing linguistic
data.
To make integration with other open-source technologies
easier.
Mikel L. Forcada Open-source MT for Icelandic
27. Background
Concepts
Rationale
Opportunities from open-source MT systems
The Apertium platform
Challenges
The Apertium engine
The Apertium platform
Language-pair data
Apertium for Icelandic?
Funding
Concluding remarks
The Apertium community
Rationale /6
More reasons for open-source development of Apertium:
To benefit from collaborative development
of the machine translation engine
of language-pair data for currently existing or new language
pairs
from industries, academia and small-language support
organizations.
To help shift MT business from the obsolescent
licence-centered model to a service-centered model.
To radically guarantee the reproducibility of machine
translation and natural language processing research.
Because it does not make sense to use public funds to
develop non-free, closed-source software.
Mikel L. Forcada Open-source MT for Icelandic
28. Background
Concepts
Rationale
Opportunities from open-source MT systems
The Apertium platform
Challenges
The Apertium engine
The Apertium platform
Language-pair data
Apertium for Icelandic?
Funding
Concluding remarks
The Apertium community
The Apertium platform
Apertium is an open-source machine translation platform
(http://www.apertium.org) providing:
1 An open-source modular shallow-transfer machine
translation engine with:
text format management
finite-state lexical processing
statistical lexical disambiguation
shallow transfer based on finite-state pattern matching
2 Open-source linguistic data in well-specified XML formats
for a variety of language pairs
3 Open-source tools: compilers to turn linguistic data into a
fast and compact form used by the engine and software to
learn disambiguation or structural transfer rules.
Mikel L. Forcada Open-source MT for Icelandic
29. Background
Concepts
Rationale
Opportunities from open-source MT systems
The Apertium platform
Challenges
The Apertium engine
The Apertium platform
Language-pair data
Apertium for Icelandic?
Funding
Concluding remarks
The Apertium community
The Apertium engine/1
SL text→ De-formatter
↓
Morphological analyser [←FST]
↓
Categorial disambiguator [←FST+stat.]
↓
[rules→] Structural transfer ↔ Lexical transfer [←FST]
↓
Morphological generator [←FST]
↓
Post-generator [←FST]
↓
Re-formatter →TL text
Mikel L. Forcada Open-source MT for Icelandic
30. Background
Concepts
Rationale
Opportunities from open-source MT systems
The Apertium platform
Challenges
The Apertium engine
The Apertium platform
Language-pair data
Apertium for Icelandic?
Funding
Concluding remarks
The Apertium community
The Apertium engine/2
Communication between modules: text (Unix “pipelines”).
Advantages:
Simplifies diagnosis and debugging
Allows the modification of data between two modules
using, e.g., filters
Makes it easy to insert alternative modules (interesting for
research and development purposes)
Mikel L. Forcada Open-source MT for Icelandic
31. Background
Concepts
Rationale
Opportunities from open-source MT systems
The Apertium platform
Challenges
The Apertium engine
The Apertium platform
Language-pair data
Apertium for Icelandic?
Funding
Concluding remarks
The Apertium community
De-formatter
Separates text from format information.
Currently available for ISO-8859 or UTF-8 plain text,
HTML, RTF, ODF, OpenOffice.org .sxw, etc.).
Based on finite-state techniques.
Most of these filters are generated (using a XSLT
stylesheet) from an XML de-formatter specification file.
Mikel L. Forcada Open-source MT for Icelandic
32. Background
Concepts
Rationale
Opportunities from open-source MT systems
The Apertium platform
Challenges
The Apertium engine
The Apertium platform
Language-pair data
Apertium for Icelandic?
Funding
Concluding remarks
The Apertium community
Morphological analyser
segments the source text in surface forms (SFs),
assigns to each SF one or more lexical forms (LFs), each
one with:
lemma
lexical category (part-of-speech)
morphological inflection information
processes contractions (en: can’t=can+not; is:
talarðu=talar +þú, ertu=ert+þú) and multi-word units which
may be invariable (is: með öðrum orðum, við hlíðina á) or
variable (is: brjóta af sér → braut af sér ).
reads finite-state transducers generated from a
morphological dictionary in XML (using a compiler).
Mikel L. Forcada Open-source MT for Icelandic
33. Background
Concepts
Rationale
Opportunities from open-source MT systems
The Apertium platform
Challenges
The Apertium engine
The Apertium platform
Language-pair data
Apertium for Icelandic?
Funding
Concluding remarks
The Apertium community
Categorial disambiguator (part-of-speech tagger)
picks one of the LFs corresponding to each ambiguous SF
(about 30% of them) according to context
uses hidden Markov models and hand-written constraint
rules
is trained using representative corpora for the source
language (manually disambiguated or not) or, recently,
using statistical models for the TL
its behavior is completely specified by an XML archive
Mikel L. Forcada Open-source MT for Icelandic
34. Background
Concepts
Rationale
Opportunities from open-source MT systems
The Apertium platform
Challenges
The Apertium engine
The Apertium platform
Language-pair data
Apertium for Icelandic?
Funding
Concluding remarks
The Apertium community
Structural transfer /1
It is based on finite-state techniques (finite-state
recognizers).
The XML transfer-rule file is preprocessed for faster
interpreting.
Rules have a pattern–action form.
It detects LF patterns to be processed using a left-to-right,
longest-match strategy.
It executes the actions associated to each pattern in the
rule file to generate the corresponding LF pattern for the
TL.
Mikel L. Forcada Open-source MT for Icelandic
35. Background
Concepts
Rationale
Opportunities from open-source MT systems
The Apertium platform
Challenges
The Apertium engine
The Apertium platform
Language-pair data
Apertium for Icelandic?
Funding
Concluding remarks
The Apertium community
Structural transfer /2
For “harder” language pairs, a three-stage structural transfer is
available:
Patterns of LFs (chunks) are detected, processed and
marked
Patterns of chunks are detected and processed: this
interchunk processing allows for longer-range
(“inter-chunk”) syntactic transformations
The output chunks are finished and the resulting LFs are
written.
Mikel L. Forcada Open-source MT for Icelandic
36. Background
Concepts
Rationale
Opportunities from open-source MT systems
The Apertium platform
Challenges
The Apertium engine
The Apertium platform
Language-pair data
Apertium for Icelandic?
Funding
Concluding remarks
The Apertium community
Lexical transfer module
reads each SL LF and generates the corresponding TL LF
reads finite-state transducers generated from bilingual
dictionaries in XML (using a compiler).
invoked by the structural transfer module
Mikel L. Forcada Open-source MT for Icelandic
37. Background
Concepts
Rationale
Opportunities from open-source MT systems
The Apertium platform
Challenges
The Apertium engine
The Apertium platform
Language-pair data
Apertium for Icelandic?
Funding
Concluding remarks
The Apertium community
Morphological generator
Generates from each TL LF, a TL SF, after adequately
inflecting it
It reads finite-state transducers generated from a
morphological dictionary in XML (using a compiler)
Mikel L. Forcada Open-source MT for Icelandic
38. Background
Concepts
Rationale
Opportunities from open-source MT systems
The Apertium platform
Challenges
The Apertium engine
The Apertium platform
Language-pair data
Apertium for Icelandic?
Funding
Concluding remarks
The Apertium community
Post-generator
Performs some TL orthographical transformations, such as
contractions (ca: de +els → dels; en: can + not →
cannot), inserting apostrophes (ca: de + amics →
d’amics), etc.
It is based on finite-state transducers generated from a
post-generation rule dictionary (using a compiler).
Mikel L. Forcada Open-source MT for Icelandic
39. Background
Concepts
Rationale
Opportunities from open-source MT systems
The Apertium platform
Challenges
The Apertium engine
The Apertium platform
Language-pair data
Apertium for Icelandic?
Funding
Concluding remarks
The Apertium community
Re-formatter
Integrates format information (plain ISO-8859 or UTF-8
text, HTML, RTF, ODT, OpenOffice.org .sxw, etc.) into the
translated text.
Also used to modify URLs in links for translate-as-you-surf .
It is based on finite-state techniques.
It is generated (using a XSLT stylesheet) from an XML
de-formatter specification file
Mikel L. Forcada Open-source MT for Icelandic
40. Background
Concepts
Rationale
Opportunities from open-source MT systems
The Apertium platform
Challenges
The Apertium engine
The Apertium platform
Language-pair data
Apertium for Icelandic?
Funding
Concluding remarks
The Apertium community
Language-pair data
The Apertium project hosts the development of a large number
of language pairs:
Stable language pairs include: es↔ca, es↔gl, es↔pt,
en↔ca, en↔es, es↔fr, ca↔oc, ro→es, es→eo,
ca→eo.
There is also a growing number of language pairs under
development. Some include Scandinavian languages (da,
sv, nn, nb).
Mikel L. Forcada Open-source MT for Icelandic
41. Background
Concepts
Rationale
Opportunities from open-source MT systems
The Apertium platform
Challenges
The Apertium engine
The Apertium platform
Language-pair data
Apertium for Icelandic?
Funding
Concluding remarks
The Apertium community
Project funding
Funded by
The Ministry of Industry, Tourism and Commerce of Spain
(also, the Ministries of Education and Science and of
Science and Technology of Spain)
The Secretariat for Technology and the Information Society
of the Government of Catalonia
The Ministry of Foreign Affairs of Romania
The Universitat d’Alacant
Companies: Prompsit Language Engineering, ABC
Enciklopedioj, etc.
Mikel L. Forcada Open-source MT for Icelandic
42. Background
Concepts
Rationale
Opportunities from open-source MT systems
The Apertium platform
Challenges
The Apertium engine
The Apertium platform
Language-pair data
Apertium for Icelandic?
Funding
Concluding remarks
The Apertium community
The Apertium community/1
Not the ideal community development situation, but close.
In addition to the original (funded) developers, a community has
formed around the platform (instigated by Francis Tyers).
More than 60 developers in
sourceforge.net/projects/apertium/, many
outside the original group; code updated very frequently,
hundreds of monthly SVN commits.
A collectively-maintained wiki shows the current
development and tips for people building new language
pairs or code.
Mikel L. Forcada Open-source MT for Icelandic
43. Background
Concepts
Rationale
Opportunities from open-source MT systems
The Apertium platform
Challenges
The Apertium engine
The Apertium platform
Language-pair data
Apertium for Icelandic?
Funding
Concluding remarks
The Apertium community
The Apertium community/2
Externally developed tools and code:
a graphical user interface apertium-tolk, and the
diagnostic tool apertium-view
plugins for OpenOffice.org or the Pidgin (previously Gaim)
messaging program
Windows ports, etc.
Many people gather and interact in the #apertium IRC
channel (at freenode.net).
Stable packages ported to Debian GNU/Linux (and the
next Ubuntu).
Mikel L. Forcada Open-source MT for Icelandic
44. Concepts
Opportunities from open-source MT systems
Challenges
The Apertium platform
Apertium for Icelandic?
Concluding remarks
Apertium for Icelandic /1
To build, for instance, a GPL apertium-is-en prototype:
one could reuse the en dictionaries in apertium-en-ca
or apertium-en-es (analysis and generation) and the
part-of-speech taggers too
one should build an is dictionary:
getting some inspiration from existing (incomplete) data in
Apertium for sv, da, fo. . .
using Wiktionary [an experiment by Francis Tyers:
http://apertium.svn.sourceforge.net/viewvc/
apertium/trunk/incubator/apertium-fo-is.is.
dix?view=markup]
convincing the authors of icemorphy or tungutorg to
release (part of) their data under the GPL license.
Mikel L. Forcada Open-source MT for Icelandic
45. Concepts
Opportunities from open-source MT systems
Challenges
The Apertium platform
Apertium for Icelandic?
Concluding remarks
Apertium for Icelandic /2
one could train an is part-of-speech tagger, perhaps with
some help from icetagger or tungutorg
one should build a bilingual is–en dictionary, for instance:
by completing the English and Icelandic dictionaries in
Ergane
by modifying bilingual dictionaries learned from a
sentence-aligned bilingual corpus using Caseli et al.’s
ReTraTos (sf.net/projects/retratos)
one could then use Sanchez-Martínez and Forcada’s
method to learn an initial set of structural transfer rules
using the same or a different corpus, and then refine it.
A prototype would be available in 1 person·year! Who dares?
Mikel L. Forcada Open-source MT for Icelandic
46. Concepts
Opportunities from open-source MT systems
Challenges
The Apertium platform
Apertium for Icelandic?
Concluding remarks
Apertium for Icelandic /3
Is the time right? The Government of Iceland has agreed on a
“Policy on Free and Open-source Software” (“Stefna um
frjálsan og opinn hugbúnað”, Mar. 11, 2008).
“Giving access to the source code expands the
opportunities for adapting and examining security aspects
of the software, in addition to allowing for its further
development if the producers discontinue it for some
reason.”
“There is a great need to increase the return on public body
investments in software design. [...] Once software has
been prepared, it is important that it has the potential of
being reused [...] Reusability can be achieved by [...]
ensuring that it is free and open-source.”
Mikel L. Forcada Open-source MT for Icelandic
47. Concepts
Opportunities from open-source MT systems
Challenges
The Apertium platform
Apertium for Icelandic?
Concluding remarks
Concluding remarks /1
Icelandic, as any other living language, however small,
needs machine translation and has the right to it!
The development of open-source MT for Icelandic can
have specific, additional effects (increasing expertise,
contributing reusable resources, reducing technological
dependency). Apertium eases this task.
Development of MT for a small language faces a number of
challenges: elictation of linguistic knowledge, need for
standard formats, modularity. Apertium offers the last two.
Of course, I will be happy to discuss these conclusions!
Mikel L. Forcada Open-source MT for Icelandic
48. Concepts
Opportunities from open-source MT systems
Challenges
The Apertium platform
Apertium for Icelandic?
Concluding remarks
Takk fyrir!
Thanks, Hrafn Loftsson, and the rest of the colleagues at
Reykjavík University and the University of Iceland for
inviting me to this conference and making me feel at home.
Thank you all for your attention.
Mikel L. Forcada Open-source MT for Icelandic
49. Concepts
Opportunities from open-source MT systems
Challenges
The Apertium platform
Apertium for Icelandic?
Concluding remarks
I should practice what I preach. . .
This work may be distributed under the terms of
the Creative Commons Attribution–Share Alike license:
http:
//creativecommons.org/licenses/by-sa/3.0/
the GNU GPL v. 3.0 License:
http://www.gnu.org/licenses/gpl.html
Dual license! E-mail me to get the sources: mlf@ua.es
Mikel L. Forcada Open-source MT for Icelandic