Apertium was launched in 2005 as a platform for RBMT. After a decade of creating MT systems (more than 40 stable languages pairs currently available) it is not anymore just a platform for MT. Its open-source nature has turned Apertium into an extensive language resource base, shared under free licenses, being downloaded hundreds of times a week and receiving thousands of contributions a year. In this presentation we will list and give numbers about what can be currently found in Apertium and what are the benefits that the open-source model for data sharing has brought. We will also discuss what do we lack and what's in the near roadmap of the platform.
Apertium: an extensive and shared language resource base for MT and much more, Gema Ramirez Sanchez, Prompsit Language Engineering
1. Apertium: an extensive
and shared LR base for
MT and much more
Gema Ramírez Sánchez (a proud Apertium activist) –
Prompsit Language Engineering – gramirez@prompsit.com
TAUS Roundtable – Barcelona, 12 May 2016
2. What is this talk about?
• This is not a talk about Apertium
• although it will be mentioned in every slide
• This is not a talk about MT
• although Apertium is a platform for MT
• This is a talk about data and linguistic tools
inside Apertium that are useful for MT:
• all MT approaches can benefit: SMT, NMT, RBMT…
• but this is not a talk about corpora!!!
6. Downloaded Apertium’s go to….
Apertium web application
Portals, third party tools
Corporate servers (offline)
Public bodies servers
Apertium Android apps
(offline!!)
Personal PC’s (users,
developers, researchers)
Apertium community!!
7. Apertium community: mostly
data contributors
Community in Sourceforge (April 2016)
Contributors 7 admins, 400 developers
Contributions 67,896 commits
8. Brief history about data
in Apertium
Year Milestone Language pairs
2004 The Spanish Ministry of Industry funds a consortium to
build FOSS MT for the languages of Spain ----------------------------
2005 Apertium RBMT plaftorm is launched providing engine,
tools and data under free licenses
3 pairs: es-ca, es-gl and
es-pt
2005-
2009
Language pair-driven innovation, still very European-
focused language pairs
+19: fr, en, eo, ro, eu,
oc, cy, nn, nb, sv, da, is,
mk, bg, ast, br
2010 Five years on! 22 pairs!!!
2011-
2015
Consolidated community, support for non-European
languages , new tools and reorganisation of data
+19: af, nl, hr, sr, mt, sl,
ara, sme, urd, hin, kaz,
tat, id, ms, ar
2016 Eleven years on! 41 pairs!!!
11. A language pair in Apertium
Language pair organisation
1 pack = 1 language pair
2 monodixes , 1 bidix
2 sets of rules (levels 1/3)
2 tagsets + probabilities
2 plain/tagged corpora
2 post-dixes
New language pair organisation
2 monolingual packs
1 monodix
1 tagset + probabilities
1 plain/tagged corpora
1 post-dix
1 bilingual pack
1 bidix
2 sets of rules (levels 1/3)
Format: all data are xml-based files
Size:
Monodixes: 5k-90k lemmata=37k-13M surface forms; coverage: 85-97%
Bidixes: 8k-90k lemmata bilingual entries
Rules: 100 (one level) - 300 (3 level) per translation direction
License: GNU General Public License
12. Linguistic tools in Apertium
Monodix
Tagset+prob
Rules
Monodix
Bidix
t
o
o
l
s
t
o
o
l
s
Post-dix
Morphologica
l
analyser
PoS tagger
Lexical transfer Full MT
Morphological
generator
Structural transfer
Post-generator
What are these tools useful for?
13. Morphological analyser:
“Gema lo explica ahora mismo”
= expanded-morphology text
Gema.np.f.sg | gema.n.f.sg
lo.detnt | lo.prn.pro.m.sg
explicar.vb.pri.3.sg |explicar.vb.imp.3.sg
ahora mismo.adv
= lemmatised text
Gema/gema, el/lo, explicar, ahora mismo
= smart tokenized text
^Gema$ ^lo$ ^explica$ ^ahora_mismo$
14. Part-of-speech tagger:
“Gema lo explica ahora mismo”
= factored text
Gema.np.f.sg
lo.prn.pro.m.sg
explicar.vb.pri.3.sg
ahora_mismo.adv
= truecase text
^Gema$ ^lo$ ^explica$ ^ahora_mismo$
^gema$ : ^piedra_preciosa$
15. Lexical transfer:
“Gema lo explica ahora mismo”
= bilingual correspondences (lemmata) by PoS
Gema - Gema
lo - it
explicar - explain
ahora mismo – right now
gema – gemstone
lo – the
17. Morphological generator:
“Gema lo explica ahora mismo”
= from expanded text to text to surface forms:
Gema.np.f.sg
lo.detnt
explicar.vb.pri.3.sg
ahora_mismo.adv
= “recaser” of a factored text:
Gema lo explica ahora_mismo
Gema
lo
explica
ahora mismo
18. Post-generator:
“Gema lo explica ahora mismo”
= nice rendering of a text
Gema
le
explique
maintenant
ALL THIS TAKING INTO ACCOUNT THE FORMAT!!!
Gema
l’explique
maintenant
19. How was data blossom
possible in Apertium?
Standards
Documentation
Availability
Support
Community building
Funding
And a free/open-source sharing model
20. Next data in Apertium?
Spanish-Italian by the end of the summer
students wanted (EAMT internship)
More Sami, Kazakh, Tatar, Irish (ongoing projects)
New pairs: Sardinian-Italian, Macedonian-Albanian,
Gujarati-Hindi, Kazakh-English, Sicilian-Spanish,
Belarussian-Russian, Polish-Russian, Kurmanji-
English (Google Summer of Code)
More monolingual language packs (Catalan and
Spanish soon!)
21. Apertium: an extensive
and shared LR base for
MT and much more
Gema Ramírez Sánchez (a proud Apertium activist) –
Prompsit Language Engineering – gramirez@prompsit.com
TAUS Roundtable – Barcelona, 12 May 2016
Thanks!!!