Apertium: an extensive and shared language resource base for MT and much more, Gema Ramirez Sanchez, Prompsit Language Engineering

Apertium: an extensive
and shared LR base for
MT and much more
Gema Ramírez Sánchez (a proud Apertium activist) –
Prompsit Language Engineering – gramirez@prompsit.com
TAUS Roundtable – Barcelona, 12 May 2016

What is this talk about?
• This is not a talk about Apertium
• although it will be mentioned in every slide
• This is not a talk about MT
• although Apertium is a platform for MT
• This is a talk about data and linguistic tools
inside Apertium that are useful for MT:
• all MT approaches can benefit: SMT, NMT, RBMT…
• but this is not a talk about corpora!!!

Apertium downloads in Sourceforge
April 2016

Apertium downloads in Sourceforge
2005-2015 by country

Downloaded Apertium’s go to…
Apertium web
application
Portals, third-party tools
Corporate servers
Public bodies servers
Apertium apps (offline!!!)
Personal PC’s (users,
developers, researchers)

Downloaded Apertium’s go to….
 Apertium web application
 Portals, third party tools
 Corporate servers (offline)
 Public bodies servers
 Apertium Android apps
(offline!!)
 Personal PC’s (users,
developers, researchers)
Apertium community!!

Apertium community: mostly
data contributors
Community in Sourceforge (April 2016)
Contributors 7 admins, 400 developers
Contributions 67,896 commits

Brief history about data
in Apertium
Year Milestone Language pairs
2004 The Spanish Ministry of Industry funds a consortium to
build FOSS MT for the languages of Spain ----------------------------
2005 Apertium RBMT plaftorm is launched providing engine,
tools and data under free licenses
3 pairs: es-ca, es-gl and
es-pt
2005-
2009
Language pair-driven innovation, still very European-
focused language pairs
+19: fr, en, eo, ro, eu,
oc, cy, nn, nb, sv, da, is,
mk, bg, ast, br
2010 Five years on! 22 pairs!!!
2011-
2015
Consolidated community, support for non-European
languages , new tools and reorganisation of data
+19: af, nl, hr, sr, mt, sl,
ara, sme, urd, hin, kaz,
tat, id, ms, ar
2016 Eleven years on! 41 pairs!!!

Data in Apertium became big!
2010: 22 pairs

Stable language pairs
in Apertium: 2016 - 41 pairs

A language pair in Apertium
Language pair organisation
1 pack = 1 language pair
 2 monodixes , 1 bidix
 2 sets of rules (levels 1/3)
 2 tagsets + probabilities
 2 plain/tagged corpora
 2 post-dixes
New language pair organisation
2 monolingual packs
 1 monodix
 1 tagset + probabilities
 1 plain/tagged corpora
 1 post-dix
1 bilingual pack
 1 bidix
 2 sets of rules (levels 1/3)
Format: all data are xml-based files
Size:
Monodixes: 5k-90k lemmata=37k-13M surface forms; coverage: 85-97%
Bidixes: 8k-90k lemmata bilingual entries
Rules: 100 (one level) - 300 (3 level) per translation direction
License: GNU General Public License

Linguistic tools in Apertium
Monodix
Tagset+prob
Rules
Monodix
Bidix
t
o
o
l
s
t
o
o
l
s
Post-dix
Morphologica
l
analyser
PoS tagger
Lexical transfer Full MT
Morphological
generator
Structural transfer
Post-generator
What are these tools useful for?

Morphological analyser:
“Gema lo explica ahora mismo”
= expanded-morphology text
Gema.np.f.sg | gema.n.f.sg
lo.detnt | lo.prn.pro.m.sg
explicar.vb.pri.3.sg |explicar.vb.imp.3.sg
ahora mismo.adv
= lemmatised text
Gema/gema, el/lo, explicar, ahora mismo
= smart tokenized text
^Gema$ ^lo$ ^explica$ ^ahora_mismo$

Part-of-speech tagger:
= factored text
Gema.np.f.sg
lo.prn.pro.m.sg
explicar.vb.pri.3.sg
ahora_mismo.adv
= truecase text
^Gema$ ^lo$ ^explica$ ^ahora_mismo$
^gema$ : ^piedra_preciosa$

Lexical transfer:
= bilingual correspondences (lemmata) by PoS
Gema - Gema
lo - it
explicar - explain
ahora mismo – right now
gema – gemstone
lo – the

Structural transfer:
= grammar rules (reorderings, changes, agreement)
NP- [Gema] – NP-[Gema]
Pro-V[lo explica] – V-Pro-[explains it]
ADV-[ahora mismo] – ADV-[right now]
= linguistically motivated phrases based on rules
[Gema] – [Gema]
[lo explica] – [explains it]
[ahora mismo] – [right now]

Morphological generator:
= from expanded text to text to surface forms:
Gema.np.f.sg
lo.detnt
explicar.vb.pri.3.sg
ahora_mismo.adv
= “recaser” of a factored text:
Gema lo explica ahora_mismo
Gema
lo
explica
ahora mismo

Post-generator:
= nice rendering of a text
Gema
le
explique
maintenant
ALL THIS TAKING INTO ACCOUNT THE FORMAT!!!
Gema
l’explique
maintenant

How was data blossom
possible in Apertium?
Standards
Documentation
Availability
Support
Community building
Funding
And a free/open-source sharing model

Next data in Apertium?
 Spanish-Italian by the end of the summer 
students wanted (EAMT internship)
 More Sami, Kazakh, Tatar, Irish (ongoing projects)
 New pairs: Sardinian-Italian, Macedonian-Albanian,
Gujarati-Hindi, Kazakh-English, Sicilian-Spanish,
Belarussian-Russian, Polish-Russian, Kurmanji-
English (Google Summer of Code)
 More monolingual language packs (Catalan and
Spanish soon!)

Apertium: an extensive
and shared LR base for
MT and much more
Gema Ramírez Sánchez (a proud Apertium activist) –
Prompsit Language Engineering – gramirez@prompsit.com
TAUS Roundtable – Barcelona, 12 May 2016
Thanks!!!

Apertium: an extensive and shared language resource base for MT and much more, Gema Ramirez Sanchez, Prompsit Language Engineering

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (11)

Similar to Apertium: an extensive and shared language resource base for MT and much more, Gema Ramirez Sanchez, Prompsit Language Engineering

Similar to Apertium: an extensive and shared language resource base for MT and much more, Gema Ramirez Sanchez, Prompsit Language Engineering (20)

More from TAUS - The Language Data Network

More from TAUS - The Language Data Network (20)

Recently uploaded

Recently uploaded (20)

Apertium: an extensive and shared language resource base for MT and much more, Gema Ramirez Sanchez, Prompsit Language Engineering