SlideShare a Scribd company logo
1 of 44
Download to read offline
#LocWorld34
Apertium: a Unique
Free/Open-Source MT System
for Related Languages
[but not only]
Gema Ramírez Sánchez1
Mikel L. Forcada1,2
1
Prompsit Language Engineering, Elx, Spain
1,2
Universitat d’Alacant, Alacant, Spain
#LocWorld34
Outline
● Apertium components
● Ready-to-use Apertium products
● Machine translation — but not only!
● Licensing — free/open-source
● The Apertium community
● Research and business with Apertium
● Languages and language pairs
● Success cases
● Funding
#LocWorld34
Apertium components
Since 2005, Apertium provides the three key
components of machine translation:
● An engine
● Data
● Tools
#LocWorld34
Apertium components: the
engine /1
● A fast, free/open-source, modular,
shallow-transfer, language-independent
machine translation engine with:
○ text format management,
○ translation memory querying,
○ finite-state lexical processing,
○ statistical and constraint-based lexical
disambiguation, and
○ shallow structural transfer based on
finite-state pattern matching
#LocWorld34
Apertium components: the
engine /2
● Most of the engine was developed inside the
Apertium project but some external
technologies are used:
○ Helsinki Finite-State toolkit (for some
morphologically-rich languages),
○ VISL CG-3 (constraint grammars for
rule-based lexical disambiguation).
#LocWorld34
Apertium components: the data
● Free/open-source language data in
well-specified XML formats for a variety of
languages and language pairs.
#LocWorld34
Apertium components: the data.
A typical language pair
Language pair organization
2 monolingual packages (A, B)
▪ 1 monolingual dictionary
(monodix)
▪ 1 tagset + probabilities
▪ 1 plain/tagged corpus
▪ 1 postgeneration “dictionary”
1 bilingual package (A–B)
▪ 1 bilingual dictionary
(bidix)
▪ 2 sets of structural transfer
(grammar) rules (levels 1–3)
Format: typically XML-based (sometimes text-based) files
Sizes:
Monodixes: 10k–90k lemmata; 100k–23M surf. forms, 85–97% cover.
Bidixes: 8k--–90k bilingual lema correspondences
Rules: 100 (one level) – 300 (3 level) per translation direction
#LocWorld34
Apertium components: the
tools
● Free/open-source tools:
○ compilers to turn linguistic data into a fast
and compact form used by the engine
and
○ software to learn disambiguation or
translation rules from corpora.
#LocWorld34
Ready-to-use Apertium
products
● A stand-alone Java application for the
desktop: apertium-caffeine.
● An Android version for handhelds.
● A stand-alone version (Apertium Simpleton)
for Windows and MacOS.
● Plug-ins and support for CAT platforms:
OmegaT, MateCat, MemoQ, Trados Studio.
● Available as a PPA repository for GNU/Linux
users.
#LocWorld34
Apertium extras: mobile app
Full
offline
mode!Over 60
translation
directions!
On
Android!
#LocWorld34
No need to install: web access
www.apertium.org
#LocWorld34
No need to install: web access
www.apertium.org
● Text box: short plain texts
● Document translation:
○ plain text
○ HTML, XML (.xliff)
○ OpenDocument (.odt, .odp, .ods)
○ Office “-x” formats: .docx, .xlsx, .pptx
○ LaTeX
● A nice feature: with/without marks for
unknown words
#LocWorld34
No need to install: web/API
access
● Other portals with all Apertium languages:
○ Prompsit’s portal: + TMX +
navigate&translate
○ iTranslate4.eu portal: multiengine
● Other portals with some Apertium languages:
○ UOC, UPV, UA (+ TMX + terminology support
+ more formats)
○ GiellaTekno portal
○ etc.
● Also API access and connectors to translation
tools are marketed
#LocWorld34
Machine translation — but not
only! /1
#LocWorld34
Machine translation — but
not only! /2
Monodix
Tagset+prob
Rules
Monodix
Bidix
t
o
o
l
s
t
o
o
l
s
Post-dix
Morphological
analyser
PoS tagger
Lexical transfer
Full MT
Morphological
generator
Structural transfer
Post-generator
#LocWorld34
Machine translation — but
not only! /3
● Apertium is a rule-based machine translation
system but the pipeline contains many
monolingual modules that can be used for other
human-language technology tasks (such as
anonymization or factored output)
● Most modules are based on finite-state
technology; HMMs are used for part-of-speech
tagging and an interpreted language is used to
write structural transfer rules.
#LocWorld34
Licensing: free/open-source /1
Apertium language data and code are both
licensed under the GNU General Public License:
● a free/open-source license allowing free
distribution of unmodified and modified
versions
● a copylefted license: it avoids private
appropriation and encourages giving
improvements back to the project (it creates
a software commons).
#LocWorld34
Licensing: free/open-source /2
● The free/open-source model creates a
community which effectively connects
researchers, developers, vendors, and users
in a continuum.
#LocWorld34
The Apertium community
● Very active group of hundreds of developers
● Contributions to Apertium at Sourceforge
● Wiki documentation (wiki.apertium.org)
● Easy entry: Apertium linguistic modelling is
simple, no need to program.
● IRC channel #apertium in freenode.net
● Mailing lists: apertium-stuff@lists.sf.net and
other lists
#LocWorld34
The Apertium community
[A search for Apertium faces in Google Images]
#LocWorld34This is Francis Tyers (spectie)!
The Apertium community
[A search for Apertium faces in Google Images]
#LocWorld34
The Apertium community
Community in Sourceforge (May 2017)
Contributors 7 admins, 428 developers
Contributions +10k from May ‘16 to May ‘17
+78k commits altogether
#LocWorld34
The Apertium community:
activities
● President and project management
committee election according to bylaws
● Support: mail, chat, online meetings
● Maintenance: pairs, web, mobile app
● Manuals & documentation: wiki, manuals,
how-to’s, training materials
● Organization of Google Summer of Code and
Google Code-In activity
● Outreach activities: conferences, workshops
● Language-related groups
#LocWorld34
Research and business with
Apertium
Apertium is already an active research and
business platform:
● Research: 40+ publications, 2 PhD thesis, 4
master's theses.
● Business: companies (Prompsit, Eleka,
Imaxin Software, etc.) offering services to
customers such as Autodesk, Adobe, the
Government of Catalonia, 2 daily newspapers
in Spain, freelancers and LSPs
#LocWorld34
Languages and language pairs
/1
● Language data is encoded mostly in XML,
but some language pairs contain data
encoded in other text-based formats.
● Stable language pairs (bilingual data) are
currently more than 40.
#LocWorld34
Languages and language pairs
/2
#LocWorld34
Languages and language pairs
/3
#LocWorld34
Languages and language pairs
/4
Year Milestone Language pairs
2004 The Spanish Ministry of Industry funds a consortium
to build FOSS MT for the languages of Spain ----------------------------
2005 Apertium RBMT plaftorm is launched providing
engine, tools and data under free licenses
3 pairs: es–ca, es–gl
and es–pt
2005-2009 Language pair-driven innovation, still very
European-focused language pairs
+19: fr, en, eo, ro, eu,
oc, cy, nn, nb, sv, da, is,
mk, bg, ast, br
2010 Five years on! 22 pairs!!!
2011-2015 Consolidated community, support for non-European
languages, new tools and reorganisation of data
+19: af, nl, hr, sr, mt, sl,
arg, sme, urd, hin, kaz,
tat, id, ms, ar
2017 Twelve years on! 43 pairs!!!
#LocWorld34
Apertium loves small
languages
● Breton→French
● Aragonese↔Spanish/Catalan
● Occitan↔Catalan/Spanish
● Italian→Sardinian
● North Sámi↔Norwegian
● Icelandic↔Swedish
● Spanish→Spanish Sign Language
#LocWorld34
Language pairs with approx.
95% text coverage
Language Lemmata Inflection models Surface forms
HBS 97,445 1,429 23,348,650
English 60,543 312 108,119
Spanish 46,003 442 4,737,777
Catalan 41,116 559 7,088,585
Galician 29,818 333 14,247,591
Asturian 46,550 443 18,541,752
Occitan 21,602 527 6,084,575
Aragonese 26,068 544 12,870,976
Portuguese 14,436 316 10,514,672
#LocWorld34
Apertium language-pair
life cycles
● For new pairs:
○ resource compilation
○ basic system creation (85% coverage, most
frequent structural phenomena)
○ evaluation
○ typically takes 3–6 months
● For existing pairs:
○ testing, enhancement, evaluation
○ typically takes 1–3 months
#LocWorld34
A related-languages pair
performance: apertium-es-pt
From Masselot et al., 2010 (Using the Apertium
Spanish–Brazilian Portuguese MT system for
localization):
● Post-editing effort (word error rate): 20%
● Post-editing speed: average 4,500 words/day
Updated 2017 (also for software localisation):
● Post-editing effort (word error rate): 14%
● Post-editing speed: average 6,500 words/day
#LocWorld34
Related language-pair
post-editing experience /1
Original Spanish MT output Portuguese final
Completa
documentación
2D.
Completa
documentação
2D.
Documentação
2D
abrangente.
#LocWorld34
Related language-pair
post-editing experience /2
Original Spanish Apertium output Portuguese final
Cree documentación
y dibujos 2D con un
completo conjunto
de herramientas de
dibujo, edición y
anotación.
Crê documentação e
desenhos 2D com
um completo
conjunto de
ferramentas de
desenho, edição e
anotação.
Produza desenhos e
documentação 2D
com um conjunto
abrangente de
ferramentas de
desenho, edição e
anotação.
Apertium output for closely-related languages is:
● Easy and fast to post-edit
● Rather mechanical, but reliable
● Predictable
#LocWorld34
Nearby LocWorld Barcelona...
● Apertium makes two daily newspaper
bilingual: Levante (Catalan) and La Voz de
Galicia (Galician).
● Universities in the Catalan speaking area
use Apertium to help in the generation of
courseware and academic information;
● Apertium is used in PLATA, the Spanish
government platform for webpage
translation.
Some success cases /1
#LocWorld34
Also by-products:
● Same-language machine translation for
local flavours/flavors: AltLang.net
○ available for English, Spanish, French and
Portuguese varieties.
○ performs spelling, lexical, grammar and
style changes.
Some success cases /2
Based on Apertium
#LocWorld34
Some other success cases/3
In Wikimedia Content Translation,
Apertium translates Wikipedia content
#LocWorld34
Wikimedia Content Translation
into Norwegian Nynorsk
Co-funded project on MT
for Scandinavian
languages including
community outreach starts
Most of the translations are
from Norwegian Bokmål.
85% are done using Apertium.
#LocWorld34
Before Content Translation: main
use for Bokmål–Nynorsk was
“homework”
#LocWorld34
● Translators Without Borders develop
crisis-specific, portable machine translation
from English to Kurdish languages (Kurmanji,
Sorani) on Apertium.
● Apertium and language experts help promote a
unified standard for Occitan by defining and
selecting it for Spanish→Occitan and
Catalan→Occitan MT
Other success cases involving
interaction with other
communities
#LocWorld34
Funding /1
● The Ministry of Industry, Tourism and
Commerce of Spain (also, the Ministries of
Education and Science and of Science and
Technology of Spain)
● The Secretariat for Technology and the
Information Society of the Government of
Catalonia
● The European Commission (DGT training and
Abu-Matran project)
● The Ministry of Foreign Affairs of Romania
#LocWorld34
Funding /2
● Universitat d'Alacant and Universitat Oberta
de Catalunya
● Ofis Publik ar Brezhoneg (Breton Language
Board)
● Ministry of Education and Science of the
Republic of Kazakhstan
● Google Summer of Code scholarships
(2009–2014, 2016, 2017) and Google Code-In
donations (2010–2016).
● And many other private companies
#LocWorld34
● If you want to build, integrate, or customize
fast, reliable, predictable machine translation for
your application.
● If you’d rather understand application-oriented
dictionaries and rules rather than deal with the
“magic” of embeddings, decoders, phrase tables,
convolutions, or probabilities.
● If there’s no way you can amass and curate
millions of translated words to train a system
for your language or application.
Then come and talk to us
(we are at booth 121).
You can be part of it!
#LocWorld34
© 2017 Mikel L. Forcada i Gema Ramírez-Sánchez
This work may be distributed under the terms of
any of these two licenses:
● Creative Commons Attribution–Share Alike:
http://creativecommons.org/licenses/by-sa/3.0/deed.e
n
● GNU GPL v. 3.0: http://www.gnu.org/licenses/gpl.html
Sharing

More Related Content

Similar to Apertium: a unique free/open-source MT system for related languages [but not only]

Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana
 
2016 EDRLab roadmap at epubsummit
2016 EDRLab roadmap at epubsummit2016 EDRLab roadmap at epubsummit
2016 EDRLab roadmap at epubsummitLaurent Le Meur
 
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Project
 
Apertium: Free/open-source rule-based machine translation and language proces...
Apertium: Free/open-source rule-based machine translation and language proces...Apertium: Free/open-source rule-based machine translation and language proces...
Apertium: Free/open-source rule-based machine translation and language proces...TAUS - The Language Data Network
 
Natural language identification
Natural language identificationNatural language identification
Natural language identificationShaktiTaneja
 
Open Source Tools for Libraries
Open Source Tools for LibrariesOpen Source Tools for Libraries
Open Source Tools for LibrariesNicole C. Engard
 
Single-Sourcing and Localization stc16
Single-Sourcing and Localization stc16Single-Sourcing and Localization stc16
Single-Sourcing and Localization stc16Laura Dent
 
Laura Dent: Single-Source and Localization
Laura Dent: Single-Source and LocalizationLaura Dent: Single-Source and Localization
Laura Dent: Single-Source and LocalizationJack Molisani
 
Python programming ppt.pptx
Python programming ppt.pptxPython programming ppt.pptx
Python programming ppt.pptxnagendrasai12
 
Python workshop
Python workshopPython workshop
Python workshopShiraz LUG
 
Transifex at the TAUS Translation Technology Showcase - Silicon Valley 2015
Transifex at the TAUS Translation Technology Showcase - Silicon Valley 2015Transifex at the TAUS Translation Technology Showcase - Silicon Valley 2015
Transifex at the TAUS Translation Technology Showcase - Silicon Valley 2015TAUS - The Language Data Network
 
How to create/improve OSS product and its community (revised)
How to create/improve OSS product and its community (revised)How to create/improve OSS product and its community (revised)
How to create/improve OSS product and its community (revised)SATOSHI TAGOMORI
 
Python Training in Gurgaon.pdf
Python Training in Gurgaon.pdfPython Training in Gurgaon.pdf
Python Training in Gurgaon.pdfAPTRON Gurgaon
 
Querix 4 gl app analyzer 2016 journey to the center of your 4gl application
Querix 4 gl app analyzer 2016 journey to the center of your 4gl applicationQuerix 4 gl app analyzer 2016 journey to the center of your 4gl application
Querix 4 gl app analyzer 2016 journey to the center of your 4gl applicationBeGooden-IT Consulting
 
It is easy contributing to Open Source - ECLIPSE CON 2020
It is easy contributing to Open Source - ECLIPSE CON 2020It is easy contributing to Open Source - ECLIPSE CON 2020
It is easy contributing to Open Source - ECLIPSE CON 2020César Hernández
 
Python Training in Gurgaon.pdf
Python Training in Gurgaon.pdfPython Training in Gurgaon.pdf
Python Training in Gurgaon.pdfAPTRON Solutions
 

Similar to Apertium: a unique free/open-source MT system for related languages [but not only] (20)

Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
 
2016 EDRLab roadmap at epubsummit
2016 EDRLab roadmap at epubsummit2016 EDRLab roadmap at epubsummit
2016 EDRLab roadmap at epubsummit
 
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
 
Apertium: Free/open-source rule-based machine translation and language proces...
Apertium: Free/open-source rule-based machine translation and language proces...Apertium: Free/open-source rule-based machine translation and language proces...
Apertium: Free/open-source rule-based machine translation and language proces...
 
Natural language identification
Natural language identificationNatural language identification
Natural language identification
 
Open Source Tools for Libraries
Open Source Tools for LibrariesOpen Source Tools for Libraries
Open Source Tools for Libraries
 
Single-Sourcing and Localization stc16
Single-Sourcing and Localization stc16Single-Sourcing and Localization stc16
Single-Sourcing and Localization stc16
 
Achievement And Lessons Learned By An Loc
Achievement And Lessons Learned By An LocAchievement And Lessons Learned By An Loc
Achievement And Lessons Learned By An Loc
 
PYTHON UNIT 1
PYTHON UNIT 1PYTHON UNIT 1
PYTHON UNIT 1
 
Laura Dent: Single-Source and Localization
Laura Dent: Single-Source and LocalizationLaura Dent: Single-Source and Localization
Laura Dent: Single-Source and Localization
 
Python programming ppt.pptx
Python programming ppt.pptxPython programming ppt.pptx
Python programming ppt.pptx
 
Introduction to python
Introduction to python Introduction to python
Introduction to python
 
Python workshop
Python workshopPython workshop
Python workshop
 
Python workshop
Python workshopPython workshop
Python workshop
 
Transifex at the TAUS Translation Technology Showcase - Silicon Valley 2015
Transifex at the TAUS Translation Technology Showcase - Silicon Valley 2015Transifex at the TAUS Translation Technology Showcase - Silicon Valley 2015
Transifex at the TAUS Translation Technology Showcase - Silicon Valley 2015
 
How to create/improve OSS product and its community (revised)
How to create/improve OSS product and its community (revised)How to create/improve OSS product and its community (revised)
How to create/improve OSS product and its community (revised)
 
Python Training in Gurgaon.pdf
Python Training in Gurgaon.pdfPython Training in Gurgaon.pdf
Python Training in Gurgaon.pdf
 
Querix 4 gl app analyzer 2016 journey to the center of your 4gl application
Querix 4 gl app analyzer 2016 journey to the center of your 4gl applicationQuerix 4 gl app analyzer 2016 journey to the center of your 4gl application
Querix 4 gl app analyzer 2016 journey to the center of your 4gl application
 
It is easy contributing to Open Source - ECLIPSE CON 2020
It is easy contributing to Open Source - ECLIPSE CON 2020It is easy contributing to Open Source - ECLIPSE CON 2020
It is easy contributing to Open Source - ECLIPSE CON 2020
 
Python Training in Gurgaon.pdf
Python Training in Gurgaon.pdfPython Training in Gurgaon.pdf
Python Training in Gurgaon.pdf
 

Recently uploaded

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 

Recently uploaded (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 

Apertium: a unique free/open-source MT system for related languages [but not only]

  • 1. #LocWorld34 Apertium: a Unique Free/Open-Source MT System for Related Languages [but not only] Gema Ramírez Sánchez1 Mikel L. Forcada1,2 1 Prompsit Language Engineering, Elx, Spain 1,2 Universitat d’Alacant, Alacant, Spain
  • 2. #LocWorld34 Outline ● Apertium components ● Ready-to-use Apertium products ● Machine translation — but not only! ● Licensing — free/open-source ● The Apertium community ● Research and business with Apertium ● Languages and language pairs ● Success cases ● Funding
  • 3. #LocWorld34 Apertium components Since 2005, Apertium provides the three key components of machine translation: ● An engine ● Data ● Tools
  • 4. #LocWorld34 Apertium components: the engine /1 ● A fast, free/open-source, modular, shallow-transfer, language-independent machine translation engine with: ○ text format management, ○ translation memory querying, ○ finite-state lexical processing, ○ statistical and constraint-based lexical disambiguation, and ○ shallow structural transfer based on finite-state pattern matching
  • 5. #LocWorld34 Apertium components: the engine /2 ● Most of the engine was developed inside the Apertium project but some external technologies are used: ○ Helsinki Finite-State toolkit (for some morphologically-rich languages), ○ VISL CG-3 (constraint grammars for rule-based lexical disambiguation).
  • 6. #LocWorld34 Apertium components: the data ● Free/open-source language data in well-specified XML formats for a variety of languages and language pairs.
  • 7. #LocWorld34 Apertium components: the data. A typical language pair Language pair organization 2 monolingual packages (A, B) ▪ 1 monolingual dictionary (monodix) ▪ 1 tagset + probabilities ▪ 1 plain/tagged corpus ▪ 1 postgeneration “dictionary” 1 bilingual package (A–B) ▪ 1 bilingual dictionary (bidix) ▪ 2 sets of structural transfer (grammar) rules (levels 1–3) Format: typically XML-based (sometimes text-based) files Sizes: Monodixes: 10k–90k lemmata; 100k–23M surf. forms, 85–97% cover. Bidixes: 8k--–90k bilingual lema correspondences Rules: 100 (one level) – 300 (3 level) per translation direction
  • 8. #LocWorld34 Apertium components: the tools ● Free/open-source tools: ○ compilers to turn linguistic data into a fast and compact form used by the engine and ○ software to learn disambiguation or translation rules from corpora.
  • 9. #LocWorld34 Ready-to-use Apertium products ● A stand-alone Java application for the desktop: apertium-caffeine. ● An Android version for handhelds. ● A stand-alone version (Apertium Simpleton) for Windows and MacOS. ● Plug-ins and support for CAT platforms: OmegaT, MateCat, MemoQ, Trados Studio. ● Available as a PPA repository for GNU/Linux users.
  • 10. #LocWorld34 Apertium extras: mobile app Full offline mode!Over 60 translation directions! On Android!
  • 11. #LocWorld34 No need to install: web access www.apertium.org
  • 12. #LocWorld34 No need to install: web access www.apertium.org ● Text box: short plain texts ● Document translation: ○ plain text ○ HTML, XML (.xliff) ○ OpenDocument (.odt, .odp, .ods) ○ Office “-x” formats: .docx, .xlsx, .pptx ○ LaTeX ● A nice feature: with/without marks for unknown words
  • 13. #LocWorld34 No need to install: web/API access ● Other portals with all Apertium languages: ○ Prompsit’s portal: + TMX + navigate&translate ○ iTranslate4.eu portal: multiengine ● Other portals with some Apertium languages: ○ UOC, UPV, UA (+ TMX + terminology support + more formats) ○ GiellaTekno portal ○ etc. ● Also API access and connectors to translation tools are marketed
  • 15. #LocWorld34 Machine translation — but not only! /2 Monodix Tagset+prob Rules Monodix Bidix t o o l s t o o l s Post-dix Morphological analyser PoS tagger Lexical transfer Full MT Morphological generator Structural transfer Post-generator
  • 16. #LocWorld34 Machine translation — but not only! /3 ● Apertium is a rule-based machine translation system but the pipeline contains many monolingual modules that can be used for other human-language technology tasks (such as anonymization or factored output) ● Most modules are based on finite-state technology; HMMs are used for part-of-speech tagging and an interpreted language is used to write structural transfer rules.
  • 17. #LocWorld34 Licensing: free/open-source /1 Apertium language data and code are both licensed under the GNU General Public License: ● a free/open-source license allowing free distribution of unmodified and modified versions ● a copylefted license: it avoids private appropriation and encourages giving improvements back to the project (it creates a software commons).
  • 18. #LocWorld34 Licensing: free/open-source /2 ● The free/open-source model creates a community which effectively connects researchers, developers, vendors, and users in a continuum.
  • 19. #LocWorld34 The Apertium community ● Very active group of hundreds of developers ● Contributions to Apertium at Sourceforge ● Wiki documentation (wiki.apertium.org) ● Easy entry: Apertium linguistic modelling is simple, no need to program. ● IRC channel #apertium in freenode.net ● Mailing lists: apertium-stuff@lists.sf.net and other lists
  • 20. #LocWorld34 The Apertium community [A search for Apertium faces in Google Images]
  • 21. #LocWorld34This is Francis Tyers (spectie)! The Apertium community [A search for Apertium faces in Google Images]
  • 22. #LocWorld34 The Apertium community Community in Sourceforge (May 2017) Contributors 7 admins, 428 developers Contributions +10k from May ‘16 to May ‘17 +78k commits altogether
  • 23. #LocWorld34 The Apertium community: activities ● President and project management committee election according to bylaws ● Support: mail, chat, online meetings ● Maintenance: pairs, web, mobile app ● Manuals & documentation: wiki, manuals, how-to’s, training materials ● Organization of Google Summer of Code and Google Code-In activity ● Outreach activities: conferences, workshops ● Language-related groups
  • 24. #LocWorld34 Research and business with Apertium Apertium is already an active research and business platform: ● Research: 40+ publications, 2 PhD thesis, 4 master's theses. ● Business: companies (Prompsit, Eleka, Imaxin Software, etc.) offering services to customers such as Autodesk, Adobe, the Government of Catalonia, 2 daily newspapers in Spain, freelancers and LSPs
  • 25. #LocWorld34 Languages and language pairs /1 ● Language data is encoded mostly in XML, but some language pairs contain data encoded in other text-based formats. ● Stable language pairs (bilingual data) are currently more than 40.
  • 28. #LocWorld34 Languages and language pairs /4 Year Milestone Language pairs 2004 The Spanish Ministry of Industry funds a consortium to build FOSS MT for the languages of Spain ---------------------------- 2005 Apertium RBMT plaftorm is launched providing engine, tools and data under free licenses 3 pairs: es–ca, es–gl and es–pt 2005-2009 Language pair-driven innovation, still very European-focused language pairs +19: fr, en, eo, ro, eu, oc, cy, nn, nb, sv, da, is, mk, bg, ast, br 2010 Five years on! 22 pairs!!! 2011-2015 Consolidated community, support for non-European languages, new tools and reorganisation of data +19: af, nl, hr, sr, mt, sl, arg, sme, urd, hin, kaz, tat, id, ms, ar 2017 Twelve years on! 43 pairs!!!
  • 29. #LocWorld34 Apertium loves small languages ● Breton→French ● Aragonese↔Spanish/Catalan ● Occitan↔Catalan/Spanish ● Italian→Sardinian ● North Sámi↔Norwegian ● Icelandic↔Swedish ● Spanish→Spanish Sign Language
  • 30. #LocWorld34 Language pairs with approx. 95% text coverage Language Lemmata Inflection models Surface forms HBS 97,445 1,429 23,348,650 English 60,543 312 108,119 Spanish 46,003 442 4,737,777 Catalan 41,116 559 7,088,585 Galician 29,818 333 14,247,591 Asturian 46,550 443 18,541,752 Occitan 21,602 527 6,084,575 Aragonese 26,068 544 12,870,976 Portuguese 14,436 316 10,514,672
  • 31. #LocWorld34 Apertium language-pair life cycles ● For new pairs: ○ resource compilation ○ basic system creation (85% coverage, most frequent structural phenomena) ○ evaluation ○ typically takes 3–6 months ● For existing pairs: ○ testing, enhancement, evaluation ○ typically takes 1–3 months
  • 32. #LocWorld34 A related-languages pair performance: apertium-es-pt From Masselot et al., 2010 (Using the Apertium Spanish–Brazilian Portuguese MT system for localization): ● Post-editing effort (word error rate): 20% ● Post-editing speed: average 4,500 words/day Updated 2017 (also for software localisation): ● Post-editing effort (word error rate): 14% ● Post-editing speed: average 6,500 words/day
  • 33. #LocWorld34 Related language-pair post-editing experience /1 Original Spanish MT output Portuguese final Completa documentación 2D. Completa documentação 2D. Documentação 2D abrangente.
  • 34. #LocWorld34 Related language-pair post-editing experience /2 Original Spanish Apertium output Portuguese final Cree documentación y dibujos 2D con un completo conjunto de herramientas de dibujo, edición y anotación. Crê documentação e desenhos 2D com um completo conjunto de ferramentas de desenho, edição e anotação. Produza desenhos e documentação 2D com um conjunto abrangente de ferramentas de desenho, edição e anotação. Apertium output for closely-related languages is: ● Easy and fast to post-edit ● Rather mechanical, but reliable ● Predictable
  • 35. #LocWorld34 Nearby LocWorld Barcelona... ● Apertium makes two daily newspaper bilingual: Levante (Catalan) and La Voz de Galicia (Galician). ● Universities in the Catalan speaking area use Apertium to help in the generation of courseware and academic information; ● Apertium is used in PLATA, the Spanish government platform for webpage translation. Some success cases /1
  • 36. #LocWorld34 Also by-products: ● Same-language machine translation for local flavours/flavors: AltLang.net ○ available for English, Spanish, French and Portuguese varieties. ○ performs spelling, lexical, grammar and style changes. Some success cases /2 Based on Apertium
  • 37. #LocWorld34 Some other success cases/3 In Wikimedia Content Translation, Apertium translates Wikipedia content
  • 38. #LocWorld34 Wikimedia Content Translation into Norwegian Nynorsk Co-funded project on MT for Scandinavian languages including community outreach starts Most of the translations are from Norwegian Bokmål. 85% are done using Apertium.
  • 39. #LocWorld34 Before Content Translation: main use for Bokmål–Nynorsk was “homework”
  • 40. #LocWorld34 ● Translators Without Borders develop crisis-specific, portable machine translation from English to Kurdish languages (Kurmanji, Sorani) on Apertium. ● Apertium and language experts help promote a unified standard for Occitan by defining and selecting it for Spanish→Occitan and Catalan→Occitan MT Other success cases involving interaction with other communities
  • 41. #LocWorld34 Funding /1 ● The Ministry of Industry, Tourism and Commerce of Spain (also, the Ministries of Education and Science and of Science and Technology of Spain) ● The Secretariat for Technology and the Information Society of the Government of Catalonia ● The European Commission (DGT training and Abu-Matran project) ● The Ministry of Foreign Affairs of Romania
  • 42. #LocWorld34 Funding /2 ● Universitat d'Alacant and Universitat Oberta de Catalunya ● Ofis Publik ar Brezhoneg (Breton Language Board) ● Ministry of Education and Science of the Republic of Kazakhstan ● Google Summer of Code scholarships (2009–2014, 2016, 2017) and Google Code-In donations (2010–2016). ● And many other private companies
  • 43. #LocWorld34 ● If you want to build, integrate, or customize fast, reliable, predictable machine translation for your application. ● If you’d rather understand application-oriented dictionaries and rules rather than deal with the “magic” of embeddings, decoders, phrase tables, convolutions, or probabilities. ● If there’s no way you can amass and curate millions of translated words to train a system for your language or application. Then come and talk to us (we are at booth 121). You can be part of it!
  • 44. #LocWorld34 © 2017 Mikel L. Forcada i Gema Ramírez-Sánchez This work may be distributed under the terms of any of these two licenses: ● Creative Commons Attribution–Share Alike: http://creativecommons.org/licenses/by-sa/3.0/deed.e n ● GNU GPL v. 3.0: http://www.gnu.org/licenses/gpl.html Sharing