SlideShare a Scribd company logo
1 of 21
Apertium: an extensive
and shared LR base for
MT and much more
Gema Ramírez Sánchez (a proud Apertium activist) –
Prompsit Language Engineering – gramirez@prompsit.com
TAUS Roundtable – Barcelona, 12 May 2016
What is this talk about?
• This is not a talk about Apertium
• although it will be mentioned in every slide
• This is not a talk about MT
• although Apertium is a platform for MT
• This is a talk about data and linguistic tools
inside Apertium that are useful for MT:
• all MT approaches can benefit: SMT, NMT, RBMT…
• but this is not a talk about corpora!!!
Apertium downloads in Sourceforge
April 2016
Apertium downloads in Sourceforge
2005-2015 by country
Downloaded Apertium’s go to…
Apertium web
application
Portals, third-party tools
Corporate servers
Public bodies servers
Apertium apps (offline!!!)
Personal PC’s (users,
developers, researchers)
Downloaded Apertium’s go to….
 Apertium web application
 Portals, third party tools
 Corporate servers (offline)
 Public bodies servers
 Apertium Android apps
(offline!!)
 Personal PC’s (users,
developers, researchers)
Apertium community!!
Apertium community: mostly
data contributors
Community in Sourceforge (April 2016)
Contributors 7 admins, 400 developers
Contributions 67,896 commits
Brief history about data
in Apertium
Year Milestone Language pairs
2004 The Spanish Ministry of Industry funds a consortium to
build FOSS MT for the languages of Spain ----------------------------
2005 Apertium RBMT plaftorm is launched providing engine,
tools and data under free licenses
3 pairs: es-ca, es-gl and
es-pt
2005-
2009
Language pair-driven innovation, still very European-
focused language pairs
+19: fr, en, eo, ro, eu,
oc, cy, nn, nb, sv, da, is,
mk, bg, ast, br
2010 Five years on! 22 pairs!!!
2011-
2015
Consolidated community, support for non-European
languages , new tools and reorganisation of data
+19: af, nl, hr, sr, mt, sl,
ara, sme, urd, hin, kaz,
tat, id, ms, ar
2016 Eleven years on! 41 pairs!!!
Data in Apertium became big!
2010: 22 pairs
Stable language pairs
in Apertium: 2016 - 41 pairs
A language pair in Apertium
Language pair organisation
1 pack = 1 language pair
 2 monodixes , 1 bidix
 2 sets of rules (levels 1/3)
 2 tagsets + probabilities
 2 plain/tagged corpora
 2 post-dixes
New language pair organisation
2 monolingual packs
 1 monodix
 1 tagset + probabilities
 1 plain/tagged corpora
 1 post-dix
1 bilingual pack
 1 bidix
 2 sets of rules (levels 1/3)
Format: all data are xml-based files
Size:
Monodixes: 5k-90k lemmata=37k-13M surface forms; coverage: 85-97%
Bidixes: 8k-90k lemmata bilingual entries
Rules: 100 (one level) - 300 (3 level) per translation direction
License: GNU General Public License
Linguistic tools in Apertium
Monodix
Tagset+prob
Rules
Monodix
Bidix
t
o
o
l
s
t
o
o
l
s
Post-dix
Morphologica
l
analyser
PoS tagger
Lexical transfer Full MT
Morphological
generator
Structural transfer
Post-generator
What are these tools useful for?
Morphological analyser:
“Gema lo explica ahora mismo”
= expanded-morphology text
Gema.np.f.sg | gema.n.f.sg
lo.detnt | lo.prn.pro.m.sg
explicar.vb.pri.3.sg |explicar.vb.imp.3.sg
ahora mismo.adv
= lemmatised text
Gema/gema, el/lo, explicar, ahora mismo
= smart tokenized text
^Gema$ ^lo$ ^explica$ ^ahora_mismo$
Part-of-speech tagger:
“Gema lo explica ahora mismo”
= factored text
Gema.np.f.sg
lo.prn.pro.m.sg
explicar.vb.pri.3.sg
ahora_mismo.adv
= truecase text
^Gema$ ^lo$ ^explica$ ^ahora_mismo$
^gema$ : ^piedra_preciosa$
Lexical transfer:
“Gema lo explica ahora mismo”
= bilingual correspondences (lemmata) by PoS
Gema - Gema
lo - it
explicar - explain
ahora mismo – right now
gema – gemstone
lo – the
Structural transfer:
“Gema lo explica ahora mismo”
= grammar rules (reorderings, changes, agreement)
NP- [Gema] – NP-[Gema]
Pro-V[lo explica] – V-Pro-[explains it]
ADV-[ahora mismo] – ADV-[right now]
= linguistically motivated phrases based on rules
[Gema] – [Gema]
[lo explica] – [explains it]
[ahora mismo] – [right now]
Morphological generator:
“Gema lo explica ahora mismo”
= from expanded text to text to surface forms:
Gema.np.f.sg
lo.detnt
explicar.vb.pri.3.sg
ahora_mismo.adv
= “recaser” of a factored text:
Gema lo explica ahora_mismo
Gema
lo
explica
ahora mismo
Post-generator:
“Gema lo explica ahora mismo”
= nice rendering of a text
Gema
le
explique
maintenant
ALL THIS TAKING INTO ACCOUNT THE FORMAT!!!
Gema
l’explique
maintenant
How was data blossom
possible in Apertium?
Standards
Documentation
Availability
Support
Community building
Funding
And a free/open-source sharing model
Next data in Apertium?
 Spanish-Italian by the end of the summer 
students wanted (EAMT internship)
 More Sami, Kazakh, Tatar, Irish (ongoing projects)
 New pairs: Sardinian-Italian, Macedonian-Albanian,
Gujarati-Hindi, Kazakh-English, Sicilian-Spanish,
Belarussian-Russian, Polish-Russian, Kurmanji-
English (Google Summer of Code)
 More monolingual language packs (Catalan and
Spanish soon!)
Apertium: an extensive
and shared LR base for
MT and much more
Gema Ramírez Sánchez (a proud Apertium activist) –
Prompsit Language Engineering – gramirez@prompsit.com
TAUS Roundtable – Barcelona, 12 May 2016
Thanks!!!

More Related Content

Viewers also liked

Integrating MT with TM for Higher Quality Translation (STAR MT and Transit NX...
Integrating MT with TM for Higher Quality Translation (STAR MT and Transit NX...Integrating MT with TM for Higher Quality Translation (STAR MT and Transit NX...
Integrating MT with TM for Higher Quality Translation (STAR MT and Transit NX...
TAUS - The Language Data Network
 
From Machine Translation to Machine Interpretation - Jimmy Kunzmann
From Machine Translation to Machine Interpretation - Jimmy KunzmannFrom Machine Translation to Machine Interpretation - Jimmy Kunzmann
From Machine Translation to Machine Interpretation - Jimmy Kunzmann
TAUS - The Language Data Network
 

Viewers also liked (11)

Integrating MT with TM for Higher Quality Translation (STAR MT and Transit NX...
Integrating MT with TM for Higher Quality Translation (STAR MT and Transit NX...Integrating MT with TM for Higher Quality Translation (STAR MT and Transit NX...
Integrating MT with TM for Higher Quality Translation (STAR MT and Transit NX...
 
Spanish Language Technology Plan. David Pérez Fernández, Cabinet of State Sec...
Spanish Language Technology Plan. David Pérez Fernández, Cabinet of State Sec...Spanish Language Technology Plan. David Pérez Fernández, Cabinet of State Sec...
Spanish Language Technology Plan. David Pérez Fernández, Cabinet of State Sec...
 
Innovative Business and Pricing Models - John Tinsley (Iconic Translation Mac...
Innovative Business and Pricing Models - John Tinsley (Iconic Translation Mac...Innovative Business and Pricing Models - John Tinsley (Iconic Translation Mac...
Innovative Business and Pricing Models - John Tinsley (Iconic Translation Mac...
 
The Future does not need Translators. Or does it?- Marcello Federico
The Future does not need Translators. Or does it?- Marcello FedericoThe Future does not need Translators. Or does it?- Marcello Federico
The Future does not need Translators. Or does it?- Marcello Federico
 
Machine Translation Quality - Are We There Yet? - Dag Schmidtke (Microsoft)
Machine Translation Quality - Are We There Yet? - Dag Schmidtke (Microsoft)Machine Translation Quality - Are We There Yet? - Dag Schmidtke (Microsoft)
Machine Translation Quality - Are We There Yet? - Dag Schmidtke (Microsoft)
 
From Machine Translation to Machine Interpretation - Jimmy Kunzmann
From Machine Translation to Machine Interpretation - Jimmy KunzmannFrom Machine Translation to Machine Interpretation - Jimmy Kunzmann
From Machine Translation to Machine Interpretation - Jimmy Kunzmann
 
Building a pan-European automated translation platform, Andrejs Vasiljevs, CE...
Building a pan-European automated translation platform, Andrejs Vasiljevs, CE...Building a pan-European automated translation platform, Andrejs Vasiljevs, CE...
Building a pan-European automated translation platform, Andrejs Vasiljevs, CE...
 
The International Customer Experience.
The International Customer Experience.The International Customer Experience.
The International Customer Experience.
 
TAUS Game Changer Innovation Contest - The Invaders
TAUS Game Changer Innovation Contest - The InvadersTAUS Game Changer Innovation Contest - The Invaders
TAUS Game Changer Innovation Contest - The Invaders
 
Modernizing Pricing and Business Models
Modernizing Pricing and Business ModelsModernizing Pricing and Business Models
Modernizing Pricing and Business Models
 
Intel
IntelIntel
Intel
 

Similar to Apertium: an extensive and shared language resource base for MT and much more, Gema Ramirez Sanchez, Prompsit Language Engineering

Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Apache OpenNLP
 
Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured Text
DataWorks Summit
 

Similar to Apertium: an extensive and shared language resource base for MT and much more, Gema Ramirez Sanchez, Prompsit Language Engineering (20)

Can Deep Learning Techniques Improve Entity Linking?
Can Deep Learning Techniques Improve Entity Linking?Can Deep Learning Techniques Improve Entity Linking?
Can Deep Learning Techniques Improve Entity Linking?
 
Logics and Ontologies for Portuguese Understanding
Logics and Ontologies for Portuguese UnderstandingLogics and Ontologies for Portuguese Understanding
Logics and Ontologies for Portuguese Understanding
 
Diversity In Localization (Olga Melnikova)
Diversity In Localization (Olga Melnikova)Diversity In Localization (Olga Melnikova)
Diversity In Localization (Olga Melnikova)
 
Doktorantūras semināra 3. prezentācija
Doktorantūras semināra 3. prezentācijaDoktorantūras semināra 3. prezentācija
Doktorantūras semināra 3. prezentācija
 
Session 0.0 aussenac semanticsnl-pwebsem2017-v4
Session 0.0   aussenac semanticsnl-pwebsem2017-v4Session 0.0   aussenac semanticsnl-pwebsem2017-v4
Session 0.0 aussenac semanticsnl-pwebsem2017-v4
 
Aussenac semanticsnl pwebsem2017-v4
Aussenac semanticsnl pwebsem2017-v4Aussenac semanticsnl pwebsem2017-v4
Aussenac semanticsnl pwebsem2017-v4
 
OpenWN-PT: a Brazilian Wordnet for all
OpenWN-PT: a Brazilian Wordnet for allOpenWN-PT: a Brazilian Wordnet for all
OpenWN-PT: a Brazilian Wordnet for all
 
Introduction to development of lexical databases
Introduction to development of lexical databasesIntroduction to development of lexical databases
Introduction to development of lexical databases
 
An Extensible Multilingual Open Source Lemmatizer
An Extensible Multilingual Open Source LemmatizerAn Extensible Multilingual Open Source Lemmatizer
An Extensible Multilingual Open Source Lemmatizer
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
 
Lexical Resources for Portuguese
Lexical Resources  for PortugueseLexical Resources  for Portuguese
Lexical Resources for Portuguese
 
OpenLogos Semantico-Syntactic Knowledge-Rich Bilingual Dictionaries
OpenLogos Semantico-Syntactic Knowledge-Rich Bilingual DictionariesOpenLogos Semantico-Syntactic Knowledge-Rich Bilingual Dictionaries
OpenLogos Semantico-Syntactic Knowledge-Rich Bilingual Dictionaries
 
OpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project ReportOpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project Report
 
Large Scale Text Processing
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text Processing
 
Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured Text
 
Detecting Cross-lingual Semantic Similarity Using Parallel PropBanks
Detecting Cross-lingual Semantic Similarity Using Parallel PropBanksDetecting Cross-lingual Semantic Similarity Using Parallel PropBanks
Detecting Cross-lingual Semantic Similarity Using Parallel PropBanks
 
Detecting Cross-lingual Semantic Similarity Using Parallel PropBanks
Detecting Cross-lingual Semantic Similarity Using Parallel PropBanksDetecting Cross-lingual Semantic Similarity Using Parallel PropBanks
Detecting Cross-lingual Semantic Similarity Using Parallel PropBanks
 
The NLP Muppets revolution!
The NLP Muppets revolution!The NLP Muppets revolution!
The NLP Muppets revolution!
 
Seeing is Correcting:Linked Open Data for Portuguese
Seeing is Correcting:Linked Open Data for PortugueseSeeing is Correcting:Linked Open Data for Portuguese
Seeing is Correcting:Linked Open Data for Portuguese
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
 

More from TAUS - The Language Data Network

More from TAUS - The Language Data Network (20)

TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
 
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
 
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
 
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
 
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
 
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
 
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
 
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
 Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann... Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
 
A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...
 
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
 
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
 
Farmer Lv (TrueTran)
Farmer Lv (TrueTran)Farmer Lv (TrueTran)
Farmer Lv (TrueTran)
 
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
 
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
 The Theory and Practice of Computer Aided Translation Training System, Liu Q... The Theory and Practice of Computer Aided Translation Training System, Liu Q...
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
 
Translation Technology Showcase in Shenzhen
Translation Technology Showcase in ShenzhenTranslation Technology Showcase in Shenzhen
Translation Technology Showcase in Shenzhen
 
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
 
SDL Trados Studio 2017, Jocelyn He (SDL)
SDL Trados Studio 2017, Jocelyn He (SDL)SDL Trados Studio 2017, Jocelyn He (SDL)
SDL Trados Studio 2017, Jocelyn He (SDL)
 
How we train post-editors - Yongpeng Wei (Lingosail)
How we train post-editors - Yongpeng Wei (Lingosail)How we train post-editors - Yongpeng Wei (Lingosail)
How we train post-editors - Yongpeng Wei (Lingosail)
 
A use-case for getting MT into your company, Kerstin Berns (berns language c...
 A use-case for getting MT into your company, Kerstin Berns (berns language c... A use-case for getting MT into your company, Kerstin Berns (berns language c...
A use-case for getting MT into your company, Kerstin Berns (berns language c...
 
QE integrated in XTM, by Bob Willans (XTM)
QE integrated in XTM, by Bob Willans (XTM)QE integrated in XTM, by Bob Willans (XTM)
QE integrated in XTM, by Bob Willans (XTM)
 

Recently uploaded

If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
Kayode Fayemi
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
raffaeleoman
 

Recently uploaded (20)

Air breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animalsAir breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animals
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
 
Presentation on Engagement in Book Clubs
Presentation on Engagement in Book ClubsPresentation on Engagement in Book Clubs
Presentation on Engagement in Book Clubs
 
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyCall Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510
 
Causes of poverty in France presentation.pptx
Causes of poverty in France presentation.pptxCauses of poverty in France presentation.pptx
Causes of poverty in France presentation.pptx
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatment
 
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
 
Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
 
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
 
ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 

Apertium: an extensive and shared language resource base for MT and much more, Gema Ramirez Sanchez, Prompsit Language Engineering

  • 1. Apertium: an extensive and shared LR base for MT and much more Gema Ramírez Sánchez (a proud Apertium activist) – Prompsit Language Engineering – gramirez@prompsit.com TAUS Roundtable – Barcelona, 12 May 2016
  • 2. What is this talk about? • This is not a talk about Apertium • although it will be mentioned in every slide • This is not a talk about MT • although Apertium is a platform for MT • This is a talk about data and linguistic tools inside Apertium that are useful for MT: • all MT approaches can benefit: SMT, NMT, RBMT… • but this is not a talk about corpora!!!
  • 3. Apertium downloads in Sourceforge April 2016
  • 4. Apertium downloads in Sourceforge 2005-2015 by country
  • 5. Downloaded Apertium’s go to… Apertium web application Portals, third-party tools Corporate servers Public bodies servers Apertium apps (offline!!!) Personal PC’s (users, developers, researchers)
  • 6. Downloaded Apertium’s go to….  Apertium web application  Portals, third party tools  Corporate servers (offline)  Public bodies servers  Apertium Android apps (offline!!)  Personal PC’s (users, developers, researchers) Apertium community!!
  • 7. Apertium community: mostly data contributors Community in Sourceforge (April 2016) Contributors 7 admins, 400 developers Contributions 67,896 commits
  • 8. Brief history about data in Apertium Year Milestone Language pairs 2004 The Spanish Ministry of Industry funds a consortium to build FOSS MT for the languages of Spain ---------------------------- 2005 Apertium RBMT plaftorm is launched providing engine, tools and data under free licenses 3 pairs: es-ca, es-gl and es-pt 2005- 2009 Language pair-driven innovation, still very European- focused language pairs +19: fr, en, eo, ro, eu, oc, cy, nn, nb, sv, da, is, mk, bg, ast, br 2010 Five years on! 22 pairs!!! 2011- 2015 Consolidated community, support for non-European languages , new tools and reorganisation of data +19: af, nl, hr, sr, mt, sl, ara, sme, urd, hin, kaz, tat, id, ms, ar 2016 Eleven years on! 41 pairs!!!
  • 9. Data in Apertium became big! 2010: 22 pairs
  • 10. Stable language pairs in Apertium: 2016 - 41 pairs
  • 11. A language pair in Apertium Language pair organisation 1 pack = 1 language pair  2 monodixes , 1 bidix  2 sets of rules (levels 1/3)  2 tagsets + probabilities  2 plain/tagged corpora  2 post-dixes New language pair organisation 2 monolingual packs  1 monodix  1 tagset + probabilities  1 plain/tagged corpora  1 post-dix 1 bilingual pack  1 bidix  2 sets of rules (levels 1/3) Format: all data are xml-based files Size: Monodixes: 5k-90k lemmata=37k-13M surface forms; coverage: 85-97% Bidixes: 8k-90k lemmata bilingual entries Rules: 100 (one level) - 300 (3 level) per translation direction License: GNU General Public License
  • 12. Linguistic tools in Apertium Monodix Tagset+prob Rules Monodix Bidix t o o l s t o o l s Post-dix Morphologica l analyser PoS tagger Lexical transfer Full MT Morphological generator Structural transfer Post-generator What are these tools useful for?
  • 13. Morphological analyser: “Gema lo explica ahora mismo” = expanded-morphology text Gema.np.f.sg | gema.n.f.sg lo.detnt | lo.prn.pro.m.sg explicar.vb.pri.3.sg |explicar.vb.imp.3.sg ahora mismo.adv = lemmatised text Gema/gema, el/lo, explicar, ahora mismo = smart tokenized text ^Gema$ ^lo$ ^explica$ ^ahora_mismo$
  • 14. Part-of-speech tagger: “Gema lo explica ahora mismo” = factored text Gema.np.f.sg lo.prn.pro.m.sg explicar.vb.pri.3.sg ahora_mismo.adv = truecase text ^Gema$ ^lo$ ^explica$ ^ahora_mismo$ ^gema$ : ^piedra_preciosa$
  • 15. Lexical transfer: “Gema lo explica ahora mismo” = bilingual correspondences (lemmata) by PoS Gema - Gema lo - it explicar - explain ahora mismo – right now gema – gemstone lo – the
  • 16. Structural transfer: “Gema lo explica ahora mismo” = grammar rules (reorderings, changes, agreement) NP- [Gema] – NP-[Gema] Pro-V[lo explica] – V-Pro-[explains it] ADV-[ahora mismo] – ADV-[right now] = linguistically motivated phrases based on rules [Gema] – [Gema] [lo explica] – [explains it] [ahora mismo] – [right now]
  • 17. Morphological generator: “Gema lo explica ahora mismo” = from expanded text to text to surface forms: Gema.np.f.sg lo.detnt explicar.vb.pri.3.sg ahora_mismo.adv = “recaser” of a factored text: Gema lo explica ahora_mismo Gema lo explica ahora mismo
  • 18. Post-generator: “Gema lo explica ahora mismo” = nice rendering of a text Gema le explique maintenant ALL THIS TAKING INTO ACCOUNT THE FORMAT!!! Gema l’explique maintenant
  • 19. How was data blossom possible in Apertium? Standards Documentation Availability Support Community building Funding And a free/open-source sharing model
  • 20. Next data in Apertium?  Spanish-Italian by the end of the summer  students wanted (EAMT internship)  More Sami, Kazakh, Tatar, Irish (ongoing projects)  New pairs: Sardinian-Italian, Macedonian-Albanian, Gujarati-Hindi, Kazakh-English, Sicilian-Spanish, Belarussian-Russian, Polish-Russian, Kurmanji- English (Google Summer of Code)  More monolingual language packs (Catalan and Spanish soon!)
  • 21. Apertium: an extensive and shared LR base for MT and much more Gema Ramírez Sánchez (a proud Apertium activist) – Prompsit Language Engineering – gramirez@prompsit.com TAUS Roundtable – Barcelona, 12 May 2016 Thanks!!!