SlideShare a Scribd company logo
.
               Compiling Apertium dictionaries with HFST
    leveraging generalised compilation formulas to get more and better end
                 applications with fewer language description
.

                                  Tommi A Pirinen, Francis Tyers
                                  tommi.pirinen@helsinki.fi

                                 University of Helsinki, Universitat d’Alacant


                                              May 22, 2012




                                                                      .      .   .      .     .     .
    Tommi A Pirinen (Helsinki)        Compiling apertium monodix with HFST           May 22, 2012       1/8
Outline




.
1    Introduction


.
2    Benefits of this work


.
3    Conclusion




                                                                 .      .   .      .     .     .
    Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       2/8
Finite-state automata and HFST and apertium

     Finite-state automata are one efficient way to encode dictionaries,
     morphological analysers etc.
     HFST stands for Helsinki Finite-State Technology— consisting of a
     library working as a compatibility layer between different open-source
     finite-state implementations,
            SFST
            OpenFST
            Foma
     Also a set of finite-state tools built on top of the library, and set of
     end products using the automata in real-world applications (sold
     separately)
     HFST is still a research project in a computational linguistics’ research
     group—not computer science or engineering
     apertium is a machine-translation platform that uses finite-state
     dictionaries
                                                              .      .   .      .     .     .
 Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       3/8
Compiling apertium dictionaries with HFST—rationale

     “just an engineering exercise”
     getting all language descriptions to compile natively in HFST (as
     opposed to converting compiled automata)
     using existing (and future) HFST algorithms to improve the resulting
     automata
     using bits of linguistic information to get better auxiliary automata for
     HFST end applications — data that may not be possible to induct
     from converted compiled automata
     possibility to integrate more complex features in of finite-state
     morphology in apertium dictionaries—morphophonetics, reduplication
     etc. that may be supported by other HFST tools
     this paper fits nicely in my PhD thesis under “State of the art of in
     language models”

                                                              .      .   .      .     .     .
 Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       4/8
Examples of immediate benefits to dictionary writers




     A lot of current work in building NLP software involves management
     of huge amounts of lexical data
     ...like generating different language models in different morphology
     programming formalisms: apertium, hunspell, xerox tools
     getting native and uniform compilation formulas for all lets you write
     dictionaries once and use everywhere
     or pick and mix tools and features from different formalisms




                                                              .      .   .      .     .     .
 Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       5/8
Examples of additional applications that can be generated
from apertium dictionaries with this work



     Spell-checkers! A basic spell-checker with generic edit distance
     suggestion generator can be automatically generated—and used in
     majority of current open-source software without any extra effort
     Predictive text entry, for mobiles, such T9, XT9, possibly swype and
     keyboard as well
     Morphological analysers, lemmatisers, segmenters, tokenisers, etc.,
     obviously




                                                              .      .   .      .     .     .
 Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       6/8
Examples of benefits that come for free—automatic
optimisation


     depending on library / end format you choose for compiled
     dictionaries, you get speed–space tradeoffs (or improvements in both)
     This is work-in-progress, but once done it can be used in all
     dictionaries without modifications to sources
     automatic flag diacritic induction
     hyperminimisation
     all this can be based on things like finding homomorphic components
     from the finite-state automaton
     the linguistic concepts present in source code but missing from the
     compiled automaton should prove very useful here!


                                                              .      .   .      .     .     .
 Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       7/8
What now?

     The reference material for the article is in our svn http:
     //hfst.svn.sf.net/svnroot/hfst/trunk/lrec-2011-apertium,
     includes compilation of spell-checkers for most apertium dictionaries
     what do we do to remove duplicate work, duplicate versions of
     dictionaries, conversion scripts. . .
     more compilers? Conversion scripts? New programming languages?
     New “standards” that everyone will use?
     I’ll throw you this: I need more linguistic data and less engineering in
     the language model implementations to compile more applications
     from one source dictionary. Example: LR/RL concept in apertium or
     asymmetric flags in Xerox FSM is engineering hack POV; had the
     description called it substandard or dialectal word form it would
     already be usable in all applications!

                                                              .      .   .      .     .     .
 Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       8/8

More Related Content

What's hot

Using translation memory_to_speed_up_tra
Using translation memory_to_speed_up_traUsing translation memory_to_speed_up_tra
Using translation memory_to_speed_up_tra
CamillaTonanzi
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation
RIILP
 
Robust extended tokenization framework for romanian by semantic parallel text...
Robust extended tokenization framework for romanian by semantic parallel text...Robust extended tokenization framework for romanian by semantic parallel text...
Robust extended tokenization framework for romanian by semantic parallel text...
ijnlc
 
Types of machine translation
Types of machine translationTypes of machine translation
Types of machine translation
Rushdi Shams
 
Machine translation
Machine translationMachine translation
Machine translation
mohamed hassan
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
John Tinsley
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translation
Marcis Pinnis
 
Lec 15,16,17 NLP.machine translation
Lec 15,16,17  NLP.machine translationLec 15,16,17  NLP.machine translation
Lec 15,16,17 NLP.machine translation
guest873a50
 
Chat adapted pos tagger for romanian language
Chat adapted pos tagger for romanian languageChat adapted pos tagger for romanian language
Chat adapted pos tagger for romanian language
University Politehnica Bucharest
 
A tutorial on Machine Translation
A tutorial on Machine TranslationA tutorial on Machine Translation
A tutorial on Machine Translation
Jaganadh Gopinadhan
 
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
ijnlc
 
Using ontology for natural language processing
Using ontology for natural language processingUsing ontology for natural language processing
Using ontology for natural language processing
cracaoanu constantin sergiu
 
Use of ontologies in natural language processing
Use of ontologies in natural language processingUse of ontologies in natural language processing
Use of ontologies in natural language processing
ATHMAN HAJ-HAMOU
 
Natural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and DifficultiesNatural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and Difficulties
ijtsrd
 
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
Editor IJARCET
 
Sanskrit and Computational Linguistic
Sanskrit and Computational Linguistic Sanskrit and Computational Linguistic
Sanskrit and Computational Linguistic
Jaganadh Gopinadhan
 
Machine Translation
Machine TranslationMachine Translation
Machine Translation
Skilrock Technologies
 
Deep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmDeep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigm
MeetupDataScienceRoma
 

What's hot (18)

Using translation memory_to_speed_up_tra
Using translation memory_to_speed_up_traUsing translation memory_to_speed_up_tra
Using translation memory_to_speed_up_tra
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation
 
Robust extended tokenization framework for romanian by semantic parallel text...
Robust extended tokenization framework for romanian by semantic parallel text...Robust extended tokenization framework for romanian by semantic parallel text...
Robust extended tokenization framework for romanian by semantic parallel text...
 
Types of machine translation
Types of machine translationTypes of machine translation
Types of machine translation
 
Machine translation
Machine translationMachine translation
Machine translation
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translation
 
Lec 15,16,17 NLP.machine translation
Lec 15,16,17  NLP.machine translationLec 15,16,17  NLP.machine translation
Lec 15,16,17 NLP.machine translation
 
Chat adapted pos tagger for romanian language
Chat adapted pos tagger for romanian languageChat adapted pos tagger for romanian language
Chat adapted pos tagger for romanian language
 
A tutorial on Machine Translation
A tutorial on Machine TranslationA tutorial on Machine Translation
A tutorial on Machine Translation
 
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
 
Using ontology for natural language processing
Using ontology for natural language processingUsing ontology for natural language processing
Using ontology for natural language processing
 
Use of ontologies in natural language processing
Use of ontologies in natural language processingUse of ontologies in natural language processing
Use of ontologies in natural language processing
 
Natural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and DifficultiesNatural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and Difficulties
 
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
 
Sanskrit and Computational Linguistic
Sanskrit and Computational Linguistic Sanskrit and Computational Linguistic
Sanskrit and Computational Linguistic
 
Machine Translation
Machine TranslationMachine Translation
Machine Translation
 
Deep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmDeep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigm
 

Viewers also liked

Power To People
Power To PeoplePower To People
Power To People
Pradipta Chatterjee
 
Lezione corso running
Lezione corso runningLezione corso running
Lezione corso runningReti
 
Prepostales
PrepostalesPrepostales
Prepostales
yinalis
 
2.2.crisostomo inovacao no semiarido
2.2.crisostomo inovacao no semiarido2.2.crisostomo inovacao no semiarido
2.2.crisostomo inovacao no semiarido
smtpinov
 
Эволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облака
Эволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облакаЭволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облака
Эволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облака
SQALab
 
Meetingin finland
Meetingin finlandMeetingin finland
Meetingin finland
anglimo
 

Viewers also liked (8)

Power To People
Power To PeoplePower To People
Power To People
 
Lezione corso running
Lezione corso runningLezione corso running
Lezione corso running
 
Prepostales
PrepostalesPrepostales
Prepostales
 
2.2.crisostomo inovacao no semiarido
2.2.crisostomo inovacao no semiarido2.2.crisostomo inovacao no semiarido
2.2.crisostomo inovacao no semiarido
 
Эволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облака
Эволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облакаЭволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облака
Эволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облака
 
Meetingin finland
Meetingin finlandMeetingin finland
Meetingin finland
 
Cactus
CactusCactus
Cactus
 
Castilla La Mancha
Castilla La ManchaCastilla La Mancha
Castilla La Mancha
 

Similar to Compiling Apertium Dictionaries with HFST

Tlf2016
Tlf2016Tlf2016
Tlf2016
Dru Lavigne
 
Lfnw2016
Lfnw2016Lfnw2016
Lfnw2016
Dru Lavigne
 
G2 pil a grapheme to-phoneme conversion tool for the italian language
G2 pil a grapheme to-phoneme conversion tool for the italian languageG2 pil a grapheme to-phoneme conversion tool for the italian language
G2 pil a grapheme to-phoneme conversion tool for the italian language
ijnlc
 
MOLTO Annual Report 2011
MOLTO Annual Report 2011MOLTO Annual Report 2011
MOLTO Annual Report 2011
Olga Caprotti
 
Lexigraf - a multilingual lexicography DTP engine
Lexigraf - a multilingual lexicography DTP engineLexigraf - a multilingual lexicography DTP engine
Lexigraf - a multilingual lexicography DTP engine
Yiannis Hatzopoulos
 
Olf2016
Olf2016Olf2016
Olf2016
Dru Lavigne
 
Concordances
Concordances Concordances
Concordances
ZahraShafiee1001
 
Lit mtap
Lit mtapLit mtap
The Distributed Ontology Language (DOL): Use Cases, Syntax, and Extensibility
The Distributed Ontology Language (DOL): Use Cases, Syntax, and ExtensibilityThe Distributed Ontology Language (DOL): Use Cases, Syntax, and Extensibility
The Distributed Ontology Language (DOL): Use Cases, Syntax, and Extensibility
Christoph Lange
 
A proposal for technique to use common terms among multiple systems
A proposal for technique to use common terms among multiple systemsA proposal for technique to use common terms among multiple systems
A proposal for technique to use common terms among multiple systems
Mahara Hui
 
Pandoc: a universal document converter
Pandoc: a universal document converterPandoc: a universal document converter
Pandoc: a universal document converter
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
MOLTO poster for META Forum, Brussels 2010, Belgium.
MOLTO poster for META Forum, Brussels 2010, Belgium.MOLTO poster for META Forum, Brussels 2010, Belgium.
MOLTO poster for META Forum, Brussels 2010, Belgium.
Olga Caprotti
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
kevig
 
Call for Papers - April Issue - International Journal on Natural Language Com...
Call for Papers - April Issue - International Journal on Natural Language Com...Call for Papers - April Issue - International Journal on Natural Language Com...
Call for Papers - April Issue - International Journal on Natural Language Com...
kevig
 
call for papers - International Journal on Natural Language Computing (IJNLC)
call for papers - International Journal on Natural Language Computing (IJNLC)call for papers - International Journal on Natural Language Computing (IJNLC)
call for papers - International Journal on Natural Language Computing (IJNLC)
kevig
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
kevig
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
kevig
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
kevig
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
kevig
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
kevig
 

Similar to Compiling Apertium Dictionaries with HFST (20)

Tlf2016
Tlf2016Tlf2016
Tlf2016
 
Lfnw2016
Lfnw2016Lfnw2016
Lfnw2016
 
G2 pil a grapheme to-phoneme conversion tool for the italian language
G2 pil a grapheme to-phoneme conversion tool for the italian languageG2 pil a grapheme to-phoneme conversion tool for the italian language
G2 pil a grapheme to-phoneme conversion tool for the italian language
 
MOLTO Annual Report 2011
MOLTO Annual Report 2011MOLTO Annual Report 2011
MOLTO Annual Report 2011
 
Lexigraf - a multilingual lexicography DTP engine
Lexigraf - a multilingual lexicography DTP engineLexigraf - a multilingual lexicography DTP engine
Lexigraf - a multilingual lexicography DTP engine
 
Olf2016
Olf2016Olf2016
Olf2016
 
Concordances
Concordances Concordances
Concordances
 
Lit mtap
Lit mtapLit mtap
Lit mtap
 
The Distributed Ontology Language (DOL): Use Cases, Syntax, and Extensibility
The Distributed Ontology Language (DOL): Use Cases, Syntax, and ExtensibilityThe Distributed Ontology Language (DOL): Use Cases, Syntax, and Extensibility
The Distributed Ontology Language (DOL): Use Cases, Syntax, and Extensibility
 
A proposal for technique to use common terms among multiple systems
A proposal for technique to use common terms among multiple systemsA proposal for technique to use common terms among multiple systems
A proposal for technique to use common terms among multiple systems
 
Pandoc: a universal document converter
Pandoc: a universal document converterPandoc: a universal document converter
Pandoc: a universal document converter
 
MOLTO poster for META Forum, Brussels 2010, Belgium.
MOLTO poster for META Forum, Brussels 2010, Belgium.MOLTO poster for META Forum, Brussels 2010, Belgium.
MOLTO poster for META Forum, Brussels 2010, Belgium.
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
 
Call for Papers - April Issue - International Journal on Natural Language Com...
Call for Papers - April Issue - International Journal on Natural Language Com...Call for Papers - April Issue - International Journal on Natural Language Com...
Call for Papers - April Issue - International Journal on Natural Language Com...
 
call for papers - International Journal on Natural Language Computing (IJNLC)
call for papers - International Journal on Natural Language Computing (IJNLC)call for papers - International Journal on Natural Language Computing (IJNLC)
call for papers - International Journal on Natural Language Computing (IJNLC)
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
 

More from Guy De Pauw

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...
Guy De Pauw
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...
Guy De Pauw
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech Tagging
Guy De Pauw
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh Language
Guy De Pauw
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik Language
Guy De Pauw
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)
Guy De Pauw
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Guy De Pauw
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News Corpus
Guy De Pauw
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of Santome
Guy De Pauw
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Guy De Pauw
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic Inflection
Guy De Pauw
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Guy De Pauw
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken Irish
Guy De Pauw
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 years
Guy De Pauw
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
Guy De Pauw
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
Guy De Pauw
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá Characters
Guy De Pauw
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation System
Guy De Pauw
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription System
Guy De Pauw
 
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Guy De Pauw
 

More from Guy De Pauw (20)

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech Tagging
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh Language
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik Language
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News Corpus
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of Santome
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic Inflection
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken Irish
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 years
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá Characters
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation System
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription System
 
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
 

Recently uploaded

Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 

Compiling Apertium Dictionaries with HFST

  • 1. . Compiling Apertium dictionaries with HFST leveraging generalised compilation formulas to get more and better end applications with fewer language description . Tommi A Pirinen, Francis Tyers tommi.pirinen@helsinki.fi University of Helsinki, Universitat d’Alacant May 22, 2012 . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 1/8
  • 2. Outline . 1 Introduction . 2 Benefits of this work . 3 Conclusion . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 2/8
  • 3. Finite-state automata and HFST and apertium Finite-state automata are one efficient way to encode dictionaries, morphological analysers etc. HFST stands for Helsinki Finite-State Technology— consisting of a library working as a compatibility layer between different open-source finite-state implementations, SFST OpenFST Foma Also a set of finite-state tools built on top of the library, and set of end products using the automata in real-world applications (sold separately) HFST is still a research project in a computational linguistics’ research group—not computer science or engineering apertium is a machine-translation platform that uses finite-state dictionaries . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 3/8
  • 4. Compiling apertium dictionaries with HFST—rationale “just an engineering exercise” getting all language descriptions to compile natively in HFST (as opposed to converting compiled automata) using existing (and future) HFST algorithms to improve the resulting automata using bits of linguistic information to get better auxiliary automata for HFST end applications — data that may not be possible to induct from converted compiled automata possibility to integrate more complex features in of finite-state morphology in apertium dictionaries—morphophonetics, reduplication etc. that may be supported by other HFST tools this paper fits nicely in my PhD thesis under “State of the art of in language models” . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 4/8
  • 5. Examples of immediate benefits to dictionary writers A lot of current work in building NLP software involves management of huge amounts of lexical data ...like generating different language models in different morphology programming formalisms: apertium, hunspell, xerox tools getting native and uniform compilation formulas for all lets you write dictionaries once and use everywhere or pick and mix tools and features from different formalisms . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 5/8
  • 6. Examples of additional applications that can be generated from apertium dictionaries with this work Spell-checkers! A basic spell-checker with generic edit distance suggestion generator can be automatically generated—and used in majority of current open-source software without any extra effort Predictive text entry, for mobiles, such T9, XT9, possibly swype and keyboard as well Morphological analysers, lemmatisers, segmenters, tokenisers, etc., obviously . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 6/8
  • 7. Examples of benefits that come for free—automatic optimisation depending on library / end format you choose for compiled dictionaries, you get speed–space tradeoffs (or improvements in both) This is work-in-progress, but once done it can be used in all dictionaries without modifications to sources automatic flag diacritic induction hyperminimisation all this can be based on things like finding homomorphic components from the finite-state automaton the linguistic concepts present in source code but missing from the compiled automaton should prove very useful here! . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 7/8
  • 8. What now? The reference material for the article is in our svn http: //hfst.svn.sf.net/svnroot/hfst/trunk/lrec-2011-apertium, includes compilation of spell-checkers for most apertium dictionaries what do we do to remove duplicate work, duplicate versions of dictionaries, conversion scripts. . . more compilers? Conversion scripts? New programming languages? New “standards” that everyone will use? I’ll throw you this: I need more linguistic data and less engineering in the language model implementations to compile more applications from one source dictionary. Example: LR/RL concept in apertium or asymmetric flags in Xerox FSM is engineering hack POV; had the description called it substandard or dialectal word form it would already be usable in all applications! . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 8/8