SlideShare a Scribd company logo
1 of 8
Download to read offline
.
               Compiling Apertium dictionaries with HFST
    leveraging generalised compilation formulas to get more and better end
                 applications with fewer language description
.

                                  Tommi A Pirinen, Francis Tyers
                                  tommi.pirinen@helsinki.fi

                                 University of Helsinki, Universitat d’Alacant


                                              May 22, 2012




                                                                      .      .   .      .     .     .
    Tommi A Pirinen (Helsinki)        Compiling apertium monodix with HFST           May 22, 2012       1/8
Outline




.
1    Introduction


.
2    Benefits of this work


.
3    Conclusion




                                                                 .      .   .      .     .     .
    Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       2/8
Finite-state automata and HFST and apertium

     Finite-state automata are one efficient way to encode dictionaries,
     morphological analysers etc.
     HFST stands for Helsinki Finite-State Technology— consisting of a
     library working as a compatibility layer between different open-source
     finite-state implementations,
            SFST
            OpenFST
            Foma
     Also a set of finite-state tools built on top of the library, and set of
     end products using the automata in real-world applications (sold
     separately)
     HFST is still a research project in a computational linguistics’ research
     group—not computer science or engineering
     apertium is a machine-translation platform that uses finite-state
     dictionaries
                                                              .      .   .      .     .     .
 Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       3/8
Compiling apertium dictionaries with HFST—rationale

     “just an engineering exercise”
     getting all language descriptions to compile natively in HFST (as
     opposed to converting compiled automata)
     using existing (and future) HFST algorithms to improve the resulting
     automata
     using bits of linguistic information to get better auxiliary automata for
     HFST end applications — data that may not be possible to induct
     from converted compiled automata
     possibility to integrate more complex features in of finite-state
     morphology in apertium dictionaries—morphophonetics, reduplication
     etc. that may be supported by other HFST tools
     this paper fits nicely in my PhD thesis under “State of the art of in
     language models”

                                                              .      .   .      .     .     .
 Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       4/8
Examples of immediate benefits to dictionary writers




     A lot of current work in building NLP software involves management
     of huge amounts of lexical data
     ...like generating different language models in different morphology
     programming formalisms: apertium, hunspell, xerox tools
     getting native and uniform compilation formulas for all lets you write
     dictionaries once and use everywhere
     or pick and mix tools and features from different formalisms




                                                              .      .   .      .     .     .
 Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       5/8
Examples of additional applications that can be generated
from apertium dictionaries with this work



     Spell-checkers! A basic spell-checker with generic edit distance
     suggestion generator can be automatically generated—and used in
     majority of current open-source software without any extra effort
     Predictive text entry, for mobiles, such T9, XT9, possibly swype and
     keyboard as well
     Morphological analysers, lemmatisers, segmenters, tokenisers, etc.,
     obviously




                                                              .      .   .      .     .     .
 Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       6/8
Examples of benefits that come for free—automatic
optimisation


     depending on library / end format you choose for compiled
     dictionaries, you get speed–space tradeoffs (or improvements in both)
     This is work-in-progress, but once done it can be used in all
     dictionaries without modifications to sources
     automatic flag diacritic induction
     hyperminimisation
     all this can be based on things like finding homomorphic components
     from the finite-state automaton
     the linguistic concepts present in source code but missing from the
     compiled automaton should prove very useful here!


                                                              .      .   .      .     .     .
 Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       7/8
What now?

     The reference material for the article is in our svn http:
     //hfst.svn.sf.net/svnroot/hfst/trunk/lrec-2011-apertium,
     includes compilation of spell-checkers for most apertium dictionaries
     what do we do to remove duplicate work, duplicate versions of
     dictionaries, conversion scripts. . .
     more compilers? Conversion scripts? New programming languages?
     New “standards” that everyone will use?
     I’ll throw you this: I need more linguistic data and less engineering in
     the language model implementations to compile more applications
     from one source dictionary. Example: LR/RL concept in apertium or
     asymmetric flags in Xerox FSM is engineering hack POV; had the
     description called it substandard or dialectal word form it would
     already be usable in all applications!

                                                              .      .   .      .     .     .
 Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       8/8

More Related Content

What's hot

13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation
RIILP
 
Types of machine translation
Types of machine translationTypes of machine translation
Types of machine translation
Rushdi Shams
 
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
ijnlc
 
Use of ontologies in natural language processing
Use of ontologies in natural language processingUse of ontologies in natural language processing
Use of ontologies in natural language processing
ATHMAN HAJ-HAMOU
 
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
Editor IJARCET
 
Sanskrit and Computational Linguistic
Sanskrit and Computational Linguistic Sanskrit and Computational Linguistic
Sanskrit and Computational Linguistic
Jaganadh Gopinadhan
 

What's hot (18)

Using translation memory_to_speed_up_tra
Using translation memory_to_speed_up_traUsing translation memory_to_speed_up_tra
Using translation memory_to_speed_up_tra
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation
 
Robust extended tokenization framework for romanian by semantic parallel text...
Robust extended tokenization framework for romanian by semantic parallel text...Robust extended tokenization framework for romanian by semantic parallel text...
Robust extended tokenization framework for romanian by semantic parallel text...
 
Types of machine translation
Types of machine translationTypes of machine translation
Types of machine translation
 
Machine translation
Machine translationMachine translation
Machine translation
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translation
 
Lec 15,16,17 NLP.machine translation
Lec 15,16,17  NLP.machine translationLec 15,16,17  NLP.machine translation
Lec 15,16,17 NLP.machine translation
 
Chat adapted pos tagger for romanian language
Chat adapted pos tagger for romanian languageChat adapted pos tagger for romanian language
Chat adapted pos tagger for romanian language
 
A tutorial on Machine Translation
A tutorial on Machine TranslationA tutorial on Machine Translation
A tutorial on Machine Translation
 
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
 
Using ontology for natural language processing
Using ontology for natural language processingUsing ontology for natural language processing
Using ontology for natural language processing
 
Use of ontologies in natural language processing
Use of ontologies in natural language processingUse of ontologies in natural language processing
Use of ontologies in natural language processing
 
Natural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and DifficultiesNatural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and Difficulties
 
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
 
Sanskrit and Computational Linguistic
Sanskrit and Computational Linguistic Sanskrit and Computational Linguistic
Sanskrit and Computational Linguistic
 
Machine Translation
Machine TranslationMachine Translation
Machine Translation
 
Deep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmDeep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigm
 

Viewers also liked (8)

Power To People
Power To PeoplePower To People
Power To People
 
Lezione corso running
Lezione corso runningLezione corso running
Lezione corso running
 
Prepostales
PrepostalesPrepostales
Prepostales
 
2.2.crisostomo inovacao no semiarido
2.2.crisostomo inovacao no semiarido2.2.crisostomo inovacao no semiarido
2.2.crisostomo inovacao no semiarido
 
Эволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облака
Эволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облакаЭволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облака
Эволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облака
 
Meetingin finland
Meetingin finlandMeetingin finland
Meetingin finland
 
Cactus
CactusCactus
Cactus
 
Castilla La Mancha
Castilla La ManchaCastilla La Mancha
Castilla La Mancha
 

Similar to Compiling Apertium Dictionaries with HFST

Similar to Compiling Apertium Dictionaries with HFST (20)

Tlf2016
Tlf2016Tlf2016
Tlf2016
 
Lfnw2016
Lfnw2016Lfnw2016
Lfnw2016
 
G2 pil a grapheme to-phoneme conversion tool for the italian language
G2 pil a grapheme to-phoneme conversion tool for the italian languageG2 pil a grapheme to-phoneme conversion tool for the italian language
G2 pil a grapheme to-phoneme conversion tool for the italian language
 
MOLTO Annual Report 2011
MOLTO Annual Report 2011MOLTO Annual Report 2011
MOLTO Annual Report 2011
 
Lexigraf - a multilingual lexicography DTP engine
Lexigraf - a multilingual lexicography DTP engineLexigraf - a multilingual lexicography DTP engine
Lexigraf - a multilingual lexicography DTP engine
 
Olf2016
Olf2016Olf2016
Olf2016
 
Concordances
Concordances Concordances
Concordances
 
Lit mtap
Lit mtapLit mtap
Lit mtap
 
The Distributed Ontology Language (DOL): Use Cases, Syntax, and Extensibility
The Distributed Ontology Language (DOL): Use Cases, Syntax, and ExtensibilityThe Distributed Ontology Language (DOL): Use Cases, Syntax, and Extensibility
The Distributed Ontology Language (DOL): Use Cases, Syntax, and Extensibility
 
A proposal for technique to use common terms among multiple systems
A proposal for technique to use common terms among multiple systemsA proposal for technique to use common terms among multiple systems
A proposal for technique to use common terms among multiple systems
 
Pandoc: a universal document converter
Pandoc: a universal document converterPandoc: a universal document converter
Pandoc: a universal document converter
 
MOLTO poster for META Forum, Brussels 2010, Belgium.
MOLTO poster for META Forum, Brussels 2010, Belgium.MOLTO poster for META Forum, Brussels 2010, Belgium.
MOLTO poster for META Forum, Brussels 2010, Belgium.
 
Call for Papers - April Issue - International Journal on Natural Language Com...
Call for Papers - April Issue - International Journal on Natural Language Com...Call for Papers - April Issue - International Journal on Natural Language Com...
Call for Papers - April Issue - International Journal on Natural Language Com...
 
call for papers - International Journal on Natural Language Computing (IJNLC)
call for papers - International Journal on Natural Language Computing (IJNLC)call for papers - International Journal on Natural Language Computing (IJNLC)
call for papers - International Journal on Natural Language Computing (IJNLC)
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
 
International Journal on Natural Language Computing(IJNLC)
 International Journal on Natural Language Computing(IJNLC) International Journal on Natural Language Computing(IJNLC)
International Journal on Natural Language Computing(IJNLC)
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
 

More from Guy De Pauw

The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
Guy De Pauw
 

More from Guy De Pauw (20)

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech Tagging
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh Language
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik Language
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News Corpus
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of Santome
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic Inflection
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken Irish
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 years
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá Characters
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation System
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription System
 
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
 

Recently uploaded

Recently uploaded (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Compiling Apertium Dictionaries with HFST

  • 1. . Compiling Apertium dictionaries with HFST leveraging generalised compilation formulas to get more and better end applications with fewer language description . Tommi A Pirinen, Francis Tyers tommi.pirinen@helsinki.fi University of Helsinki, Universitat d’Alacant May 22, 2012 . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 1/8
  • 2. Outline . 1 Introduction . 2 Benefits of this work . 3 Conclusion . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 2/8
  • 3. Finite-state automata and HFST and apertium Finite-state automata are one efficient way to encode dictionaries, morphological analysers etc. HFST stands for Helsinki Finite-State Technology— consisting of a library working as a compatibility layer between different open-source finite-state implementations, SFST OpenFST Foma Also a set of finite-state tools built on top of the library, and set of end products using the automata in real-world applications (sold separately) HFST is still a research project in a computational linguistics’ research group—not computer science or engineering apertium is a machine-translation platform that uses finite-state dictionaries . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 3/8
  • 4. Compiling apertium dictionaries with HFST—rationale “just an engineering exercise” getting all language descriptions to compile natively in HFST (as opposed to converting compiled automata) using existing (and future) HFST algorithms to improve the resulting automata using bits of linguistic information to get better auxiliary automata for HFST end applications — data that may not be possible to induct from converted compiled automata possibility to integrate more complex features in of finite-state morphology in apertium dictionaries—morphophonetics, reduplication etc. that may be supported by other HFST tools this paper fits nicely in my PhD thesis under “State of the art of in language models” . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 4/8
  • 5. Examples of immediate benefits to dictionary writers A lot of current work in building NLP software involves management of huge amounts of lexical data ...like generating different language models in different morphology programming formalisms: apertium, hunspell, xerox tools getting native and uniform compilation formulas for all lets you write dictionaries once and use everywhere or pick and mix tools and features from different formalisms . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 5/8
  • 6. Examples of additional applications that can be generated from apertium dictionaries with this work Spell-checkers! A basic spell-checker with generic edit distance suggestion generator can be automatically generated—and used in majority of current open-source software without any extra effort Predictive text entry, for mobiles, such T9, XT9, possibly swype and keyboard as well Morphological analysers, lemmatisers, segmenters, tokenisers, etc., obviously . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 6/8
  • 7. Examples of benefits that come for free—automatic optimisation depending on library / end format you choose for compiled dictionaries, you get speed–space tradeoffs (or improvements in both) This is work-in-progress, but once done it can be used in all dictionaries without modifications to sources automatic flag diacritic induction hyperminimisation all this can be based on things like finding homomorphic components from the finite-state automaton the linguistic concepts present in source code but missing from the compiled automaton should prove very useful here! . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 7/8
  • 8. What now? The reference material for the article is in our svn http: //hfst.svn.sf.net/svnroot/hfst/trunk/lrec-2011-apertium, includes compilation of spell-checkers for most apertium dictionaries what do we do to remove duplicate work, duplicate versions of dictionaries, conversion scripts. . . more compilers? Conversion scripts? New programming languages? New “standards” that everyone will use? I’ll throw you this: I need more linguistic data and less engineering in the language model implementations to compile more applications from one source dictionary. Example: LR/RL concept in apertium or asymmetric flags in Xerox FSM is engineering hack POV; had the description called it substandard or dialectal word form it would already be usable in all applications! . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 8/8