SlideShare a Scribd company logo
1 of 1
Download to read offline
LABORATOIRE
         D'INFORMATIQUE                              Describing Morphologically-rich Languages using
                                                        Metagrammars: a Look at Verbs in Ikota
                                                                                        1                                                         2                                          1
                                                            Denys Duchier , Brunelle Magnana Ekoukou , Yannick Parmentier , Simon
         FONDAMENTALE
           D'ORLEANS

                                                                                         1                 2
                                                                                Petitjean , Emmanuel Schang
                                                                          1
                                                                           LIFO, Université d’Orléans - 6, rue Léonard de Vinci 45067 Orléans Cedex 2 – France
                                                                         {denys.duchier,yannick.parmentier,simon.petitjean}@univ-orleans.fr
                                                                             2
                                                                               LLL, Université d’Orléans - 10, rue de Tours 45067 Orléans Cedex 2 – France
                                                                              {brunelle.magnana-ekoukou,emmanuel.schang}@univ-orleans.fr

                                                                                                                             Purpose

                                                                  • Provide a formal description of the morphology of verbs in Ikota
                                                                  • Automatically derive from this description a lexicon of inflected forms

   •   Lexicalized wide-coverage tree-grammars for natural languages are very large and extremely resource intensive to develop and maintain
   •   They can be automatically produced by software from a highly modular formal description called a metagrammar, introduced by [1]
   •   Metagrammars are much easier to develop and to maintain
   •   We propose to adopt a similar strategy to capture morphological generalizations over verbs in Ikota

                               Ikota language                                                                                                            eXtensible MetaGrammar
       Ikota is a Bantu language of Gabon and the Democratic                                  XMG is normally used to describe lexicalized tree grammars. In other words, an XMG specification is a declarative description
       Republic of Congo. It is threatened with extinction in                                 of the tree-structures composing a grammar. This description relies on four main concepts:
       Gabon, mainly because of its abandon for French. It                                     • abstraction: the ability to associate a content with a name
       shares many grammatical features with the Bantu lan-                                    • contribution: the ability to accumulate information in any level of linguistic description
       guages. Ikota is a tonal language with two registers (High                              • conjunction: the ability to combine pieces of information
       and Low). It has ten noun classes and has a widespread                                  • disjunction: the ability to non-deterministically select pieces of information
       agreement in the NP.
                              Table: Ikota’s noun classes
                                                                                                                          Formally, one can define an XMG specification as follows:
                       Noun class      prefix        allomorphs
                       CL 1            mò-, Ø-           `
                                                    mw-, n-                                                                  Rule :=          Name → Content
                       CL 2            bà-          b-                                                                    Content :=          Contribution | Name | Content ∨ Content | Content ∧ Content
                       CL 3            mò-, Ø-           `
                                                    mw-, n-
                       CL 4            mè-
                       CL 5            ì-, Ã-       dy-
                                                                                                                                          Metagrammar of Ikota verbal morphology
                       CL 6            mà-          m-
                       CL 7            è-
                       CL 8            bè-                                                                   Subj    →      1 ← m ∨ 1 ← o ∨ ...
                                                                                                                                           `
                       CL 9            Ø-                                                                                   p=1        p=2
                       CL 14           ò-, bò-      bw-                                                                     n = sg n = sg
                                                                                                            Tense    →          2 ← ´e   ∨     2 ← ´e    ∨     2 ← a
                                                                                                                                                                   `    ∨      2 ← a`    ∨      2 ← ab´
                                                                                                                                                                                                     ´ ı
                               Verbs in Ikota                                                                               tense = past tense = future tense = present tense = past       tense = future
                                                                                                                            proxi = near                                   proxi = ¬near proxi = imminent
       Verbs consist of a lexical root (VR) and several affixes                                              Active    →        5 ← À ∨ 5 ← Á ∨ 4 ← ´bw`  e E
       distributed on each side of the VR.                                                                                  active = + active = + active = -
                                                                                                                              prog = -   prog = +
                                Table: Verb formation                                                                            4 ← ÀK
                                                                                                          Aspect     →                      ∨
            Subj- Tense- VR -(Aspect) -Active -(Proximal)                                                                   tense = future tense = ¬future
                                                                                                                             prog = -         prog = +
                      Table: Verbal forms of bòÃákà "to eat"                                            Proximal     →        6 ← nÁ ∨ 6 ← sÁ ∨                        ∨
       Subj. Tense VR Aspect Active Prox.       Value                                                                       proxi = day proxi = far proxi = none ∨ near proxi = imminent
                                                                                                                                                                          tense = future
        m-    à- Ã             -á              present                                                    To-Eat     →         3 ← Ã
        m-    à- Ã             -á    -ná past, yesterday                                                                    vclass = g1
        m-    à- Ã             -á    -sá    distant past                                                 To-Give     →        3 ← w
        m-    é-    Ã          -á           recent past                                                                     vclass = g2
        m-    é-    Ã  -àk     -à          medium future                                                      VR     → To-Eat ∨ To-Give
        m-    é-    Ã  -àk     -à    -ná future, tomorrow                                                    Verb    → Subj ∧ Tense ∧ VR ∧ Aspect ∧ Active ∧ Proximal
        m-    é-    Ã  -àk     -à    -sá   distant future
        m- ábí- Ã -àk          -à         imminent future
                                                                                                                            Figure: A successful derivation for : o+´+Ã+ÀK+À+nÁ → o+´+Ã+ak+a+n´ → oÃ`k`n´
                                                                                                                                                                  ` e             ` e   ` ` a     ´ a a a
       The infinitival suffix aka at the lexical level can be re-
       alized at the surface level by ákà, ´Ù`, ´k`, depending on
                                           E E O O
                                                                                                              Verb     → Subj ∧ Tense ∧ VR ∧ Aspect ∧ Active ∧ Proximal
       the verb class.
                                                                                                                       → 1 ← o ∧`          2 ← ´
                                                                                                                                               e       ∧ 3 ← Ã ∧           4 ← ÀK ∧ 5 ← À ∧ 6 ← nÁ
                                                                                                                         p=2         tense = future vclass = g1 tense = future active = + proxi = day
                                  References                                                                             n = sg                                         prog = -    prog = -
                                                                                                                       → 1 ← o 2 ← ´ 3 ← Ã 4 ← ÀK 5 ← À 6 ← nÁ
                                                                                                                                `        e
       [1] Marie Candito, A Principle-Based Hierarchical Representation of                                                p=2         prog = - tense = future vclass = g1
           LTAGs, Proceedings of the 16th International Conference on Com-                                                n = sg active = + proxi = day
           putational Linguistics (COLING’96)
       [2] Magnana Ekoukou, Brunelle, Morphologie nominale de l’ikota                                                                            Figure: A failed derivation: clashes on tense and on prog
           (B25): inventaire des classes nominales Mémoire de Master 2,
           Université d’Orléans,2010
                                                                                                             Verb     → Subj ∧ Tense ∧ VR ∧ Aspect ∧ Active ∧ Proximal
       [3] Piron, Pascale, Éléments de description du kota, langue bantoue
           du Gabon, mémoire de licence spéciale africaine, Université Libre                                          →     1 ← o ∧
                                                                                                                                 `       2 ← ´e   ∧ 3 ← Ã ∧                   ∧ 5 ← À ∧ 6 ← nÁ
           de Bruxelles, 1990                                                                                               p=2     tense = future vclass = g1 tense = ¬future active = + proxi = day
                                                                                                                            n = sg                              prog = +        prog = -
                                                                                                                      → failure!


                                                                                                                          Conclusion
   •   We proposed a formal, albeit preliminary, declarative description of verbal morphology in Ikota, an arguably minority African language
   •   From this formal description, using XMG, we are able to automatically produce a lexicon of fully inflected verb forms with morphosyntactic features.
   •   Methodology : XMG helps to express ideas, test them quickly, and then validate the results against the available data
   •   Using the same tool, we will be able to also describe the syntax of the language using e.g. tree-adjoining grammars (the topic of an ongoing PhD thesis)
D. Duchier, B. Magnana Ekoukou, Y. Parmentier, S. Petitjean, E. Schang Describing Morphologically-rich Languages using Metagrammars: a Look at Verbs in Ikota. In: LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”

More Related Content

More from Guy De Pauw

Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Guy De Pauw
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTGuy De Pauw
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionGuy De Pauw
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingGuy De Pauw
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishGuy De Pauw
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsGuy De Pauw
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersGuy De Pauw
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentGuy De Pauw
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersGuy De Pauw
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemGuy De Pauw
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemGuy De Pauw
 
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Guy De Pauw
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersGuy De Pauw
 
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...Guy De Pauw
 
Amharic document clustering
Amharic document clusteringAmharic document clustering
Amharic document clusteringGuy De Pauw
 
Modeling Improved Syllabification Algorithm for Amharic
Modeling Improved Syllabification Algorithm for AmharicModeling Improved Syllabification Algorithm for Amharic
Modeling Improved Syllabification Algorithm for AmharicGuy De Pauw
 
Ge'ez Verbs Morphology and Declaration Model
Ge'ez Verbs Morphology and Declaration ModelGe'ez Verbs Morphology and Declaration Model
Ge'ez Verbs Morphology and Declaration ModelGuy De Pauw
 
Do we need linguistic knowledge for speech technology applications in African...
Do we need linguistic knowledge for speech technology applications in African...Do we need linguistic knowledge for speech technology applications in African...
Do we need linguistic knowledge for speech technology applications in African...Guy De Pauw
 
Towards the implementation of a refined data model for a Zulu machine-readabl...
Towards the implementation of a refined data model for a Zulu machine-readabl...Towards the implementation of a refined data model for a Zulu machine-readabl...
Towards the implementation of a refined data model for a Zulu machine-readabl...Guy De Pauw
 
Capturing Word-level Dependencies in Morpheme-based Language Modeling
Capturing Word-level Dependencies in Morpheme-based Language ModelingCapturing Word-level Dependencies in Morpheme-based Language Modeling
Capturing Word-level Dependencies in Morpheme-based Language ModelingGuy De Pauw
 

More from Guy De Pauw (20)

Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFST
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic Inflection
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken Irish
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 years
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá Characters
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation System
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription System
 
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
 
Amharic document clustering
Amharic document clusteringAmharic document clustering
Amharic document clustering
 
Modeling Improved Syllabification Algorithm for Amharic
Modeling Improved Syllabification Algorithm for AmharicModeling Improved Syllabification Algorithm for Amharic
Modeling Improved Syllabification Algorithm for Amharic
 
Ge'ez Verbs Morphology and Declaration Model
Ge'ez Verbs Morphology and Declaration ModelGe'ez Verbs Morphology and Declaration Model
Ge'ez Verbs Morphology and Declaration Model
 
Do we need linguistic knowledge for speech technology applications in African...
Do we need linguistic knowledge for speech technology applications in African...Do we need linguistic knowledge for speech technology applications in African...
Do we need linguistic knowledge for speech technology applications in African...
 
Towards the implementation of a refined data model for a Zulu machine-readabl...
Towards the implementation of a refined data model for a Zulu machine-readabl...Towards the implementation of a refined data model for a Zulu machine-readabl...
Towards the implementation of a refined data model for a Zulu machine-readabl...
 
Capturing Word-level Dependencies in Morpheme-based Language Modeling
Capturing Word-level Dependencies in Morpheme-based Language ModelingCapturing Word-level Dependencies in Morpheme-based Language Modeling
Capturing Word-level Dependencies in Morpheme-based Language Modeling
 

Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs in Ikota

  • 1. LABORATOIRE D'INFORMATIQUE Describing Morphologically-rich Languages using Metagrammars: a Look at Verbs in Ikota 1 2 1 Denys Duchier , Brunelle Magnana Ekoukou , Yannick Parmentier , Simon FONDAMENTALE D'ORLEANS 1 2 Petitjean , Emmanuel Schang 1 LIFO, Université d’Orléans - 6, rue Léonard de Vinci 45067 Orléans Cedex 2 – France {denys.duchier,yannick.parmentier,simon.petitjean}@univ-orleans.fr 2 LLL, Université d’Orléans - 10, rue de Tours 45067 Orléans Cedex 2 – France {brunelle.magnana-ekoukou,emmanuel.schang}@univ-orleans.fr Purpose • Provide a formal description of the morphology of verbs in Ikota • Automatically derive from this description a lexicon of inflected forms • Lexicalized wide-coverage tree-grammars for natural languages are very large and extremely resource intensive to develop and maintain • They can be automatically produced by software from a highly modular formal description called a metagrammar, introduced by [1] • Metagrammars are much easier to develop and to maintain • We propose to adopt a similar strategy to capture morphological generalizations over verbs in Ikota Ikota language eXtensible MetaGrammar Ikota is a Bantu language of Gabon and the Democratic XMG is normally used to describe lexicalized tree grammars. In other words, an XMG specification is a declarative description Republic of Congo. It is threatened with extinction in of the tree-structures composing a grammar. This description relies on four main concepts: Gabon, mainly because of its abandon for French. It • abstraction: the ability to associate a content with a name shares many grammatical features with the Bantu lan- • contribution: the ability to accumulate information in any level of linguistic description guages. Ikota is a tonal language with two registers (High • conjunction: the ability to combine pieces of information and Low). It has ten noun classes and has a widespread • disjunction: the ability to non-deterministically select pieces of information agreement in the NP. Table: Ikota’s noun classes Formally, one can define an XMG specification as follows: Noun class prefix allomorphs CL 1 mò-, Ø- ` mw-, n- Rule := Name → Content CL 2 bà- b- Content := Contribution | Name | Content ∨ Content | Content ∧ Content CL 3 mò-, Ø- ` mw-, n- CL 4 mè- CL 5 ì-, Ã- dy- Metagrammar of Ikota verbal morphology CL 6 mà- m- CL 7 è- CL 8 bè- Subj → 1 ← m ∨ 1 ← o ∨ ... ` CL 9 Ø- p=1 p=2 CL 14 ò-, bò- bw- n = sg n = sg Tense → 2 ← ´e ∨ 2 ← ´e ∨ 2 ← a ` ∨ 2 ← a` ∨ 2 ← ab´ ´ ı Verbs in Ikota tense = past tense = future tense = present tense = past tense = future proxi = near proxi = ¬near proxi = imminent Verbs consist of a lexical root (VR) and several affixes Active → 5 ← À ∨ 5 ← Á ∨ 4 ← ´bw` e E distributed on each side of the VR. active = + active = + active = - prog = - prog = + Table: Verb formation 4 ← ÀK Aspect → ∨ Subj- Tense- VR -(Aspect) -Active -(Proximal) tense = future tense = ¬future prog = - prog = + Table: Verbal forms of bòÃákà "to eat" Proximal → 6 ← nÁ ∨ 6 ← sÁ ∨ ∨ Subj. Tense VR Aspect Active Prox. Value proxi = day proxi = far proxi = none ∨ near proxi = imminent tense = future m- à- Ã -á present To-Eat → 3 ← Ã m- à- Ã -á -ná past, yesterday vclass = g1 m- à- Ã -á -sá distant past To-Give → 3 ← w m- é- Ã -á recent past vclass = g2 m- é- Ã -àk -à medium future VR → To-Eat ∨ To-Give m- é- Ã -àk -à -ná future, tomorrow Verb → Subj ∧ Tense ∧ VR ∧ Aspect ∧ Active ∧ Proximal m- é- Ã -àk -à -sá distant future m- ábí- Ã -àk -à imminent future Figure: A successful derivation for : o+´+Ã+ÀK+À+nÁ → o+´+Ã+ak+a+n´ → oÃ`k`n´ ` e ` e ` ` a ´ a a a The infinitival suffix aka at the lexical level can be re- alized at the surface level by ákà, ´Ù`, ´k`, depending on E E O O Verb → Subj ∧ Tense ∧ VR ∧ Aspect ∧ Active ∧ Proximal the verb class. → 1 ← o ∧` 2 ← ´ e ∧ 3 ← Ã ∧ 4 ← ÀK ∧ 5 ← À ∧ 6 ← nÁ p=2 tense = future vclass = g1 tense = future active = + proxi = day References n = sg prog = - prog = - → 1 ← o 2 ← ´ 3 ← Ã 4 ← ÀK 5 ← À 6 ← nÁ ` e [1] Marie Candito, A Principle-Based Hierarchical Representation of p=2 prog = - tense = future vclass = g1 LTAGs, Proceedings of the 16th International Conference on Com- n = sg active = + proxi = day putational Linguistics (COLING’96) [2] Magnana Ekoukou, Brunelle, Morphologie nominale de l’ikota Figure: A failed derivation: clashes on tense and on prog (B25): inventaire des classes nominales Mémoire de Master 2, Université d’Orléans,2010 Verb → Subj ∧ Tense ∧ VR ∧ Aspect ∧ Active ∧ Proximal [3] Piron, Pascale, Éléments de description du kota, langue bantoue du Gabon, mémoire de licence spéciale africaine, Université Libre → 1 ← o ∧ ` 2 ← ´e ∧ 3 ← Ã ∧ ∧ 5 ← À ∧ 6 ← nÁ de Bruxelles, 1990 p=2 tense = future vclass = g1 tense = ¬future active = + proxi = day n = sg prog = + prog = - → failure! Conclusion • We proposed a formal, albeit preliminary, declarative description of verbal morphology in Ikota, an arguably minority African language • From this formal description, using XMG, we are able to automatically produce a lexicon of fully inflected verb forms with morphosyntactic features. • Methodology : XMG helps to express ideas, test them quickly, and then validate the results against the available data • Using the same tool, we will be able to also describe the syntax of the language using e.g. tree-adjoining grammars (the topic of an ongoing PhD thesis) D. Duchier, B. Magnana Ekoukou, Y. Parmentier, S. Petitjean, E. Schang Describing Morphologically-rich Languages using Metagrammars: a Look at Verbs in Ikota. In: LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”