Morphological segmentation is a fundamental task in language processing. Some languages, such as
Arabic and Tigrinya,have words packed with very rich morphological information.Therefore, unpacking
this information becomes a necessary taskfor many downstream natural language processing tasks. This
paper presents the first morphological segmentation research forTigrinya. We constructed a new
morphologically segmented corpus with 45,127 manually segmented tokens. Conditional random fields
(CRF) and window-based longshort-term memory (LSTM) neural networkswere employed separately to
develop our boundary detection models. We appliedlanguage-independent character and substring features
for the CRFand character embeddings for the LSTM networks. Experimentswere performed with four
variants of the Begin-Inside-Outside (BIO) chunk annotation scheme. We achieved 94.67% F1 scoreusing
bidirectional LSTMs with fixed-sizewindow approach to morphemeboundary detection.
Corpus-based part-of-speech disambiguation of PersianIDES Editor
In this paper we introduce a method for part-ofspeech
disambiguation of Persian texts, which uses word class
probabilities in a relatively small training corpus in order to
automatically tag unrestricted Persian texts. The experiment
has been carried out in two levels as unigram and bi-gram
genotypes disambiguation. Comparing the results gained from
the two levels, we show that using immediate right context to
which a given word belongs can increase the accuracy rate of
the system to a high degree
This document presents two new approaches for aligning sentences in parallel English-Arabic corpora: mathematical regression (MR) and genetic algorithm (GA) classifiers. Feature vectors containing text features like length, punctuation score, and cognate score are extracted from sentence pairs and used to train the MR and GA models on manually aligned training data. The trained models are then tested on additional sentence pairs, achieving better results than a baseline length-based approach. The methods can be applied to any language pair by modifying the feature vector.
SYNTACTIC ANALYSIS BASED ON MORPHOLOGICAL CHARACTERISTIC FEATURES OF THE ROMA...kevig
This paper gives complete guidelines for authors submitting papers for the AIRCC Journals. This paper refers to the syntactic analysis of phrases in Romanian, as an important process of natural language processing. We will suggest a real-time solution, based on the idea of using some words or groups of words that indicate grammatical category; and some specific endings of some parts of sentence. Our idea is based on some characteristics of the Romanian language, where some prepositions, adverbs or some specific endings can provide a lot of information about the structure of a complex sentence. Such characteristics can be found in other languages, too, such as French. Using a special grammar, we developed a system (DIASEXP) that can perform a dialogue in natural language with assertive and interogative sentences about a “story” (a set of sentences describing some events from the real life).
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
This document presents machine learning techniques to identify verb subcategorization frames from Czech language corpora. It compares three statistical techniques and shows they can discover new subcategorization frames and label dependents as arguments or adjuncts with 88% accuracy. It describes the task, relevant properties of Czech including free word order and case marking, and how the techniques are applied without assuming a predefined frame set.
This document discusses Lexical Functional Grammar (LFG) and Generalized Phrase Structure Grammar (GPSG). LFG was developed in the 1970s and emphasizes analyzing phenomena in lexical and functional terms. It uses two levels of structure: c-structure, which is a tree structure, and f-structure, which captures grammatical functions. GPSG was developed in 1985 and is confined to context-free phrase structure rules. It uses immediate dominance and linear precedence rules.
Natural Language Processing (NLP) involves using computational techniques to analyze and understand human languages. Key techniques in NLP include sentiment analysis to classify emotions in text, text classification to categorize text into predefined tags or categories, and tokenization which breaks text into discrete words and punctuation. NLP is used to teach machines how to read and understand human languages by identifying relationships between words and entities. Other areas of NLP include parts of speech tagging, constituent structure analysis, and analysis of pronunciation, morphology, syntax, semantics, and pragmatics.
Corpus-based part-of-speech disambiguation of PersianIDES Editor
In this paper we introduce a method for part-ofspeech
disambiguation of Persian texts, which uses word class
probabilities in a relatively small training corpus in order to
automatically tag unrestricted Persian texts. The experiment
has been carried out in two levels as unigram and bi-gram
genotypes disambiguation. Comparing the results gained from
the two levels, we show that using immediate right context to
which a given word belongs can increase the accuracy rate of
the system to a high degree
This document presents two new approaches for aligning sentences in parallel English-Arabic corpora: mathematical regression (MR) and genetic algorithm (GA) classifiers. Feature vectors containing text features like length, punctuation score, and cognate score are extracted from sentence pairs and used to train the MR and GA models on manually aligned training data. The trained models are then tested on additional sentence pairs, achieving better results than a baseline length-based approach. The methods can be applied to any language pair by modifying the feature vector.
SYNTACTIC ANALYSIS BASED ON MORPHOLOGICAL CHARACTERISTIC FEATURES OF THE ROMA...kevig
This paper gives complete guidelines for authors submitting papers for the AIRCC Journals. This paper refers to the syntactic analysis of phrases in Romanian, as an important process of natural language processing. We will suggest a real-time solution, based on the idea of using some words or groups of words that indicate grammatical category; and some specific endings of some parts of sentence. Our idea is based on some characteristics of the Romanian language, where some prepositions, adverbs or some specific endings can provide a lot of information about the structure of a complex sentence. Such characteristics can be found in other languages, too, such as French. Using a special grammar, we developed a system (DIASEXP) that can perform a dialogue in natural language with assertive and interogative sentences about a “story” (a set of sentences describing some events from the real life).
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
This document presents machine learning techniques to identify verb subcategorization frames from Czech language corpora. It compares three statistical techniques and shows they can discover new subcategorization frames and label dependents as arguments or adjuncts with 88% accuracy. It describes the task, relevant properties of Czech including free word order and case marking, and how the techniques are applied without assuming a predefined frame set.
This document discusses Lexical Functional Grammar (LFG) and Generalized Phrase Structure Grammar (GPSG). LFG was developed in the 1970s and emphasizes analyzing phenomena in lexical and functional terms. It uses two levels of structure: c-structure, which is a tree structure, and f-structure, which captures grammatical functions. GPSG was developed in 1985 and is confined to context-free phrase structure rules. It uses immediate dominance and linear precedence rules.
Natural Language Processing (NLP) involves using computational techniques to analyze and understand human languages. Key techniques in NLP include sentiment analysis to classify emotions in text, text classification to categorize text into predefined tags or categories, and tokenization which breaks text into discrete words and punctuation. NLP is used to teach machines how to read and understand human languages by identifying relationships between words and entities. Other areas of NLP include parts of speech tagging, constituent structure analysis, and analysis of pronunciation, morphology, syntax, semantics, and pragmatics.
Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web...Svetlin Nakov
Scientific paper: False friends are pairs of words in two languages
that are perceived as similar, but have different
meanings, e.g., Gift in German means poison in
English. In this paper, we present several unsupervised
algorithms for acquiring such pairs
from a sentence-aligned bi-text. First, we try different
ways of exploiting simple statistics about
monolingual word occurrences and cross-lingual
word co-occurrences in the bi-text. Second, using
methods from statistical machine translation, we
induce word alignments in an unsupervised way,
from which we estimate lexical translation probabilities,
which we use to measure cross-lingual
semantic similarity. Third, we experiment with
a semantic similarity measure that uses the Web
as a corpus to extract local contexts from text
snippets returned by a search engine, and a bilingual
glossary of known word translation pairs,
used as “bridges”. Finally, all measures are combined
and applied to the task of identifying likely
false friends. The evaluation for Russian and
Bulgarian shows a significant improvement over
previously-proposed algorithms.
The document discusses Performance Grammar (PG), a psycholinguistically motivated grammar formalism that aims to describe syntactic processing phenomena in language production and comprehension. PG uses a two-stage process to generate sentences, first assembling an unordered hierarchical structure and then applying linear precedence rules to order constituents from left to right. However, a purely precedence-based approach cannot account for incremental sentence production. To address this, PG supplements precedence rules with rules that assign constituents to absolute positions within a constituent's spatial layout.
Unifying Logical Form and The Linguistic Level of LFHady Ba
This document summarizes a talk given by Mouhamadou El Hady Ba on unifying logical form between linguistics and logic. The talk discusses (1) the generative linguistic framework and the linguistic level of logical form (LF), (2) logicians' view of logical form as translating natural language into an artificial logical language, and (3) how generalized quantification allows for a unification of the two views of logical form by treating noun phrases as generalized quantifiers. The unification shows that linguists' LF and logicians' logical form are equivalent and reveals the true logical structure of natural language sentences. However, problems remain in fully identifying the two views of logical form and determining whether grammatical rules are truly
The document discusses phonetic form (PF) and logical form (LF) as interface levels between language and other cognitive systems. PF and LF connect the computational system of grammar to physical sound realization and semantic meaning. The document also discusses syntactic movement and how it relates deep structure to surface structure through traces. Binding theory deals with how referring expressions like pronouns relate to other noun phrases in a sentence.
This document describes a statistical approach to part-of-speech tagging and chunking for the Malayalam language. It uses a Hidden Markov Model with the Viterbi algorithm for tagging. The system was trained on a manually tagged corpus containing over 15,000 tokens. For testing, the system achieved a 90.5% accuracy for part-of-speech tagging and 92% accuracy for chunking on a test set of 200 tokens. The statistical approach performed well for the highly inflected Malayalam language. The document also outlines the tagsets used for part-of-speech tagging and chunking in Malayalam.
This paper performs a detailed analysis on the alignment of Portuguese contractions, based on a previously aligned bilingual corpus. The alignment task was performed manually in a subset of the English-Portuguese CLUE4Translation Alignment Collection. The initial parallel corpus was pre-processed and a decision was made as to whether the contraction should be maintained or decomposed in the alignment. Decomposition was required in the cases in which the two words that have been concatenated, i.e., the preposition and the determiner or pronoun, go in two separate translation alignment pairs (PT- [no seio de] [a União Europeia] EN- [within] [the European Union]). Most contractions required decomposition in contexts where they are positioned at the end of a multiword unit. On the other hand, contractions tend to be maintained when they occur at the beginning or in the middle of the multiword unit, i.e., in the frozen part of the multiword (PT- [no que diz respeito a] EN- [with regard to] or PT- [além disso] EN-[in addition]. A correct alignment of multiwords and phrasal units containing contractions is instrumental for machine translation, paraphrasing, and variety adaptation.
Learning phoneme mappings for transliteration without parallel dataAttaporn Ninsuwan
The document presents a method for learning cross-language phoneme mappings without parallel data by framing transliteration as a decipherment problem and using monolingual resources to learn mappings between English and Japanese phonemes. It compares this unsupervised approach to a supervised approach using parallel data and finds the unsupervised method achieves 40% accuracy on a name transliteration task, similar to the supervised approach. The goal is to develop transliteration systems that do not require parallel resources for any language pair.
Lecture 2: From Semantics To Semantic-Oriented ApplicationsMarina Santini
From the "Natural Language Processing" LinkedIn group:
John Kontos, Professor of Artificial Intelligence
I wonder whether translating into formal logic is nothing more than transliteration which simply isolates the part of the text that can be reasoned upon using the simple inference mechanism of formal logic. The real problem I think lies with the part of text that CANNOT be translated one the one hand and the one that changes its meaning due to civilization advances. My own proposal is to leave NL text alone and try building inference mechanisms for the UNTRANSLATED text depending on the task requirements.
All the best
John"
Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge...Guy De Pauw
This document discusses tokenization and computational verb morphology for Setswana, a Bantu language with a disjunctive orthography. It presents an approach that combines two tokenization transducers and a morphological analyzer to effectively tokenize Setswana text. The approach was tested on a short Setswana text and achieved 93.6% accuracy between the automatically and hand-tokenized texts. While mostly successful, some issues remained around longest matches that were not valid tokens or did not allow morphological analysis. Overall, the approach demonstrated that a precise tokenizer and morphological analyzer can largely resolve the challenges of Setswana's disjunctive writing system.
Natural Language processing Parts of speech tagging, its classes, and how to ...Rajnish Raj
Part of speech (POS) tagging is the process of assigning a part of speech tag like noun, verb, adjective to each word in a sentence. It involves determining the most likely tag sequence given the probabilities of tags occurring before or after other tags, and words occurring with certain tags. POS tagging is the first step in many NLP applications and helps determine the grammatical role of words. It involves calculating bigram and lexical probabilities from annotated corpora to find the tag sequence with the highest joint probability.
Dependency Analysis of Abstract Universal Structures in Korean and EnglishJinho Choi
This thesis gives two contributions in the form of lexical resourcesto (1) dependency parsing in Korean and (2) semantic parsing in English. First,we describe our methodology for building three dependency treebanks in Korean derived from existing treebanks and pseudo-annotated according to the latest guidelines from the Universal Dependencies (UD). The original Google Korean UD Treebank is re-tokenized to ensure morpheme-level annotation consistency with other corpora while maintaining linguistic validity of the revised tokens. Phrase structure trees in the Penn Korean Treebank and the Kaist Treebank are automatically converted into UD dependency trees by applying head-percolation rules and linguistically motivated heuristics. A total of 38K+ dependency trees are generated.To the best of our knowledge, this is the first time that the three Korean treebanks are converted into UD dependency treebanks following the latest annotation guidelines. Second, we introduce an on-going project for augmenting the OntoNotes phrase structure treebank with semantic features found in the Abstract Meaning Representation (AMR), as part of an effort to build an accurate AMR parser. We propose a novel technique for AMR parsing that first trains a dependency parser on the OntoNotes corpus augmented with numbered arguments in the Proposition Bank (PropBank), and then does a transfer learning of the trained dependency parser for the AMR parsing task. A preliminary step is to prepare dependency data by performing an automatic replacement of dependencies that define predicate argument structure with their corresponding PropBank argument labels during constituent-to-dependency conversion. To the best of our knowledge, this is the first time that the PropBank labels are directly inserted into dependency structure, producing a new dependency corpus with rich syntactic information as well as semantic role information provided by PropBank that fully describes the predicate-argument structure, making it an ideal resource for AMR parsing and, broadly, semantic parsing.
This document provides an overview of the Minimalist Program (MP) proposed by Chomsky in 1993. It discusses the redundant and necessary levels of representation, including Logical Form and Phonetic Form. Principles like economy of derivation and economy of representation are explained. The document also covers topics like phrase structure, movements, feature checking, and the Full Interpretation Principle in MP. The conclusion states that MP aims to minimize theoretical concepts in syntax to achieve universality of grammar.
The document discusses constructive description logics and provides three options for constructing description logics constructively:
1) Translating description logic syntax into intuitionistic first-order logic (IFOL) to obtain the logic IALC.
2) Translating description logic syntax into intuitionistic modal logic (IK) to obtain the logic iALC.
3) Translating description logic syntax into constructive modal logic (CK) to obtain the logic cALC.
The talk outlines the translation approaches and discusses some pros and cons of the different constructive description logics, but notes that the work is preliminary and more criteria are needed to identify the best constructive system(s).
Finite-state morphological parsing uses finite-state transducers to parse words into their morphological components like stems and affixes. It requires a lexicon of stems and affixes, morphotactic rules describing valid morpheme combinations, and orthographic rules for spelling changes. The parser is built as a cascade of finite-state automata representing the lexicon, morphotactics and spelling rules. It maps surface word forms onto their underlying lexical representations including stems and morphological features. This allows morphological analysis of both regular and irregular forms.
Neural machine translation of rare words with subword unitsTae Hwan Jung
This paper proposes using subword units generated by byte-pair encoding (BPE) to address the open-vocabulary problem in neural machine translation. The paper finds that BPE segmentation outperforms a back-off dictionary baseline on two translation tasks, improving BLEU by up to 1.1 and CHRF by up to 1.3. BPE learns a joint encoding between source and target languages which increases consistency in segmentation compared to language-specific encodings, further improving translation of rare and unseen words.
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015RIILP
The document discusses using syntactic preordering models to delimit the morphosyntactic search space for machine translation of morphologically rich languages. It explores preordering dependency trees of the source language to reduce word order variations and predicting morphological attributes on the source side to inform target language word selection. Experimental results show that non-local features and jointly learning which attributes to predict can improve translation performance over baselines. The work aims to combine preordering and morphology prediction to better exploit interactions between syntactic structure and inflectional properties.
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMERijnlc
The document describes a deep BiGRU-CNN-CRF neural network model that jointly performs word segmentation, stemming, and named entity recognition for the Myanmar language. The model uses character-level and syllable-level representations as input. It was trained on a manually annotated Myanmar corpus. Evaluation results showed the neural sequence labeling architecture achieved state-of-the-art performance for the joint tasks of word segmentation, stemming, and named entity recognition in Myanmar text.
Natural Language Processing (NLP) involves developing computational techniques to analyze and understand human languages. Key techniques in NLP include sentiment analysis to classify emotions in text, text classification to categorize text, and tokenization to break text into discrete units like words. NLP is used to teach machines how to read and understand human language by identifying relationships between words and other linguistic elements.
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...gerogepatton
This document describes a study that used machine learning models to measure the complexity of pronouncing different languages. The study trained a character-level transformer model on a grapheme-to-phoneme transliteration task for 22 languages. It found that languages with a more direct grapheme-to-phoneme mapping, like Esperanto and Malay, were easier for the model to learn than languages with a more complex mapping, like Cantonese and Japanese. The complexity of a language's pronunciation was found to correlate with how direct or simple its grapheme-to-phoneme mapping is. The study also noted that comparing languages fairly requires considering differences in available training data per language.
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...IJITE
Machine learning models allow us to compare languages by showing how hard a task in each language might be to learn and perform well on. Following this line of investigation, we explore what makes a language “hard to pronounce” by modelling the task of grapheme-to-phoneme (g2p) transliteration. By training a character-level transformer model on this task across 22 languages and measuring the model’s proficiency against its grapheme and phoneme
inventories, we show that certain characteristics emerge that separate easier and harder
languages with respect to learning to pronounce. Namely the complexity of a language's
pronunciation from its orthography is due to the expressive or simplicity of its grapheme-tophoneme mapping. Further discussion illustrates how future studies should consider relative data sparsity per language to design fairer cross-lingual comparison tasks.
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...ijrap
Machine learning models allow us to compare languages by showing how hard a task in each language might be to learn and perform well on. Following this line of investigation, we explore what makes a language “hard to pronounce” by modelling the task of grapheme-to-phoneme (g2p) transliteration. By training a character-level transformer model on this task across 22 languages and measuring the model’s proficiency against its grapheme and phoneme inventories, we show that certain characteristics emerge that separate easier and harder languages with respect to learning to pronounce. Namely the complexity of a language's pronunciation from its orthography is due to the expressive or simplicity of its grapheme-to phoneme mapping. Further discussion illustrates how future studies should consider relative data sparsity per language to design fairer cross-lingual comparison tasks.
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMERkevig
Morphological stemming becomes a critical step toward natural language processing. The process of
stemming is to reduce alternative forms to a common morphological root. Word segmentation for
Myanmar Language, like for most Asian Languages, is an important task and extensively-studied
sequence labelling problem. Named entity detection is one of the issues in Asian Language that has
traditionally required a large amount of feature engineering to achieve high performance. The new
approach is integrating them that would benefit in all these processes. In recent years, end-to-end
sequence labelling models with deep learning are widely used. This paper introduces a deep BiGRUCNN-CRF network that jointly learns word segmentation, stemming and named entity recognition tasks.
We trained the model using manually annotated corpora. State-of-the-art named entity recognition
systems rely heavily on handcrafted feature built in our new approach, we introduce the joint model that
relies on two sources of information: character level representation and syllable level representation.
Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web...Svetlin Nakov
Scientific paper: False friends are pairs of words in two languages
that are perceived as similar, but have different
meanings, e.g., Gift in German means poison in
English. In this paper, we present several unsupervised
algorithms for acquiring such pairs
from a sentence-aligned bi-text. First, we try different
ways of exploiting simple statistics about
monolingual word occurrences and cross-lingual
word co-occurrences in the bi-text. Second, using
methods from statistical machine translation, we
induce word alignments in an unsupervised way,
from which we estimate lexical translation probabilities,
which we use to measure cross-lingual
semantic similarity. Third, we experiment with
a semantic similarity measure that uses the Web
as a corpus to extract local contexts from text
snippets returned by a search engine, and a bilingual
glossary of known word translation pairs,
used as “bridges”. Finally, all measures are combined
and applied to the task of identifying likely
false friends. The evaluation for Russian and
Bulgarian shows a significant improvement over
previously-proposed algorithms.
The document discusses Performance Grammar (PG), a psycholinguistically motivated grammar formalism that aims to describe syntactic processing phenomena in language production and comprehension. PG uses a two-stage process to generate sentences, first assembling an unordered hierarchical structure and then applying linear precedence rules to order constituents from left to right. However, a purely precedence-based approach cannot account for incremental sentence production. To address this, PG supplements precedence rules with rules that assign constituents to absolute positions within a constituent's spatial layout.
Unifying Logical Form and The Linguistic Level of LFHady Ba
This document summarizes a talk given by Mouhamadou El Hady Ba on unifying logical form between linguistics and logic. The talk discusses (1) the generative linguistic framework and the linguistic level of logical form (LF), (2) logicians' view of logical form as translating natural language into an artificial logical language, and (3) how generalized quantification allows for a unification of the two views of logical form by treating noun phrases as generalized quantifiers. The unification shows that linguists' LF and logicians' logical form are equivalent and reveals the true logical structure of natural language sentences. However, problems remain in fully identifying the two views of logical form and determining whether grammatical rules are truly
The document discusses phonetic form (PF) and logical form (LF) as interface levels between language and other cognitive systems. PF and LF connect the computational system of grammar to physical sound realization and semantic meaning. The document also discusses syntactic movement and how it relates deep structure to surface structure through traces. Binding theory deals with how referring expressions like pronouns relate to other noun phrases in a sentence.
This document describes a statistical approach to part-of-speech tagging and chunking for the Malayalam language. It uses a Hidden Markov Model with the Viterbi algorithm for tagging. The system was trained on a manually tagged corpus containing over 15,000 tokens. For testing, the system achieved a 90.5% accuracy for part-of-speech tagging and 92% accuracy for chunking on a test set of 200 tokens. The statistical approach performed well for the highly inflected Malayalam language. The document also outlines the tagsets used for part-of-speech tagging and chunking in Malayalam.
This paper performs a detailed analysis on the alignment of Portuguese contractions, based on a previously aligned bilingual corpus. The alignment task was performed manually in a subset of the English-Portuguese CLUE4Translation Alignment Collection. The initial parallel corpus was pre-processed and a decision was made as to whether the contraction should be maintained or decomposed in the alignment. Decomposition was required in the cases in which the two words that have been concatenated, i.e., the preposition and the determiner or pronoun, go in two separate translation alignment pairs (PT- [no seio de] [a União Europeia] EN- [within] [the European Union]). Most contractions required decomposition in contexts where they are positioned at the end of a multiword unit. On the other hand, contractions tend to be maintained when they occur at the beginning or in the middle of the multiword unit, i.e., in the frozen part of the multiword (PT- [no que diz respeito a] EN- [with regard to] or PT- [além disso] EN-[in addition]. A correct alignment of multiwords and phrasal units containing contractions is instrumental for machine translation, paraphrasing, and variety adaptation.
Learning phoneme mappings for transliteration without parallel dataAttaporn Ninsuwan
The document presents a method for learning cross-language phoneme mappings without parallel data by framing transliteration as a decipherment problem and using monolingual resources to learn mappings between English and Japanese phonemes. It compares this unsupervised approach to a supervised approach using parallel data and finds the unsupervised method achieves 40% accuracy on a name transliteration task, similar to the supervised approach. The goal is to develop transliteration systems that do not require parallel resources for any language pair.
Lecture 2: From Semantics To Semantic-Oriented ApplicationsMarina Santini
From the "Natural Language Processing" LinkedIn group:
John Kontos, Professor of Artificial Intelligence
I wonder whether translating into formal logic is nothing more than transliteration which simply isolates the part of the text that can be reasoned upon using the simple inference mechanism of formal logic. The real problem I think lies with the part of text that CANNOT be translated one the one hand and the one that changes its meaning due to civilization advances. My own proposal is to leave NL text alone and try building inference mechanisms for the UNTRANSLATED text depending on the task requirements.
All the best
John"
Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge...Guy De Pauw
This document discusses tokenization and computational verb morphology for Setswana, a Bantu language with a disjunctive orthography. It presents an approach that combines two tokenization transducers and a morphological analyzer to effectively tokenize Setswana text. The approach was tested on a short Setswana text and achieved 93.6% accuracy between the automatically and hand-tokenized texts. While mostly successful, some issues remained around longest matches that were not valid tokens or did not allow morphological analysis. Overall, the approach demonstrated that a precise tokenizer and morphological analyzer can largely resolve the challenges of Setswana's disjunctive writing system.
Natural Language processing Parts of speech tagging, its classes, and how to ...Rajnish Raj
Part of speech (POS) tagging is the process of assigning a part of speech tag like noun, verb, adjective to each word in a sentence. It involves determining the most likely tag sequence given the probabilities of tags occurring before or after other tags, and words occurring with certain tags. POS tagging is the first step in many NLP applications and helps determine the grammatical role of words. It involves calculating bigram and lexical probabilities from annotated corpora to find the tag sequence with the highest joint probability.
Dependency Analysis of Abstract Universal Structures in Korean and EnglishJinho Choi
This thesis gives two contributions in the form of lexical resourcesto (1) dependency parsing in Korean and (2) semantic parsing in English. First,we describe our methodology for building three dependency treebanks in Korean derived from existing treebanks and pseudo-annotated according to the latest guidelines from the Universal Dependencies (UD). The original Google Korean UD Treebank is re-tokenized to ensure morpheme-level annotation consistency with other corpora while maintaining linguistic validity of the revised tokens. Phrase structure trees in the Penn Korean Treebank and the Kaist Treebank are automatically converted into UD dependency trees by applying head-percolation rules and linguistically motivated heuristics. A total of 38K+ dependency trees are generated.To the best of our knowledge, this is the first time that the three Korean treebanks are converted into UD dependency treebanks following the latest annotation guidelines. Second, we introduce an on-going project for augmenting the OntoNotes phrase structure treebank with semantic features found in the Abstract Meaning Representation (AMR), as part of an effort to build an accurate AMR parser. We propose a novel technique for AMR parsing that first trains a dependency parser on the OntoNotes corpus augmented with numbered arguments in the Proposition Bank (PropBank), and then does a transfer learning of the trained dependency parser for the AMR parsing task. A preliminary step is to prepare dependency data by performing an automatic replacement of dependencies that define predicate argument structure with their corresponding PropBank argument labels during constituent-to-dependency conversion. To the best of our knowledge, this is the first time that the PropBank labels are directly inserted into dependency structure, producing a new dependency corpus with rich syntactic information as well as semantic role information provided by PropBank that fully describes the predicate-argument structure, making it an ideal resource for AMR parsing and, broadly, semantic parsing.
This document provides an overview of the Minimalist Program (MP) proposed by Chomsky in 1993. It discusses the redundant and necessary levels of representation, including Logical Form and Phonetic Form. Principles like economy of derivation and economy of representation are explained. The document also covers topics like phrase structure, movements, feature checking, and the Full Interpretation Principle in MP. The conclusion states that MP aims to minimize theoretical concepts in syntax to achieve universality of grammar.
The document discusses constructive description logics and provides three options for constructing description logics constructively:
1) Translating description logic syntax into intuitionistic first-order logic (IFOL) to obtain the logic IALC.
2) Translating description logic syntax into intuitionistic modal logic (IK) to obtain the logic iALC.
3) Translating description logic syntax into constructive modal logic (CK) to obtain the logic cALC.
The talk outlines the translation approaches and discusses some pros and cons of the different constructive description logics, but notes that the work is preliminary and more criteria are needed to identify the best constructive system(s).
Finite-state morphological parsing uses finite-state transducers to parse words into their morphological components like stems and affixes. It requires a lexicon of stems and affixes, morphotactic rules describing valid morpheme combinations, and orthographic rules for spelling changes. The parser is built as a cascade of finite-state automata representing the lexicon, morphotactics and spelling rules. It maps surface word forms onto their underlying lexical representations including stems and morphological features. This allows morphological analysis of both regular and irregular forms.
Neural machine translation of rare words with subword unitsTae Hwan Jung
This paper proposes using subword units generated by byte-pair encoding (BPE) to address the open-vocabulary problem in neural machine translation. The paper finds that BPE segmentation outperforms a back-off dictionary baseline on two translation tasks, improving BLEU by up to 1.1 and CHRF by up to 1.3. BPE learns a joint encoding between source and target languages which increases consistency in segmentation compared to language-specific encodings, further improving translation of rare and unseen words.
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015RIILP
The document discusses using syntactic preordering models to delimit the morphosyntactic search space for machine translation of morphologically rich languages. It explores preordering dependency trees of the source language to reduce word order variations and predicting morphological attributes on the source side to inform target language word selection. Experimental results show that non-local features and jointly learning which attributes to predict can improve translation performance over baselines. The work aims to combine preordering and morphology prediction to better exploit interactions between syntactic structure and inflectional properties.
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMERijnlc
The document describes a deep BiGRU-CNN-CRF neural network model that jointly performs word segmentation, stemming, and named entity recognition for the Myanmar language. The model uses character-level and syllable-level representations as input. It was trained on a manually annotated Myanmar corpus. Evaluation results showed the neural sequence labeling architecture achieved state-of-the-art performance for the joint tasks of word segmentation, stemming, and named entity recognition in Myanmar text.
Natural Language Processing (NLP) involves developing computational techniques to analyze and understand human languages. Key techniques in NLP include sentiment analysis to classify emotions in text, text classification to categorize text, and tokenization to break text into discrete units like words. NLP is used to teach machines how to read and understand human language by identifying relationships between words and other linguistic elements.
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...gerogepatton
This document describes a study that used machine learning models to measure the complexity of pronouncing different languages. The study trained a character-level transformer model on a grapheme-to-phoneme transliteration task for 22 languages. It found that languages with a more direct grapheme-to-phoneme mapping, like Esperanto and Malay, were easier for the model to learn than languages with a more complex mapping, like Cantonese and Japanese. The complexity of a language's pronunciation was found to correlate with how direct or simple its grapheme-to-phoneme mapping is. The study also noted that comparing languages fairly requires considering differences in available training data per language.
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...IJITE
Machine learning models allow us to compare languages by showing how hard a task in each language might be to learn and perform well on. Following this line of investigation, we explore what makes a language “hard to pronounce” by modelling the task of grapheme-to-phoneme (g2p) transliteration. By training a character-level transformer model on this task across 22 languages and measuring the model’s proficiency against its grapheme and phoneme
inventories, we show that certain characteristics emerge that separate easier and harder
languages with respect to learning to pronounce. Namely the complexity of a language's
pronunciation from its orthography is due to the expressive or simplicity of its grapheme-tophoneme mapping. Further discussion illustrates how future studies should consider relative data sparsity per language to design fairer cross-lingual comparison tasks.
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...ijrap
Machine learning models allow us to compare languages by showing how hard a task in each language might be to learn and perform well on. Following this line of investigation, we explore what makes a language “hard to pronounce” by modelling the task of grapheme-to-phoneme (g2p) transliteration. By training a character-level transformer model on this task across 22 languages and measuring the model’s proficiency against its grapheme and phoneme inventories, we show that certain characteristics emerge that separate easier and harder languages with respect to learning to pronounce. Namely the complexity of a language's pronunciation from its orthography is due to the expressive or simplicity of its grapheme-to phoneme mapping. Further discussion illustrates how future studies should consider relative data sparsity per language to design fairer cross-lingual comparison tasks.
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMERkevig
Morphological stemming becomes a critical step toward natural language processing. The process of
stemming is to reduce alternative forms to a common morphological root. Word segmentation for
Myanmar Language, like for most Asian Languages, is an important task and extensively-studied
sequence labelling problem. Named entity detection is one of the issues in Asian Language that has
traditionally required a large amount of feature engineering to achieve high performance. The new
approach is integrating them that would benefit in all these processes. In recent years, end-to-end
sequence labelling models with deep learning are widely used. This paper introduces a deep BiGRUCNN-CRF network that jointly learns word segmentation, stemming and named entity recognition tasks.
We trained the model using manually annotated corpora. State-of-the-art named entity recognition
systems rely heavily on handcrafted feature built in our new approach, we introduce the joint model that
relies on two sources of information: character level representation and syllable level representation.
Part of speech tagging is one of the basic steps in natural language processing. Although it has been
investigated for many languages around the world, very little has been done for Setswana language.
Setswana language is written disjunctively and some words play multiple functions in a sentence. These
features make part of speech tagging more challenging. This paper presents a finite state method for
identifying one of the compound parts of speech, the relative. Results show an 82% identification rate
which is lower than for other languages. The results also show that the model can identify the start of a
relative 97% of the time but fail to identify where it stops 13% of the time. The model fails due to the
limitations of the morphological analyser and due to more complex sentences not accounted for in the
model.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals
Cross lingual similarity discrimination with translation characteristicsijaia
This document summarizes a research paper on cross-lingual similarity discrimination using translation characteristics. The paper proposes a discriminative model trained on bilingual corpora to classify sentences in a target language as similar or dissimilar to a given sentence in a source language. Features used in the model include translation characteristics like sentence length ratios, word alignments, and polarity. The model is trained on various sampling methods to address the imbalanced data of having many more negative samples than positive translations. Experiments on 1500 English-Chinese sentence pairs show the model achieves satisfactory performance according to three evaluation metrics, outperforming a baseline system.
Annotated text corpora are an important resource for natural language processing research and technologies. Corpora can be annotated with linguistic information like parts of speech, morphology, syntax, and semantics through a layered approach. This involves manually or automatically tagging words, sentences, and texts with linguistic metadata. Well-annotated corpora are essential for tasks like morphological analysis, part-of-speech tagging, parsing, and machine translation model training.
PARSING ARABIC VERB PHRASES USING PREGROUP GRAMMARSijnlc
Parsing of Arabic phrases is a crucial requirement for many applications such as question answering and machine translation. The calculus of pregroup introduced by Lambek as an algebraic computational machinery for the grammatical analysis of natural languages. Pregroup grammar used to analyse sentence structure in many European languages such as English, and non-European languages such as Japanese. In Arabic language, Lambek employed the notions of pregroup to analyse some grammatical structures such as conjugating the verb, tense modifiers and equational sentences. This work attempts to develop an initial phase of an efficient automatic pregroup grammar parser by using linear approach to analysethe verbal phrases of Modern Standard Arabic (MSA). The proposed system starts building Arabic lexicon contains all possible categories of Arabic verbs, then analysing the input verbal Arabic phrase to check if it is wellformed by using linear parsing algorithm based on pregroup grammar rules.
Building of Database for English-Azerbaijani Machine Translation Expert SystemWaqas Tariq
In the article the results of development of machine translation expert system is presented. The approach of translation correspondences defining is suggested as a background for creation of data base and knowledge base of the system. Methods of transformation rules compiling applied for linguistic knowledge base of the expert system are based on the defining of translation correspondences between Azerbaijani and English languages.
In this paper, a novel hierarchical Persian Stemming approach based on the Part-Of-Speech (POS) of the
word in a sentence is presented. The implemented stemmer includes hash tables and several deterministic
finite automata (DFA) in its different levels of hierarchy for removing the prefixes and suffixes of the
words. We had two intentions in using hash tables in our method. The first one is that the DFA don’t
support some special words, so hash table can partly solve the addressed problem. And the second goal is
to speed up the implemented stemmer with omitting the time that DFA need. Because of the hierarchical
organization, this method is fast and flexible enough. Our experiments on test sets from Hamshahri
Collection and Security News from ICTna.ir Site show that our method has the average accuracy of
95.37% which is even improved in using the method on a test set with common topics.
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORMijnlc
This article describes the morphological analysis of a standard Arabic natural language processing, as a
part of an electronic dictionary-constricting phase. A fully 3-lettered inflected verbs model are formalized
based on a linguistic classification, using NOOJ platform, the classification gives certain representative
verbs that will considered as lemmas, this verbs form our dictionary entries, they are also conjugated
according to our inflection paradigm relying on certain specific morphological properties. This dictionary
will be considered as an Arabic resource, which will help NLP applications and NOOJ platform to analyse
sophisticated Arabic corpora.
This document provides an overview of basic probability concepts and statistical methods. It discusses probability as it relates to outcomes and events, and as the tool used in statistics to make inferences from samples. It then covers specific probability concepts like n-gram models, which use the previous n-1 words to predict the next word. The document also summarizes part-of-speech tagging methods, including rule-based, supervised stochastic, and unsupervised approaches. Freely available POS taggers for various languages are also listed.
G2 pil a grapheme to-phoneme conversion tool for the italian languageijnlc
This paper presents a knowledge-based approach for the grapheme to-phoneme conversion (G2P) of isolated words of the Italian language. With more than 7,000 languages in the world, the biggest challenge today is to rapidly port speech processing systems to new languages with low human effort and at reasonable cost. This includes the creation of qualified pronunciation dictionaries. The dictionaries provide the mapping from the orthographic form of a word to its pronunciation, which is useful in both speech synthesis and automatic speech recognition (ASR) systems. For training the acoustic models we need an automatic routine that maps the spelling of training set to a string of phonetic symbols representing the pronunciation.
The document describes a machine learning approach for language identification, named entity recognition, and transliteration on query words. It discusses:
1) Using supervised machine learning classifiers like random forest, decision trees, and SVMs along with contextual, character n-gram, and gazetteer features for language identification of Hindi-English and Bangla-English words.
2) Applying an IOB tagging scheme and features like character n-grams, context words, and typographic properties for named entity recognition and classification.
3) A statistical machine transliteration model that segments, aligns, and maps source and target language transliteration units based on context and probabilities learned from parallel training data.
This document discusses Lexical Functional Grammar (LFG) and Generalized Phrase Structure Grammar (GPSG). LFG was developed in the 1970s and emphasizes analyzing phenomena in lexical and functional terms. It uses two levels of structure: c-structure, which is a tree structure, and f-structure, which captures grammatical functions in an attribute-value matrix. GPSG was developed in 1985 and is confined to context-free phrase structure rules. It uses immediate dominance and linear precedence rules.
This document discusses different types of human knowledge and how language acquisition works. It begins by distinguishing between linguistic knowledge, which is knowledge about language, and non-linguistic knowledge. It notes that language has its own principles and rules that are different from other cognitive systems. Two approaches to language acquisition are described - one that views language as developing through general intellectual development, and one that argues language may require special genetic programming. The document then discusses how linguistic intuitions of native speakers can provide insights into their linguistic knowledge and syntactic judgments. It defines the difference between competence, a speaker's innate linguistic knowledge, and performance, their actual language use. Finally, it notes both universal aspects of language shared across languages but also areas of variation between
International Refereed Journal of Engineering and Science (IRJES)irjes
The core of the vision IRJES is to disseminate new knowledge and technology for the benefit of all, ranging from academic research and professional communities to industry professionals in a range of topics in computer science and engineering. It also provides a place for high-caliber researchers, practitioners and PhD students to present ongoing research and development in these areas.
Hidden markov model based part of speech tagger for sinhala languageijnlc
In this paper we present a fundamental lexical semantics of Sinhala language and a Hidden Markov Model (HMM) based Part of Speech (POS) Tagger for Sinhala language. In any Natural Language processing task, Part of Speech is a very vital topic, which involves analysing of the construction, behaviour and the dynamics of the language, which the knowledge could utilized in computational linguistics analysis and automation applications. Though Sinhala is a morphologically rich and agglutinative language, in which words are inflected with various grammatical features, tagging is very essential for further analysis of the language. Our research is based on statistical based approach, in which the tagging process is done by computing the tag sequence probability and the word-likelihood probability from the given corpus, where the linguistic knowledge is automatically extracted from the annotated corpus. The current tagger could reach more than 90% of accuracy for known words.
Similar to MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYA (20)
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
How to Fix the Import Error in the Odoo 17Celine George
An import error occurs when a program fails to import a module or library, disrupting its execution. In languages like Python, this issue arises when the specified module cannot be found or accessed, hindering the program's functionality. Resolving import errors is crucial for maintaining smooth software operation and uninterrupted development processes.
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
বাংলাদেশের অর্থনৈতিক সমীক্ষা ২০২৪ [Bangladesh Economic Review 2024 Bangla.pdf] কম্পিউটার , ট্যাব ও স্মার্ট ফোন ভার্সন সহ সম্পূর্ণ বাংলা ই-বুক বা pdf বই " সুচিপত্র ...বুকমার্ক মেনু 🔖 ও হাইপার লিংক মেনু 📝👆 যুক্ত ..
আমাদের সবার জন্য খুব খুব গুরুত্বপূর্ণ একটি বই ..বিসিএস, ব্যাংক, ইউনিভার্সিটি ভর্তি ও যে কোন প্রতিযোগিতা মূলক পরীক্ষার জন্য এর খুব ইম্পরট্যান্ট একটি বিষয় ...তাছাড়া বাংলাদেশের সাম্প্রতিক যে কোন ডাটা বা তথ্য এই বইতে পাবেন ...
তাই একজন নাগরিক হিসাবে এই তথ্য গুলো আপনার জানা প্রয়োজন ...।
বিসিএস ও ব্যাংক এর লিখিত পরীক্ষা ...+এছাড়া মাধ্যমিক ও উচ্চমাধ্যমিকের স্টুডেন্টদের জন্য অনেক কাজে আসবে ...
A review of the growth of the Israel Genealogy Research Association Database Collection for the last 12 months. Our collection is now passed the 3 million mark and still growing. See which archives have contributed the most. See the different types of records we have, and which years have had records added. You can also see what we have for the future.
How to Manage Your Lost Opportunities in Odoo 17 CRMCeline George
Odoo 17 CRM allows us to track why we lose sales opportunities with "Lost Reasons." This helps analyze our sales process and identify areas for improvement. Here's how to configure lost reasons in Odoo 17 CRM
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYA
1. International Journal on Natural Language Computing (IJNLC) Vol. 7, No.2, April 2018
DOI: 10.5121/ijnlc.2018.7203 29
MORPHOLOGICAL SEGMENTATION WITH LSTM
NEURAL NETWORKS FOR TIGRINYA
Yemane Tedla and Kazuhide Yamamoto
Natural Language Processing Lab, Nagaoka University of Technology,
Nagaoka City, Japan
ABSTRACT
Morphological segmentation is a fundamental task in language processing. Some languages, such as
Arabic and Tigrinya,have words packed with very rich morphological information.Therefore, unpacking
this information becomes a necessary taskfor many downstream natural language processing tasks. This
paper presents the first morphological segmentation research forTigrinya. We constructed a new
morphologically segmented corpus with 45,127 manually segmented tokens. Conditional random fields
(CRF) and window-based longshort-term memory (LSTM) neural networkswere employed separately to
develop our boundary detection models. We appliedlanguage-independent character and substring features
for the CRFand character embeddings for the LSTM networks. Experimentswere performed with four
variants of the Begin-Inside-Outside (BIO) chunk annotation scheme. We achieved 94.67% F1 scoreusing
bidirectional LSTMs with fixed-sizewindow approach to morphemeboundary detection.
KEYWORDS
Morphological segmentation, Tigrinyalanguage,LSTM neural networks, CRF
1. INTRODUCTION
Morphological processing at the word level is usually the initial step in the different stages of
natural language processing (NLP). Morphemes constitute the minimal meaning-bearing units in
a language [1]. In this paper, we focus on the task of detecting morphological boundaries, which
is alsoreferredto as morphological segmentation. This task involves the breaking down of words
into their component morphemes. For example, the English word “reads”can be segmented into
“read”and “s”, where “read”is the stem and “-s”is an inflectional morpheme, marking third person
singular verb.
Morphological segmentation is useful for several downstream NLP tasks, such as morphological
analysis, POS tagging, stemming, and lemmatization. Segmentation is also applied as an
important preprocessing phase in a number of systems including machine translation, information
retrieval, and speech recognition. Segmentation is mainly performed using rule-based approaches
or machine learning approaches. Rule-based approaches can be quite expensive and language-
dependent because the morphemes and all the affixation rules need to be identified to
disambiguate segmentation boundaries. Machine learning approach, on the other hand, is data-
driven wherein the underlying structure is automatically extracted from the data. In this paper, we
present supervised morphological segmentation based on CRFs [2] and LSTM neural networks
[3]. Since morphemes are sequences of characters, we address the problem as a sequence tagging
task and propose a fixed-size window approach for modeling contextual information of
characters. CRFs are well-suited for this kind of sequence aware classification tasks. We also
exploit the long-distance memory capabilities of LSTMs for modeling boundaries of morphemes.
Our main contributions are the following.
1. We constructed the first morphologically segmented corpus for Tigrinya. This corpus is
annotated with boundaries that identify prefix, stem, and suffix morphemes.
2. International Journal on Natural Language Computing (IJNLC) Vol. 7, No.2, April 2018
30
2. We present the first supervised morphological segmentation research for Tigrinya.
Extensive experiments were performed for different annotation schemes and learning
approaches exploiting character embeddings as the sole features for LSTMs. We also
compare our results with feature-rich CRF-based segmentation employing language-
agnostic character and substring features.
3. Tigrinya is an understudied language. Therefore, through this fundamental research, we
hope to contribute in bettering the understanding of the properties and language processing
challenges of the language and encourage further research.
This paper is organized as follows. In section 2, the Tigrinya morphology will be briefly
introduced. In section 3, relevant previous works will be discussed. Section 4 describes the CRF
and LSTM-based methods employed in this research. In the sections that follow, the experimental
settings will be explained and the results discussed. Finally, concluding remarks will be provided.
2. TIGRINYA LANGUAGE
2.1. WRITING SYSTEM
Tigrinya is one of the few African languages that still use an indigenous writing system for
education and daily communication. The writing system, known as the Ge’ez script, is adopted
from the ancient Ge’ez language, which is currently used as a liturgical language. The Ge’ez
script is an abugida system in which each letter (alphabet) represents a consonant-vowel (CV)
syllable. The Tigrinya alphabet chart, known as “Fidel”, comprises of about 275symbols.
Gemination is not explicit in the Ge’ez script; however, this limitation does not seem to pose a
problem for native speakers.In this paper, Tigrinya words are transliterated to Latin characters
according to the SERA scheme with the addition of “I”for the explicit marking of the epenthetic
vowel known as “SadIsI”.The SERA transliteration scheme is available at
ftp://ftp.geez.org/pub/sera-docs/sera-faq.txt. We directly apply labeling to the Latin
transliterations of Tigrinya words and not the Ge’ez script. The Ge’ez script is syllabic, and, in
many cases, the boundary has fusional properties resulting in alterations of characters at the
boundary. For example, the word “sebere”(He broke) would be segmented as “seber”+
“e”because the morpheme “e”represents grammatical features. However, this morpheme cannot
be isolated from the Ge’ez script because the last characters “re”forms a single symbol in the
Ge’ez script and segmenting “sebere”as “sebe”+ “re”is not a correct analysis.
2.2. TIGRINYA MORPHOLOGY
Tigrinya is a Semitic language spoken by over 7 million people in Eritrea and Ethiopia. The
morphology of Semitic languages, known as “root-and-pattern”morphology, has distinct non-
concatenative properties that intercalate consonantal roots and vowel patterns [4]. For example, in
Tigrinya, the words “sebere”(he broke) and “sebira”(she broke) share a common tri-consonantal
root or radical “s-b-r”but have different sequences of vowel patterns (“e-e-e”and “e-i-a”) that
are inserted in-between the radicals. In addition to such a unique infixation, words are formed by
affixing morphemes of prefix, suffix, as well as circumfix. These morphemes represent
morphological features including gender, person, number, tense, aspect, mood, voice, and so on.
Furthermore, there are clitics of mostly prepositions and conjunctions that can be affixed in other
words. These components are arranged in the following manner;
(proclitics)(prefix/circumfix)(root-with-infix)(circumfix/suffixes)(enclitics)
Specifically, the order of morpheme slots is defined by [5] as follows.
(prep|conj)(rel)(neg)sbjSTEMsbj(obj)(neg)(conj)
3. International Journal on Natural Language Computing (IJNLC) Vol. 7, No.2, April 2018
31
The slots “prep”and “conj”are affixes of prepositions or conjunctions attached before or after the
word. The “rel”indicates a relativizer (the prefix “zI”) corresponding in function to the English
demonstratives like that, which, and who. The “sbj”on either side ofthe STEM are a prefix and/or
suffix of the four verb types namely; perfective, imperfective, imperative, and gerundive. As
shown in examples 1-4, the perfective and gerundive verbs conjugate only on suffixes (examples
1 and 4) while imperfective verbs undergo both the prefix and suffix inflections (example 2). The
imperatives show the suffix only conjugations or change prefix as well (example 3). In addition to
the verb type, these fusional morphemes convey gender, person, and number information.
1. seber + u (they broke)
2. yI + sebIr + u (they break)
3. yI + sIber + u (break/let them break)
4.sebir + omI(they broke)
Moreover, Tigrinya independent pronouns have a pronominal suffix (SUF) of gender, person, and
number as shown in examples 5 and 6.
5.nIsu (he) →nIs + u/SUF
6. nIsa(she) →nIs + a/SUF
The word order typology is normally subject-object-verb (SOV), though there are cases in which
this sequence may not apply strictly [6, 7]. Changes in “sbj”verb affixes, along with pronoun
inflections, enforce subject-verb agreements. One aspect of the non-concatenative morphology in
Tigrinya is the circumfixing of negation morpheme in the structure “ayI-STEM-nI”[5]. In
addition to the slots defined by [5], some enclitics such as question particles“do; ke;Ke”can also
be found in Tigrinya orthography as free or bound suffix morphemes. For example,
KeyIdudo?→keyId + u + do (did he go?)
nIsuKe? →nIs + u + Ke (what about him?)
The pronominal object marker “obj”is always suffixed to the verbas shown in examples 7-10.
According to [6], Tigrinyasuffixes of object pronoun can be categorized into two constructs.The
first is described in relation to verbs (examples 7, 8 and 9) and another indicates the semantic role
of applicative cases by inflecting for prepositional enclitic “lI”+ a pronominal suffix as in
example 10.
7. beliOI + wo/obj (he ate [something])
8. hibu + wo /obj (he gave [something] to him)
9. hibu + ni /obj(he gave me [something])
10. beliOI + lu /obj (he ate for him/he ate using [it])
Tigrinya words are also generated by derivational morphology. There are up to eight derivational
categories that can originate from a single verb [8]. For example, the passive form of perfective
(example11) and gerundive verbs (example 12) is constructed by prefixing “te”to the main verb.
Furthermore, adverbs can be derived from nouns by prefixing “bI-”which has similar
functionality as the English “-ly”suffix (example 13).
11. zekere (he remembered) →te + zekere (he was remembered)
12. zekiru(he remembered) →te + zekiru (he was remembered)
13. bIHaqi(truly)→bI + Haqi
In this work, we deal with boundaries of prefix, stem, and suffix morphemes. Inflections related
to infixes (internal stem changes) are not feasible for this type of segmentation.
4. International Journal on Natural Language Computing (IJNLC) Vol. 7, No.2, April 2018
32
In conclusion, all the morphological and derivational processes generate a large number of
complex word forms. According to [8], Tigrinya verb inflections, derivations, and combinations
of the slot order can produce more than 100,000 word forms from a single verb root. The
ambiguity and difficulty in relation to segmentation are briefly discussed in the following section.
2.3. MORPHOLOGICAL AMBIGUITY
In segmentation, ambiguity may occur at the word-level or due to sentential context. In Tigrinya,
a major source of ambiguity is when certain character sequences of morphemes are natively
present as part of words. In this case, the characters do not represent grammatical features and
hence segmentation should not be applied.Consider the words “bIrIhanI”(light) and
“bIHayli”(by force). The same prefix “bI-” which is an inseparable part of the noun “bIrIhanI”,
represents an adverb of manner (by) in the second word. Moreover, morphemes may also appear
as constituents of other morphemes. For instance, the noun suffix “-netI”(example 16) contains
the sub-morph “-etI”that can have the role of a suffix for conjugation of third person, feminine,
singular attributes as in example15. Note that “-netI”in example 14 is not a morpheme.
14. genetI→genetI/NOUN_paradise
15. wesenetI→wesen/STEM_decided + etI/SUF_she
16. naxInetI→naxI/STEM_independence + netI/NOUN-SUF
Moreover, the lack of gemination marking may introduce segmentation ambiguity. For example,
the word “medere”can be interpreted as the noun “speech”or the phrase “he gave a speech”if the
“de”in “medere”is geminated. A computational system must discern both cases such that the
phrase is segmented while the noun is left intact. Resolving such cases may require more context
or additional linguistic information such as part-of-speech. In this work, we would like to avoid
resorting to any language-specific knowledge. Therefore, in the LSTM approach, we use
character embeddings to capture contextual dependencies of characters.
As explained earlier, Tigrinya words have multiple consecutive morpheme slots. This pattern
causes under-segmentation confusion due to the numerous intermediate splits comprising atomic
and composite morphemes. Table 1 lists some of the possible segmentations for the word token
“InItezeyIHatetIkayomI”(if you did not ask them).
Furthermore, in Tigrinya, compound words are often written attached. For example, “betI(house)
megIbi(food)”collectively translates to “restaurant”in English. In the orthography, these words
can be found either separate or attached. We queried for “betI-” starting words in a text corpus
containing about 5 million words extracted from the HaddasErtra newspaper,which is published
in Eritrea. The previous word, for instance, occurs delimited by a space in about 95% of the
response. Another compound word “betISIHIfetI” (office) was found attached in about 24% of
the response, which is not a small portion. Although the words arenot grammatical morphemes,
we believe segmenting (normalizing) these words would be useful for practical reasons such as
mitigating data sparseness.
Table 1. Examples of intermediate splits caused by under-segmentation. The italicized second row is the
expected segmentation.
InItezeyIHatetIkayomI
InIte-zeyI-HatetI-ka-yomI
InIte-zeyIHatetIkayomI
InIte-zeyI-HatetIkayomI
InIte-zeyI-HatetI-kayomI
InItezeyIHatetIka-yomI
InItezeyIHatetI-kayomI
InItezeyI-HatetIkayomI
5. International Journal on Natural Language Computing (IJNLC) Vol. 7, No.2, April 2018
33
3. RELATED WORKS
The work of [1] introduced the unsupervised discovery of morphemes based on minimum
description length (MDL) and likelihood optimization. This method lacks the handling of
representing contextual dependencies, such as stem and affix orders. Although several other
unsupervised segmentation approaches have been proposed, it was shown by [9] that minimally
supervised approaches provided better performance compared to solely unsupervised methods
applied on large unlabeled datasets. For example, unsupervised experiments for Estonian, that
achieved 73.3% F1 score with 3.9 million words, was outperformed by a supervised CRF that
attained 82.1% with just 1000 word forms. This work also showed that semi-supervised
approaches that use both annotated and unannotated data can be leveraged to improve upon
simply using completely unsupervised methods. In related works for Semitic languages of
Hebrew and Arabic, [10] uses a probabilistic model where segmentation and morpheme
alignmentsareinferred from the shared structure between both languages using parallel corpus
with and without annotation. Recent works on neural models rely on the use of some form of
embedding for extracting relevant features. [11] demonstrated a generic window-based neural
architecture that is applied to several NLP tasks while avoiding explicit use of feature
engineering. Their system trains on large unlabeled data to learn internal representations. We
have also adopted a similar window-based approach although with the input of character
sequence window instead of words as in [11]. In [12], state-of-the-art results were achieved with a
neural model that learns POS related features only from character-level and sub-word
embeddings. Other research, however, argue that enriching embeddings with additional
morphological information boost performance. For example, [13] demonstrates this by using the
results of a morphological analyzer to further improve candidate ranking in a morphological
disambiguation task for Arabic. In a research for Burmese word segmentation, [14] address the
problem by employing binary classification with classifiers such as CRFs. The tagset restriction
to the binary was mainly due to data size. Our data is similarly small size corpus;however, we
report experiments with several schemes to investigate the effect of using simple to more
expressive tags in the Tigrinya morpheme segmentation.
Amharic and Tigrinya are closely related languages. These languages share a number of
grammatical features and vocabularies. [16]presentedmorphological rule learning and
segmentation based on inductive logic programming where rules or affixes were learned by
exposing easy-to-complex examples incrementally to an intelligent teacher. Their system for affix
segmentation achieved a performance of 0.94 precision and 0.97 recall measures. Tigrinya
remains an under-studied and under-resourced language from the NLP perspective. However, as
regards to morphological processing, [8] employed finite state transducers (FSTs) to develop,
“HornMorpho”, a morphological analysis and generation system for Tigrinya, Amharic, and
Oromo languages. The FST empowered by feature structures was effectively adopted to process
the unusual non-concatenative root-and-pattern morphology. The Tigrinya module of
HornMorph2.5 performs the full analysis of Tigrinya verb classes [5]. The system was tested on
randomly selected 400 word forms with 329 of them being non-ambiguous. The analysis
revealedremarkably accurate results with a few errors due to unhandled FSTs, which can be
integrated. Our approach is different from HornMorph in at least two aspects. First, our task is
limited to identifying morphological boundaries. The results amount to partially analyzed
segments although these segments are not explicitly annotated for grammatical feature. The
annotation could be pursued with further processing of the output or training on morphologically
annotated data, which is currently missing for Tigrinya. Second, the use of the FSTs relies heavily
on the linguistic knowledge of the language in question. This would require time-consuming
manual construction of language-specific rules, which is more challenging for root-and-pattern
morphology. In contrast, we follow a data-driven (machine learning) approach to automatically
extract features of the language from a relatively small boundary annotated data. Moreover, on
the limitations of HornMorph, [5] notedthat analysis incurs a “considerable time”to exhaust all
6. International Journal on Natural Language Computing (IJNLC) Vol. 7, No.2, April 2018
34
options before the system responds. Besides HornMorph, there was an attempt to use affix based
shallow segmentation in the pre-processing phase for improving word alignment in the English-
Tigrinyastatistical machine translation [16].
4. METHOD
4.1. MORPHOLOGICALLY SEGMENTED CORPUS
There is no publicly available morphologically segmented resource for Tigrinya. Therefore, we
based our studies on a new morphologically segmented corpus developed in-house. The first
version of this corpus comprisesover 45,000 tokens derived from randomly selected 2774
sentencesofthe part-of-speech tagged Nagaoka Tigrinya Corpus (NTC) available
athttp://eng.jnlp.org/yemane/ntigcorpus.
For the purpose of boundary detection, we employed character-based BIO chunking scheme,
which allows us to exploit character dependencies and alleviate out-of-vocabulary (OOV)
problems by reducing the morpheme vocabulary to about 60 Latin characters that cover the
transliteration mapping we adopted.
4.2. TAGGING SCHEMES
The popular IOB tagging scheme is used to annotate the Beginning (B), Inside (I), and outside
(O) of chunks in tasks such as base phrase chunking and named entity recognition. Similarly, we
address morpheme boundary detection by annotating every character in morpheme chunks with
the appropriate IOB label. There are different variants of the IOB scheme including BIO, IOB,
BIE, IOBES, and so on. There is no general consensus as to which variant performs best. [11]
usedthe IOBES format as it encodedmore information whereas extensive evaluations by [17]
showed that the BIO has superior results compared to the IOB scheme. Furthermore, the BIES
scheme gave better results for Japanese word segmentation [18].
Table 2.Annotating with differenttagging schemes
Scheme
tImaliInItezeyIHatete
(if he did not ask yesterday)
Word t I m a l i I n I t e z e y I H a t e t e
BIE B I I I I E B I I I E B I I E B I I I I E
BIES B I I I I E B I I I E B I I E B I I I I S
BIO O O O O O O B I I I I B I I I B I I I I I
BIOES O O O O O O B I I I E B I I E B I I I E S
We experimented with four schemes, namely, BIE, BIES, BIO, and BIOES to explore the
modeling of morpheme segmentation in a low-resource setting with morphologically rich
language Tigrinya. In our case, the B, I, and E tags marks Begin, Inside, and End of multi-
character morphemes. The S tag annotates single character morpheme and the O tag assigns
character sequences outside of morpheme chunks. An example of using these annotations is
presented in Table 2.
4.3.CHARACTER EMBEDDINGS
Morphological segmentation is primarily character-level analysis. For example, in Tigrinya, the
characters “zI”have a grammatical role when used as a relativizer that is prefixed only to the
perfective (example 17) or imperfective verbs (example 18). Therefore, all other occurrences of
“zI”, such as the noun “zInabI”(rain) should not be segmented. Consequently, recognizing the
shape of relativized perfectives and imperfectives becomes crucial.
7. International Journal on Natural Language Computing (IJNLC) Vol. 7, No.2, April 2018
35
17. zIsebere‘that broke’ →zI + sebere/PERFECTIVE
18. zIsebIrI‘that breaks’ →→→→zI + sebIrI/IMPERFECTIVE
19. *zI + yIsIberI/IMPERATIVE – invalid
20. zInabI‘rain’ →→→→*zI + nabI- incorrect segmentation
Character-level information must be extracted to learn such informative features useful for
identifying boundaries. We generated past and future fixed-width character context for each
character. The concatenated contextual characters form a single feature vector for the central
target character. For example, the features of the word “selamI”(peace, hello) for each character
are generated as depicted in Table 3. We also showed the corresponding label of the central
character in the BIE scheme. The Boldface character represents the central character with its left
(past) and right (future) context of width five characters. Characters are padded with underscores
(_) to complete the remaining slots depending on the window size.
Table 3. An example of generated fixed window character sequences with assigned label
Window Label
_ _ _ _ _ s e l a m I B
_ _ _ _ s e l a m I _ I
_ _ _ s e l a m I _ _ I
_ _ s e l a m I _ _ _ I
_ s e l a m I _ _ _ _ I
s e l a m I _ _ _ _ _ E
We settledfor a window size of five as increasing it beyond five did not result in significant
improvements. Moreover, the dimension of optimal character embedding was decided by hyper-
parameter tuning experiments. We initialized the embedding layer from a lookup table for integer
(index) representations of the fixed-width character vectors. The embedding is then fed to an
LSTM after passing through a dropout layer.
4.4. CRF
CRFs are probabilistic approaches capable of modeling context-dependent sequence labeling [2].
In a morphological segmentation, the model is trained to predict a sequence of output tags (IOB
labels) = , , … , from a sequence of feature characters = , , … , . The training
task is to maximize the log probabilitylog ( ( | )) of the valid label sequence. The conditional
probability is computed as:
( | ) =
( , )
∑ ( , ) (1)
The equation for the scoring function score is given as:
( , ) = , + ! ,
"
#
"
#$
(2)
%&',&' (
denotes the emission probability of the state change from label ) to label *while +,,- is the
transition probability denoting the score of the*./
label of the)./
word.
8. International Journal on Natural Language Computing (IJNLC) Vol. 7, No.2, April 2018
36
4.5. LSTM NEURAL NETWORK
Recurrent Neural Networks (RNNs) are feed forward neural networks with feedback cycles to
capture time dynamics using back-propagationthrough time. This recurrent connection allows
thenetwork to employ the current inputs as well as previously encountered states. Given the input
vector 0., ( in our case, a window offive characters left and right of the target character), the
hiddenstate ℎ. at each time step 2, can be expressed by equation 4.
0. = 3 .45, .4(54 ), … , ., … , .6(54), .657 (3)
ℎ. = 8(9:/ . + 9//ℎ.4 + ;/) (4)
where the 9terms represent the weight matrices, ; is a bias vector and 8 is the activation
function. However, due to the gradient vanishing (very small weight changes) or exploding
problems (very large weight changes), long distance dependencies are not properly propagated in
RNNs. [3] introduced LSTMs to overcome this gradient updating problem. The neurons of
LSTMs called memory blocks are composed of three gates (forget, input and output) that control
the flow of information and a memory cell with self-recurrent connection. The formulae of these
four components are given in equations 5 to 9.In these equations, the input, forget, output and cell
activation vectors are denoted by ), <, =and > respectively; 8 and ? represent sigmoid and tanh
functions respectively; ⨀operation is the Hadamard product; 9 stands for the weight matrices
and ; is the bias.
). = 8(9,: . + 9,/ℎ.4 + ;,)
<. = 8A9B: . + 9B/ℎ.4 + ;BC
>5 = ?(9D: . + 9D/ℎ.4 + ;D)
>. = <. ⨀ >.4 + ). ⨀ >5
=. = 8(9E: . + 9E/ℎ.4 + ;E)
ℎ. = =. ⨀ ? (>. )
(5)
(6)
(7)
(8)
(9)
(10)
The LSTM architecture we propose to use is illustrated in Figure 1. Character-level features
(embeddings) generated by the embedding layer are fed to the forward LSTM network. Dropout
layers are adjusted before and after the LSTM layer to regularize the model and prevent
overfitting. The output of the network is then processed by a fully connected (Dense) layer and
finally, tag probability distributions over all candidates are computed via the softmax classifier.
Figure1: The LSTM network
9. International Journal on Natural Language Computing (IJNLC) Vol. 7, No.2, April 2018
37
Due to the capacity to work well with long-distance dependencies, LSTMs have achieved state-
of-the-art performance in window-based approaches as well as sequence labeling tasks including
morphological segmentation [19], part-of-speech [20] and named entity recognition [21].
Figure2: The BiLSTM neural network
4.6. BIDIRECTIONAL LSTM
LSTMs encode the input representation in the forward pass. In Bidirectional LSTM network
(BiLSTM),the input vector is processed in both forward and backward passesand presented
separately to hidden states. The mechanism is usefulin encoding past (forward) and future
(backward) information ofthe input vector. We illustrate the architecture of the BiLSTM
wepropose to use in Figure 2. Both the forward and backward layeroutputs are calculated by
using the standard LSTM updating equations5 to 9. Accordingly, for every time step 2 the
forward andbackward networks take the layer input 0. and then outputℎ.. The celltakes the input
state >5and the cell output >.as well as the previous cell output =.for training and updating
parameters.Like in LSTMs, the output of each network in BiLSTM canbe expressed as:
. = 8 A9/&ℎ. + ;&C (11)
where9 is the weight matrix, ; is the bias vector of the output layer and 8 is the activation
function of the output layer. Boththe forward and backward networks operate on the input state of
0.and generate the forward outputℎ.
FFFG and the backward outputℎ.
HFFF. Thenthe final output, . from
both networks is combined using operationssuch as multiplication, summation or simple
concatenation asgiven in equation 12. In our case, 8 is a concatenating function.
. = 8(ℎ.
FFFG, ℎ.
HFFF) (12)
This context-dependent representation from both networks ispassed along to the fully-connected
hidden layer (Dense) whichthen propagates it to the output layer.
Embedding
Backward LSTM
Forward LSTM
||
Dense
Softmax
_ _ s e l a m I _ _ _
B | I | O
10. International Journal on Natural Language Computing (IJNLC) Vol. 7, No.2, April 2018
38
5. EXPERIMENTS
In this section, the used data and the experimental settings are reported. Initially, we set training
epochs to hundred and configured early stopping if training continuedwithout loss improvement
for 15 consecutive epochs. We usedKeras (https://www.keras.io/) to develop and
Hyperas(https://github.com/maxpumperla/hyperas) to tune our deep neural networks. Hyperas, in
turn,applieshyperopt for optimization. The algorithm used is Tree-structured Parzen Estimator
approach (TPE) over five evaluation trails (https://papers.nips.cc/paper/4443-algorithms-for-
hyperparameter-optimization.pdf).
5.1. SETTINGS
Datasets. The text used in this research is extracted from the NTC corpus. The NTC consists of
around 72,000 POS tagged tokens collected from newspaper articles. Our corpus contains 45,127
tokens, of which 13,336 tokens are unique words. Training is performed with ten-fold cross-
validation, where about 10% of the data in every training iteration is used for development. We
further split the development data into two equal halves allotting one set for validation during
training and the other half for testing the model’s skill.
Table 4. Training data splits as per the count of the fixed-width character windows (vectors) used as the
actual input sequences
Data size
All Training Validation Test
100% 90% 5% 5%
289,158 260,242 14,458 14,458
The evaluation reports in this paper are based on the final test set results. For every target
character in a word, a fixed-width character window is generated as shown in Table 3. The data
statistics according to the generated character windows is presented in Table 4.
Hyper-parameters.
CRF: We performed preliminary experiments for deciding the regularization algorithm (L1 or
L2) andthe C-parameter values of [0.5, 1, 1.5]. Consequently, L1 regularization and C value of
1.5 were selected as these values gave better results. The character and substring features we
considered spanned a window size of five. The full list of the character features is given as
follows.
a. Left context: -5 to 0, -4 to 0, -3 to 0, -2 to 0, -1 to 0
b. Right context: 0 to 5, 0 to 4, 0 to 3, 0 to 2, 0 to 1
c. Left+Right context: a + b
d. N-grams: bi-grams
LSTM: In order to search for the parameters that yield optimal performance, we explored hyper
parameters that include embedding size, batch size, dropouts, hidden layer neurons, and
optimizers. The complete list of the selected parameters for LSTM is summarized in Table 5. We
achieved similar results for the BIE and BIES tunings as shown in the table. However, the BIO
and BIOES schemes showeddifference in the tuning results and, therefore, these were used
separately.
Embeddings: We tested several embedding dimensions from the set {50, 60, 100, 150, 200, 250}.
Separate runs were made for all types of tag sets.
11. International Journal on Natural Language Computing (IJNLC) Vol. 7, No.2, April 2018
39
Table 5. LSTM Hyperparameters selected by tuning
Parameter BIE/BIES BIO BIOES
Window 5 5 5
Character embedding 150 150 250
hidden layer size 32 128 128
Batch size 32 256 256
optimizer adam adam RMSProp
Dropouts: Randomly selecting and dropping-out nodes have proven effective at mitigating
overfitting and regularizing the model [22]. We applied dropout on the character embedding as
well as on the inputs and outputs of the LSTM/BiLSTM layer. For example, the selected dropouts
for the BIO-based tunings are 0.07 and 0.5 for the embedding and LSTM output layers
respectively. The dropout probabilities are selected from a uniform distribution over the interval
[0, 0.6].
Batch size: We ran tuning for the batch size of the set {16, 32, 64, 128, 256}.
Hidden layer size:The hidden layer’s size is searched from the set {64, 128, 256}.
Optimizers and learning rate: We investigated the stochastic gradient descent(SGD) with
stepwise learning rate and other more sophisticated algorithms such as AdaDelta[23], Adam [24],
and RMSProp[25]. The SGD learning rate was initialized to 0.1, a momentum of 0.9 with rate
updates for every 10 epochs at a drop rate of 0.5. However, the SGD setting did not result in
significant gains compared to the automatic gradient update methods.
5.2. BASELINE
The baseline considered in this study is the CRF model trained on only character features with a
window size of five. The window size is kept the same with the neural LSTMs to compare the
models subject to the same information. The baseline experiments for the CRF achieved an F1
score of around 60% and 63% using the BIOES and BIO tags respectively. As for the BIE tags,
the performance reachedabout 75.7%. All the subsequent CRF-based experiments employed two
auxiliary features, which were the contextual and n-gram characters listed in section 5.1. As
depicted in Table 6, significant performance enhancements were achieved with the auxiliary
features. However, compared to the LSTM based experiments, the window five based CRF
results were still suboptimal.
5.3. EVALUATION
We report boundary precision(+), boundary recall (I) and boundary F1 scores (J1) as given by
equation 13 to 15. Precision evaluates the percentageof correctly predicted boundaries with
respect to the predictedboundaries and recall measures the percentage of correctlypredicted
boundaries with respect to the true boundaries. F1 scoreis the harmonic mean of precision and
recall and can be interpretedas their weighted average.
+ =
>=LLM>2N LMO)>2MO ;=PQORL)MS
LMO)>2MO ;=PQORL)MS
I =
>=LLM>2N LMO)>2MO ;=PQORL)MS
2LPM ;=PQORL)MS
J1 =
2 ∗ ( LM>)S)=Q ∗ LM>RNN)
LM>)S)=Q + LM>RNN
(13)
(14)
(15)
The correct predictions are the count of true positives while the actual (true) boundaries are these
true positives added with the false negatives. The system proposed predictions can be found from
12. International Journal on Natural Language Computing (IJNLC) Vol. 7, No.2, April 2018
40
the sum of the true positives and the false positives.The number of I or O classes is found more
than double of theB classes. Therefore, to account for any class imbalance, we computedweighted
macro-averages, in which each class contributionis weighted by the number of samples of that
class.
The macro-averaged precision and recall are calculated by takingthe average of precision and
recall of the individual classes asshown in equations 16 and 17, where Q denotes total number
ofclasses.
VR>L= − RXM. + =
+(>NRSS 1) + +(>NRSS 2) + … + +(>NRSS Q)
Q
VR>L= − RXM. I =
I(>NRSS 1) + I(>NRSS 2) + … + I(>NRSS Q)
Q
(16)
(17)
6. RESULTS
We experimented and compared the performance of three models trained with four different
tagging strategies as explained earlier. The resultsof ten-fold cross-validation are summarized in
Table 6 with P, R,F1 representing precision, recall, and F1 score respectively.
Table 6.Results of CRF, LSTM and BiLSTM experiments with four BIO schemes.
Tagset CRF LSTM BiLSTM
P R F1 P R F1 P R F1
BIE 92.62 92.62 92.62 94.38 94.37 94.37 94.68 94.68 94.67
BIES 92.44 92.44 92.44 94.22 94.20 94.20 94.60 94.59 94.59
BIO 84.88 84.88 84.88 90.91 89.96 89.96 90.16 90.11 90.11
BIOES 83.60 83.60 83.60 88.26 88.21 88.21 88.45 88.39 88.39
Generally, we observed the choice of segmentation schemes affecting the model performance.
Overall, using the BIE scheme resulted in the best performance in all tests. Although the BIOES
tagset is the most expressive, since the corpus is rather small, the simpler tagsets showed better
generalization over the dataset.The BIE-CRF model achieved 92.62% in the F1 score while the
performance of the BIO and BIOES-based CRFsfellto about 84.88% and 83.6% respectively. The
CRF model using the BIO scheme outperformed the CRF model using the BIOES by 1.28%
absolute change. A majority of the errors are associated with under-segmentation and confusion
between I (inside) and O (outside) tags. Masking these differences by dropping the O tag has
proved useful in our experiments. As a result, the CRF model using the BIE scheme outperformed
the BIO-based CRF by around 7.7percentage point. We compared the regular LSTM with its
bidirectional extension. The overall results showed that the BiLSTMs performed slightly better
than the regular LSTMs sharing the same scheme. In the window-based approach, the past and
future context is partly encoded in the feature vector. This window was fed to the regular LSTM
at training time making the information available for both networks quite similar. This may be the
reason for not seeing much variation between the two models. Nevertheless, the additional design
in BiLSTMs to reverse-process the input sequence allows the network to learn more detailed
structures. Therefore, we sawslightly improved results with the BiLSTM network over the regular
LSTM. As with CRFs, a significant increase in the F1 score was achieved when changing from
the BIOES to the BIE scheme. In both LSTM models, we observed again in performance of over
6 percentage point. Overall, in these low-resource experiments, the model generalizedbetter when
the data was tagged with the BIE scheme as this reducedthe model complexity introduced by the
O tags. The BiLSTM model using the BIE scheme performed superior to all others scoring
94.67% in the F1. This result was achieved by employing only character embeddings to extract
morphological features. On the other hand, the CRF model which useda rich set of character + bi-
13. International Journal on Natural Language Computing (IJNLC) Vol. 7, No.2, April 2018
41
gram + substring features scored around 92.62%. For a fair comparison, we used features of the
CRF model that spanned the same window of characters as the LSTM models. In other words, we
avoided heavy linguistic engineering on the CRF side. Although the features were not
linguistically motivated, achieving an optimal result this way still requires many trials and better
design of features. It is to be noted that additional hand-crafted features and wider windows for
the CRF model may produce better results than what has beenreported. However, from the LSTM
results, we sawthat forgoing feature engineering and extracting features using character
embeddings couldsufficiently encode the morphological information desired for high-
performance boundary detection.
While the results may not be directly comparable due to the difference in settings and evaluation,
two related previous works of[26] and [19] on the Semitic languages of Arabic and Hebrew
arereferred for comparison. We adopted similar LSTM-based fixed-window approach presented
in [19] that investigated Arabic andHebrew morphological segmentation with the S & B dataset
containing6,192 word types [27]. The BiLSTM-based best results reportedare an F1 score of
95.2% on Hebrew and 95.5% on Arabic. [19] employedmore advanced multi-window features
and an ensemble of 10modelsbecause the performance of character window features was not
satisfactory for the small data size used. In comparison,we achieved similar results (see Table 6)
with no further featureengineering other than simple character window embeddings,however, the
dataset available to us is larger with 13, 336 wordtypes.
Prior to the work by [19], the state-of-the-art has been a CRF-based morpheme segmentation
relying on hand-crafted feature setsthat include all left and right substrings of a target character
andan additional bias component [26]. Training on the same S and B dataset, [26] achieved F1
score of 94.5% and 97.8% on Hebrewand Arabic respectively. We notice that the LSTM-based
workby [19] on Arabic dataset, does not outperform the CRF-based method of [26]. We achieved
around 92.62% F1 score with CRF-basedTigrinya experiment that was trained on the feature set
listedin section 5.1 with constraints to span only five left and right charactersmatching the
information exposed to our LSTM settings.
6.1. ERROR ANALYSIS
The models using the BIE/BIES or the BIO/BIOES tagsets achieved comparable results. We
analyzed the effect of the tagset choice by comparingthe BiLSTM models of the BIE and BIO
tagging schemes. For easiercomparison, the confusion matrices (in percentage) of both
experiments are presented jointly in Table 7. The values are paired in X/Yformat where X is a
value from the set {B, I, E}in the BIE schemeand Y is one of {B, I, O}in the BIO scheme. As
mentioned earlier, the number of Ior O classes is about two-fold greater than the B class.
Therefore, there is more confusion stemming from those tags. The B tag represents the morpheme
segmentation boundaries; therefore, the under-segmentation and over-segmentation errors can be
explained in relation to the B tags. Looking at the BIE figures, around 91% of thepredictions for
the B tags are correct. In other words, 9% of the true Btags were confused for the I and O tags,
which together amounts to theunder-segmentation errors. These types of errors are almost
doubledwith the use of the BIO scheme. On the other hand, the over-segmentationerrors are about
5.9% and 7.4% in the BIE and BIO scheme respectively. Over-segmentation occurs when true I,O,
or E tags are confused with the B tag. In this case, morpheme boundaries are predicted while the
characters actually belong to the inside, outside, or end of the morpheme. These results show that
the under-segmentation errors contribute more compared to the over-segmentation errors.
In the BIE scheme results, we observe that the I and B tags are correctly predicted with relatively
higher accuracies than with the BIO scheme. Specifically, the prediction accuracy of the B tag
(boundaries) increased from 82.6% to 91.2% (8.6% absolute change). The main reason is the
considerable reduction of errors introduced due to the confusions of tag I for O or vice-versa in
the BIO scheme. This comparison suggests that improved results can be achieved by denoising
14. International Journal on Natural Language Computing (IJNLC) Vol. 7, No.2, April 2018
42
the data from the O tag. Therefore, for a low-resource scenario of boundary detection, starting
with the BIE tags may be a better choice for improving model generalization compared to more
complex schemes that are feasible for larger datasets.
Table 7.BiLSTM Confusion matrix comparison for BIE (boldfaced) and BIO schemes in %. Rows and
columns represent predicted and true values respectively.
Predicted
B/B I/I E/O
True
B/B 91.20/82.60 4.54/9.94 4.26/7.46
I/I 2.83/3.21 93.84/90.74 3.34/6.05
E/O 3.09/4.19 9.65/10.83 87.26/84.98
The following sample sentence extracted from the Bible is segmented with all four models. Note
that the training corpus does not include text from the Bible.
Tigrinya: “amIlaKI kea mIrIayuzEbIhIgImIbIlaOuzITIOumIkWluomIabImIdIriabIqWele”
English: “And out of the earth the Lord made every tree to come, delighting the eye and good for
food”.
The expected segmentation is given under the column labeled “True”. We observe that the BIE
model segmentation is the nearest to the reference segmentation. The other models showed under-
segmentation errors for the word “ZEbIhIgI”and “kWlu”. The verbal noun prefix “mI-”is
correctly segmented by all models whereas the models failed to recognize the boundary of the
causative prefix “a-”in the last word “abIqWele”.
Table 8. Sample of segmentation result; the underscore character marks segmentation boundaries and the
boldfaced characters denote segmentation errors.
Sentence True BIE BIES BIO BIEOS
amIlaKI amIlaKI amIlaKI amIlaKI amIlaKI amIlaKI
Kea kea kea kea kea Kea
mIrIayu mI_rIay_u mI_rIay_u mI_rIay_u mI_rIay_u mI_rIay_u
zEbIhIgI zE_bIhIgI zE_bIhIgI zEbIhIgI zEbIhIgI zEbIhIgI
mIbIlaOu mI_bIlaO_u mI_bIlaO_u mI_bIlaO_u mI_bIlaO_u mI_bIlaO_u
zITIOumI zI_TIOumI zI_TIOumI zI_TIOumI zI_TIOumI zI_TIOumI
kWlu kWl_u kWl_u kWlu kWlu kWlu
omI omI omI omI omI omI
abI abI abI abI abI abI
mIdIri mIdIri mIdIri mIdIri mIdIri mIdIri
abIqWele a_bIqWel_e abIqWel_e abIqWel_e abIqWel_e abIqWel_e
7. CONCLUSION AND FUTURE WORK
In this work, we presented the first morphological segmentation research for the morphologically
rich Tigrinya language. The research was performed based on a new manually segmented corpus
comprising over 45,000 words. Four variants of the BIO chunk annotation scheme were
employed to train three different morphological segmentation models. The first model wasbased
on CRFs with language-independent features of characters, n-grams, and substrings. The other
two were based on LSTM and BiLSTM deep neural architectures leveraging character
embeddings of a fixed-size window for extracting the morpheme boundary features. We
evaluated the BIE, BIES, BIO, and BIOES tagging schemes for morphological boundary
15. International Journal on Natural Language Computing (IJNLC) Vol. 7, No.2, April 2018
43
detection. Although the size of the corpus is relatively small, we achieved 94.67% F1 score in
boundary detection using the BIE chunking scheme.
In the future, we plan to improve and enlarge the corpus by including vocabularies from different
domains. This will allow the inclusion of potentially unseen patterns as the current orthographic
style is limited to news domain. We also plan to extend our experiments to use embeddings with
character and substring concatenated features for the BIO and BIOES schemes, which currently
have lower performance. We are also interested in integrating minimally supervised approaches
to make use of large unlabeled datasets. Segmented words have proved useful in mitigating the
adverse effects of data sparseness in Semitic language processing [28]. We would like to explore
the use of our segmentation output in machine translation, full-fledged morphological analysis,
stemming, part of speech tagging, and other downstream NLP tasks.
ACKNOWLEDGEMENT
We would like to thank Dr. Chenchen Ding, from NICT Japan, for his constructive comments.
REFERENCES
[1] M. Creutz and K. Lagus, “Unsupervised discovery of morphemes,” in Proceedings of the ACL02
workshop on morphological and phonological learning-Volume 6.Association for Computational
Linguistics, 2002, pp. 21–30.
[2] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional random fields: Probabilistic models
for segmenting and labeling sequence data,” in Proceedings of the Eighteenth International
Conference on MachineLearning, ser. ICML ’01, San Francisco, USA, 2001, pp. 282–289.
[3] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” in Neural computation, vol. 9, pp.
1735–80, 12 1997.
[4] S. Wintner, Morphological processing of Semitic languages. Berlin, Heidelberg: Springer Berlin
Heidelberg, 2014, pp. 43–66.
[5] M. Gasser, “Hornmorpho 2.5 user’s guide,” Indiana University, Indiana, 2012.
[6] N. A. Kifle, “Differential object marking and topicality in Tigrinya,” pp. 4–25, 2007.
[7] A. Sahle, A comprehensive Tigrinya grammar, Lawrenceville NJ: Red Sea Press, Inc., 1998, pp. 71–
72.
[8] M. Gasser, “Semitic morphological analysis and generation using finite state transducers with feature
structures,” in Proceedings of the 12th Conferenceof the European Chapter of the Association for
Computational Linguistics. Association for Computational Linguistics, 2009, pp. 309–317.
[9] T. Ruokolainen, O. Kohonen, K. Sirts, S.-A.Grnroos, M. Kurimo, and S. Virpioja, “A comparative
study of minimally supervised morphological segmentation,” Computational Linguistics, vol. 42, no.
1, pp. 91–120, 2016.
[10] B. Snyder and R. Barzilay, “Unsupervised multilingual learning for morphological segmentation,”
Proceedings of ACL-08: HLT, pp. 737–745, 2008.
[11] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural Language
Processing (almost) from scratch,” Journalof Machine Learning Research, vol. 12, pp. 2493–2537,
2011.
[12] C. D. Santos and B. Zadrozny, “Learning character-level representations for part-of-speech tagging,”
in Proceedings of the 31st International Conferenceon Machine Learning, vol. 32, no. 2, Bejing,
China, 22–24 Jun 2014, pp. 1818–1826.
[13] N. Zalmout and N. Habash, “Don’t throw those morphological analyzers away just yet: Neural
morphological disambiguation for Arabic,” in Proceedingsof the 2017 Conference on Empirical
Methods in Natural LanguageProcessing, 2017, pp. 704–713.
16. International Journal on Natural Language Computing (IJNLC) Vol. 7, No.2, April 2018
44
[14] C. Ding, Y. K. Thu, M.Utiyama, and E. Sumita, “Word segmentation for Burmese (Myanmar),” ACM
Transactions on Asian and Low-ResourceLanguage Information Processing (TALLIP), vol. 15, no. 4,
p. 22, 2016
[15] W. Mulugeta, M. Gasser, and B. Yimam, “Incremental learning of affix segmentation,” Proceedings
of COLING 2012, pp. 1901–1914, 2012.
[16] Y. Tedla and K. Yamamoto, “The effect of shallow segmentation on English-Tigrinya statistical
machine translation,” in 2016 InternationalConference on Asian Language Processing (IALP), Nov
2016, pp. 79–82.
[17] N. Reimers and I. Gurevych, “Optimal hyperparameters for deep LSTM networks for sequence
labeling tasks,” arXiv preprint arXiv:1707.06799, 2017.
[18] Y. Kitagawa and M. Komachi, “Long short-term memory for Japanese word segmentation,” arXiv
preprint arXiv:1709.08011, 2017.
[19] L. Wang, Z. Cao, Y. Xia, and G. de Melo, “Morphological segmentation with window LSTM neural
networks.” in Proceedings of the Thirtieth AAAIConference on Artificial Intelligence (AAAI-16)
Morphological. Association for the Advancement of Artificial Intelligence, 2016.
[20] B. Plank, A. Søgaard, and Y. Goldberg, “Multilingual part-of-speech tagging with bidirectional long
short term memory models and auxiliary loss,” arXiv preprint arXiv:1604.05529, 2016.
[21] X. Ma and E. H. Hovy, “End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF,” CoRR,
vol. abs/1603.01354, 2016.
[22] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way
to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no.
1, pp. 1929–1958, 2014.
[23] M. D. Zeiler, “Adadelta: An adaptive learning rate method,” CoRR, vol. abs/1212.5701, 2012.
[24] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980,
2014.
[25] Y. Dauphin, H. de Vries, J. Chung, and Y. Bengio, “RMSProp and equilibrated adaptive learning
rates for non-convex optimization,” CoRR, vol. abs/1502.04390, 2015.
[26] T. Ruokolainen, O. Kohonen, S. Virpioja, and M. Kurimo, “Supervised morphological segmentation
in a low-resource learning setting using conditional random fields,” in Proceedings of the Seventeenth
Conference on Computational Natural Language Learning.Association for Computational Linguistics,
August 2013, pp. 29–37.
[27] S. Benjamin and B. Regina, “Cross-lingual propagation for morphological analysis,” in Proceedings
of the Twenty-Third AAAI Conference on Artificial Intelligence. Association for the Advancement of
Artificial Intelligence, 2008, pp. 848–854.
[28] H. Al-Haj and A. Lavie, “The impact of Arabic morphological segmentation on broad-coverage
English-to Arabic statistical machine translation,” Machine Translation, vol. 26, no. 1, pp. 3–24, Mar
2012.