In this paper we report on an experimental syntactically and morphologically driven rule-based Arabic
tagger. The tagger is developed using Arabic language grammatical rules and regulations. The tagger requires no pre-tagged text and is developed using a primitive set of lexicon items along with extensive
grammatical and structural rules. It is tested and compared to Stanford tagger both in terms of accuracy and
performance (speed). Obtained results are quite comparable to Stanford tagger performance with marginal
difference favoring the developed tagger in accuracy with huge difference in terms of speed of execution. The
newly developed tagger named MTE Tagger has been tested and evaluated. For the evaluation of its accuracy of tagging, a set of Arabic text was manually prepared and annotated. Compared to Stanford tagger, the
MTE tagger performance was quite comparable. The developed tagger makes use of no pre-annotated datasets, except of some simple lexicon consisting of list of words representing closed word types like demonstrative nouns or pronouns or some particles. For the purpose of evaluation of the new tagger, it was run on
multiple datasets and results were compared to those of Stanford tagger. In particular, both taggers (the
MTE and the Stanford) were run on a set of 1226 sentences with close to 20,000 tokens that was human
annotated and verified to serve as testbed. The results were very encouraging where in both test runs, the
MTE tagger outperformed the Stanford tagger in terms of accuracy of 87.88% versus 86.67% for the Stanford tagger. In terms of speed of tagging and in comparison Stanford tagger, MTE Taggers’ performance
was on average 1:50. More improved accuracy is possible in future work as the set of rules are further
optimized, integrated and more of Arabic language properties such as end of word discretization are used.
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...ijnlc
Most languages, especially in Africa, have fewer or no established part-of-speech (POS) tagged corpus.
However, POS tagged corpus is essential for natural language processing (NLP) to support advanced
researches such as machine translation, speech recognition, etc. Even in cases where there is no POS
tagged corpus, there are some languages for which parallel texts are available online. The task of POS
tagging a new language corpus with a new tagset usually face a bootstrapping problem at the initial stages
of the annotation process. The unavailability of automatic taggers to help the human annotator makes the
annotation process to appear infeasible to quickly produce adequate amounts of POS tagged corpus for
advanced NLP research and training the taggers. In this paper, we demonstrate the efficacy of a POS
annotation method that employed the services of two automatic approaches to assist POS tagged corpus
creation for a novel language in NLP. The two approaches are cross-lingual and monolingual POS tags
projection. We used cross-lingual to automatically create an initial ‘errorful’ tagged corpus for a target
language via word-alignment. The resources for creating this are derived from a source language rich in
NLP resources. A monolingual method is applied to clean the induce noise via an alignment process and to
transform the source language tags to the target language tags. We used English and Igbo as our case
study. This is possible because there are parallel texts that exist between English and Igbo, and the source
language English has available NLP resources. The results of the experiment show a steady improvement
in accuracy and rate of tags transformation with score ranges of 6.13% to 83.79% and 8.67% to 98.37%
respectively. The rate of tags transformation evaluates the rate at which source language tags are
translated to target language tags.
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...kevig
In this paper, we demonstrate the efficacy of a POS annotation method that employed the services of two automatic approaches to assist POS tagged corpus creation for a novel language in NLP. The two approaches are cross-lingual and monolingual POS tags projection. We used cross-lingual to automatically create an initial ‘errorful’ tagged corpus for a target language via word-alignment. The resources for creating this are derived from a source language rich in NLP resources. A monolingual method is applied to clean the induce noise via an alignment process and to transform the source language tags to the target language tags. We used English and Igbo as our case study. This is possible because there are parallel texts that exist between English and Igbo, and the source language English has available NLP resources. The results of the experiment show a steady improvement in accuracy and rate of tags transformation with score ranges of 6.13% to 83.79% and 8.67% to 98.37% respectively. The rate of tags transformation evaluates the rate at which source language tags are translated to target language tags.
XMODEL: An XML-based Morphological Analyzer for Arabic LanguageWaqas Tariq
Morphological analysis is an essential stage in language engineering applications. For the Arabic language, this stage is not easy to develop because the Arabic language has some particularities such as the phenomena of agglutination and a lot of morphological ambiguity phenomenon. These reasons make the design of the morphological analyzer for Arabic somewhat difficult and require lots of other tools and treatments. The volume of the lexicon is another big problem of the morphological analysis of the Arabic Language which affects directly the process of the analyzing. In this paper we present a Morphological Analyzer for Modern Standard Arabic based on Arabic Morphological Automaton technique and using a new and innovative language (XMODEL) to represent the Arabic morphological knowledge in an optimal way. Both the Arabic Morphological Analyzer and Arabic Morphological Automaton are implemented in Java language and used XML technology. Buckwalter Arabic Morphological Analyzer and Xerox Arabic Finite State Morphology are two of the best known morphological analyzers for Modern Standard Arabic and they are also available and documented. Our Morphological Analyzer can be exploited by Natural Language Processing (NLP) applications such as machine translation, orthographical correction, information retrieval and both syntactic and semantic analyzers. At the end, an evaluation of Xerox and our system is done.
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...IJCI JOURNAL
Recent advancements in the field of natural language processing have markedly enhanced the capability of machines to comprehend human language. However, as language models progress, they require continuous architectural enhancements and different approaches to text processing. One significant challenge stems from the rich diversity of languages, each characterized by its distinctive grammar resulting ina decreased accuracy of language models for specific languages, especially for low-resource languages. This limitation is exacerbated by the reliance of existing NLP models on rigid tokenization methods, rendering them susceptible to issues with previously unseen or infrequent words. Additionally, models based on word and subword tokenization are vulnerable to minor typographical errors, whether they occur naturally or result from adversarial misspellings. To address these challenges, this paper presents the utilization of a recently proposed free-tokenization method, such as Cannine, to enhance the comprehension of natural language. Specifically, we employ this method to develop an Arabic-free tokenization language model. In this research, we will precisely evaluate our model’s performance across a range of eight tasks using Arabic Language Understanding Evaluation (ALUE) benchmark. Furthermore, we will conduct a comparative analysis, pitting our free-tokenization model against existing Arabic language models that rely on sub-word tokenization. By making our pre-training and fine-tuning models accessible to the Arabic NLP community, we aim to facilitate the replication of our experiments and contribute to the advancement of Arabic language processing capabilities. To further support reproducibility and open-source collaboration, the complete source code and model checkpoints will be made publicly available on our Huggingface1 . In conclusion, the results of our study will demonstrate that the free-tokenization approach exhibits comparable performance to established Arabic language models that utilize sub-word tokenization techniques. Notably, in certain tasks, our model surpasses the performance of some of these existing models. This evidence underscores the efficacy of free-tokenization in processing the Arabic language, particularly in specific linguistic contexts.
Stemming is the process of reducing words to their stems or roots. Due to the morphological richness and
complexity of the Arabic language, stemming is an essential part of most Natural Language Processing
(NLP) tasks for this language. In this paper, we study the impact of different stemming approaches on the
Named Entity Recognition (NER) task for Arabic and explore the merits, limitations and differences
between light stemming and root-extraction methods. Our experiments are evaluated on the standard
ANERCorp dataset as well as the AQMAR Arabic Wikipedia Named Entity Corpus.
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATIONIJDKP
The document presents a new hybrid stemming algorithm for Arabic text categorization. It combines three existing stemming approaches - root-based (Khoja stemmer), stem-based (Larkey stemmer), and statistical (N-gram). The hybrid approach first normalizes text, removes stop words and prefixes/suffixes. It then uses bigram similarity and the Dice measure to find valid roots by comparing word bigrams to a root dictionary. The algorithm is evaluated using naive Bayesian and SVM classifiers in an Arabic text categorization system, and is found to outperform the individual stemming methods. The proposed stemmer improves the performance of Arabic text categorization.
This document discusses different techniques for improving Arabic information retrieval from automatically transcribed Arabic speech, including normalization, stopword removal, light stemming, heavy stemming, and tokenization. It presents the results of an experiment applying these techniques to Arabic text extracted from television news videos. The techniques of normalization, stopword removal, and light stemming increased retrieval precision according to the experiment, while heavy stemming and trigrams had a negative effect. The choice of machine translation engine for the English queries was also shown to significantly impact retrieval effectiveness.
Hybrid part of-speech tagger for non-vocalized arabic textijnlc
The document presents a hybrid part-of-speech tagging method for Arabic text that combines rule-based and statistical approaches. Rule-based tagging alone can misclassify words and leave some untagged, so the method integrates it with a Hidden Markov Model tagger. The hybrid approach is evaluated on two Arabic corpora and achieves accuracy rates of 97.6% and 98%, outperforming the individual rule-based and HMM taggers.
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...ijnlc
Most languages, especially in Africa, have fewer or no established part-of-speech (POS) tagged corpus.
However, POS tagged corpus is essential for natural language processing (NLP) to support advanced
researches such as machine translation, speech recognition, etc. Even in cases where there is no POS
tagged corpus, there are some languages for which parallel texts are available online. The task of POS
tagging a new language corpus with a new tagset usually face a bootstrapping problem at the initial stages
of the annotation process. The unavailability of automatic taggers to help the human annotator makes the
annotation process to appear infeasible to quickly produce adequate amounts of POS tagged corpus for
advanced NLP research and training the taggers. In this paper, we demonstrate the efficacy of a POS
annotation method that employed the services of two automatic approaches to assist POS tagged corpus
creation for a novel language in NLP. The two approaches are cross-lingual and monolingual POS tags
projection. We used cross-lingual to automatically create an initial ‘errorful’ tagged corpus for a target
language via word-alignment. The resources for creating this are derived from a source language rich in
NLP resources. A monolingual method is applied to clean the induce noise via an alignment process and to
transform the source language tags to the target language tags. We used English and Igbo as our case
study. This is possible because there are parallel texts that exist between English and Igbo, and the source
language English has available NLP resources. The results of the experiment show a steady improvement
in accuracy and rate of tags transformation with score ranges of 6.13% to 83.79% and 8.67% to 98.37%
respectively. The rate of tags transformation evaluates the rate at which source language tags are
translated to target language tags.
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...kevig
In this paper, we demonstrate the efficacy of a POS annotation method that employed the services of two automatic approaches to assist POS tagged corpus creation for a novel language in NLP. The two approaches are cross-lingual and monolingual POS tags projection. We used cross-lingual to automatically create an initial ‘errorful’ tagged corpus for a target language via word-alignment. The resources for creating this are derived from a source language rich in NLP resources. A monolingual method is applied to clean the induce noise via an alignment process and to transform the source language tags to the target language tags. We used English and Igbo as our case study. This is possible because there are parallel texts that exist between English and Igbo, and the source language English has available NLP resources. The results of the experiment show a steady improvement in accuracy and rate of tags transformation with score ranges of 6.13% to 83.79% and 8.67% to 98.37% respectively. The rate of tags transformation evaluates the rate at which source language tags are translated to target language tags.
XMODEL: An XML-based Morphological Analyzer for Arabic LanguageWaqas Tariq
Morphological analysis is an essential stage in language engineering applications. For the Arabic language, this stage is not easy to develop because the Arabic language has some particularities such as the phenomena of agglutination and a lot of morphological ambiguity phenomenon. These reasons make the design of the morphological analyzer for Arabic somewhat difficult and require lots of other tools and treatments. The volume of the lexicon is another big problem of the morphological analysis of the Arabic Language which affects directly the process of the analyzing. In this paper we present a Morphological Analyzer for Modern Standard Arabic based on Arabic Morphological Automaton technique and using a new and innovative language (XMODEL) to represent the Arabic morphological knowledge in an optimal way. Both the Arabic Morphological Analyzer and Arabic Morphological Automaton are implemented in Java language and used XML technology. Buckwalter Arabic Morphological Analyzer and Xerox Arabic Finite State Morphology are two of the best known morphological analyzers for Modern Standard Arabic and they are also available and documented. Our Morphological Analyzer can be exploited by Natural Language Processing (NLP) applications such as machine translation, orthographical correction, information retrieval and both syntactic and semantic analyzers. At the end, an evaluation of Xerox and our system is done.
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...IJCI JOURNAL
Recent advancements in the field of natural language processing have markedly enhanced the capability of machines to comprehend human language. However, as language models progress, they require continuous architectural enhancements and different approaches to text processing. One significant challenge stems from the rich diversity of languages, each characterized by its distinctive grammar resulting ina decreased accuracy of language models for specific languages, especially for low-resource languages. This limitation is exacerbated by the reliance of existing NLP models on rigid tokenization methods, rendering them susceptible to issues with previously unseen or infrequent words. Additionally, models based on word and subword tokenization are vulnerable to minor typographical errors, whether they occur naturally or result from adversarial misspellings. To address these challenges, this paper presents the utilization of a recently proposed free-tokenization method, such as Cannine, to enhance the comprehension of natural language. Specifically, we employ this method to develop an Arabic-free tokenization language model. In this research, we will precisely evaluate our model’s performance across a range of eight tasks using Arabic Language Understanding Evaluation (ALUE) benchmark. Furthermore, we will conduct a comparative analysis, pitting our free-tokenization model against existing Arabic language models that rely on sub-word tokenization. By making our pre-training and fine-tuning models accessible to the Arabic NLP community, we aim to facilitate the replication of our experiments and contribute to the advancement of Arabic language processing capabilities. To further support reproducibility and open-source collaboration, the complete source code and model checkpoints will be made publicly available on our Huggingface1 . In conclusion, the results of our study will demonstrate that the free-tokenization approach exhibits comparable performance to established Arabic language models that utilize sub-word tokenization techniques. Notably, in certain tasks, our model surpasses the performance of some of these existing models. This evidence underscores the efficacy of free-tokenization in processing the Arabic language, particularly in specific linguistic contexts.
Stemming is the process of reducing words to their stems or roots. Due to the morphological richness and
complexity of the Arabic language, stemming is an essential part of most Natural Language Processing
(NLP) tasks for this language. In this paper, we study the impact of different stemming approaches on the
Named Entity Recognition (NER) task for Arabic and explore the merits, limitations and differences
between light stemming and root-extraction methods. Our experiments are evaluated on the standard
ANERCorp dataset as well as the AQMAR Arabic Wikipedia Named Entity Corpus.
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATIONIJDKP
The document presents a new hybrid stemming algorithm for Arabic text categorization. It combines three existing stemming approaches - root-based (Khoja stemmer), stem-based (Larkey stemmer), and statistical (N-gram). The hybrid approach first normalizes text, removes stop words and prefixes/suffixes. It then uses bigram similarity and the Dice measure to find valid roots by comparing word bigrams to a root dictionary. The algorithm is evaluated using naive Bayesian and SVM classifiers in an Arabic text categorization system, and is found to outperform the individual stemming methods. The proposed stemmer improves the performance of Arabic text categorization.
This document discusses different techniques for improving Arabic information retrieval from automatically transcribed Arabic speech, including normalization, stopword removal, light stemming, heavy stemming, and tokenization. It presents the results of an experiment applying these techniques to Arabic text extracted from television news videos. The techniques of normalization, stopword removal, and light stemming increased retrieval precision according to the experiment, while heavy stemming and trigrams had a negative effect. The choice of machine translation engine for the English queries was also shown to significantly impact retrieval effectiveness.
Hybrid part of-speech tagger for non-vocalized arabic textijnlc
The document presents a hybrid part-of-speech tagging method for Arabic text that combines rule-based and statistical approaches. Rule-based tagging alone can misclassify words and leave some untagged, so the method integrates it with a Hidden Markov Model tagger. The hybrid approach is evaluated on two Arabic corpora and achieves accuracy rates of 97.6% and 98%, outperforming the individual rule-based and HMM taggers.
Bantu Spell Checker and Corrector using Modified Edit Distance Algorithm (MEDA)jennifer steffan
Automatic spelling correction for a language is critical since the current world is almost entirely dependent on digital devices that employ electronic keyboards. Correct spelling adds to textual document accessibility and readability. Many NLP applications, such as web search engines, text summarization, sentiment analysis, and so on, rely on automatic spelling correction. A few efforts on automatic spelling correction in Bantu languages have been completed; however, the numbers are insufficient. We proposed a spell checker for typed words based on the Modified minimum edit distance Algorithm (MEDA), and the Syllable Error Detection Algorithm (SEDA). In this study, we adjusted the minimal edit distance Algorithm by including a frequency score for letters and ordered operations. The SEDA identifies the component of the word and the position of the letter which has an error. For this research, the Setswana language was utilized for testing, and other languages related to Setswana will use this spell checker. Setswana is a Bantu language spoken mostly in Botswana, South Africa, and Namibia and its automatic spelling correction are still in its early stages. Setswana is Botswana’s national language and is mostly utilized in schools and government offices. The accuracy was measured in 2500 Setswana words for assessment. The SEDA discovered incorrect Setswana words with 99% accuracy. When evaluating MEDA, the edit distance algorithm was utilized as the baseline, and it generated an accuracy of 52%. In comparison, the edit distance algorithm with ordered operations provided 64% accuracy, and MEDA produced 92% accuracy. The model failed in the closely related terms.
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORMijnlc
This article describes the morphological analysis of a standard Arabic natural language processing, as a
part of an electronic dictionary-constricting phase. A fully 3-lettered inflected verbs model are formalized
based on a linguistic classification, using NOOJ platform, the classification gives certain representative
verbs that will considered as lemmas, this verbs form our dictionary entries, they are also conjugated
according to our inflection paradigm relying on certain specific morphological properties. This dictionary
will be considered as an Arabic resource, which will help NLP applications and NOOJ platform to analyse
sophisticated Arabic corpora.
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES ijnlc
Stemming is the process of term conflation. It conflates all the word variants to a common form called as stem. It plays significant role in numerous Natural Language Processing (NLP) applications like morphological analysis, parsing, document summarization, text classification, part-of-speech tagging, question-answering system, machine translation, word sense disambiguation, information retrieval (IR), etc. Each of these tasks requires some pre-processing to be done. Stemming is one of the important building blocks for all these applications. This paper, presents an overview of various stemming techniques, evaluation criteria for stemmers and various existing stemmers for Indic languages.
COMPARATIVE ANALYSIS OF ARABIC STEMMING ALGORITHMSIJMIT JOURNAL
In the context of Information Retrieval, Arabic stemming algorithms have become a most research area of
information retrieval. Many researchers have developed algorithms to solve the problem of stemming.
Each researcher proposed his own methodology and measurements to test the performance and compute
the accuracy of his algorithm. Thus, nobody can make accurate comparisons between these algorithms.
Many generic conflation techniques and stemming algorithms are theoretically analyzed in this paper.
Then, the main Arabic language characteristics that are necessary to be mentioned before discussing
Arabic stemmers are summarized. The evaluation of the algorithms in this paper shows that Arabic
stemming algorithm is still one of the most information retrieval challenges. This paper aims to compare
the most of the commonly used light stemmers in terms of affixes lists, algorithms, main ideas, and
information retrieval performance. The results show that the light10 stemmer outperformed the other
stemmers. Finally, recommendations for future research regarding the development of a standard Arabic
stemmer were presented.
Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable ...CSCJournals
Phonetization is the transcription from written text into sounds. It is used in many natural language processing tasks, such as speech processing, speech synthesis, and computer-aided pronunciation assessment. A common phonetization approach is the use of letter-to-sound rules developed by linguists for the transcription from grapheme to sound. In this paper, we address the problem of rule-based phonetization of standard Arabic. 1The paper contributions can be summarized as follows: 1) Discussion of the transcription rules of standard Arabic which were used in literature on the phonemic and phonetic level. 2) Improvements of existing rules are suggested and new rules are introduced. Moreover, a comprehensive algorithm covering the phenomenon of pharyngealization in standard Arabic is proposed. Finally, the resulting rules set has been tested on large datasets. 3) We present a reliable automatic phonetic transcription of standard Arabic at five levels: phoneme, allophone, syllable, word, and sentence. An encoding which covers all sounds of standard Arabic is proposed, and several pronunciation dictionaries have been automatically generated. These dictionaries have been manually verified yielding an accuracy higher than 99 % for standard Arabic texts that do not contain dates, numbers, acronyms, abbreviations, and special symbols. The dictionaries are available for research purposes.
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...ijnlc
This article introduces a methodology for analyzing sentiment in Arabic text using a global foreign lexical
source. Our method leverages the available resource in another language such as the SentiWordNet in
English to the limited language resource that is Arabic. The knowledge that is taken from the external
resource will be injected into the feature model whilethe machine-learning-based classifier is trained. The
first step of our method is to build the bag-of-words (BOW) model of the Arabic text. The second step
calculates the score of polarity using translation machine technique and English SentiWordNet. The scores
for each text will be added to the model in three pairs for objective, positive, and negative. The last step of
our method involves training the ML classifier on that model to predict the sentiment of the Arabic text.
Our method increases the performance compared with the baseline model that is BOW in most cases. In
addition, it seems a viable approach to sentiment analysis in Arabic text where there is limitation of the
available resource.
Grapheme-To-Phoneme Tools for the Marathi Speech SynthesisIJERA Editor
We describe in detail a Grapheme-to-Phoneme (G2P) converter required for the development of a good quality
Marathi Text-to-Speech (TTS) system. The Festival and Festvox framework is chosen for developing the
Marathi TTS system. Since Festival does not provide complete language processing support specie to various
languages, it needs to be augmented to facilitate the development of TTS systems in certain new languages.
Because of this, a generic G2P converter has been developed. In the customized Marathi G2P converter, we
have handled schwa deletion and compound word extraction. In the experiments carried out to test the Marathi
G2P on a text segment of 2485 words, 91.47% word phonetisation accuracy is obtained. This Marathi G2P has
been used for phonetising large text corpora which in turn is used in designing an inventory of phonetically rich
sentences. The sentences ensured a good coverage of the phonetically valid di-phones using only 1.3% of the
complete text corpora.
Contextual Analysis for Middle Eastern Languages with Hidden Markov Modelsijnlc
Displaying a document in Middle Eastern languages requires contextual analysis due to different presentational forms for each character of the alphabet. The words of the document will be formed by the joining of the correct positional glyphs representing corresponding presentational forms of the
characters. A set of rules defines the joining of the glyphs. As usual, these rules vary from language to language and are subject to interpretation by the software developers.
Building of Database for English-Azerbaijani Machine Translation Expert SystemWaqas Tariq
In the article the results of development of machine translation expert system is presented. The approach of translation correspondences defining is suggested as a background for creation of data base and knowledge base of the system. Methods of transformation rules compiling applied for linguistic knowledge base of the expert system are based on the defining of translation correspondences between Azerbaijani and English languages.
Natural language processing with python and amharic syntax parse tree by dani...Daniel Adenew
Natural Language Processing is an interrelated disincline adding the capability of communicating as human beings to Computerworld. Amharic language is having much improvement over time thanks to researcher at PHD, MSC level at AAU. Here , I have tried to study and come up a limited scope solution that does syntax parsing for Amharic language and draws syntax parse trees using Python!!
Hybrid approaches for automatic vowelization of arabic textsijnlc
Hybrid approaches for automatic vowelization of Arabic texts are presented in this article. The process is
made up of two modules. In the first one, a morphological analysis of the text words is performed using the
open source morphological Analyzer AlKhalil Morpho Sys. Outputs for each word analyzed out of context,
are its different possible vowelizations. The integration of this Analyzer in our vowelization system required
the addition of a lexical database containing the most frequent words in Arabic language. Using a
statistical approach based on two hidden Markov models (HMM), the second module aims to eliminate the
ambiguities. Indeed, for the first HMM, the unvowelized Arabic words are the observed states and the
vowelized words are the hidden states. The observed states of the second HMM are identical to those of the
first, but the hidden states are the lists of possible diacritics of the word without its Arabic letters. Our
system uses Viterbi algorithm to select the optimal path among the solutions proposed by Al Khalil Morpho
Sys. Our approach opens an important way to improve the performance of automatic vowelization of
Arabic texts for other uses in automatic natural language processing.
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...gerogepatton
Many automatic translation works have been addressed between major European language pairs, by taking advantage of large scale parallel corpora, but very few research works are conducted on the Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system. LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score of 12%, 11%, and 6% respectively.
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...gerogepatton
Many automatic translation works have been addressed between major European language pairs, by
taking advantage of large scale parallel corpora, but very few research works are conducted on the
Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel
Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic
text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation
of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using
Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system.
LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM
based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score
of 12%, 11%, and 6% respectively.
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...ijaia
Many automatic translation works have been addressed between major European language pairs, by taking advantage of large scale parallel corpora, but very few research works are conducted on the Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system. LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score of 12%, 11%, and 6% respectively
A Survey of Arabic Text Classification Models IJECEIAES
There is a huge content of Arabic text available over online that requires an organization of these texts. As result, here are many applications of natural languages processing (NLP) that concerns with text organization. One of the is text classification (TC). TC helps to make dealing with unorganized text. However, it is easier to classify them into suitable class or labels. This paper is a survey of Arabic text classification. Also, it presents comparison among different methods in the classification of Arabic texts, where Arabic text is represented a complex text due to its vocabularies. Arabic language is one of the richest languages in the world, where it has many linguistic bases. The researche in Arabic language processing is very few compared to English. As a result, these problems represent challenges in the classification, and organization of specific Arabic text. Text classification (TC) helps to access the most documents, or information that has already classified into specific classes, or categories to one or more classes or categories. In addition, classification of documents facilitate search engine to decrease the amount of document to, and then to become easier to search and matching with queries.
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...gerogepatton
This document describes a study that used machine learning models to measure the complexity of pronouncing different languages. The study trained a character-level transformer model on a grapheme-to-phoneme transliteration task for 22 languages. It found that languages with a more direct grapheme-to-phoneme mapping, like Esperanto and Malay, were easier for the model to learn than languages with a more complex mapping, like Cantonese and Japanese. The complexity of a language's pronunciation was found to correlate with how direct or simple its grapheme-to-phoneme mapping is. The study also noted that comparing languages fairly requires considering differences in available training data per language.
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...IJITE
Machine learning models allow us to compare languages by showing how hard a task in each language might be to learn and perform well on. Following this line of investigation, we explore what makes a language “hard to pronounce” by modelling the task of grapheme-to-phoneme (g2p) transliteration. By training a character-level transformer model on this task across 22 languages and measuring the model’s proficiency against its grapheme and phoneme
inventories, we show that certain characteristics emerge that separate easier and harder
languages with respect to learning to pronounce. Namely the complexity of a language's
pronunciation from its orthography is due to the expressive or simplicity of its grapheme-tophoneme mapping. Further discussion illustrates how future studies should consider relative data sparsity per language to design fairer cross-lingual comparison tasks.
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...ijrap
Machine learning models allow us to compare languages by showing how hard a task in each language might be to learn and perform well on. Following this line of investigation, we explore what makes a language “hard to pronounce” by modelling the task of grapheme-to-phoneme (g2p) transliteration. By training a character-level transformer model on this task across 22 languages and measuring the model’s proficiency against its grapheme and phoneme inventories, we show that certain characteristics emerge that separate easier and harder languages with respect to learning to pronounce. Namely the complexity of a language's pronunciation from its orthography is due to the expressive or simplicity of its grapheme-to phoneme mapping. Further discussion illustrates how future studies should consider relative data sparsity per language to design fairer cross-lingual comparison tasks.
Investigation of the Effect of Obstacle Placed Near the Human Glottis on the ...kevig
Simulation of human vocal tract model is necessary to understand the effect of different
conditions on speech generation. In natural calamities such as earthquakes person may swallow
soil due to large amount of dust around. The effect of soil is investigated on the sound pressure in
the human vocal tract. The generation of speech starts from the glottis, so the soil obstacle is
positioned near it. The whole model is designed and analyzed using FEM. As the speech signal
requires a bandwidth near about 4 kHz, the investigation carried out to obtained sound pressure
in the human vocal tract from 100-5000 Hz sound signals.
Identification and Classification of Named Entities in Indian Languageskevig
The process of identification of Named Entities (NEs) in a given document and then there classification into
different categories of NEs is referred to as Named Entity Recognition (NER). We need to do a great effort
in order to perform NER in Indian languages and achieve the same or higher accuracy as that obtained by
English and the European languages. In this paper, we have presented the results that we have achieved by
performing NER in Hindi, Bengali and Telugu using Hidden Markov Model (HMM) and Performance
Metrics.
More Related Content
Similar to A GRAMMATICALLY AND STRUCTURALLY BASED PART OF SPEECH (POS) TAGGER FOR ARABIC LANGUAGE
Bantu Spell Checker and Corrector using Modified Edit Distance Algorithm (MEDA)jennifer steffan
Automatic spelling correction for a language is critical since the current world is almost entirely dependent on digital devices that employ electronic keyboards. Correct spelling adds to textual document accessibility and readability. Many NLP applications, such as web search engines, text summarization, sentiment analysis, and so on, rely on automatic spelling correction. A few efforts on automatic spelling correction in Bantu languages have been completed; however, the numbers are insufficient. We proposed a spell checker for typed words based on the Modified minimum edit distance Algorithm (MEDA), and the Syllable Error Detection Algorithm (SEDA). In this study, we adjusted the minimal edit distance Algorithm by including a frequency score for letters and ordered operations. The SEDA identifies the component of the word and the position of the letter which has an error. For this research, the Setswana language was utilized for testing, and other languages related to Setswana will use this spell checker. Setswana is a Bantu language spoken mostly in Botswana, South Africa, and Namibia and its automatic spelling correction are still in its early stages. Setswana is Botswana’s national language and is mostly utilized in schools and government offices. The accuracy was measured in 2500 Setswana words for assessment. The SEDA discovered incorrect Setswana words with 99% accuracy. When evaluating MEDA, the edit distance algorithm was utilized as the baseline, and it generated an accuracy of 52%. In comparison, the edit distance algorithm with ordered operations provided 64% accuracy, and MEDA produced 92% accuracy. The model failed in the closely related terms.
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORMijnlc
This article describes the morphological analysis of a standard Arabic natural language processing, as a
part of an electronic dictionary-constricting phase. A fully 3-lettered inflected verbs model are formalized
based on a linguistic classification, using NOOJ platform, the classification gives certain representative
verbs that will considered as lemmas, this verbs form our dictionary entries, they are also conjugated
according to our inflection paradigm relying on certain specific morphological properties. This dictionary
will be considered as an Arabic resource, which will help NLP applications and NOOJ platform to analyse
sophisticated Arabic corpora.
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES ijnlc
Stemming is the process of term conflation. It conflates all the word variants to a common form called as stem. It plays significant role in numerous Natural Language Processing (NLP) applications like morphological analysis, parsing, document summarization, text classification, part-of-speech tagging, question-answering system, machine translation, word sense disambiguation, information retrieval (IR), etc. Each of these tasks requires some pre-processing to be done. Stemming is one of the important building blocks for all these applications. This paper, presents an overview of various stemming techniques, evaluation criteria for stemmers and various existing stemmers for Indic languages.
COMPARATIVE ANALYSIS OF ARABIC STEMMING ALGORITHMSIJMIT JOURNAL
In the context of Information Retrieval, Arabic stemming algorithms have become a most research area of
information retrieval. Many researchers have developed algorithms to solve the problem of stemming.
Each researcher proposed his own methodology and measurements to test the performance and compute
the accuracy of his algorithm. Thus, nobody can make accurate comparisons between these algorithms.
Many generic conflation techniques and stemming algorithms are theoretically analyzed in this paper.
Then, the main Arabic language characteristics that are necessary to be mentioned before discussing
Arabic stemmers are summarized. The evaluation of the algorithms in this paper shows that Arabic
stemming algorithm is still one of the most information retrieval challenges. This paper aims to compare
the most of the commonly used light stemmers in terms of affixes lists, algorithms, main ideas, and
information retrieval performance. The results show that the light10 stemmer outperformed the other
stemmers. Finally, recommendations for future research regarding the development of a standard Arabic
stemmer were presented.
Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable ...CSCJournals
Phonetization is the transcription from written text into sounds. It is used in many natural language processing tasks, such as speech processing, speech synthesis, and computer-aided pronunciation assessment. A common phonetization approach is the use of letter-to-sound rules developed by linguists for the transcription from grapheme to sound. In this paper, we address the problem of rule-based phonetization of standard Arabic. 1The paper contributions can be summarized as follows: 1) Discussion of the transcription rules of standard Arabic which were used in literature on the phonemic and phonetic level. 2) Improvements of existing rules are suggested and new rules are introduced. Moreover, a comprehensive algorithm covering the phenomenon of pharyngealization in standard Arabic is proposed. Finally, the resulting rules set has been tested on large datasets. 3) We present a reliable automatic phonetic transcription of standard Arabic at five levels: phoneme, allophone, syllable, word, and sentence. An encoding which covers all sounds of standard Arabic is proposed, and several pronunciation dictionaries have been automatically generated. These dictionaries have been manually verified yielding an accuracy higher than 99 % for standard Arabic texts that do not contain dates, numbers, acronyms, abbreviations, and special symbols. The dictionaries are available for research purposes.
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...ijnlc
This article introduces a methodology for analyzing sentiment in Arabic text using a global foreign lexical
source. Our method leverages the available resource in another language such as the SentiWordNet in
English to the limited language resource that is Arabic. The knowledge that is taken from the external
resource will be injected into the feature model whilethe machine-learning-based classifier is trained. The
first step of our method is to build the bag-of-words (BOW) model of the Arabic text. The second step
calculates the score of polarity using translation machine technique and English SentiWordNet. The scores
for each text will be added to the model in three pairs for objective, positive, and negative. The last step of
our method involves training the ML classifier on that model to predict the sentiment of the Arabic text.
Our method increases the performance compared with the baseline model that is BOW in most cases. In
addition, it seems a viable approach to sentiment analysis in Arabic text where there is limitation of the
available resource.
Grapheme-To-Phoneme Tools for the Marathi Speech SynthesisIJERA Editor
We describe in detail a Grapheme-to-Phoneme (G2P) converter required for the development of a good quality
Marathi Text-to-Speech (TTS) system. The Festival and Festvox framework is chosen for developing the
Marathi TTS system. Since Festival does not provide complete language processing support specie to various
languages, it needs to be augmented to facilitate the development of TTS systems in certain new languages.
Because of this, a generic G2P converter has been developed. In the customized Marathi G2P converter, we
have handled schwa deletion and compound word extraction. In the experiments carried out to test the Marathi
G2P on a text segment of 2485 words, 91.47% word phonetisation accuracy is obtained. This Marathi G2P has
been used for phonetising large text corpora which in turn is used in designing an inventory of phonetically rich
sentences. The sentences ensured a good coverage of the phonetically valid di-phones using only 1.3% of the
complete text corpora.
Contextual Analysis for Middle Eastern Languages with Hidden Markov Modelsijnlc
Displaying a document in Middle Eastern languages requires contextual analysis due to different presentational forms for each character of the alphabet. The words of the document will be formed by the joining of the correct positional glyphs representing corresponding presentational forms of the
characters. A set of rules defines the joining of the glyphs. As usual, these rules vary from language to language and are subject to interpretation by the software developers.
Building of Database for English-Azerbaijani Machine Translation Expert SystemWaqas Tariq
In the article the results of development of machine translation expert system is presented. The approach of translation correspondences defining is suggested as a background for creation of data base and knowledge base of the system. Methods of transformation rules compiling applied for linguistic knowledge base of the expert system are based on the defining of translation correspondences between Azerbaijani and English languages.
Natural language processing with python and amharic syntax parse tree by dani...Daniel Adenew
Natural Language Processing is an interrelated disincline adding the capability of communicating as human beings to Computerworld. Amharic language is having much improvement over time thanks to researcher at PHD, MSC level at AAU. Here , I have tried to study and come up a limited scope solution that does syntax parsing for Amharic language and draws syntax parse trees using Python!!
Hybrid approaches for automatic vowelization of arabic textsijnlc
Hybrid approaches for automatic vowelization of Arabic texts are presented in this article. The process is
made up of two modules. In the first one, a morphological analysis of the text words is performed using the
open source morphological Analyzer AlKhalil Morpho Sys. Outputs for each word analyzed out of context,
are its different possible vowelizations. The integration of this Analyzer in our vowelization system required
the addition of a lexical database containing the most frequent words in Arabic language. Using a
statistical approach based on two hidden Markov models (HMM), the second module aims to eliminate the
ambiguities. Indeed, for the first HMM, the unvowelized Arabic words are the observed states and the
vowelized words are the hidden states. The observed states of the second HMM are identical to those of the
first, but the hidden states are the lists of possible diacritics of the word without its Arabic letters. Our
system uses Viterbi algorithm to select the optimal path among the solutions proposed by Al Khalil Morpho
Sys. Our approach opens an important way to improve the performance of automatic vowelization of
Arabic texts for other uses in automatic natural language processing.
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...gerogepatton
Many automatic translation works have been addressed between major European language pairs, by taking advantage of large scale parallel corpora, but very few research works are conducted on the Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system. LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score of 12%, 11%, and 6% respectively.
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...gerogepatton
Many automatic translation works have been addressed between major European language pairs, by
taking advantage of large scale parallel corpora, but very few research works are conducted on the
Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel
Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic
text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation
of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using
Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system.
LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM
based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score
of 12%, 11%, and 6% respectively.
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...ijaia
Many automatic translation works have been addressed between major European language pairs, by taking advantage of large scale parallel corpora, but very few research works are conducted on the Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system. LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score of 12%, 11%, and 6% respectively
A Survey of Arabic Text Classification Models IJECEIAES
There is a huge content of Arabic text available over online that requires an organization of these texts. As result, here are many applications of natural languages processing (NLP) that concerns with text organization. One of the is text classification (TC). TC helps to make dealing with unorganized text. However, it is easier to classify them into suitable class or labels. This paper is a survey of Arabic text classification. Also, it presents comparison among different methods in the classification of Arabic texts, where Arabic text is represented a complex text due to its vocabularies. Arabic language is one of the richest languages in the world, where it has many linguistic bases. The researche in Arabic language processing is very few compared to English. As a result, these problems represent challenges in the classification, and organization of specific Arabic text. Text classification (TC) helps to access the most documents, or information that has already classified into specific classes, or categories to one or more classes or categories. In addition, classification of documents facilitate search engine to decrease the amount of document to, and then to become easier to search and matching with queries.
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...gerogepatton
This document describes a study that used machine learning models to measure the complexity of pronouncing different languages. The study trained a character-level transformer model on a grapheme-to-phoneme transliteration task for 22 languages. It found that languages with a more direct grapheme-to-phoneme mapping, like Esperanto and Malay, were easier for the model to learn than languages with a more complex mapping, like Cantonese and Japanese. The complexity of a language's pronunciation was found to correlate with how direct or simple its grapheme-to-phoneme mapping is. The study also noted that comparing languages fairly requires considering differences in available training data per language.
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...IJITE
Machine learning models allow us to compare languages by showing how hard a task in each language might be to learn and perform well on. Following this line of investigation, we explore what makes a language “hard to pronounce” by modelling the task of grapheme-to-phoneme (g2p) transliteration. By training a character-level transformer model on this task across 22 languages and measuring the model’s proficiency against its grapheme and phoneme
inventories, we show that certain characteristics emerge that separate easier and harder
languages with respect to learning to pronounce. Namely the complexity of a language's
pronunciation from its orthography is due to the expressive or simplicity of its grapheme-tophoneme mapping. Further discussion illustrates how future studies should consider relative data sparsity per language to design fairer cross-lingual comparison tasks.
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...ijrap
Machine learning models allow us to compare languages by showing how hard a task in each language might be to learn and perform well on. Following this line of investigation, we explore what makes a language “hard to pronounce” by modelling the task of grapheme-to-phoneme (g2p) transliteration. By training a character-level transformer model on this task across 22 languages and measuring the model’s proficiency against its grapheme and phoneme inventories, we show that certain characteristics emerge that separate easier and harder languages with respect to learning to pronounce. Namely the complexity of a language's pronunciation from its orthography is due to the expressive or simplicity of its grapheme-to phoneme mapping. Further discussion illustrates how future studies should consider relative data sparsity per language to design fairer cross-lingual comparison tasks.
Similar to A GRAMMATICALLY AND STRUCTURALLY BASED PART OF SPEECH (POS) TAGGER FOR ARABIC LANGUAGE (20)
Investigation of the Effect of Obstacle Placed Near the Human Glottis on the ...kevig
Simulation of human vocal tract model is necessary to understand the effect of different
conditions on speech generation. In natural calamities such as earthquakes person may swallow
soil due to large amount of dust around. The effect of soil is investigated on the sound pressure in
the human vocal tract. The generation of speech starts from the glottis, so the soil obstacle is
positioned near it. The whole model is designed and analyzed using FEM. As the speech signal
requires a bandwidth near about 4 kHz, the investigation carried out to obtained sound pressure
in the human vocal tract from 100-5000 Hz sound signals.
Identification and Classification of Named Entities in Indian Languageskevig
The process of identification of Named Entities (NEs) in a given document and then there classification into
different categories of NEs is referred to as Named Entity Recognition (NER). We need to do a great effort
in order to perform NER in Indian languages and achieve the same or higher accuracy as that obtained by
English and the European languages. In this paper, we have presented the results that we have achieved by
performing NER in Hindi, Bengali and Telugu using Hidden Markov Model (HMM) and Performance
Metrics.
Effect of Query Formation on Web Search Engine Resultskevig
Query in a search engine is generally based on natural language. A query can be expressed in more than
one way without changing its meaning as it depends on thinking of human being at a particular moment.
Aim of the searcher is to get most relevant results immaterial of how the query has been expressed. In the
present paper, we have examined the results of search engine for change in coverage and similarity of first
few results when a query is entered in two semantically same but in different formats. Searching has been
made through Google search engine. Fifteen pairs of queries have been chosen for the study. The t-test has
been used for the purpose and the results have been checked on the basis of total documents found,
similarity of first five and first ten documents found in the results of a query entered in two different
formats. It has been found that the total coverage is same but first few results are significantly different.
Investigations of the Distributions of Phonemic Durations in Hindi and Dogrikevig
Speech generation is one of the most important areas of research in speech signal processing which is now gaining a serious attention. Speech is a natural form of communication in all living things. Computers with the ability to understand speech and speak with a human like voice are expected to contribute to the development of more natural man-machine interface. However, in order to give those functions that are even closer to those of human beings, we must learn more about the mechanisms by which speech is produced and perceived, and develop speech information processing technologies that can generate a more natural sounding systems. The so described field of stud, also called speech synthesis and more prominently acknowledged as text-to-speech synthesis, originated in the mid eighties because of the emergence of DSP and the rapid advancement of VLSI techniques. To understand this field of speech, it is necessary to understand the basic theory of speech production. Every language has different phonetic alphabets and a different set of possible phonemes and their combinations.
For the analysis of the speech signal, we have carried out the recording of five speakers in Dogri (3 male and 5 females) and eight speakers in Hindi language (4 male and 4 female). For estimating the durational distributions, the mean of mean of ten instances of vowels of each speaker in both the languages has been calculated. Investigations have shown that the two durational distributions differ significantly with respect to mean and standard deviation. The duration of phoneme is speaker dependent. The whole investigation can be concluded with the end result that almost all the Dogri phonemes have shorter duration, in comparison to Hindi phonemes. The period in milli seconds of same phonemes when uttered in Hindi were found to be longer compared to when they were spoken by a person with Dogri as his mother tongue. There are many applications which are directly of indirectly related to the research being carried out. For instance the main application may be for transforming Dogri speech into Hindi and vice versa, and further utilizing this application, we can develop a speech aid to teach Dogri to children. The results may also be useful for synthesizing the phonemes of Dogri using the parameters of the phonemes of Hindi and for building large vocabulary speech recognition systems.
May 2024 - Top10 Cited Articles in Natural Language Computingkevig
Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze, understand, and generate languages that humans use naturally to address computers.
Effect of Singular Value Decomposition Based Processing on Speech Perceptionkevig
Speech is an important biological signal for primary mode of communication among human being and also the most natural and efficient form of exchanging information among human in speech. Speech processing is the most important aspect in signal processing. In this paper the theory of linear algebra called singular value decomposition (SVD) is applied to the speech signal. SVD is a technique for deriving important parameters of a signal. The parameters derived using SVD may further be reduced by perceptual evaluation of the synthesized speech using only perceptually important parameters, where the speech signal can be compressed so that the information can be transformed into compressed form without losing its quality. This technique finds wide applications in speech compression, speech recognition, and speech synthesis. The objective of this paper is to investigate the effect of SVD based feature selection of the input speech on the perception of the processed speech signal. The speech signal which is in the form of vowels \a\, \e\, \u\ were recorded from each of the six speakers (3 males and 3 females). The vowels for the six speakers were analyzed using SVD based processing and the effect of the reduction in singular values was investigated on the perception of the resynthesized vowels using reduced singular values. Investigations have shown that the number of singular values can be drastically reduced without significantly affecting the perception of the vowels.
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Modelskevig
Relevance evaluation of a query and a passage is essential in Information Retrieval (IR). Recently, numerous studies have been conducted on tasks related to relevance judgment using Large Language Models (LLMs) such as GPT-4,
demonstrating significant improvements. However, the efficacy of LLMs is considerably influenced by the design of the prompt. The purpose of this paper is to
identify which specific terms in prompts positively or negatively impact relevance
evaluation with LLMs. We employed two types of prompts: those used in previous
research and generated automatically by LLMs. By comparing the performance of
these prompts in both few-shot and zero-shot settings, we analyze the influence of
specific terms in the prompts. We have observed two main findings from our study.
First, we discovered that prompts using the term ‘answer’ lead to more effective
relevance evaluations than those using ‘relevant.’ This indicates that a more direct
approach, focusing on answering the query, tends to enhance performance. Second,
we noted the importance of appropriately balancing the scope of ‘relevance.’ While
the term ‘relevant’ can extend the scope too broadly, resulting in less precise evaluations, an optimal balance in defining relevance is crucial for accurate assessments.
The inclusion of few-shot examples helps in more precisely defining this balance.
By providing clearer contexts for the term ‘relevance,’ few-shot examples contribute
to refine relevance criteria. In conclusion, our study highlights the significance of
carefully selecting terms in prompts for relevance evaluation with LLMs.
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Modelskevig
Relevance evaluation of a query and a passage is essential in Information Retrieval (IR). Recently, numerous studies have been conducted on tasks related to relevance judgment using Large Language Models (LLMs) such as GPT-4, demonstrating significant improvements. However, the efficacy of LLMs is considerably influenced by the design of the prompt. The purpose of this paper is to identify which specific terms in prompts positively or negatively impact relevance evaluation with LLMs. We employed two types of prompts: those used in previous research and generated automatically by LLMs. By comparing the performance of these prompts in both few-shot and zero-shot settings, we analyze the influence of specific terms in the prompts. We have observed two main findings from our study. First, we discovered that prompts using the term ‘answer’ lead to more effective relevance evaluations than those using ‘relevant.’ This indicates that a more direct approach, focusing on answering the query, tends to enhance performance. Second, we noted the importance of appropriately balancing the scope of ‘relevance.’ While the term ‘relevant’ can extend the scope too broadly, resulting in less precise evaluations, an optimal balance in defining relevance is crucial for accurate assessments. The inclusion of few-shot examples helps in more precisely defining this balance. By providing clearer contexts for the term ‘relevance,’ few-shot examples contribute to refine relevance criteria. In conclusion, our study highlights the significance of carefully selecting terms in prompts for relevance evaluation with LLMs.
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers.
Genetic Approach For Arabic Part Of Speech Taggingkevig
With the growing number of textual resources available, the ability to understand them becomes critical.
An essential first step in understanding these sources is the ability to identify the parts-of-speech in each
sentence. Arabic is a morphologically rich language, which presents a challenge for part of speech
tagging. In this paper, our goal is to propose, improve, and implement a part-of-speech tagger based on a
genetic algorithm. The accuracy obtained with this method is comparable to that of other probabilistic
approaches.
Rule Based Transliteration Scheme for English to Punjabikevig
Machine Transliteration has come out to be an emerging and a very important research area in the field of
machine translation. Transliteration basically aims to preserve the phonological structure of words. Proper
transliteration of name entities plays a very significant role in improving the quality of machine translation.
In this paper we are doing machine transliteration for English-Punjabi language pair using rule based
approach. We have constructed some rules for syllabification. Syllabification is the process to extract or
separate the syllable from the words. In this we are calculating the probabilities for name entities (Proper
names and location). For those words which do not come under the category of name entities, separate
probabilities are being calculated by using relative frequency through a statistical machine translation
toolkit known as MOSES. Using these probabilities we are transliterating our input text from English to
Punjabi.
Improving Dialogue Management Through Data Optimizationkevig
In task-oriented dialogue systems, the ability for users to effortlessly communicate with machines and computers through natural language stands as a critical advancement. Central to these systems is the dialogue manager, a pivotal component tasked with navigating the conversation to effectively meet user goals by selecting the most appropriate response. Traditionally, the development of sophisticated dialogue management has embraced a variety of methodologies, including rule-based systems, reinforcement learning, and supervised learning, all aimed at optimizing response selection in light of user inputs. This research casts a spotlight on the pivotal role of data quality in enhancing the performance of dialogue managers. Through a detailed examination of prevalent errors within acclaimed datasets, such as Multiwoz 2.1 and SGD, we introduce an innovative synthetic dialogue generator designed to control the introduction of errors precisely. Our comprehensive analysis underscores the critical impact of dataset imperfections, especially mislabeling, on the challenges inherent in refining dialogue management processes.
Document Author Classification using Parsed Language Structurekevig
Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used, for example, to determine authorship of all of The Federalist Papers. Such methods may be useful in more modern times to detect fake or AI authorship. Progress in statistical natural language parsers introduces the possibility of using grammatical structure to detect authorship. In this paper we explore a new possibility for detecting authorship using grammatical structural information extracted using a statistical natural language parser. This paper provides a proof of concept, testing author classification based on grammatical structure on a set of “proof texts,” The Federalist Papers and Sanditon which have been as test cases in previous authorship detection studies. Several features extracted from the statisticalnaturallanguage parserwere explored: all subtrees of some depth from any level; rooted subtrees of some depth, part of speech, and part of speech by level in the parse tree. It was found to be helpful to project the features into a lower dimensional space. Statistical experiments on these documents demonstrate that information from a statistical parser can, in fact, assist in distinguishing authors.
Rag-Fusion: A New Take on Retrieval Augmented Generationkevig
Infineon has identified a need for engineers, account managers, and customers to rapidly obtain product information. This problem is traditionally addressed with retrieval-augmented generation (RAG) chatbots, but in this study, I evaluated the use of the newly popularized RAG-Fusion method. RAG-Fusion combines RAG and reciprocal rank fusion (RRF) by generating multiple queries, reranking them with reciprocal scores and fusing the documents and scores. Through manually evaluating answers on accuracy, relevance, and comprehensiveness, I found that RAG-Fusion was able to provide accurate and comprehensive answers due to the generated queries contextualizing the original query from various perspectives. However, some answers strayed off topic when the generated queries' relevance to the original query is insufficient. This research marks significant progress in artificial intelligence (AI) and natural language processing (NLP) applications and demonstrates transformations in a global and multi-industry context.
Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...kevig
The common practice in Machine Learning research is to evaluate the top-performing models based on their performance. However, this often leads to overlooking other crucial aspects that should be given careful consideration. In some cases, the performance differences between various approaches may be insignificant, whereas factors like production costs, energy consumption, and carbon footprint should be taken into account. Large Language Models (LLMs) are widely used in academia and industry to address NLP problems. In this study, we present a comprehensive quantitative comparison between traditional approaches (SVM-based) and more recent approaches such as LLM (BERT family models) and generative models (GPT2 and LLAMA2), using the LexGLUE benchmark. Our evaluation takes into account not only performance parameters (standard indices), but also alternative measures such as timing, energy consumption and costs, which collectively contribute to the carbon footprint. To ensure a complete analysis, we separately considered the prototyping phase (which involves model selection through training-validation-test iterations) and the in-production phases. These phases follow distinct implementation procedures and require different resources. The results indicate that simpler algorithms often achieve performance levels similar to those of complex models (LLM and generative models), consuming much less energy and requiring fewer resources. These findings suggest that companies should consider additional considerations when choosing machine learning (ML) solutions. The analysis also demonstrates that it is increasingly necessary for the scientific world to also begin to consider aspects of energy consumption in model evaluations, in order to be able to give real meaning to the results obtained using standard metrics (Precision, Recall, F1 and so on).
Evaluation of Medium-Sized Language Models in German and English Languagekevig
Large language models (LLMs) have garnered significant attention, but the definition of “large” lacks clarity. This paper focuses on medium-sized language models (MLMs), defined as having at least six billion parameters but less than 100 billion. The study evaluates MLMs regarding zero-shot generative question answering, which requires models to provide elaborate answers without external document retrieval. The paper introduces an own test dataset and presents results from human evaluation. Results show that combining the best answers from different MLMs yielded an overall correct answer rate of 82.7% which is better than the 60.9% of ChatGPT. The best MLM achieved 71.8% and has 33B parameters, which highlights the importance of using appropriate training data for fine-tuning rather than solely relying on the number of parameters. More fine-grained feedback should be used to further improve the quality of answers. The open source community is quickly closing the gap to the best commercial models.
IMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATIONkevig
In task-oriented dialogue systems, the ability for users to effortlessly communicate with machines and
computers through natural language stands as a critical advancement. Central to these systems is the
dialogue manager, a pivotal component tasked with navigating the conversation to effectively meet user
goals by selecting the most appropriate response. Traditionally, the development of sophisticated dialogue
management has embraced a variety of methodologies, including rule-based systems, reinforcement
learning, and supervised learning, all aimed at optimizing response selection in light of user inputs. This
research casts a spotlight on the pivotal role of data quality in enhancing the performance of dialogue
managers. Through a detailed examination of prevalent errors within acclaimed datasets, such as
Multiwoz 2.1 and SGD, we introduce an innovative synthetic dialogue generator designed to control the
introduction of errors precisely. Our comprehensive analysis underscores the critical impact of dataset
imperfections, especially mislabeling, on the challenges inherent in refining dialogue management
processes.
Document Author Classification Using Parsed Language Structurekevig
Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the
text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used,
for example, to determine authorship of all of The Federalist Papers. Such methods may be useful in more modern
times to detect fake or AI authorship. Progress in statistical natural language parsers introduces the possibility of
using grammatical structure to detect authorship. In this paper we explore a new possibility for detecting authorship
using grammatical structural information extracted using a statistical natural language parser. This paper provides a
proof of concept, testing author classification based on grammatical structure on a set of “proof texts,” The Federalist
Papers and Sanditon which have been as test cases in previous authorship detection studies. Several features extracted
of some depth, part of speech, and part of speech by level in the parse tree. It was found to be helpful to project the
features into a lower dimensional space. Statistical experiments on these documents demonstrate that information
from a statistical parser can, in fact, assist in distinguishing authors.
RAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATIONkevig
Infineon has identified a need for engineers, account managers, and customers to rapidly obtain product
information. This problem is traditionally addressed with retrieval-augmented generation (RAG) chatbots,
but in this study, I evaluated the use of the newly popularized RAG-Fusion method. RAG-Fusion combines
RAG and reciprocal rank fusion (RRF) by generating multiple queries, reranking them with reciprocal
scores and fusing the documents and scores. Through manually evaluating answers on accuracy,
relevance, and comprehensiveness, I found that RAG-Fusion was able to provide accurate and
comprehensive answers due to the generated queries contextualizing the original query from various
perspectives. However, some answers strayed off topic when the generated queries' relevance to the
original query is insufficient. This research marks significant progress in artificial intelligence (AI) and
natural language processing (NLP) applications and demonstrates transformations in a global and multiindustry context
Performance, energy consumption and costs: a comparative analysis of automati...kevig
The common practice in Machine Learning research is to evaluate the top-performing models based on their
performance. However, this often leads to overlooking other crucial aspects that should be given careful
consideration. In some cases, the performance differences between various approaches may be insignificant, whereas factors like production costs, energy consumption, and carbon footprint should be taken into
account. Large Language Models (LLMs) are widely used in academia and industry to address NLP problems. In this study, we present a comprehensive quantitative comparison between traditional approaches
(SVM-based) and more recent approaches such as LLM (BERT family models) and generative models (GPT2 and LLAMA2), using the LexGLUE benchmark. Our evaluation takes into account not only performance
parameters (standard indices), but also alternative measures such as timing, energy consumption and costs,
which collectively contribute to the carbon footprint. To ensure a complete analysis, we separately considered the prototyping phase (which involves model selection through training-validation-test iterations) and
the in-production phases. These phases follow distinct implementation procedures and require different resources. The results indicate that simpler algorithms often achieve performance levels similar to those of
complex models (LLM and generative models), consuming much less energy and requiring fewer resources.
These findings suggest that companies should consider additional considerations when choosing machine
learning (ML) solutions. The analysis also demonstrates that it is increasingly necessary for the scientific
world to also begin to consider aspects of energy consumption in model evaluations, in order to be able to
give real meaning to the results obtained using standard metrics (Precision, Recall, F1 and so on).
Accident detection system project report.pdfKamal Acharya
The Rapid growth of technology and infrastructure has made our lives easier. The
advent of technology has also increased the traffic hazards and the road accidents take place
frequently which causes huge loss of life and property because of the poor emergency facilities.
Many lives could have been saved if emergency service could get accident information and
reach in time. Our project will provide an optimum solution to this draw back. A piezo electric
sensor can be used as a crash or rollover detector of the vehicle during and after a crash. With
signals from a piezo electric sensor, a severe accident can be recognized. According to this
project when a vehicle meets with an accident immediately piezo electric sensor will detect the
signal or if a car rolls over. Then with the help of GSM module and GPS module, the location
will be sent to the emergency contact. Then after conforming the location necessary action will
be taken. If the person meets with a small accident or if there is no serious threat to anyone’s
life, then the alert message can be terminated by the driver by a switch provided in order to
avoid wasting the valuable time of the medical rescue team.
Applications of artificial Intelligence in Mechanical Engineering.pdfAtif Razi
Historically, mechanical engineering has relied heavily on human expertise and empirical methods to solve complex problems. With the introduction of computer-aided design (CAD) and finite element analysis (FEA), the field took its first steps towards digitization. These tools allowed engineers to simulate and analyze mechanical systems with greater accuracy and efficiency. However, the sheer volume of data generated by modern engineering systems and the increasing complexity of these systems have necessitated more advanced analytical tools, paving the way for AI.
AI offers the capability to process vast amounts of data, identify patterns, and make predictions with a level of speed and accuracy unattainable by traditional methods. This has profound implications for mechanical engineering, enabling more efficient design processes, predictive maintenance strategies, and optimized manufacturing operations. AI-driven tools can learn from historical data, adapt to new information, and continuously improve their performance, making them invaluable in tackling the multifaceted challenges of modern mechanical engineering.
Determination of Equivalent Circuit parameters and performance characteristic...pvpriya2
Includes the testing of induction motor to draw the circle diagram of induction motor with step wise procedure and calculation for the same. Also explains the working and application of Induction generator
Digital Twins Computer Networking Paper Presentation.pptxaryanpankaj78
A Digital Twin in computer networking is a virtual representation of a physical network, used to simulate, analyze, and optimize network performance and reliability. It leverages real-time data to enhance network management, predict issues, and improve decision-making processes.
This study Examines the Effectiveness of Talent Procurement through the Imple...DharmaBanothu
In the world with high technology and fast
forward mindset recruiters are walking/showing interest
towards E-Recruitment. Present most of the HRs of
many companies are choosing E-Recruitment as the best
choice for recruitment. E-Recruitment is being done
through many online platforms like Linkedin, Naukri,
Instagram , Facebook etc. Now with high technology E-
Recruitment has gone through next level by using
Artificial Intelligence too.
Key Words : Talent Management, Talent Acquisition , E-
Recruitment , Artificial Intelligence Introduction
Effectiveness of Talent Acquisition through E-
Recruitment in this topic we will discuss about 4important
and interlinked topics which are
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...PriyankaKilaniya
Energy efficiency has been important since the latter part of the last century. The main object of this survey is to determine the energy efficiency knowledge among consumers. Two separate districts in Bangladesh are selected to conduct the survey on households and showrooms about the energy and seller also. The survey uses the data to find some regression equations from which it is easy to predict energy efficiency knowledge. The data is analyzed and calculated based on five important criteria. The initial target was to find some factors that help predict a person's energy efficiency knowledge. From the survey, it is found that the energy efficiency awareness among the people of our country is very low. Relationships between household energy use behaviors are estimated using a unique dataset of about 40 households and 20 showrooms in Bangladesh's Chapainawabganj and Bagerhat districts. Knowledge of energy consumption and energy efficiency technology options is found to be associated with household use of energy conservation practices. Household characteristics also influence household energy use behavior. Younger household cohorts are more likely to adopt energy-efficient technologies and energy conservation practices and place primary importance on energy saving for environmental reasons. Education also influences attitudes toward energy conservation in Bangladesh. Low-education households indicate they primarily save electricity for the environment while high-education households indicate they are motivated by environmental concerns.
Open Channel Flow: fluid flow with a free surfaceIndrajeet sahu
Open Channel Flow: This topic focuses on fluid flow with a free surface, such as in rivers, canals, and drainage ditches. Key concepts include the classification of flow types (steady vs. unsteady, uniform vs. non-uniform), hydraulic radius, flow resistance, Manning's equation, critical flow conditions, and energy and momentum principles. It also covers flow measurement techniques, gradually varied flow analysis, and the design of open channels. Understanding these principles is vital for effective water resource management and engineering applications.
A GRAMMATICALLY AND STRUCTURALLY BASED PART OF SPEECH (POS) TAGGER FOR ARABIC LANGUAGE
1. International Journal on Natural Language Computing (IJNLC) Vol.11, No.5, October 2022
DOI: 10.5121/ijnlc.2022.11502 17
A GRAMMATICALLY AND STRUCTURALLY
BASED PART OF SPEECH (POS) TAGGER
FOR ARABIC LANGUAGE
Mohamed Taybe Elhadi1
and Ramadan Sayad Alfared2
1
Software Engineering Department,
Faculty of Information Technology, Zawia University, Az Zawiyah, Libya
2
Computer Technology Department,
Faculty of Information Technology, Zawia University, Az Zawiyah, Libya
ABSTRACT
In this paper we report on an experimental syntactically and morphologically driven rule-based Arabic
tagger. The tagger is developed using Arabic language grammatical rules and regulations. The tagger re-
quires no pre-tagged text and is developed using a primitive set of lexicon items along with extensive
grammatical and structural rules. It is tested and compared to Stanford tagger both in terms of accuracy and
performance (speed). Obtained results are quite comparable to Stanford tagger performance with marginal
difference favoring the developed tagger in accuracy with huge difference in terms of speed of execution. The
newly developed tagger named MTE Tagger has been tested and evaluated. For the evaluation of its accu-
racy of tagging, a set of Arabic text was manually prepared and annotated. Compared to Stanford tagger, the
MTE tagger performance was quite comparable. The developed tagger makes use of no pre-annotated da-
tasets, except of some simple lexicon consisting of list of words representing closed word types like demon-
strative nouns or pronouns or some particles. For the purpose of evaluation of the new tagger, it was run on
multiple datasets and results were compared to those of Stanford tagger. In particular, both taggers (the
MTE and the Stanford) were run on a set of 1226 sentences with close to 20,000 tokens that was human
annotated and verified to serve as testbed. The results were very encouraging where in both test runs, the
MTE tagger outperformed the Stanford tagger in terms of accuracy of 87.88% versus 86.67% for the Stan-
ford tagger. In terms of speed of tagging and in comparison Stanford tagger, MTE Taggers’ performance
was on average 1:50. More improved accuracy is possible in future work as the set of rules are further
optimized, integrated and more of Arabic language properties such as end of word discretization are used.
KEYWORDS
Part of Speech Tagging, Rule-based POS, Arabic Text Processing.
1. INTRODUCTION
Natural Language Understanding (NLU) and Text Processing with all of their related subfields are
part of what is termed computational linguistics. Computational linguistics is the combination of
computing and linguistics dealing with the automatic processing of natural language using artificial
intelligence and machine learning methods [1,2,3].
Part of speech tagging (POS) is an important component of NLP and a prerequisite for many
text-processing subfields and applications. It is concerned with assigning a label to language tokens
representing most appropriate grammatical or morph-syntactical category. POS-tagging is usually
the first step in linguistic analysis and a very important intermediate step in many applications such
as machine translation, parsers, information retrieval, and spell-checkers-correctors [1]. POS has
2. International Journal on Natural Language Computing (IJNLC) Vol.11, No.5, October 2022
18
very much matured for English and many European languages. Unlike English, however, Arabic
language lacks NLP tools and resources including POS taggers and its prerequisite resources; the
manually tagged corpora. When done on a large corpus, POS for instance, is a labor intensive and
time-consuming task. As most POS processing is based on Markoff models which in turn are based
on statistical models trained on annotated datasets. Obtaining a fast and quality tagging algorithms
with high precision and accuracy requires a large manually annotated data. It is quite a disap-
pointment for many language users, Arabic language users in particular, that there are very limited
and hardly any freely available annotated data for training and evaluations. With the exception of
very few tools, there is hardly any tools for Arabic part of speech tagging. Great many attempts
have taken place to produce POS taggers for Arabic using non-statistical alternatives such as
rule-based and machine learning methods.
This paper reports on an experimental syntactically and morphologically driven rule-based Arabic
tagger developed using Arabic language grammatical rules and regulations without the use of
pre-tagged corpora and without the involvement of any language experts except for the authors
who are merely native Arabic speakers. It is developed using a primitive set of lexicon items along
with extensive grammatical and structural rules. The tagger is tested and compared to currently
used tagger in terms of accuracy and performance (speed). Obtained results are quite comparable
and in favour of the newly developed tagger both in accuracy and certainly more in speed of tag-
ging.
This paper is organized as follows: section 1 is an introduction, section 2 is a review of Arabic
POS; Section 3 contains a description of corpora and datasets used for testing and evaluations;
section 4 contains a detailed description and discussion on the new tagger (named MTE Tagger);
Section 5 contains a detailed description of experiments conducted and results obtained along with
discussions; finally, section 6 contains concluding remarks followed with list of reference used.
2. ARABIC LANGUAGE AND PART OF SPEECH TAGGING (POS)
Like many Semitic languages, Arabic language had a history that belongs to thousands of years ago
[4]. The language is used as the language of journalism, media and education in the private sector
as well as in public and governmental agencies. It is the medium of commutations for close to 400
Arabs in 21 states and large communities all over the globe. It is also of interest to many other
nations who share the religious beliefs of Islam, as it is the language of the Holy Quran. The De-
partment of Cultural Affairs in USA asserts that Arabic language is one of the worlds’ important
languages and the United Nations lists it as the sixth official language of the United Nations [5,6].
Modern Standard Arabic (MSA) is based on classical Arabic, on the wholly Quran and on Arabic
literature. It is a written and read in a right to left fashion using a set of twenty-eight letters. Arabic
has three numbers, singular, dual, and plural; two genders, feminine and masculine; and three
grammatical cases, nominative, accusative, and genitive. Words are grouped as nouns, verbs, ad-
jectives, adverbs, and particles [5,6,7].
Arabic has limited access to technology hindering research efforts in automation and utilization.
With a relatively complex nature, Arabic itself poses quite a challenge for NLP [6]. Such chal-
lenges are further complicated by the existence of multiples of parallel dialects that are similar in
certain aspects but different in others [5].
It is stated in [8] that Arabic is made of 58% nouns, 31% particles and 11% verbs with prepositions
making up 14.1%, and 44.5% of all particles. Nouns in the genitive case make up 61.8% of the
total, in the accusative 9.6% and in the nominative 18.5%. Nouns that occur after a preposition are
more frequent than nominative and accusative nouns [9]. Such characteristics and challenges cause
lots of ambiguities. Some such ambiguities are inevitable and constitute an inherent part of the
3. International Journal on Natural Language Computing (IJNLC) Vol.11, No.5, October 2022
19
language. They are rather considered advantageous by introducing flexibility and expressiveness
from the perspective of authors particularly eloquent writers and poets.
There are other challenges and limitations that relate to POS that has to do with tag sets devel-
opment and usage. Arabic’ limited research work on standard tag-set, lack of resources, richly
inflected nature and a complex morphological nature constitute the main reasons for such limited
research. Arabic lacks manually tagged corpus to be used for training and evaluations making it
hard to develop tools such as POS taggers using a statistical approach [10-13].
Tag sets, an important part of POS research, had to be developed and researched. Some of the
researched and used tag sets include Brown tag set which contains 226 tags, LOB tag set which
contains 135 tags which is based on the tag set that was used in Brown corpus and Penn Treebank.
Other Arabic tag sets used in literature includes Khoja tag set with 177 detailed tags [14]. This set
was used for their semi-automatic tagger system. In [15] a tag set of 55 tags were used for an HMM
tagger. In [1] a tag set which contains 28 general tags and 161 detailed tags used by AMT tagger
system [2].
2.1. Part of Speech Tagging (POS)
In natural languages, Arabic included, POS tagging is the process of assigning of words into spe-
cific and representative linguistic or grammatical type. It is the assignment of POS tags to a token
taken from a sentence, a paragraph or simple set of words. POS tagging is an important requirement
to NLP applications permitting the categorisation of words as adjectives, verbs, nouns, or a prep-
osition. Knowing the verbs in a sentence, for instance can very well help us deciding on action(s)
the sentence may contain and can help to indicate sentential meaning. With POS tagging we are
also able to do chunking of sentences and aid in figuring out functional components such as Sub-
ject, Object and Verb. A number of different approaches are being used to do POS tagging with
rule-based and stochastic methodology [16,17] as the most common methods.
2.1.1. Stochastic POS Models
Stochastic models use the probabilistic methods using first-order or second-order Markov models
[18, 19]. They are based on building a trainable statistical language model and estimating param-
eters using previously tagged corpus. They make use of the idea that the probability of a word
appearing with a specific tag and the probability that a tag is followed by another. Tree Tagger is
and earlier example of such taggers that achieving an accuracy of up to 96.36% [19,20]. Stochastic
systems require less work and cost than the rule-based approach and are considered more trans-
porting of the language model to other languages especially provided that large manually tagged
corpus is available. On the other hand, they suffer from unknown words that cannot be tagged and
lack of annotated corpora in certain languages. Some of the stochastic based system in use include
CLAWS (Constituent-Likelihood Automatic Word-Tagging System), PARTS system; and many
other POS-taggers [21-26].
2.1.2. Rule-based POS Models
Rule-based taggers go back to the 1960-70’s and use a set of linguistic rules during the tagging
process. They are easy to maintain and provide an accurate and robust system [27-28], but are very
difficult to build. Some of the well-known rule-based systems CGC (Computational Grammar
Coder) [29,20], TAGGIT [32], TBL (Transformation-Based error-driven Learning) system [33]
and Fidditch system [34] and many others [35].
4. International Journal on Natural Language Computing (IJNLC) Vol.11, No.5, October 2022
20
2.1.3. Hybrid and Other Systems
A number of systems with high rate of accuracy were produced using a combination of more than
one model. Examples include [35,36] for European languages with a reported accuracy of 98%.;
POS-tagger for Hungarian language[12] and POS-tagger developed by [37,38]. Many other POS
taggers are inspired from the Artificial Intelligence are developed including machine learning,
memory based and neural networks [39-43]. Other methods used include Conditional Random
Fields, Long Short-Term Memory (LSTM) and a variation on LSTM the bidirectional LSTMs
(BiLSTMs), in which the learning algorithm is fed with the original data from the beginning to the
end, and then from the end to beginning [44].
2.2. Arabic POS Tagging
An initial focal step in Arabic POS was the adaptation of Tree tagger for Arabic with tag set that
covers 22 different languages including Arabic [45,46]. Later on, the Stanford POS tagger was
introduced to become one of the few taggers that supported Arabic [47]. Khoja is an Arabic de-
veloped tagger that combines statistical and rule-based approach achieving an accuracy close to
90% [48]. Schmid reported on a language independent tagger based on decision trees [49,50].
Algerian, et al reported an accuracy of 91% using a small manually annotated lexicon [51]. Yousif,
et al used the Support Vector Machines (SVM) and a tagged corpus reporting an accuracy of
99.99% [52]. Labidi reported on similar work [53] using augmented stately sliding-window [54]
based on a database of nearly 50.000 Arabic terms.
Al Shamsi, et al reported on a semi-automatic hybrid tagger that used statistical method and
morphological rules in the form of HMMs achieving an accuracy of 90% [55]. Othmane, et al
reported on Automatic Arabic POS-Tagger which is a combination of statistical and rule-based
techniques achieving an accuracy of 86 % [56]. Mohamed, et al implementing Arabic Brill’s
POS-tagger using a manually created corpus [57]. Kučera, et al created a rule-based tagger basing
their work on automatic annotation output produced by the morphological analyser of Tim
BuckeckWalter reported an accuracy of 96% [58]. An Arabic POS-tagger was developed using the
support vector machine (SVM) method and LDC (Linguistic Data Consortium) [59]. An HMM
tagger for Arabic language with an accuracy of 96% was reported on in [60]. In [61], a tagger was
developed that used a rule-based and a memory-based learning achieving an accuracy of 86%.
In [62], structure of Arabic sentence was taken into account in developing an Arabic POS-tagger
for un-vocalized text with an accuracy of 97%. In [63], a POS-tagger was developed using the
rule-based and untagged raw partially-vocalized corpus making use of pattern-based, lexical, and
contextual rules. The system an accuracy of 91%. In [64] used the genetic algorithm and a reduced
tag set to develop an Arabic POS tagging. [65] considered the structure of Arabic sentence com-
bining morphological analysis with Hidden Markov Models (HMMs) obtained a recognition rate
of this tagger reached 96%. In [66, 67] developed what they termed whole word tagging and
segmentation-based tagging. More research on Arabic POS tagging is performed. Most of such
work uses the hidden Markov models [68-70]. Unfortunately, most of reported wok is private with
limited availability.
3. CORPORA AND DATASETS
A corpus is normally looked at as a large collection of machine readable text that is accessible and
searchable. It role is to provide POS systems with the needed linguistic knowledge that helps re-
solving the ambiguity. Well-known corpora for English language include Brown University [71],
Bergen Corpus of London Teenage English (COLT) [72], BNC (British National Corpus) [73],
5. International Journal on Natural Language Computing (IJNLC) Vol.11, No.5, October 2022
21
Child Language Data Exchange System (CHILDES) [74], TOSCA Corpus [65] and Penn Treebank
Corpus [75].
For Arabic language, however, there is no free corpus available. A few corpora were created for
Arabic language including, but not limited to, LDC Arabic newswire corpus, Hayat newspaper
corpus, An-Nahar Newspaper Text Corpus, Buckeck Walter Arabic Corpus, Nijmegen Corpus,
Penn Arabic Treebank Corpus and Corpus of Contemporary Arabic [58,75,76,77].
As this work is based on linguistics rules and regulations of Arabic language, there was no need or
use of corpora except for evaluation purposes. As will be discussed in section 5, five datasets re-
ferred-to in this work as CNN-UTF8, Basel-Dataset, Arabic Discretized Books, Quranic Text
Dataset and Annotated Dataset were used for evaluation. They included two sets on news, one from
Quran and one from books along with the annotated set.
4. THE MTE TAGGER: THE PROPOSED APPROACH
MTE Tagger is totally based on readily available primitive data lists and a complex set of linguistic
rules both of which are highlighted next:
Figure 1. The over all process adopted.
Available word list of some known word types which are mostly collected from web, short and
limited in size, hand formulated, closed word types of Arabic basic components such as pronouns,
demonstrative nouns …etc. Few non-exhaustive lists are for things such as common names and
geographical information like cities and countries …etc.
1. An initial data collection using available net resources and datasets is performed. This part is
geared at minimal work to account for many of the closed Arabic word taxonomies and some
limited but readily available list. It resulted in the lists as shown in the table (mostly incomplete
but quite useful).
Lists
Rules MTE
Tagger
Data
Set 1
Tagg
er
Data
Set 2
Tag
ger
Data
Set 3
Tag
ger
Stanford
Tagger
Compare Results
Anno-
tated
Data
Set 1
Tag-
ger
6. International Journal on Natural Language Computing (IJNLC) Vol.11, No.5, October 2022
22
Table 1. Sample Closed POS Categories List
Days Weeks Months Seasons Adjectives صفات God Attributes هللا صفات
Verb Cana-Thana كان
-
ضن
اخواتها و Mausool Names الموصول اسم Five Names الخمس األسماء
Adjective Seq Numbers تراتبية ارقام Punctuation ترقيم Particles حروف
Place and Time Adverbs الضرف Currencies عمالت تفضيل
Independent Pronounsمنفصلة ضمائر Towns & countries وبلدان مدن Exceptional Adverbs
Demonstrative Nouns االشارة اسم Colors الوان Partials Rabat الريط حروف
IN Particle الجر حروف Numbersارقام MSA Verb افعال
Particle Ena’a ان
واخواتها God Names الحسنى األسماء People Names األشخاص أسماء
2. A second set of very compacted grammatical functions or rule representing the Arabic lan-
guage rules for many of syntactical and structural compositions. Table 2 lists some names of
some of such rules. The tagger works sentence by sentence by first searching the set of lists
(primitive lexicon) to decide on the category of any word in the sentence. Next it will reiterate
through the set of grammatical and structural rules and will re-confirm existing categories and
fix new ones. At the end, a sentence may have some missing categories for some words. A
number of experiments were tried but the issue still open for improvements.
Table 2. Gramatical, Morphological and Strctural Procedures (Rules)
Ckeck Eshar Plus Ckeck FirstPOS Ckeck IfAdaadTratubiah
Ckeck SplDarfI Ckeck Tarkim PunI Ckeck TimesI Ckeck
Ckeck Spl Words I Ckeck CDI HrfJarAndAftNNContextI
Ckeck MudenBuldanI EnaaGroupCTX tamizeAlafadd
Ckeck
MaousoolNameCtxI
Ckeck TafdealCtx1I Emma
Ckeck IfParticlesNotJar Ckeck IstithnaCtxI EthandEthaPlus
Ckeck Names Ataf GroupCTX Ckeck Lema Ctx
Ckeck Damer MunfaselI Ckeck StartWaAlifCtxI BaseRuleSet1
Ckeck WITHCtxI Ckeck MaaaCt Ckeck IfEThCtx
CanaZanaGroupCtx Ckeck MustathnaI Ckeck AlfLamCtxI
LaNafiaWaNahia Ckeck AlfFariqaCtxI Ckeck LELI
Ckeck KADCtxI ckeckSaYAVBCtx setJJ
5. EXPERIMENTS, EVALUATIONS, AND DISCUSSIONS
The evaluations process consists of the following experiments with results as shown in Table 3.
Two sets of experiments were performed. The first set is made of four runs on four different un-
annotated data sets to compare performance (Accuracy and Timing) of the new tagger to that of
Stanford Tagger.
7. International Journal on Natural Language Computing (IJNLC) Vol.11, No.5, October 2022
23
Table 3. Accuracy and Timings results comparison.
Accuracy Stanford
Tagger
MTE
Tagger
Speed
Data/Sets Experiments MTE / STF Timing
Mints
Timing
Mints
%
CNN-UTF8 74.23 2954.5 4.4 0.059
Basel-Dataset 75,2 714.13 0.245 0.0003
Arabic Discretized Books 75.45 6052.57 62.54 0.0103
Quranic Text Dataset 65.45 3252.73 4.1 0.0013
Average 72.58 3243;48 71.29 0.022
Annotated Dataset 87.88 / 86.67 0.022 0.020 .91
The second set of experiments are based on a small selected dataset that is manually annotated. The
two taggers are both run on the data set and accuracy of tagging and speed of performance are noted
and compared. Accuracy is a representation of the number of rightly tagged tokens while perfor-
mance is the speed of tagging. Due to the expectation that rule-based systems tend to be much
faster and robust, the measurements take are only indicative and lack features of a well-controlled
experiments.
5.1. MTE Tagger vs. Stanford Tagger on Unannotated Datasets
This is a set of four experiments each is conducted on a different set of data.
5.1.1. Experiment-1: MTE Tagger vs. Stanford tagger on CNN-UTF8
This a set of 5070 files of news articles taken from CNN covering business, entertainment, middle
east, science, technology, sports and world news [1]. The whole set contains a total of 141,4021
words. The obtained results of comparing MTE tagger to Stanford tagger showed and overall ac-
curacy of 74.20 %. That is the percentage of overlap between the taggers is three quarters and
differed on one quarter. In terms of timings, the results were 48 minutes and 44.5 seconds versus
only 4.4 seconds respectively. That is the time taken by MTE Tagger is only 0.0015% of that taken
by Stanford tagger.
5.1.2. Experiment-2: MTE Tagger vs. Stanford on Basel-Dataset
This is a set of 1000 files that are hand compiled on 4 different subjects, namely science, politics,
arts, sports and economics [78]. The set is made of a total of 159,442 words. The obtained results of
comparing MTE tagger to Stanford tagger show and overall accuracy of 75.12%. Again numbers
are similar to the previous set. In terms of timings, the results were 11 minutes and 54.13 seconds
versus 0.245 seconds which is only 0.00034%
5.1.3. Experiment 3: MTE Tagger vs. Stanford on Arabic Discretized Books (Tashkeela)
This a set of 20 files with each file containing the text of a whole book [79]. The text is discretized.
The set is made of a total of 298,416 words. The obtained results of comparing MTE tagger to
Stanford tagger show and overall accuracy of 75.43%. Again numbers are similar to the previous
two sets. In terms of timings, the results were 1 hour, 40 minutes and 52.57 seconds versus 1 mi-
nute and 1.54 seconds which is .01%. This albeit higher than the previous cases. It is probably due
to diacritics removal.
8. International Journal on Natural Language Computing (IJNLC) Vol.11, No.5, October 2022
24
5.1.4. Experiment 4: Quranic Text Dataset
This a single file containing 214 chapters of the whole Quran. The set is made of 77,289 words. The
obtained results of comparing MTE tagger to Stanford tagger show and overall accuracy of
65.79%. Even though this lower than previous experiments results, still however within an ac-
ceptable range. In terms of timings, the results were 54 minutes and 12.73 seconds versus 4.1
seconds which is only 0.0013%. On average, there percentage accuracy is 88.89% and the speed is
2.2% faster in favor of MTE Tagger.
5.2. MTE Tagger vs. Stanford Tagger on Annotated Dataset:
This experiment is based on manually tagged dataset. The data set consists of a total of 17,485
words taken from set 1. The objective is comparing the MTE tagger performs to that of Stanford
tagger. The results obtained were 87.89 % for MTE Tagger and 86.67% for the Stanford tagger. In
terms of timings, the results showed that MTE took on average only 2.2% of the time taken by
Stanford tagger.
5.3. Discussions
From obtained results, we can see that the first set of experiments aimed at looking at the using of
Stanford tagger on four datasets with variable tokens and context. The data set included 2 sets on
news, one from Quran and one from books.
To validate the utility of Stanford and then to compare the results obtained to MTE tagger, the
experiments were clearly that the MTE tagger did not compare well to the Stanford tagger with an
average of 72.64%. That is to say they only agree 72.64%. This prompted us to study the data and
see the differences and agreements. It was clear that the difference was a result of disagreements
for which many cases Stanford failed to tag correctly and vice versa.
Obtained results prompted us to annotate part of the datasets to use for evaluation. We studied the
results and confirmed the correct and fixed the incorrect for a set of 1226 sentences.
The manually annotated set was developed and compared to the two taggers. Obviously better
accuracies were made with MTE tagger having a marginally higher accuracy.
The obtained results were very encouraging and further refinement of the tagger to include more
complete lexicon and more rules using furthers linguistic properties like vocalization will certainly
make the tagger perform better. As far as time performance, much better numbers where obtained
with MTE taking only 2.2% compared to STF. It is expected that Rule-based are to be much faster,
but we were surprised by their results. Table 5 shows a sentence taken from the results.
Table 4. A sentence example: Made of 27 Tokens. Taggers match on 19 and mismatch on 8
Word MTE STF Agree Verify 0/1 Word MTE STF Agree Verify 0/1
وبينت VBD VBD Agree Both Right 1 معهم NN JJ Disagree STF
Wrong
1-0
الشرطة DTNN DTNN Agree Right 1 ما WP WP Agree are Right 1
ان IN IN Agree Right 1 يبدو VBP VBP Agree are Right 1
صلة NN NN Agree Right 1 الفتة NN JJ Disagree are wrong 0-0
المهاجم DTNN DTNN Agree Right 1 المحققين DTNNS DTNNS Agree Right 1
تميل VBP VBP Agree Right 1 تواصلوا VBD VBD Agree are Right 1
الي IN IN Agree Right 1 مع IN NN Disagree STF
Wrong
1-0
انه NN NN Agree Right 1 زوجته NN NN Agree Right 1
9. International Journal on Natural Language Computing (IJNLC) Vol.11, No.5, October 2022
25
The calculated result is 19/27. This signifies that both the calculated percentage overlap can still be
made more accurate.
Table 5. Overal missed percentage for the different POS catagories
Tag DTNNPS DTNNP DTJJR DT CD IN NNS
Wrong 0 0 0 4 34 17
Right 50 35 16 126 647 1519 608
% 0 0 0 0 0.62 2.24 2.8
Tag WP CC DTNNS RP NNP JJR RB
Wrong 7 8 13 4 15 3 32
Right 243 265 379 99 319 53 462
% 2.88 3.02 3.43 4.04 4.7 5.66 6.93
Tag PRP DTNN VBD NN JJ DTJJ VBP
Wrong 6 532 88 1048 10 102 596
Right 84 5432 674 7014 56 537 916
% 7.14 9.79 13.06 14.94 17.86 18.99 65.07
Tag Nouns Verbs Adjectives <<For totals
Wrong 1625 684 112
Right 13837 1590 593
% 11.74 43.02 18.89
Looking at the overall success of tagging we could see that Adjectives (JJ) are the least accurate in
MTE and better rules will still have to be invented to improve the classification of JJs.
6. CONCLUSIONS
In this paper we report on an experimental syntactically and morphologically driven rule-based
Arabic tagger. The tagger is developed using Arabic language grammatical rules and regulations.
The tagger requires no pre-tagged text and is developed using a primitive set of lexicon items along
with extensive grammatical and structural rules. It is tested and compared to Stanford tagger both
in terms of accuracy and performance (speed). Obtained results are quite comparable to Stanford
tagger performance with marginal difference favoring the developed tagger in accurate and huge
difference in terms of performance. The newly developed tagger name MTE Tagger has been
tested and evaluated and was able to obtain an accuracy of 85% versus 82% for the Stanford tagger.
The developed tagger makes use of no pre-annotated datasets, except of some simple lexicon
consisting of list of words representing closed word types like demonstrative nouns or pronouns list
or some particles. For the purpose of evaluation of the new tagger, it was run on multiple datasets
and results were compared to those of Stanford tagger. In particular, both taggers (the MTE and the
Stanford) were run on a set of 1226 sentences with close to 20,000 tokens that was human anno-
tated and verified to serve as testbed. The results were very encouraging in both test runs the MTE
tagger outperformed the Stanford tagger with accuracies in the range of 87.88% versus 86.67% for
the Stanford tagger. In terms of efficiency (speed of tagging) the MTE to Sanford tagger 1:50.
متاثر NN NN Agree Right 1 دون RB NN Disagree STF
Wrong
1-0
بالتنظيم NN NNP Agree Right 1 تقديم NN NN Agree Right 1
عوضا NN NN Agree Right 1 تفاصيل VBP NN Disagree MTE
Wrong
0-1
عن IN IN Agree Right 1 اضافية ----- JJ Disagree MTE
Failed
0-1
علي IN IN Agree Wrong 0 تفاصيل VBP NN Disagree MTE
Wrong
0-1
تواصل VBP NN Disagree MTT
Wrong
0-1 اضافية ----- JJ Disagree MTE
Failed
0-1
مباشر NN JJ Disagree MTT
Wrong
0-1 - - - - - -
10. International Journal on Natural Language Computing (IJNLC) Vol.11, No.5, October 2022
26
Better accuracy is expected as the set of rules are optimized and other Arabic language properties
such as end of word discretization are used.
REFERENCES
[1] Abumalloh RA, Al-Sarhan HM, Ibrahim O, Abu-Ulbeh W. Arabic part-of-speech tagging. Journal of
Soft Computing and Decision Support Systems. 2016 Feb 25;3(2):45-52.
[2] Das BR, Sahoo S, Panda CS, Patnaik S. Part of speech tagging in odia using support vector machine.
Procedia Computer Science. 2015 Jan 1;48:507-12.
[3] Stolcke A, Ries K, Coccaro N, Shriberg E, Bates R, Jurafsky D, Taylor P, Martin R, Ess-Dykema
CV, Meteer M. Dialogue act modeling for automatic tagging and recognition of conversational
speech. Computational linguistics. 2000 Sep 1;26(3):339-73.
[4] Seikaly ZA. The Arabic Language: The Glue that Binds the Arab World. AMIDEAST, Ameri-
ca-Mideast Educational and Training Services, Inc. 2007
[5] Elhadi MT, Alfared RA. Adopting Arabic Taggers to Annotate a Libyan Dialect Text with a
Pre-Tagging Processing and Term Substitutions. International Journal of Science and Technology,
Specail Edition. Feb 2022.
[6] Elhadi MT. Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents.
Journal of Advances in Computer Engineering and Technology. 2019 May 1;5(2):117-28.
[7] Agirre E, Edmonds P, editors. Word sense disambiguation: Algorithms and applications. Springer
Science & Business Media; 2007 Nov 16.
[8] al-Khūlī, M.A. al-Tarākīb al-Shāiʽah fī al-Lughah al-ʽArabīyah: Dirasah Iḥṣā’īyah .ʻAmmām: Dār
al-Falāḥ lil-Nashr wa-al-Tawzīʽ. (1998).
[9] Husni R, Zaher A. Working with Arabic prepositions: Structures and functions. Routledge; 2020 Mar
2.
[10] Marsi E, Van Den Bosch A, Soudi A. Memory-based morphological analysis generation and
part-of-speech tagging of Arabic. InProceedings of the ACL workshop on computational approaches
to semitic languages 2005 Jun (pp. 1-8).
[11] Stokoe C, Oakes MP, Tait J. Word sense disambiguation in information retrieval revisited. InPro-
ceedings of the 26th annual international ACM SIGIR conference on Research and development in
informaion retrieval 2003 Jul 28 (pp. 159-166).
[12] Atkins BT. Tools for computer-aided corpus lexicography: the Hector project. Acta Linguistica Hun-
garica. 1992 Jan 1;41(1/4):5-71.
[13] Jacquemin B, Brun C, Roux C. Enriching a text by semantic disambiguation for information extrac-
tion. arXiv preprint cs/0506048. 2005 Jun 12.
[14] Chiche A, Yitagesu B. Part of speech tagging: a systematic review of deep learning and machine
learning approaches. Journal of Big Data. 2022 Dec;9(1):1-25.
[15] Schneider, S. The biggest data challenges that you might not even know you have,
https://www.ibm.com/blogs/watson/2016/05/biggest-data-challenges-might-not-even-know/
[16] Masand B, Linoff G, Waltz D. Classifying news stories using memory based reasoning. InProceed-
ings of the 15th annual international ACM SIGIR conference on Research and development in infor-
mation retrieval 1992 Jun 1 (pp. 59-65).
[17] Zaghouani W. Critical survey of the freely available Arabic corpora. arXiv preprint
arXiv:1702.07835. 2017 Feb 25.
[18] Bahl, L.R., Mercer, R.L.: Part-of-speech assignment by a statistical decision algorithm. In: Proceed-
ings of the IEEE International Symposium on Information Theory, pp. 88-89. IEEE Computer Socie-
ty Press, Los Alamitos (1976)
[19] Sanger N. The computational analysis of English: A Corpus-based approach: R. Garside; G. Leech;
and G. Sampson (Eds.). Longman, London and New York (1987).
[20] Marshall I. Choice of grammatical word-class without global syntactic analysis: tagging words in the
LOB corpus. Computers and the Humanities. 1983 Sep 1:139-50.
[21] Church KW. A stochastic parts program and noun phrase parser for unrestricted text. InInternational
Conference on Acoustics, Speech, and Signal Processing, 1989 May 23 (pp. 695-698). IEEE.
[22] Cutting D, Kupiec J, Pedersen J, Sibun P. A practical part-of-speech tagger. InThird conference on
applied natural language processing 1992 Mar (pp. 133-140).
11. International Journal on Natural Language Computing (IJNLC) Vol.11, No.5, October 2022
27
[23] Kupiec J. Robust part-of-speech tagging using a hidden Markov model. Computer speech & lan-
guage. 1992 Jul 1;6(3):225-42.
[24] Weischedel R, Schwartz R, Palmucci J, Meteer M, Ramshaw L. Coping with ambiguity and unknown
words through probabilistic models. Using Large Corpora. 1994:323-6.
[25] Merialdo B. Tagging English text with a probabilistic model. Computational linguistics.
1994;20(2):155-71.
[26] Khoja S. APT: An automatic Arabic part-of-speech tagger (Doctoral dissertation, Lancaster Univer-
sity). 2003
[27] Alqrainy S. A morphological-syntactical analysis approach for Arabic textual tagging (Doctoral dis-
sertation, De Montfort University). 2008
[28] McEnery AM. Computational Linguistics: A handbook and toolbox for natural language processing.
Coronet Books Incorporated; 1992.
[29] Harris Z. String analysis of language structure. Mouton and Co., The Hague. 1962.
[30] Klein S, Simmons RF. A computational approach to grammatical coding of English words. Journal of
the ACM (JACM). 1963 Jul 1;10(3):334-47.
[31] Greene BB, Rubin GM. Automatic grammatical tagging of English. Department of Linguistics,
Brown University; 1971.
[32] Brill E. A simple rule-based part of speech tagger. PENNSYLVANIA UNIV PHILADELPHIA
DEPT OF COMPUTER AND INFORMATION SCIENCE; 1992 Jan 1.
[33] Abney S. Part-of-speech tagging and partial parsing. InCorpus-based methods in language and speech
processing 1997 (pp. 118-136). Springer, Dordrecht.
[34] Chanod JP, Tapanainen P. Tagging French--comparing a statistical and a constraint-based method.
arXiv preprint cmp-lg/9503003. 1995 Mar 2..
[35] Volk M, Schneider G. Comparing a statistical and a rule-based tagger for German. arXiv preprint
cs/9811016. 1998 Nov 11.
[36] Schmid H. Part-of-speech tagging with neural networks. arXiv preprint cmp-lg/9410018. 1994 Oct
24.
[37] Perez-Ortiz JA, Forcada ML. Part-of-speech tagging with recurrent neural networks. InIJCNN'01.
International Joint Conference on Neural Networks. Proceedings (Cat. No. 01CH37222) 2001 Jul 15
(Vol. 3, pp. 1588-1592). IEEE.
[38] Bosch AV, Marsi E, Soudi A. Memory-based morphological analysis and part-of-speech tagging of
Arabic. In Arabic Computational Morphology 2007 (pp. 201-217). Springer, Dordrecht.
[39] Cover T, Hart P. Nearest neighbor pattern classification. IEEE transactions on information theory.
1967 Jan;13(1):21-7.
[40] Zavrel J, Daelemans W. Recent advances in memory-based part-of-speech tagging. In VI Simposio
Internacional de Comunicacion Social 1999 (pp. 590-597).),
[41] Ramani D. A short survey on memory based reinforcement learning. arXiv preprint
arXiv:1904.06736. 2019 Apr 14.
[42] Angle S, Mishra P, Sharma DM. Automated Error Correction and Validation for POS Tagging of
Hindi. In Proceedings of the 32nd Pacific Asia Conference on Language, Information and computa-
tion 2018.
[43] Kanakaraddi SG, Nandyal SS. Survey on parts of speech tagger techniques. In2018 International
Conference on Current Trends towards Converging Technologies (ICCTCT) 2018 Mar 1 (pp. 1-6).
IEEE.
[44] Katz G, Diab M, editors. Introduction to the Special Issue on Arabic Computational Linguistics.
ACM Transactions on Asian Language Information Processing (TALIP). 2011 Mar 1;10(1):1-4.
[45] Toutanvoa K, Manning CD. Enriching the knowledge sources used in a maximum entropy
part-of-speech tagger. In2000 Joint SIGDAT conference on Empirical methods in natural language
processing and very large corpora 2000 Oct (pp. 63-70)
[46] Manning CD. Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In Interna-
tional conference on intelligent text processing and computational linguistics 2011 Feb 20 (pp.
171-189). Springer, Berlin, Heidelberg.
[47] Brill E. Transformation-based error-driven learning and natural language processing: A case study in
part-of-speech tagging. Computational linguistics. 1995;21(4):543-65.
[48] Khoja S. APT: Arabic part-of-speech tagger. InProceedings of the Student Workshop at NAACL
2001 Jun (pp. 20-25).
12. International Journal on Natural Language Computing (IJNLC) Vol.11, No.5, October 2022
28
[49] Schmid H. TreeTagger-a language independent part-of-speech tagger. http://www. ims. uni-stuttgart.
de/projekte/corplex/TreeTagger/. 1994
[50] Wu X, Kumar V, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS,
Zhou ZH. Top 10 algorithms in data mining. Knowledge and information systems. 2008
Jan;14(1):1-37.
[51] Algrainy S, AlSerhan HM, Ayesh A. Pattern-based algorithm for part-of-speech tagging. In Interna-
tional Conference on Computer Engineering and Systems, ICCES 2008 (pp. 119-124).
[52] Yousif JH, Sembok TM. Arabic part-of-speech tagger-based Support Vectors Machines. In2008 In-
ternational Symposium on Information Technology 2008 Aug 26 (Vol. 3, pp. 1-7). IEEE.
[53] Labidi M. New Combined Method to Improve Arabic POS Tagging. Journal of Autonomous Intelli-
gence. 2019 Jan 8;1(2):23-8.
[54] Sánchez-Villamil E, Forcada ML, Carrasco RC. Unsupervised training of a finite-state slid-
ing-window part-of-speech tagger. InInternational Conference on Natural Language Processing (in
Spain) 2004 Oct 20 (pp. 454-463). Springer, Berlin, Heidelberg.
[55] Elhadj YO. Statistical part-of-speech tagger for traditional Arabic texts. Journal of computer science.
2009;5(11):794.
[56] Diab M, Hacioglu K, Jurafsky D. Automatic tagging of Arabic text: From raw text to base phrase
chunks. InProceedings of HLT-NAACL 2004: Short papers 2004 (pp. 149-152).Banko and Moore
(2004)
[57] Al Shamsi F, Guessoum A. A hidden Markov model-based POS tagger for Arabic. InProceeding of
the 8th international conference on the statistical analysis of textual data, France 2006 (pp. 31-42).
[58] Alqrainy S, Muaidi H, Alkoffash MS. Context-free grammar analysis for Arabic sentences. Interna-
tional Journal of Computer Applications. 2012 Jan 1;53(3).
[59] Othmane CZ, Fraj FB, Limam I. POS-tagging Arabic texts: A novel approach based on ant colony.
Natural Language Engineering. 2017 May;23(3):419-39.
[60] Mohamed E, Kübler S. Arabic part of speech tagging. InProceedings of the Seventh International
Conference on Language Resources and Evaluation (LREC'10) 2010 May.
[61] Alqrainy S. A morphological-syntactical analysis approach for Arabic textual tagging (Doctoral dis-
sertation, De Montfort University).2008
[62] Kučera H, Francis WN. Computational analysis of present-day American English. Dartmouth;
1967.Stenstrom et al., 2002)
[63] Al-Taani AT, Al-Rub SA. A rule-based approach for tagging non-vocalized Arabic words. Int.
Arab J. Inf. Technol.. 2009 Jul 1;6(3):320-8..
[64] AbuZeina D, Al-Khatib W, Elshafei M, Al-Muhtaseb H. Toward enhanced Arabic speech recognition
using part of speech tagging. International Journal of Speech Technology. 2011 Dec;14(4):419-26.
[65] Tlili-Guiassa Y. Hybrid method for tagging Arabic text. Journal of Computer science.
2006;2(3):245-8.
[66] Hadni M, Ouatik SA, Lachkar A, Meknassi M. Hybrid part-of-speech tagger for non-vocalized Ara-
bic text. Int. J. Nat. Lang. Comput. 2013 Dec;2(6):1-5.
[67] Calciu RH. Semantic change in the age of corpus linguistics. Journal of Humanistic and Social Stud-
ies. 2012;3(1):45-58. (MacWhinney and Snow, 1984),
[68] Ali BB, Jarray F. Genetic approach for Arabic part of speech tagging. arXiv preprint
arXiv:1307.3489. 2013 Jul 11.
[69] El Hadj Y, Al-Sughayeir I, Al-Ansari A. Arabic part-of-speech tagging using the sentence structure.
InProceedings of the Second International Conference on Arabic Language Resources and Tools,
Cairo, Egypt 2009 Apr (pp. 241-5).
[70] Kübler S, Mohamed E. Part of speech tagging for Arabic. Natural Language Engineering. 2012
Oct;18(4):521-48.
[71] Bird S, Klein E, Loper E. Natural language processing with Python: analyzing text with the natural
language toolkit. " O'Reilly Media, Inc."; 2009 Jun 12.
[72] Stenström AB, Breivik LE. The Bergen corpus of London teenager language (COLT). ICAME jour-
nal. 1993 Apr;17:128.
[73] MacWhinney B. Child Language Data Exchange System. Transcript Analysis. 1984 Aug;1(1):2-18.
[74] Aarts J, Van Halteren H, Oostdijk N. The linguistic annotation of corpora: The TOSCA analysis sys-
tem. International journal of corpus linguistics. 1998 Jan 1;3(2):189-210.
[75] Taylor A, Marcus M, Santorini B. The Penn treebank: an overview. Treebanks. 2003:5-22.
13. International Journal on Natural Language Computing (IJNLC) Vol.11, No.5, October 2022
29
[76] Motaz K. Saad and Wesam Ashour, "OSAC: Open-Source Arabic Corpus", 6th ArchEng Internation-
al Symposiums, EEECS’10 the 6th International Symposium on Electrical and Electronics Engineer-
ing and Computer Science, European University of Lefke, Cyprus, 2010.
[77] Sklyarova DD. The History of Corpus Linguistics Development.
[78] Bani-Ismail, B, Al-Rababah, K, Shatnawi, S., The effect of full word, stem, and root as index-term on
Arabic information retrieval, Global Journal of Computer Science and Technology, 2011
[79] Zerrouki T, Balla A. Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization
systems. Data in brief. 2017 Apr 1;11:147-51.