Part-of-speech (POS) tagger plays an important role in Natural Language Applications like
Speech Recognition, Natural Language Parsing, Information Retrieval and Multi Words Term
Extraction. This study proposes a building of an efficient and accurate POS Tagging technique
for Arabic language using statistical approach. Arabic Rule-Based method suffers from
misclassified and unanalyzed words due to the ambiguity issue. To overcome these two
problems, we propose a Hidden Markov Model (HMM) integrated with Arabic Rule-Based
method. Our POS tagger generates a set of 4 POS tags: Noun, Verb, Particle, and Quranic
Initial (INL). The proposed technique uses the different contextual information of the words with
a variety of the features which are helpful to predict the various POS classes. To evaluate its
accuracy, the proposed method has been trained and tested with the Holy Quran Corpus
containing 77 430 terms for undiacritized Classical Arabic language. The experiment results
demonstrate the efficiency of our method for Arabic POS Tagging. The obtained accuracies are
97.6% and 94.4% for respectively our method and for the Rule based tagger method.
Hybrid part of-speech tagger for non-vocalized arabic textijnlc
The document presents a hybrid part-of-speech tagging method for Arabic text that combines rule-based and statistical approaches. Rule-based tagging alone can misclassify words and leave some untagged, so the method integrates it with a Hidden Markov Model tagger. The hybrid approach is evaluated on two Arabic corpora and achieves accuracy rates of 97.6% and 98%, outperforming the individual rule-based and HMM taggers.
MULTI-WORD TERM EXTRACTION BASED ON NEW HYBRID APPROACH FOR ARABIC LANGUAGEcsandit
Arabic Multiword Term are relevant strings of words in text documents. Once they are
automatically extracted, they can be used to increase the performance of any text mining
applications such as Categorisation, Clustering, Information Retrieval System, Machine
Translation, and Summarization, etc. This paper introduces our proposed Multiword term
extraction system based on the contextual information. In fact, we propose a new method based
a hybrid approach for Arabic Multiword term extraction. Like other method based on hybrid
approach, our method is composed by two main steps: the Linguistic approach and the
Statistical one. In the first step, the Linguistic approach uses Part Of Speech (POS) Tagger
(Taani’s Tagger) and the Sequence Identifier as patterns in order to extract the candidate
AMTWs. While in the second one which includes our main contribution, the Statistical approach
incorporates the contextual information by using a new proposed association measure based on
Termhood and Unithood for AMWTs extraction. To evaluate the efficiency of our proposed
method for AMWTs extraction, this later has been tested and compared using three different
association measures: the proposed one named NTC-Value, NC-Value, and C-Value. The
experimental results using Arabic Texts taken from the environment domain, show that our
hybrid method outperforms the other ones in term of precision, in addition, it can deal correctly
with tri-gram Arabic Multiword terms.
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATIONIJDKP
The document presents a new hybrid stemming algorithm for Arabic text categorization. It combines three existing stemming approaches - root-based (Khoja stemmer), stem-based (Larkey stemmer), and statistical (N-gram). The hybrid approach first normalizes text, removes stop words and prefixes/suffixes. It then uses bigram similarity and the Dice measure to find valid roots by comparing word bigrams to a root dictionary. The algorithm is evaluated using naive Bayesian and SVM classifiers in an Arabic text categorization system, and is found to outperform the individual stemming methods. The proposed stemmer improves the performance of Arabic text categorization.
Hybrid approaches for automatic vowelization of arabic textsijnlc
Hybrid approaches for automatic vowelization of Arabic texts are presented in this article. The process is
made up of two modules. In the first one, a morphological analysis of the text words is performed using the
open source morphological Analyzer AlKhalil Morpho Sys. Outputs for each word analyzed out of context,
are its different possible vowelizations. The integration of this Analyzer in our vowelization system required
the addition of a lexical database containing the most frequent words in Arabic language. Using a
statistical approach based on two hidden Markov models (HMM), the second module aims to eliminate the
ambiguities. Indeed, for the first HMM, the unvowelized Arabic words are the observed states and the
vowelized words are the hidden states. The observed states of the second HMM are identical to those of the
first, but the hidden states are the lists of possible diacritics of the word without its Arabic letters. Our
system uses Viterbi algorithm to select the optimal path among the solutions proposed by Al Khalil Morpho
Sys. Our approach opens an important way to improve the performance of automatic vowelization of
Arabic texts for other uses in automatic natural language processing.
A New Approach to Parts of Speech Tagging in Malayalamijcsit
Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word’s
usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes
include additional information, with case markers (number, gender etc) and tense markers. A large number
of current language processing systems use a parts-of-speech tagger for pre-processing.
There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based
Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the
oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and
statistical information to assign tag to words. It use large corpus, so that Time complexity and Space
complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic
Approach is the widely used one nowadays because of its accuracy.
Malayalam is a Dravidian family of languages, inflectional with suffixes with the root word forms. The
currently used Algorithms are efficient Machine Learning Algorithms but these are not built for
Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.
My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use
Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence
structure along with the dictionary entry.
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMijnlc
The document describes a machine transliteration system that transliterates Hindi and Marathi names and words to English using support vector machines (SVM). It segments source language names into phonetic units, and trains an SVM classifier using phonetic units and n-grams as features to label each unit with its English transliteration. The system achieves good accuracy for Hindi-English and Marathi-English transliteration.
A New Concept Extraction Method for Ontology Construction From Arabic TextCSCJournals
Ontology is one of the most popular representation model used for knowledge representation, sharing and reusing. The Arabic language has complex morphological, grammatical, and semantic aspects. Due to complexity of Arabic language, automatic Arabic terminology extraction is difficult. In addition, concept extraction from Arabic documents has been challenging research area, because, as opposed to term extraction, concept extraction are more domain related and more selective. In this paper, we present a new concept extraction method for Arabic ontology construction, which is the part of our ontology construction framework. A new method to extract domain relevant single and multi-word concepts in the domain has been proposed, implemented and evaluated. Our method combines linguistic, statistical information and domain knowledge. It first uses linguistic patterns based on POS tags to extract concept candidates, and then stop words filter is implemented to filter unwanted strings. To determine relevance of these candidates within the domain, different statistical measures and new domain relevance measure are implemented for first time for Arabic language. To enhance the performance of concept extraction, a domain knowledge will be integrated into the module. The concepts scores are calculated according to their statistical values and domain knowledge values. In order to evaluate the performance of the method, precision scores were calculated. The results show the high effectiveness of the proposed approach to extract concepts for Arabic ontology construction.
Grapheme-To-Phoneme Tools for the Marathi Speech SynthesisIJERA Editor
We describe in detail a Grapheme-to-Phoneme (G2P) converter required for the development of a good quality
Marathi Text-to-Speech (TTS) system. The Festival and Festvox framework is chosen for developing the
Marathi TTS system. Since Festival does not provide complete language processing support specie to various
languages, it needs to be augmented to facilitate the development of TTS systems in certain new languages.
Because of this, a generic G2P converter has been developed. In the customized Marathi G2P converter, we
have handled schwa deletion and compound word extraction. In the experiments carried out to test the Marathi
G2P on a text segment of 2485 words, 91.47% word phonetisation accuracy is obtained. This Marathi G2P has
been used for phonetising large text corpora which in turn is used in designing an inventory of phonetically rich
sentences. The sentences ensured a good coverage of the phonetically valid di-phones using only 1.3% of the
complete text corpora.
Hybrid part of-speech tagger for non-vocalized arabic textijnlc
The document presents a hybrid part-of-speech tagging method for Arabic text that combines rule-based and statistical approaches. Rule-based tagging alone can misclassify words and leave some untagged, so the method integrates it with a Hidden Markov Model tagger. The hybrid approach is evaluated on two Arabic corpora and achieves accuracy rates of 97.6% and 98%, outperforming the individual rule-based and HMM taggers.
MULTI-WORD TERM EXTRACTION BASED ON NEW HYBRID APPROACH FOR ARABIC LANGUAGEcsandit
Arabic Multiword Term are relevant strings of words in text documents. Once they are
automatically extracted, they can be used to increase the performance of any text mining
applications such as Categorisation, Clustering, Information Retrieval System, Machine
Translation, and Summarization, etc. This paper introduces our proposed Multiword term
extraction system based on the contextual information. In fact, we propose a new method based
a hybrid approach for Arabic Multiword term extraction. Like other method based on hybrid
approach, our method is composed by two main steps: the Linguistic approach and the
Statistical one. In the first step, the Linguistic approach uses Part Of Speech (POS) Tagger
(Taani’s Tagger) and the Sequence Identifier as patterns in order to extract the candidate
AMTWs. While in the second one which includes our main contribution, the Statistical approach
incorporates the contextual information by using a new proposed association measure based on
Termhood and Unithood for AMWTs extraction. To evaluate the efficiency of our proposed
method for AMWTs extraction, this later has been tested and compared using three different
association measures: the proposed one named NTC-Value, NC-Value, and C-Value. The
experimental results using Arabic Texts taken from the environment domain, show that our
hybrid method outperforms the other ones in term of precision, in addition, it can deal correctly
with tri-gram Arabic Multiword terms.
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATIONIJDKP
The document presents a new hybrid stemming algorithm for Arabic text categorization. It combines three existing stemming approaches - root-based (Khoja stemmer), stem-based (Larkey stemmer), and statistical (N-gram). The hybrid approach first normalizes text, removes stop words and prefixes/suffixes. It then uses bigram similarity and the Dice measure to find valid roots by comparing word bigrams to a root dictionary. The algorithm is evaluated using naive Bayesian and SVM classifiers in an Arabic text categorization system, and is found to outperform the individual stemming methods. The proposed stemmer improves the performance of Arabic text categorization.
Hybrid approaches for automatic vowelization of arabic textsijnlc
Hybrid approaches for automatic vowelization of Arabic texts are presented in this article. The process is
made up of two modules. In the first one, a morphological analysis of the text words is performed using the
open source morphological Analyzer AlKhalil Morpho Sys. Outputs for each word analyzed out of context,
are its different possible vowelizations. The integration of this Analyzer in our vowelization system required
the addition of a lexical database containing the most frequent words in Arabic language. Using a
statistical approach based on two hidden Markov models (HMM), the second module aims to eliminate the
ambiguities. Indeed, for the first HMM, the unvowelized Arabic words are the observed states and the
vowelized words are the hidden states. The observed states of the second HMM are identical to those of the
first, but the hidden states are the lists of possible diacritics of the word without its Arabic letters. Our
system uses Viterbi algorithm to select the optimal path among the solutions proposed by Al Khalil Morpho
Sys. Our approach opens an important way to improve the performance of automatic vowelization of
Arabic texts for other uses in automatic natural language processing.
A New Approach to Parts of Speech Tagging in Malayalamijcsit
Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word’s
usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes
include additional information, with case markers (number, gender etc) and tense markers. A large number
of current language processing systems use a parts-of-speech tagger for pre-processing.
There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based
Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the
oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and
statistical information to assign tag to words. It use large corpus, so that Time complexity and Space
complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic
Approach is the widely used one nowadays because of its accuracy.
Malayalam is a Dravidian family of languages, inflectional with suffixes with the root word forms. The
currently used Algorithms are efficient Machine Learning Algorithms but these are not built for
Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.
My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use
Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence
structure along with the dictionary entry.
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMijnlc
The document describes a machine transliteration system that transliterates Hindi and Marathi names and words to English using support vector machines (SVM). It segments source language names into phonetic units, and trains an SVM classifier using phonetic units and n-grams as features to label each unit with its English transliteration. The system achieves good accuracy for Hindi-English and Marathi-English transliteration.
A New Concept Extraction Method for Ontology Construction From Arabic TextCSCJournals
Ontology is one of the most popular representation model used for knowledge representation, sharing and reusing. The Arabic language has complex morphological, grammatical, and semantic aspects. Due to complexity of Arabic language, automatic Arabic terminology extraction is difficult. In addition, concept extraction from Arabic documents has been challenging research area, because, as opposed to term extraction, concept extraction are more domain related and more selective. In this paper, we present a new concept extraction method for Arabic ontology construction, which is the part of our ontology construction framework. A new method to extract domain relevant single and multi-word concepts in the domain has been proposed, implemented and evaluated. Our method combines linguistic, statistical information and domain knowledge. It first uses linguistic patterns based on POS tags to extract concept candidates, and then stop words filter is implemented to filter unwanted strings. To determine relevance of these candidates within the domain, different statistical measures and new domain relevance measure are implemented for first time for Arabic language. To enhance the performance of concept extraction, a domain knowledge will be integrated into the module. The concepts scores are calculated according to their statistical values and domain knowledge values. In order to evaluate the performance of the method, precision scores were calculated. The results show the high effectiveness of the proposed approach to extract concepts for Arabic ontology construction.
Grapheme-To-Phoneme Tools for the Marathi Speech SynthesisIJERA Editor
We describe in detail a Grapheme-to-Phoneme (G2P) converter required for the development of a good quality
Marathi Text-to-Speech (TTS) system. The Festival and Festvox framework is chosen for developing the
Marathi TTS system. Since Festival does not provide complete language processing support specie to various
languages, it needs to be augmented to facilitate the development of TTS systems in certain new languages.
Because of this, a generic G2P converter has been developed. In the customized Marathi G2P converter, we
have handled schwa deletion and compound word extraction. In the experiments carried out to test the Marathi
G2P on a text segment of 2485 words, 91.47% word phonetisation accuracy is obtained. This Marathi G2P has
been used for phonetising large text corpora which in turn is used in designing an inventory of phonetically rich
sentences. The sentences ensured a good coverage of the phonetically valid di-phones using only 1.3% of the
complete text corpora.
A considerable interest has been given to Multiword Expression (MWEs) identification and treatment. The identification of MWEs affects the quality of results of different tasks heavily used in natural language processing (NLP) such as parsing and generation. Different
approaches for MWEs identification have been applied such as statistical methods which employed as an inexpensive and language independent way of finding co-occurrence patterns.Another approach relays on linguistic methods for identification, which employ information such as part of speech (POS) filters and lexical alignment between languages is also used and
produced more targeted candidate lists. This paper presents a framework for extracting Arabic
MWEs (nominal or verbal MWEs) for bi-gram using hybrid approach. The proposed approach starts with applying statistical method and then utilizes linguistic rules in order to enhance the results by extracting only patterns that match relevant language rule. The proposed hybrid
approach outperforms other traditional approaches.
ATAR: Attention-based LSTM for Arabizi transliterationIJECEIAES
A non-standard romanization of Arabic script, known as Arbizi, is widely used in Arabic online and SMS/chat communities. However, since state-of-the-art tools and applications for Arabic NLP expects Arabic to be written in Arabic script, handling contents written in Arabizi requires a special attention either by building customized tools or by transliterating them into Arabic script. The latter approach is the more common one and this work presents two significant contributions in this direction. The first one is to collect and publicly release the first large-scale “Arabizi to Arabic script” parallel corpus focusing on the Jordanian dialect and consisting of more than 25 k pairs carefully created and inspected by native speakers to ensure highest quality. Second, we present ATAR, an ATtention-based LSTM model for ARabizi transliteration. Training and testing this model on our dataset yields impressive accuracy (79%) and BLEU score (88.49).
A novel method for arabic multi word term extractionijdms
Arabic Multiword Terms (AMWTs) are relevant strings of words in text documents. Once they are
automatically extracted, they can be used to increase the performance of any Arabic Text Mining
applications such as Categorization, Clustering, Information Retrieval System, Machine Translation, and
Summarization, etc. Mainly the proposed methods for AMWTs extraction can be categorized in three
approaches: Linguistic-based, Statistic-based, and hybrid-based approach. These methods present some
drawbacks that limit their use. In fact they can only deal with bi-grams terms and their yield not good
accuracies. In this paper, to overcome these drawbacks, we propose a new and efficient method for
AMWTs Extraction based on a hybrid approach. This latter is composed by two main filtering steps: the
Linguistic filter and the Statistical one. The Linguistic Filter uses our proposed Part Of Speech (POS)
Tagger and the Sequence identifier as patterns in order to extract candidate AMWTs. While the Statistical
filter incorporate the contextual information, and a new proposed association measure based on Termhood
and Unithood Estimation named NTC-Value.
To evaluate and illustrate the efficiency of our proposed method for AMWTs extraction, a comparative
study has been conducted based on Kalimat Corpus and using nine experiment schemes: In the linguistic
filter, we used three POS Taggers such as Taani’s method based Rule-approach, HMM method based
Statistical-approach, and our recently proposed Tagger based Hybrid –approach. While in the Statistical
filter, we used three statistical measures such as C-Value, NC-Value, and our proposed NTC-Value. The
obtained results demonstrate the efficiency of our proposed method for AMWTs extraction: it outperforms
the other ones and can deal correctly with the tri-grams terms.
This document describes a study on improving Tamil-English cross-language information retrieval through transliteration generation and mining techniques. The study achieved a peak Mean Average Precision of 0.5133 for monolingual English retrieval and 0.4145 for Tamil-English cross-language retrieval, representing an improvement over baselines without handling out-of-vocabulary terms. Transliteration mining performed better than generation at resolving out-of-vocabulary terms and boosting retrieval performance.
A Marathi Hidden-Markov Model Based Speech Synthesis Systemiosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVALcsandit
The demand for multilingual information is becoming erceptive as the users of the internet throughout the world are escalating and it creates a problem of retrieving documents in one language by specifying query in another language. This increasing demand can be addressed by designing automatic tools, which accepts the query in one language and retrieves the relevant documents in other languages. We have developed prototype Amharic-Arabic Cross Language
Information Retrieval System by applying dictionary-based approach that enables the users to retrieve relevant documents from Amharic-Arabic corpus by entering the query in Amharic and retrieving the relevant documents both Amharic and Arabic.
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...kevig
This study proposes a method to develop neural models of the morphological analyzer for Japanese Hiragana sentences using the Bi-LSTM CRF model. Morphological analysis is a technique that divides text data into words and assigns information such as parts of speech. In Japanese natural language processing systems, this technique plays an essential role in downstream applications because the Japanese language does not have word delimiters between words. Hiragana is a type of Japanese phonogramic characters, which is used for texts for children or people who cannot read Chinese characters. Morphological analysis of Hiragana sentences is more difficult than that of ordinary Japanese sentences because there is less information for dividing. For morphological analysis of Hiragana sentences, we demonstrated the effectiveness of fine-tuning using a model based on ordinary Japanese text and examined the influence of training data on texts of various genres.
Improving accuracy of part-of-speech (POS) tagging using hidden markov model ...IJECEIAES
In Natural Language Processing (NLP), Word segmentation and Part-ofSpeech (POS) tagging are fundamental tasks. The POS information is also necessary in NLP’s preprocessing work applications such as machine translation (MT), information retrieval (IR), etc. Currently, there are many research efforts in word segmentation and POS tagging developed separately with different methods to get high performance and accuracy. For Myanmar Language, there are also separate word segmentors and POS taggers based on statistical approaches such as Neural Network (NN) and Hidden Markov Models (HMMs). But, as the Myanmar language's complex morphological structure, the OOV problem still exists. To keep away from error and improve segmentation by utilizing POS data, segmentation and labeling should be possible at the same time.The main goal of developing POS tagger for any Language is to improve accuracy of tagging and remove ambiguity in sentences due to language structure. This paper focuses on developing word segmentation and Part-of- Speech (POS) Tagger for Myanmar Language. This paper presented the comparison of separate word segmentation and POS tagging with joint word segmentation and POS tagging.
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...kevig
In this paper, phoneme sequences are used as language information to perform code-switched language
identification (LID). With the one-pass recognition system, the spoken sounds are converted into
phonetically arranged sequences of sounds. The acoustic models are robust enough to handle multiple
languages when emulating multiple hidden Markov models (HMMs). To determine the phoneme similarity
among our target languages, we reported two methods of phoneme mapping. Statistical phoneme-based
bigram language models (LM) are integrated into speech decoding to eliminate possible phone
mismatches. The supervised support vector machine (SVM) is used to learn to recognize the phonetic
information of mixed-language speech based on recognized phone sequences. As the back-end decision is
taken by an SVM, the likelihood scores of segments with monolingual phone occurrence are used to
classify language identity. The speech corpus was tested on Sepedi and English languages that are often
mixed. Our system is evaluated by measuring both the ASR performance and the LID performance
separately. The systems have obtained a promising ASR accuracy with data-driven phone merging
approach modelled using 16 Gaussian mixtures per state. In code-switched speech and monolingual
speech segments respectively, the proposed systems achieved an acceptable ASR and LID accuracy.
Compression-Based Parts-of-Speech Tagger for The Arabic LanguageCSCJournals
This paper explores the use of Compression-based models to train a Part-of-Speech (POS) tagger for the Arabic language. The newly developed tagger is based on the Prediction-by-Partial Matching (PPM) compression system, which has already been employed successfully in several NLP tasks. Several models were trained for the new tagger, the first models were trained using a silver-standard data from two different POS Arabic taggers, and the second model utilised the BAAC corpus, which is a 50K term manually annotated MSA corpus, where the PPM tagger achieved an accuracy of 93.07%. Also, the tag-based models were utilised to evaluate the performance of the new tagger by first tagging different Classical Arabic corpora and Modern Standard Arabic corpora then compressing the text using tag-based compression models. The results show that the use of silver-standard models has led to a reduction in the quality of the tag-based compression by an average of 0.43%, whereas the use of the gold-standard model has increased the tag-based compression quality by an average of 4.61% when used to tag Modern Standard Arabic text.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Brill's Rule-based Part of Speech Tagger for Kadazanidescitation
This paper presents the Part of Speech Tagger (POS) for Kadazan language by
implementing Brill's approach which is also known as a Transformation-Based Error
Driven Learning approach. Kadazan language is chosen because there is not even one POS
tagger has been developed for this language yet. Hence, this study has been carried out in
order to develop a POS tagger especially for Kadazan language that can tag Kadazan
corpus systematically, help to reduce the ambiguity problem and at the same time can be
used as a learning language tool. Therefore, the main objective of this study is to automate
the tagging process for Kadazan language. Brill' approach is an enhance version of the
original Rule-Based approach which it transforms the tags based on a set of predefined
rules. Brill’s approach uses rules to transform wrong tags into correct tags in the corpus. In
order to achieve the main goal, several objectives have been set which are to create the
specific lexical and contextual rules for Kadazan language, by applying Brill’s approach
based on rules and to evaluate the effectiveness of Kadazan Part of Speech using Brill’s
approach. The tagging process is divided into four main phases. In first phase, Brill’s
approach process begins by inputting a new untagged text into the system. In second phase,
the input text will go through the initial state annotater to tag all the words inside the corpus
to its most likely tags and produce a temporary corpus. In third phase, the temporary
corpus is then compared to the goal corpus to detect if there is any errors occurred. In last
phase, the rules will be applied to reduce any errors occurred and fix the temporary corpus.
The tagging approach has been trained using two Kadazan children’s story books which
contain 2069 words. Evaluation process is done by comparing the tagging results of Brill’s
approach with the manual tagging. Kadazan Part of Speech Tagger has achieved around 93
% of accuracy. This study has shown how Brill’s tagging approach can be used to identify
tags for Kadazan language.
Improving performance of english hindi cross language information retrieval u...ijnlc
The main issue in Cross Language Information Retrieval (CLIR) is the poor performance of retrieval in
terms of average precision when compared to monolingual retrieval performance. The main reasons
behind poor performance of CLIR are mismatching of query terms, lexical ambiguity and un-translated
query terms. The existing problems of CLIR are needed to be addressed in order to increase the
performance of the CLIR system. In this paper, we are putting our effort to solve the given problem by
proposed an algorithm for improving the performance of English-Hindi CLIR system. We used all possible
combination of Hindi translated query using transliteration of English query terms and choosing the best
query among them for retrieval of documents. The experiment is performed on FIRE 2010 (Forum of
Information Retrieval Evaluation) datasets. The experimental result show that the proposed approach gives
better performance of English-Hindi CLIR system and also helps in overcoming existing problems and
outperforms the existing English-Hindi CLIR system in terms of average precision.
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language pair. The machine translation system will take input script as English sentence and parse with the help of Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will take the parsed output and separate the source text word by word and searches for their corresponding target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also reordering rules are there. After applying the reordering rules, English sentence will be syntactically reordered to suit Marathi language
Design and Development of a Malayalam to English Translator- A Transfer Based...Waqas Tariq
This paper describes a transfer based scheme for translating Malayalam, a Dravidian language, to English. This system inputs Malayalam sentences and outputs equivalent English sentences. The system comprises of a preprocessor for splitting the compound words, a morphological parser for context disambiguation and chunking, a syntactic structure transfer module and a bilingual dictionary. All the modules are morpheme based to reduce dictionary size. The system does not rely on a stochastic approach and it is based on a rule-based architecture along with various linguistic knowledge components of both Malayalam and English. The system uses two sets of rules: rules for Malayalam morphology and rules for syntactic structure transfer from Malayalam to English. The system is designed using artificial intelligence techniques.
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
Cross Language Information Retrieval (CLIR) deals with retrieving relevant information stored in a language different from
the language of user’s query. This helps users to express the information need in their native languages. Machine translation based (MTbased)
approach of CLIR uses existing machine translation techniques to provide automatic translation of queries. This paper covers the
research work done in CLIR and MT systems for Marathi language in India.
Named Entity Recognition using Hidden Markov Model (HMM)kevig
Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP) which is the branch of artificial intelligence. It has many applications mainly in machine translation, text to speech synthesis, natural language understanding, Information Extraction, Information retrieval, question answering etc. The aim of NER is to classify words into some predefined categories like location name, person name, organization name, date, time etc. In this paper we describe the Hidden Markov Model (HMM) based approach of machine learning in detail to identify the named entities. The main idea behind the use of HMM model for building NER system is that it is language independent and we can apply this system for any language domain. In our NER system the states are not fixed means it is of dynamic in nature one can use it according to their interest. The corpus used by our NER system is also not domain specific
Nohemí Rocio Álvarez Morales recomienda a Miryam Celeste Contreras Guzmán para una beca en el Centro Superior de Diseño de Modas de Madrid. Álvarez describe a Contreras como seria, responsable, honesta y respetuosa. Además, Álvarez solicita sus calificaciones del primer cuatrimestre de la carrera de Diseño e Industria del Vestido y de la Moda en la Universidad Interamericana para el Desarrollo.
IMPROVING SUPERVISED CLASSIFICATION OF DAILY ACTIVITIES LIVING USING NEW COST...csandit
The growing population of elders in the society calls for a new approach in care giving. By
inferring what activities elderly are performing in their houses it is possible to determine their
physical and cognitive capabilities. In this paper we show the potential of important
discriminative classifiers namely the Soft-Support Vector Machines (C-SVM), Conditional
Random Fields (CRF) and k-Nearest Neighbors (k-NN) for recognizing activities from sensor
patterns in a smart home environment. We address also the class imbalance problem in activity
recognition field which has been known to hinder the learning performance of classifiers. Cost
sensitive learning is attractive under most imbalanced circumstances, but it is difficult to
determine the precise misclassification costs in practice. We introduce a new criterion for
selecting the suitable cost parameter C of the C-SVM method. Through our evaluation on four
real world imbalanced activity datasets, we demonstrate that C-SVM based on our proposed
criterion outperforms the state-of-the-art discriminative methods in activity recognition.
El documento presenta información sobre ciclos naturales y la vida humana. En la primera sección se describen los ciclos biogeoquímicos del agua, carbono y nitrógeno y cómo interactúan los seres vivos con su entorno a través de relaciones como la producción, consumo y descomposición. La segunda sección cubre temas de la adolescencia como la pubertad, el sistema reproductor, el embarazo y la lactancia. También incluye información sobre anticoncepción, enfermedades de transmisión sexual y factores que prom
A considerable interest has been given to Multiword Expression (MWEs) identification and treatment. The identification of MWEs affects the quality of results of different tasks heavily used in natural language processing (NLP) such as parsing and generation. Different
approaches for MWEs identification have been applied such as statistical methods which employed as an inexpensive and language independent way of finding co-occurrence patterns.Another approach relays on linguistic methods for identification, which employ information such as part of speech (POS) filters and lexical alignment between languages is also used and
produced more targeted candidate lists. This paper presents a framework for extracting Arabic
MWEs (nominal or verbal MWEs) for bi-gram using hybrid approach. The proposed approach starts with applying statistical method and then utilizes linguistic rules in order to enhance the results by extracting only patterns that match relevant language rule. The proposed hybrid
approach outperforms other traditional approaches.
ATAR: Attention-based LSTM for Arabizi transliterationIJECEIAES
A non-standard romanization of Arabic script, known as Arbizi, is widely used in Arabic online and SMS/chat communities. However, since state-of-the-art tools and applications for Arabic NLP expects Arabic to be written in Arabic script, handling contents written in Arabizi requires a special attention either by building customized tools or by transliterating them into Arabic script. The latter approach is the more common one and this work presents two significant contributions in this direction. The first one is to collect and publicly release the first large-scale “Arabizi to Arabic script” parallel corpus focusing on the Jordanian dialect and consisting of more than 25 k pairs carefully created and inspected by native speakers to ensure highest quality. Second, we present ATAR, an ATtention-based LSTM model for ARabizi transliteration. Training and testing this model on our dataset yields impressive accuracy (79%) and BLEU score (88.49).
A novel method for arabic multi word term extractionijdms
Arabic Multiword Terms (AMWTs) are relevant strings of words in text documents. Once they are
automatically extracted, they can be used to increase the performance of any Arabic Text Mining
applications such as Categorization, Clustering, Information Retrieval System, Machine Translation, and
Summarization, etc. Mainly the proposed methods for AMWTs extraction can be categorized in three
approaches: Linguistic-based, Statistic-based, and hybrid-based approach. These methods present some
drawbacks that limit their use. In fact they can only deal with bi-grams terms and their yield not good
accuracies. In this paper, to overcome these drawbacks, we propose a new and efficient method for
AMWTs Extraction based on a hybrid approach. This latter is composed by two main filtering steps: the
Linguistic filter and the Statistical one. The Linguistic Filter uses our proposed Part Of Speech (POS)
Tagger and the Sequence identifier as patterns in order to extract candidate AMWTs. While the Statistical
filter incorporate the contextual information, and a new proposed association measure based on Termhood
and Unithood Estimation named NTC-Value.
To evaluate and illustrate the efficiency of our proposed method for AMWTs extraction, a comparative
study has been conducted based on Kalimat Corpus and using nine experiment schemes: In the linguistic
filter, we used three POS Taggers such as Taani’s method based Rule-approach, HMM method based
Statistical-approach, and our recently proposed Tagger based Hybrid –approach. While in the Statistical
filter, we used three statistical measures such as C-Value, NC-Value, and our proposed NTC-Value. The
obtained results demonstrate the efficiency of our proposed method for AMWTs extraction: it outperforms
the other ones and can deal correctly with the tri-grams terms.
This document describes a study on improving Tamil-English cross-language information retrieval through transliteration generation and mining techniques. The study achieved a peak Mean Average Precision of 0.5133 for monolingual English retrieval and 0.4145 for Tamil-English cross-language retrieval, representing an improvement over baselines without handling out-of-vocabulary terms. Transliteration mining performed better than generation at resolving out-of-vocabulary terms and boosting retrieval performance.
A Marathi Hidden-Markov Model Based Speech Synthesis Systemiosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVALcsandit
The demand for multilingual information is becoming erceptive as the users of the internet throughout the world are escalating and it creates a problem of retrieving documents in one language by specifying query in another language. This increasing demand can be addressed by designing automatic tools, which accepts the query in one language and retrieves the relevant documents in other languages. We have developed prototype Amharic-Arabic Cross Language
Information Retrieval System by applying dictionary-based approach that enables the users to retrieve relevant documents from Amharic-Arabic corpus by entering the query in Amharic and retrieving the relevant documents both Amharic and Arabic.
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...kevig
This study proposes a method to develop neural models of the morphological analyzer for Japanese Hiragana sentences using the Bi-LSTM CRF model. Morphological analysis is a technique that divides text data into words and assigns information such as parts of speech. In Japanese natural language processing systems, this technique plays an essential role in downstream applications because the Japanese language does not have word delimiters between words. Hiragana is a type of Japanese phonogramic characters, which is used for texts for children or people who cannot read Chinese characters. Morphological analysis of Hiragana sentences is more difficult than that of ordinary Japanese sentences because there is less information for dividing. For morphological analysis of Hiragana sentences, we demonstrated the effectiveness of fine-tuning using a model based on ordinary Japanese text and examined the influence of training data on texts of various genres.
Improving accuracy of part-of-speech (POS) tagging using hidden markov model ...IJECEIAES
In Natural Language Processing (NLP), Word segmentation and Part-ofSpeech (POS) tagging are fundamental tasks. The POS information is also necessary in NLP’s preprocessing work applications such as machine translation (MT), information retrieval (IR), etc. Currently, there are many research efforts in word segmentation and POS tagging developed separately with different methods to get high performance and accuracy. For Myanmar Language, there are also separate word segmentors and POS taggers based on statistical approaches such as Neural Network (NN) and Hidden Markov Models (HMMs). But, as the Myanmar language's complex morphological structure, the OOV problem still exists. To keep away from error and improve segmentation by utilizing POS data, segmentation and labeling should be possible at the same time.The main goal of developing POS tagger for any Language is to improve accuracy of tagging and remove ambiguity in sentences due to language structure. This paper focuses on developing word segmentation and Part-of- Speech (POS) Tagger for Myanmar Language. This paper presented the comparison of separate word segmentation and POS tagging with joint word segmentation and POS tagging.
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...kevig
In this paper, phoneme sequences are used as language information to perform code-switched language
identification (LID). With the one-pass recognition system, the spoken sounds are converted into
phonetically arranged sequences of sounds. The acoustic models are robust enough to handle multiple
languages when emulating multiple hidden Markov models (HMMs). To determine the phoneme similarity
among our target languages, we reported two methods of phoneme mapping. Statistical phoneme-based
bigram language models (LM) are integrated into speech decoding to eliminate possible phone
mismatches. The supervised support vector machine (SVM) is used to learn to recognize the phonetic
information of mixed-language speech based on recognized phone sequences. As the back-end decision is
taken by an SVM, the likelihood scores of segments with monolingual phone occurrence are used to
classify language identity. The speech corpus was tested on Sepedi and English languages that are often
mixed. Our system is evaluated by measuring both the ASR performance and the LID performance
separately. The systems have obtained a promising ASR accuracy with data-driven phone merging
approach modelled using 16 Gaussian mixtures per state. In code-switched speech and monolingual
speech segments respectively, the proposed systems achieved an acceptable ASR and LID accuracy.
Compression-Based Parts-of-Speech Tagger for The Arabic LanguageCSCJournals
This paper explores the use of Compression-based models to train a Part-of-Speech (POS) tagger for the Arabic language. The newly developed tagger is based on the Prediction-by-Partial Matching (PPM) compression system, which has already been employed successfully in several NLP tasks. Several models were trained for the new tagger, the first models were trained using a silver-standard data from two different POS Arabic taggers, and the second model utilised the BAAC corpus, which is a 50K term manually annotated MSA corpus, where the PPM tagger achieved an accuracy of 93.07%. Also, the tag-based models were utilised to evaluate the performance of the new tagger by first tagging different Classical Arabic corpora and Modern Standard Arabic corpora then compressing the text using tag-based compression models. The results show that the use of silver-standard models has led to a reduction in the quality of the tag-based compression by an average of 0.43%, whereas the use of the gold-standard model has increased the tag-based compression quality by an average of 4.61% when used to tag Modern Standard Arabic text.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Brill's Rule-based Part of Speech Tagger for Kadazanidescitation
This paper presents the Part of Speech Tagger (POS) for Kadazan language by
implementing Brill's approach which is also known as a Transformation-Based Error
Driven Learning approach. Kadazan language is chosen because there is not even one POS
tagger has been developed for this language yet. Hence, this study has been carried out in
order to develop a POS tagger especially for Kadazan language that can tag Kadazan
corpus systematically, help to reduce the ambiguity problem and at the same time can be
used as a learning language tool. Therefore, the main objective of this study is to automate
the tagging process for Kadazan language. Brill' approach is an enhance version of the
original Rule-Based approach which it transforms the tags based on a set of predefined
rules. Brill’s approach uses rules to transform wrong tags into correct tags in the corpus. In
order to achieve the main goal, several objectives have been set which are to create the
specific lexical and contextual rules for Kadazan language, by applying Brill’s approach
based on rules and to evaluate the effectiveness of Kadazan Part of Speech using Brill’s
approach. The tagging process is divided into four main phases. In first phase, Brill’s
approach process begins by inputting a new untagged text into the system. In second phase,
the input text will go through the initial state annotater to tag all the words inside the corpus
to its most likely tags and produce a temporary corpus. In third phase, the temporary
corpus is then compared to the goal corpus to detect if there is any errors occurred. In last
phase, the rules will be applied to reduce any errors occurred and fix the temporary corpus.
The tagging approach has been trained using two Kadazan children’s story books which
contain 2069 words. Evaluation process is done by comparing the tagging results of Brill’s
approach with the manual tagging. Kadazan Part of Speech Tagger has achieved around 93
% of accuracy. This study has shown how Brill’s tagging approach can be used to identify
tags for Kadazan language.
Improving performance of english hindi cross language information retrieval u...ijnlc
The main issue in Cross Language Information Retrieval (CLIR) is the poor performance of retrieval in
terms of average precision when compared to monolingual retrieval performance. The main reasons
behind poor performance of CLIR are mismatching of query terms, lexical ambiguity and un-translated
query terms. The existing problems of CLIR are needed to be addressed in order to increase the
performance of the CLIR system. In this paper, we are putting our effort to solve the given problem by
proposed an algorithm for improving the performance of English-Hindi CLIR system. We used all possible
combination of Hindi translated query using transliteration of English query terms and choosing the best
query among them for retrieval of documents. The experiment is performed on FIRE 2010 (Forum of
Information Retrieval Evaluation) datasets. The experimental result show that the proposed approach gives
better performance of English-Hindi CLIR system and also helps in overcoming existing problems and
outperforms the existing English-Hindi CLIR system in terms of average precision.
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language pair. The machine translation system will take input script as English sentence and parse with the help of Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will take the parsed output and separate the source text word by word and searches for their corresponding target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also reordering rules are there. After applying the reordering rules, English sentence will be syntactically reordered to suit Marathi language
Design and Development of a Malayalam to English Translator- A Transfer Based...Waqas Tariq
This paper describes a transfer based scheme for translating Malayalam, a Dravidian language, to English. This system inputs Malayalam sentences and outputs equivalent English sentences. The system comprises of a preprocessor for splitting the compound words, a morphological parser for context disambiguation and chunking, a syntactic structure transfer module and a bilingual dictionary. All the modules are morpheme based to reduce dictionary size. The system does not rely on a stochastic approach and it is based on a rule-based architecture along with various linguistic knowledge components of both Malayalam and English. The system uses two sets of rules: rules for Malayalam morphology and rules for syntactic structure transfer from Malayalam to English. The system is designed using artificial intelligence techniques.
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
Cross Language Information Retrieval (CLIR) deals with retrieving relevant information stored in a language different from
the language of user’s query. This helps users to express the information need in their native languages. Machine translation based (MTbased)
approach of CLIR uses existing machine translation techniques to provide automatic translation of queries. This paper covers the
research work done in CLIR and MT systems for Marathi language in India.
Named Entity Recognition using Hidden Markov Model (HMM)kevig
Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP) which is the branch of artificial intelligence. It has many applications mainly in machine translation, text to speech synthesis, natural language understanding, Information Extraction, Information retrieval, question answering etc. The aim of NER is to classify words into some predefined categories like location name, person name, organization name, date, time etc. In this paper we describe the Hidden Markov Model (HMM) based approach of machine learning in detail to identify the named entities. The main idea behind the use of HMM model for building NER system is that it is language independent and we can apply this system for any language domain. In our NER system the states are not fixed means it is of dynamic in nature one can use it according to their interest. The corpus used by our NER system is also not domain specific
Nohemí Rocio Álvarez Morales recomienda a Miryam Celeste Contreras Guzmán para una beca en el Centro Superior de Diseño de Modas de Madrid. Álvarez describe a Contreras como seria, responsable, honesta y respetuosa. Además, Álvarez solicita sus calificaciones del primer cuatrimestre de la carrera de Diseño e Industria del Vestido y de la Moda en la Universidad Interamericana para el Desarrollo.
IMPROVING SUPERVISED CLASSIFICATION OF DAILY ACTIVITIES LIVING USING NEW COST...csandit
The growing population of elders in the society calls for a new approach in care giving. By
inferring what activities elderly are performing in their houses it is possible to determine their
physical and cognitive capabilities. In this paper we show the potential of important
discriminative classifiers namely the Soft-Support Vector Machines (C-SVM), Conditional
Random Fields (CRF) and k-Nearest Neighbors (k-NN) for recognizing activities from sensor
patterns in a smart home environment. We address also the class imbalance problem in activity
recognition field which has been known to hinder the learning performance of classifiers. Cost
sensitive learning is attractive under most imbalanced circumstances, but it is difficult to
determine the precise misclassification costs in practice. We introduce a new criterion for
selecting the suitable cost parameter C of the C-SVM method. Through our evaluation on four
real world imbalanced activity datasets, we demonstrate that C-SVM based on our proposed
criterion outperforms the state-of-the-art discriminative methods in activity recognition.
El documento presenta información sobre ciclos naturales y la vida humana. En la primera sección se describen los ciclos biogeoquímicos del agua, carbono y nitrógeno y cómo interactúan los seres vivos con su entorno a través de relaciones como la producción, consumo y descomposición. La segunda sección cubre temas de la adolescencia como la pubertad, el sistema reproductor, el embarazo y la lactancia. También incluye información sobre anticoncepción, enfermedades de transmisión sexual y factores que prom
The document discusses the fauna of Colombia, which includes a diverse range of animal species from big cats like jaguars and pumas, to monkeys, birds, reptiles, and amphibians. Colombia has over 1,900 species of birds, making it one of the most biodiverse countries for avian wildlife. The Andean mountains and rainforests are home to much of Colombia's unique animal life.
Este documento analiza la seguridad ciudadana en el barrio El Carmen y la carrera cuarta de Valledupar. Busca identificar qué partes de estas áreas tienen mayores problemas de seguridad e investigar si la administración tiene suficientes puestos de control policial y si la policía realiza adecuadamente su labor de vigilancia. El objetivo general es mejorar las condiciones de seguridad para los habitantes a través de la coordinación entre dependencias.
Este documento describe los pasos para instalar y configurar un servidor DNS básico en Red Hat Enterprise Linux 6.2. Explica conceptos clave como resolución de nombres de dominio y el protocolo DNS. Luego detalla la configuración de archivos como ifcfg-eth0, named.conf y las zonas forward y reverse para mapear nombres de dominio a direcciones IP. Finalmente, verifica el funcionamiento del servidor DNS de forma local y remota desde un cliente Windows.
La realidad virtual es una ciencia que utiliza ordenadores y dispositivos para crear una apariencia de realidad en la que el usuario tiene la sensación de estar presente, a través del desarrollo de varias técnicas en un entorno virtual.
I had the best vacation of my life in Acapulco. Acapulco is where I spent my most memorable vacation. The short document discusses having an amazing vacation in the city of Acapulco.
Este documento describe la amistad como una relación afectiva entre dos o más personas que se desarrolla en diferentes etapas de la vida. La amistad surge cuando las personas encuentran intereses comunes, y puede darse entre personas y otros seres como animales.
This document is Mubarek's physics assignment covering the electromagnetic spectrum. It discusses the different types of electromagnetic radiation including radio waves, infrared, visible light, ultraviolet, X-rays, and gamma rays. For each type it describes common uses as well as potential health risks from overexposure, such as radio waves and cell phones possibly causing cancer, infrared causing dehydration and overheating, ultraviolet increasing cancer risk, X-rays posing risks to pregnancies, and gamma rays being able to cause mutations and cancer.
El documento presenta la estructura organizativa de la Universidad "Fermín Toro". El Consejo Directivo se encuentra en la cima de la estructura y está compuesto por la Presidencia, Auditoría Interna, Dirección de Administración y Finanzas, Dirección de Recursos Humanos y Dirección de Deportes y Rendimiento. Esta última dirección supervisa a la Consultoría Jurídica y las direcciones de Deporte para Todos, Medicina Deportiva y Servicios Generales. La estructura también incluye varias federaciones deportivas venezolanas.
The document discusses non SQL databases, including what they are, how they work, and some examples. Non SQL databases can scale horizontally by dispersing data across numerous inexpensive systems or scale vertically on a single system. They are often open source and used by companies like Facebook. Non SQL databases fall into categories like key-value stores, column family stores, document databases, and graph databases.
ACR Chicago/Virtual Mediation Lab Webinar on Online Mediation - November 13, ...virtualmediationlab
This is 3rd of a 3-step ACR Chicago/Virtual Mediation Lab project to show what online mediation means, how it works, how it can be used, and how to integrate online and face-to-face mediation.
LINKING SOFTWARE DEVELOPMENT PHASE AND PRODUCT ATTRIBUTES WITH USER EVALUATIO...csandit
This paper presents an evaluation methodology to reveal the relationships between the
attributes of software products, practices applied during the development phase and the user
evaluation of the products. For the case study, the games sector has been chosen due to easy
access to the user evaluation of this type of software products. Product attributes and practices
applied during the development phase have been collected from the developers via
questionnaires. User evaluation results were collected from a group of independent evaluators.
Two bipartite networks were created using the gathered data. The first network maps software
products to the practices applied during the development phase and the second network maps
the products to the product attributes. According to the links, similarities were determined and
subgroups of products were obtained according to selected development phase practices. By
this way, the effect of development phase on the user evaluation has been investigated.
Agenda Colaborativa para o Fortalecimento do Sistema de Garantia de Direitos ...Aghata Gonsalves
Esta agenda é resultado de um processo de co-construção coletiva que envolveu gestores e profissionais da política de garantia de proteção integral e também mais de 100 crianças e adolescentes do município, que elaboraram um diagnóstico participativo e propuseram soluções para o fortalecimento da política e melhoria da qualidade de vida de crianças e adolescentes. Esse processo ocorreu no quadro do Projeto de Desenvolvimento Institucional promovido pelo Instituto Comunitário Grande Florianópolis e o Programa de Extensão Esag Comunidade.
IMPROVING RULE-BASED METHOD FOR ARABIC POS TAGGING USING HMM TECHNIQUEcscpconf
Part-of-speech (POS) tagger plays an important role in Natural Language Applications like Speech Recognition, Natural Language Parsing, Information Retrieval and Multi Words Term
Extraction. This study proposes a building of an efficient and accurate POS Tagging technique for Arabic language using statistical approach. Arabic Rule-Based method suffers from
misclassified and unanalyzed words due to the ambiguity issue. To overcome these two problems, we propose a Hidden Markov Model (HMM) integrated with Arabic Rule-Based method. Our POS tagger generates a set of 4 POS tags: Noun, Verb, Particle, and Quranic Initial (INL). The proposed technique uses the different contextual information of the words with a variety of the features which are helpful to predict the various POS classes. To evaluate its accuracy, the proposed method has been trained and tested with the Holy Quran Corpus
containing 77 430 terms for undiacritized Classical Arabic language. The experiment results demonstrate the efficiency of our method for Arabic POS Tagging. The obtained accuracies are 97.6% and 94.4% for respectively our method and for the Rule based tagger method
Arabic morphology complex is a core challenge of
Arabic information retrieval. This limitation makes
Arabic language is so complex environment of
information retrieval specialists due to the high
inflected in Arabic language. This paper explains the
importance of the stemming in information retrieval
through developed a method to search about information needs in Arabic text. The new developed
method based on light stemmer to increase matching
the keywords between the query, with the related
document in the test collection.
Customer Opinions Evaluation: A Case Study on Arabic Tweets gerogepatton
This paper presents an automatic method for extracting, processing, and analysis of customer opinions
on Arabic social media. We present a four-step approach for mining of Arabic tweets. First, Natural
Language Processing (NLP) with different types of analyses had performed. Second, we present an
automatic and expandable lexicon for Arabic adjectives. The initial lexicon is built using 1350 adjectives
as seeds from processing of different datasets in Arabic language. The lexicon is automatically expanded
by collecting synonyms and morphemes of each word through Arabic resources and google translate.
Third, emotional analysis was considered by two different methods; Machine Learning (ML) and rulebased method. Finally, Feature Selection (FS) is also considered to enhance the mining results. The
experimental results reveal that the proposed method outperforms counterpart ones with an improvement
margin of up to 4% using F-Measure.
CUSTOMER OPINIONS EVALUATION: A CASESTUDY ON ARABIC TWEETSgerogepatton
This paper presents an automatic method for extracting, processing, and analysis of customer opinions on Arabic social media. We present a four-step approach for mining of Arabic tweets. First, Natural Language Processing (NLP) with different types of analyses had performed. Second, we present an automatic and expandable lexicon for Arabic adjectives. The initial lexicon is built using 1350 adjectives as seeds from processing of different datasets in Arabic language. The lexicon is automatically expanded by collecting synonyms and morphemes of each word through Arabic resources and google translate. Third, emotional analysis was considered by two different methods; Machine Learning (ML) and rulebased method. Finally, Feature Selection (FS) is also considered to enhance the mining results. The experimental results reveal that the proposed method outperforms counterpart ones with an improvement margin of up to 4% using F-Measure.
CUSTOMER OPINIONS EVALUATION: A CASESTUDY ON ARABIC TWEETSijaia
This paper presents an automatic method for extracting, processing, and analysis of customer opinions on Arabic social media. We present a four-step approach for mining of Arabic tweets. First, Natural Language Processing (NLP) with different types of analyses had performed. Second, we present an automatic and expandable lexicon for Arabic adjectives. The initial lexicon is built using 1350 adjectives as seeds from processing of different datasets in Arabic language. The lexicon is automatically expanded by collecting synonyms and morphemes of each word through Arabic resources and google translate. Third, emotional analysis was considered by two different methods; Machine Learning (ML) and rulebased method. Finally, Feature Selection (FS) is also considered to enhance the mining results. The experimental results reveal that the proposed method outperforms counterpart ones with an improvement margin of up to 4% using F-Measure.
COMPARATIVE ANALYSIS OF ARABIC STEMMING ALGORITHMSIJMIT JOURNAL
In the context of Information Retrieval, Arabic stemming algorithms have become a most research area of
information retrieval. Many researchers have developed algorithms to solve the problem of stemming.
Each researcher proposed his own methodology and measurements to test the performance and compute
the accuracy of his algorithm. Thus, nobody can make accurate comparisons between these algorithms.
Many generic conflation techniques and stemming algorithms are theoretically analyzed in this paper.
Then, the main Arabic language characteristics that are necessary to be mentioned before discussing
Arabic stemmers are summarized. The evaluation of the algorithms in this paper shows that Arabic
stemming algorithm is still one of the most information retrieval challenges. This paper aims to compare
the most of the commonly used light stemmers in terms of affixes lists, algorithms, main ideas, and
information retrieval performance. The results show that the light10 stemmer outperformed the other
stemmers. Finally, recommendations for future research regarding the development of a standard Arabic
stemmer were presented.
In this paper we present gender and authorship categorisationusing the Prediction by Partial Matching (PPM) compression scheme for text from Twitter written in Arabic. The PPMD variant of the compression scheme with different orders was used to perform the categorisation. We also applied different machine learning algorithms such as Multinational Naïve Bayes (MNB), K-Nearest Neighbours (KNN), and an implementation of Support Vector Machine (LIBSVM), applying the same processing steps for all the algorithms. PPMD shows significantly better accuracy in comparison to all the other machine learning algorithms, with order 11 PPMD working best, achieving 90 % and 96% accuracy for gender and authorship respectively.
In this paper we present gender and authorship categorisationusing the Prediction by Partial Matching
(PPM) compression scheme for text from Twitter written in Arabic. The PPMD variant of the compression
scheme with different orders was used to perform the categorisation. We also applied different machine
learning algorithms such as Multinational Naïve Bayes (MNB), K-Nearest Neighbours (KNN), and an
implementation of Support Vector Machine (LIBSVM), applying the same processing steps for all the
algorithms. PPMD shows significantly better accuracy in comparison to all the other machine learning
algorithms, with order 11 PPMD working best, achieving 90 % and 96% accuracy for gender and
authorship respectively.
GENDER AND AUTHORSHIP CATEGORISATION OF ARABIC TEXT FROM TWITTER USING PPMijcsit
In this paper we present gender and authorship categorisation using the Prediction by Partial Matching(PPM) compression scheme for text from Twitter written in Arabic. The PPMD variant of the compression scheme with different orders was used to perform the categorisation. We also applied different machine learning algorithms such as Multinational Naïve Bayes (MNB), K-Nearest Neighbours (KNN), and an
implementation of Support Vector Machine (LIBSVM), applying the same processing steps for all the algorithms. PPMD shows significantly better accuracy in comparison to all the other machine learning algorithms, with order 11 PPMD working best, achieving 90 % and 96% accuracy for gender and
authorship respectively.
A new keyphrases extraction method based on suffix tree data structure for ar...ijma
The document presents a new method for extracting keyphrases from Arabic documents using a suffix tree data structure. The method first cleans documents by removing stop words and special characters. It then generates a suffix tree document model to represent documents as sequences of words. Nodes in the suffix tree are scored based on factors like phrase length and term frequency-inverse document frequency. Highly scored nodes that are 1-3 words are selected as keyphrases. The keyphrases are used instead of full text for clustering Arabic documents with hierarchical clustering algorithms and various similarity measures. Experimental results showed the keyphrase extraction approach improved the quality of document clustering.
Genetic Approach For Arabic Part Of Speech Taggingkevig
With the growing number of textual resources available, the ability to understand them becomes critical.
An essential first step in understanding these sources is the ability to identify the parts-of-speech in each
sentence. Arabic is a morphologically rich language, which presents a challenge for part of speech
tagging. In this paper, our goal is to propose, improve, and implement a part-of-speech tagger based on a
genetic algorithm. The accuracy obtained with this method is comparable to that of other probabilistic
approaches.
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGINGijnlc
With the growing number of textual resources available, the ability to understand them becomes critical.
An essential first step in understanding these sources is the ability to identify the parts-of-speech in each
sentence. Arabic is a morphologically rich language, which presents a challenge for part of speech
tagging. In this paper, our goal is to propose, improve, and implement a part-of-speech tagger based on a
genetic algorithm. The accuracy obtained with this method is comparable to that of other probabilistic
approaches.
AN EFFECTIVE ARABIC TEXT CLASSIFICATION APPROACH BASED ON KERNEL NAIVE BAYES ...ijaia
With growing texts of electronic documents used in many applications, a fast and accurate text classification method is very important. Arabic text classification is one of the most challenging topics. This is probably caused by the fact that Arabic words have unlimited variation in the meaning, in addition to the problems that are specific to Arabic language only. Many studies have been proved that Naive Bayes (NB)
classifier is being relatively robust, easy to implement, fast, and accurate for many different fields such as text classification. However, non-linear classification and strong violations of the independence assumptions problems can lead to very poor performance of NB classifier. In this paper, first, we preprocess
the Arabic documents to tokenize only the Arabic words. Second, we convert those words into vectors using term frequency and inverse document frequency (TF-IDF) technique. Third, we propose an efficient approach based on Kernel Naive Bayes (KNB) classifier to solve the non-linearity problem of
Arabic text classification. Finally, experimental results and performance evaluation on our collected dataset of Arabic topic mining corpus are presented, showing the effectiveness of the proposed KNB classifier against other baseline classifiers.
Phrase Identification is one of the most critical and widely studied in Natural Language processing (NLP) tasks. Verb Phrase Identification within a sentence is very useful for a variety of application on NLP. One of the core enabling technologies required in NLP applications is a Morphological Analysis. This paper presents the Myanmar Verb Phrase Identification and Translation Algorithm and develops a Markov Model with Morphological Analysis. The system is based on Rule-Based Maximum Matching Approach. In Machine Translation, Large amount of information is needed to guide the translation process. Myanmar Language is inflected language and there are very few creations and researches of Lexicon in Myanmar, comparing to other language such as English, French and Czech etc. Therefore, this system is proposed Myanmar Verb Phrase identification and translation model based on Syntactic Structure and Morphology of Myanmar Language by using Myanmar- English bilingual lexicon. Markov Model is also used to reformulate the translation probability of Phrase pairs. Experiment results showed that proposed system can improve translation quality by applying morphological analysis on Myanmar Language.
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...IJCI JOURNAL
Recent advancements in the field of natural language processing have markedly enhanced the capability of machines to comprehend human language. However, as language models progress, they require continuous architectural enhancements and different approaches to text processing. One significant challenge stems from the rich diversity of languages, each characterized by its distinctive grammar resulting ina decreased accuracy of language models for specific languages, especially for low-resource languages. This limitation is exacerbated by the reliance of existing NLP models on rigid tokenization methods, rendering them susceptible to issues with previously unseen or infrequent words. Additionally, models based on word and subword tokenization are vulnerable to minor typographical errors, whether they occur naturally or result from adversarial misspellings. To address these challenges, this paper presents the utilization of a recently proposed free-tokenization method, such as Cannine, to enhance the comprehension of natural language. Specifically, we employ this method to develop an Arabic-free tokenization language model. In this research, we will precisely evaluate our model’s performance across a range of eight tasks using Arabic Language Understanding Evaluation (ALUE) benchmark. Furthermore, we will conduct a comparative analysis, pitting our free-tokenization model against existing Arabic language models that rely on sub-word tokenization. By making our pre-training and fine-tuning models accessible to the Arabic NLP community, we aim to facilitate the replication of our experiments and contribute to the advancement of Arabic language processing capabilities. To further support reproducibility and open-source collaboration, the complete source code and model checkpoints will be made publicly available on our Huggingface1 . In conclusion, the results of our study will demonstrate that the free-tokenization approach exhibits comparable performance to established Arabic language models that utilize sub-word tokenization techniques. Notably, in certain tasks, our model surpasses the performance of some of these existing models. This evidence underscores the efficacy of free-tokenization in processing the Arabic language, particularly in specific linguistic contexts.
A GRAMMATICALLY AND STRUCTURALLY BASED PART OF SPEECH (POS) TAGGER FOR ARABIC...kevig
In this paper we report on an experimental syntactically and morphologically driven rule-based Arabic
tagger. The tagger is developed using Arabic language grammatical rules and regulations. The tagger requires no pre-tagged text and is developed using a primitive set of lexicon items along with extensive
grammatical and structural rules. It is tested and compared to Stanford tagger both in terms of accuracy and
performance (speed). Obtained results are quite comparable to Stanford tagger performance with marginal
difference favoring the developed tagger in accuracy with huge difference in terms of speed of execution. The
newly developed tagger named MTE Tagger has been tested and evaluated. For the evaluation of its accuracy of tagging, a set of Arabic text was manually prepared and annotated. Compared to Stanford tagger, the
MTE tagger performance was quite comparable. The developed tagger makes use of no pre-annotated datasets, except of some simple lexicon consisting of list of words representing closed word types like demonstrative nouns or pronouns or some particles. For the purpose of evaluation of the new tagger, it was run on
multiple datasets and results were compared to those of Stanford tagger. In particular, both taggers (the
MTE and the Stanford) were run on a set of 1226 sentences with close to 20,000 tokens that was human
annotated and verified to serve as testbed. The results were very encouraging where in both test runs, the
MTE tagger outperformed the Stanford tagger in terms of accuracy of 87.88% versus 86.67% for the Stanford tagger. In terms of speed of tagging and in comparison Stanford tagger, MTE Taggers’ performance
was on average 1:50. More improved accuracy is possible in future work as the set of rules are further
optimized, integrated and more of Arabic language properties such as end of word discretization are used.
A GRAMMATICALLY AND STRUCTURALLY BASED PART OF SPEECH (POS) TAGGER FOR ARABIC...kevig
In this paper we report on an experimental syntactically and morphologically driven rule-based Arabic
tagger. The tagger is developed using Arabic language grammatical rules and regulations. The tagger requires no pre-tagged text and is developed using a primitive set of lexicon items along with extensive
grammatical and structural rules. It is tested and compared to Stanford tagger both in terms of accuracy and
performance (speed). Obtained results are quite comparable to Stanford tagger performance with marginal
difference favoring the developed tagger in accuracy with huge difference in terms of speed of execution. The
newly developed tagger named MTE Tagger has been tested and evaluated. For the evaluation of its accuracy of tagging, a set of Arabic text was manually prepared and annotated. Compared to Stanford tagger, the
MTE tagger performance was quite comparable. The developed tagger makes use of no pre-annotated datasets, except of some simple lexicon consisting of list of words representing closed word types like demonstrative nouns or pronouns or some particles. For the purpose of evaluation of the new tagger, it was run on
multiple datasets and results were compared to those of Stanford tagger. In particular, both taggers (the
MTE and the Stanford) were run on a set of 1226 sentences with close to 20,000 tokens that was human
annotated and verified to serve as testbed. The results were very encouraging where in both test runs, the
MTE tagger outperformed the Stanford tagger in terms of accuracy of 87.88% versus 86.67% for the Stanford tagger. In terms of speed of tagging and in comparison Stanford tagger, MTE Taggers’ performance
was on average 1:50. More improved accuracy is possible in future work as the set of rules are further
optimized, integrated and more of Arabic language properties such as end of word discretization are used.
speech segmentation based on four articles in one.Abebe Tora
speech segmentation based on four articles in one. the aim of this review is comparing four articles based on their methods, findings, limitations, evaluation metrics, strength and contribute to area. if any one want to obtain articles, contact me. abebetora79@gmail.com.
An improved Arabic text classification method using word embeddingIJECEIAES
Feature selection (FS) is a widely used method for removing redundant or irrelevant features to improve classification accuracy and decrease the model’s computational cost. In this paper, we present an improved method (referred to hereafter as RARF) for Arabic text classification (ATC) that employs the term frequency-inverse document frequency (TF-IDF) and Word2Vec embedding technique to identify words that have a particular semantic relationship. In addition, we have compared our method with four benchmark FS methods namely principal component analysis (PCA), linear discriminant analysis (LDA), chi-square, and mutual information (MI). Support vector machine (SVM), k-nearest neighbors (K-NN), and naive Bayes (NB) are three machine learning based algorithms used in this work. Two different Arabic datasets are utilized to perform a comparative analysis of these algorithms. This paper also evaluates the efficiency of our method for ATC on the basis of performance metrics viz accuracy, precision, recall, and F-measure. Results revealed that the highest accuracy achieved for the SVM classifier applied to the Khaleej-2004 Arabic dataset with 94.75%, while the same classifier recorded an accuracy of 94.01% for the Watan-2004 Arabic dataset.
Named Entity Recognition System for Hindi Language: A Hybrid ApproachWaqas Tariq
Named Entity Recognition (NER) is a major early step in Natural Language Processing (NLP) tasks like machine translation, text to speech synthesis, natural language understanding etc. It seeks to classify words which represent names in text into predefined categories like location, person-name, organization, date, time etc. In this paper we have used a combination of machine learning and Rule based approaches to classify named entities. The paper introduces a hybrid approach for NER. We have experimented with Statistical approaches like Conditional Random Fields (CRF) & Maximum Entropy (MaxEnt) and Rule based approach based on the set of linguistic rules. Linguistic approach plays a vital role in overcoming the limitations of statistical models for morphologically rich language like Hindi. Also the system uses voting method to improve the performance of the NER system. Keywords: NER, MaxEnt, CRF, Rule base, Voting, Hybrid approach
Similar to IMPROVING RULE-BASED METHOD FOR ARABIC POS TAGGING USING HMM TECHNIQUE (20)
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...Alex Pruden
Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security.
Paper Link: https://eprint.iacr.org/2024/257
Trusted Execution Environment for Decentralized Process MiningLucaBarbaro3
Presentation of the paper "Trusted Execution Environment for Decentralized Process Mining" given during the CAiSE 2024 Conference in Cyprus on June 7, 2024.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyScyllaDB
Freshworks creates AI-boosted business software that helps employees work more efficiently and effectively. Managing data across multiple RDBMS and NoSQL databases was already a challenge at their current scale. To prepare for 10X growth, they knew it was time to rethink their database strategy. Learn how they architected a solution that would simplify scaling while keeping costs under control.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
2. Computer Science & Information Technology (CS & IT)
258
adjective. This is done on the basis of words role, both individually as well as in the sentence.
Most words occurring in undiacritized Arabic text have the ambiguity in terms of their part of
speech [4]. Take for example the term " ,"ذھبit can be treated as a noun "gold" or a verb "go".
There are three general approaches to deal with the tagging problem: Rule-based approach,
Statistical approach, and Hybrid approach. The Rule-based approach consists of developing a
rules knowledge base established by linguists in order to define precisely how and where to
assign the various POS tags. The statistical approach consists of building a trainable model and
using the previously-tagged corpus to estimate its parameters. Once this is done, the model can be
used to determine the tagger of other texts. Generally, successful statistical taggers are mainly
based on Hidden Markov Models (HMMs). Finally, the hybrid approach consists in combining
rule-based approach with a statistical one. Recently, the most of the POS Taggers use the latter
approach as it gives better results.
Among the most recent works, we have favored the rule-based method proposed by A.Taani [19]
over other methods for a number of reasons. First, it is simple to understand, accurate, and relies
on a correct Arabic sentence structure using the metrics of syntactic patterns. Second, unlike the
other taggers which are generally developed for Modern Standard Arabic (MSA) and thus may
not be appropriate for the Classical Arabic (CA), the Taani’s method can deal directly with the
CA which as it is the language of the Holy Quran.
However, this rule-based method [19] presents some weaknesses: it may misclassify and
unanalyze some words. For example, the term "( "ھدىi.e "upon, right") is not analyzed and the
word "( "موتكمi.e "your death") is assigned by the incorrect "verb" tag.
To overcame these problems, we propose an Arabic Part-Of-Speech Tagging method based on
hybrid approach which combines the rule-based approach with a statistical approach that relies on
the Arabic sentence structure improving the Taani's POS Tagging [19]. The rest of paper is
organized as follows: In section 2, we describe the related works of POS tagging techniques in
Arabic language. The Rule-based tagger is presented in section 3. Section 4 describes the
principles of HMM tagger. Our proposed method is described in section 5. Section 6 presents
experimental results. Finally, Section 7 concludes the paper and describes the future works.
2. RELATED WORKS
Part-of-speech tagging consists of assigning to each word of a sentence a tag which indicates the
function of the word in a specific context. As we have mentioned previously, there are many
methods of POS tagging witch can be classified in three categories: Statistical approach, Rulebased approach and Hybrid approach.
2.1. Statistical approach:
The statistical approach requires much less human effort, successful model during the last years
Hidden Markov Models and related techniques have focused on building probabilistic models of
tag transition sequences in sentence. This task is difficult for Arabic languages due to the lack of
annotated large corpus. So far, numerous POs tagging methods have been presented in Arabic
languages which are often statistical. Banko et al. [16] present a HMM tagger that exploits
context on both sides of a word to be tagged. It is evaluated in both the unsupervised and
supervised cases. Orumchian's tagger [10] is presented for Persian POS tagging which is follows
the TNT POS tagger. The TNT tagger is based on Hidden Markov Models theory. This system
uses 2.5 million tagged words as training data and the size of the tag-set is 38. Guessoum et al
[26] present a POS tagging system to resolves the ambiguity through the use of a statistical
3. 259
Computer Science & Information Technology (CS & IT)
language model developed from Arabic corpus as a Hidden Markov Model (HMM). Albared et
al. [27] developed a Bigram Hidden Markov Model (HMM) to tackle the POS tagging problem of
Arabic language. The HMM parameters are estimated from a small training data. They have
studied the behaviour of the HMM applied to Arabic POS tagging using small amount of data. By
using different smoothing algorithms with HMM model to overcome the data sparseness problem.
Alhadj et al. [30] propose a new method of part-of-speech tagger that can be used for analyzing
and annotating traditional Arabic texts, especially the Holy Quran text. This approach combines
the morphological analysis with Hidden Markov Models (HMMs) based-on the Arabic sentence
structure.
2.2. Rule-based approach:
This approach has successfully been used in developing many natural language processing
systems. Systems that use rule-based transformations are based on a core of solid linguistic
knowledge. One of them is the affix. Some affixes are proper to verbs; some are proper to nouns;
and some others are used with verbs and nouns. Another, important sign in Arabic language is the
pattern, which is an important guide in recognizing the word category. The approach is also used
for some specific tasks. Diab et al. [15] designed an automatic tagging system to tokenize part-ofspeech tag in Arabic text. Habash et al. [24] proposed a morphological analyzer for tokenizing
and morphologically tagging Arabic words. Freeman [13] described an Arabic part-of-speech
tagging system based on the Brill tagging system which is a machine learning system that can be
trained with a previously-tagged corpus. Author used a tags set containing 146 tags extracted
from Brown corpus for English language. Lee et al. [18] used a corpus of manually segmented
words which appears to be a subset of the first release of the ATB (110,000 words). They
obtained a list of prefixes and suffixes from this corpus which is apparently augmented by a
manually derived list of other affixes. Maamouri et al. [14] presented a part-of-speech tagging
system for Arabic. The authors based their work on the output of Tim Buckwalter’s
morphological analyzer. This tagging system is tested on a corpus consisted of 734 files extracted
from the "Agence France Press" [25].
2.3. Hybrid approach:
This approach consists to combine the rule-based method and the statistical methods used to
assign the best tag for each of the words of input text. For hybrid methods, different Arabic
taggers have recently emerged. Among these studies, Khoja [12] combines statistical and rulebased techniques and uses a tag set of 131 basically derived from the BNC English tag set. TliliGuiassa [17] used a hybrid method of based-rules and a memory-based learning method. One of
the most recent works for POS tagging is done by Jabbari and Allison [7]. Their approach is
transformation based and previously been used in English by Brill and Hepple [8, 9]. The
construction of this tagger contains a trained learner machine which includes approximated rules.
In fact, they applied an implementation of Error-Driven Transformation Based Learning.
Note that the most of these taggers [14] apply a translation of the Arabic original text to English
language and use tags set derived from English which is not appropriate for Arabic. Other taggers
[12] [13] [16] rely on a transliteration of the Arabic input text. Moreover, the most taggers are
generally developed for Modern Standard Arabic (MSA) and thus may not be appropriate for the
Classical Arabic which is the language of the Holy Quran. Generally, The hybrid methods [7] [8]
[9] gives the better results for POS tagging.
Among the most recent works, we have chosen the rule-based method presented by A.Taani [19]
because it is accurate and relies on correct Arabic sentence structure using the metrics of syntactic
patterns. Moreover, it uses the Classical Arabic (CA) which is the language of the Holy Quran.
4. Computer Science & Information Technology (CS & IT)
260
However, this rule-based method [19] presents some problems: it may misclassify and not
analyze some words. For example, the term "( "ھدىi.e "upon" or "right") is not analyzed and the
word "( "موتكمi.e "your death") is assigned by the incorrect "verb" tag. To resolve these problems,
this paper proposes an Arabic Part-Of-Speech Tagging method which combines the Rule-Based
approach [19] with HMM technique. The latter is superior to other models in term of training
time and is suitable for application dealing with large amounts of text. In the following two
section, both Taani’s Rule-based Tagging and HMM Tagging method will be presented and
detailed.
3. TAANI’S RULE – BASED TAGGING METHOD
The rule-based tagging method [19] allows classifying the words in a non-vocalized Arabic text
to their tags. It is constituted of three main phases: the lexicon analyzer, the morphological
analyzer, and the syntax analyzer. Figure 1 shows the architecture of this system.
Lexicon Analyzer: In this step, a lexicon of stop lists in Arabic language is defined. This lexicon
includes prepositions, adverbs, conjunctions, interrogative particles, exceptions and interjections.
All the words have to pass this phase. If the word is found in the lexicon, it is considered as
tagged. Else, it passes to the next step.
Morphological analyzer: Each word which has not been tagged in the previous phase will
immigrate to this phase. A set of the affixes of each word are extracted. An affix may be a prefix,
suffix or infix. After that, these affixes and the relations between them are used in a set of rules to
tag the word into its class. Not that this phase is the core of the system, since it distinguishes the
major percentage of untagged words into nouns or verbs.
Syntax analyzer: This phase can help in tagging the words which the previous two phases failed
to tag. It consists of two rules: sentence context and reverse parsing. The sentence context rule is
based on the relation between the untagged words and their adjacent. Arabic language has some
types of relations between adjacent words. For example the preposition and interjections are
always followed by nouns. These relations may allow tagging the words into its corresponding
classes. The reverse parsing rule is based on Arabic context-free grammar. The authors propose a
set of rules which are used frequently in Arabic language.
5. 261
Computer Science & Information Technology (CS & IT)
Figure1. Architecture of the rule-Based Arabic POS Tagger [19]
In the following section, we present the HMM model since it will be integrated in our method for
POS tagging Arabic text.
4. HIDDEN MARKOV MODEL
The use of a Hidden Markov Model (HMM) to do part-of-speech tagging can be seen as a special
case of Bayesian inference [20]. It can be formalized as follows: for a given sequence of words,
what is the best sequence of tags which corresponds to this sequence of words? If we represent an
By using the Bayesian rule and then eliminating the constant part,
transformed to this new one:
the equation can be
Where
represents the probability of the tag sequence (tag transition probabilities), and can
be computed using an N-gram model, as follows:
6. Computer Science & Information Technology (CS & IT)
262
However, it can happen that some trigrams (or bigrams) will never appear in the training set; so,
to avoid assigning null probabilities to unseen trigrams (bigrams), we used a deleted interpolation
developed by [20]:
Where
, the probability of a
Then, for calculating the likelihood of the word sequence given tag
word appearing is generally supposed to be dependant only on its own part-of-speech tag. So, it
can be written as follows:
In addition, a tagged training set has to be used for computing these probabilities, as follows:
Tag sequence probabilities and word likelihoods represent the HMM model parameters: transition
probabilities and emission (observation) probabilities. Once these parameters are set, the HMM
model can be used to find the best sequence of given a sequence of input words. The Viterbi
algorithm can be used to perform this task.
5. PROPOSED METHOD FOR ARABIC POS TAGGING
The proposed method is based on hybrid approach; it combines the Rule-Based method presented
by Taani’s [19] with a HMM model (see Figure 2). As we have mentioned, the Rule-based
method is composed by three steps: lexicon analyzer, morphological analyzer and syntax analyzer
(Cf. section 3).
Almost all words are recognized by rule-based method. However, some terms are not analyzed or
misclassified. These terms (the rest failed terms) will be analyzed using the HMM model. The
states of this model correspond to part-of-speech tags and the observations correspond to words
(see Figure 3). The basic idea of our HMM model which adopts the supervised learning is to
assign the most probable tag to the word of an input sentence. Two major steps are required: the
training step and the test step.
7. 263
Computer Science & Information Technology (CS & IT)
Figure2. Architecture of our method Arabic POS Tagger
The training step is based on supervised learning. It allows to learn the parameter of the HMM
model using the corpus by estimating the transition and emission probabilities. First, for each
iteration (concerning one term) we compute the emission probability for each tag i.e. p (word |tagi)
(see equation 7). Second, for each iteration (concerning one tag) we calculate the transition
probabilities which represent the relation between tag and previous tag i.e. p (tag| previous tag)
(see equation 7) for the Hidden Markov Model. The results of this step are two matrices: the
matrix of transition probabilities (Tag/ Tag) and the matrix of emission probabilities (word/Tag).
Figure 3: The Architecture of HMM Tagger
8. Computer Science & Information Technology (CS & IT)
264
The testing step aims to assign the best probable word's tag for which the term has been
misclassified or unanalyzed during the rule-based process. First, we give all possible tags for this
word. Then, we compute the probabilities of each tag of this word by using the transition
probabilities and emission probabilities. Finally, the Viterbi algorithm is used to calculate the best
probable path (best tag sequence) for a given word in a sequence (sentence].
To illustrate these steps, we are considering as example the following sentence:
Figure 4: Example of sentence including an Arabic word with unknown tag
For the word " "ھدىits tag is unknown by the rule based method. First, we assign it the three
different tags N, V, and INL. Using the HMM model, the transition and emission probabilities
previously estimated in the training phase, we compute the probabilities of each tag for the
word" "ھدىas the function of probabilities of previous tag in the sentence:
-
P (ھدىN) = P(ھدىN) P(NP) P(PN) P(NP) P(PN) P(NP)
P (ھدىV) = (ھدىV) P(NP) P(PN) P(NP) P(PN) P(NP)
P (ھدىINL) = (ھدىINL) P(NP) P(PN) P(NP) P(PN) P(NP)
Figure 5: Example of sentence including an Arabic word with unknown tag
Finally, we calculate the best probable path (i.e. best tag sequence) by using the Viterbi algorithm.
The latter is the most common decoding algorithm for HMM that gives the most likely tag
sequence given a set of tags. It uses the following formula.
6. RESULTS AND DISCUSSIONS
In this section, we present and describe the used corpora to evaluate our proposed method for
Arabic POS Tagging. Then, we present some pre-processing task done on the corpus, and
describe the tag set that we used. Experimental results will be presented and discussed.
9. 265
Computer Science & Information Technology (CS & IT)
6.1. Corpus description
We incorporate the Quranic Arabic Corpus [22] named as:”quranic-corpus-text-0.2”. The
language form of the corpus is classical Arabic (CA). The Arabic alphabet of the corpus consists
of the following letters: .أ ب ت ث د ج ح خ ه ع غ ف ق ص ض ط ك م ن ل ي س ش ظ ز و ة ى ال ر ؤ ء ئThe
phonetic transcription for letters is shown in Table1.
The Holy Quran corpus consists of 6236 sentences with total of 77430 words and used 33-tag
[23]. This tags set consists of the following : Noun (N), Proper Noun (PN), Number(NUM),
Adjective (ADJ), Imperative verbal noun (IMPN), Verb (V), Prohibition Particle (PRO), Negative
Particle (NEG), Accusative Particle (ACC), Conditional Particle (COND), Restriction Particle
(RES), Particle of Certainty (CERT), Interrogative Particle (INTG), Inceptive Particle (INC),
Vocative Particle (VOC), Retraction Particle (RET), Amendment Particle (AMD), Future Particle
(FUT), Exhortation Particle (EXH), Exceptive Particle (EXP), Explanation Particle (EXL),
Surprise Particle (SUR), Aversion Particle (AVR), Answer Particle (ANS), Coordinating
Conjunction (CONJ), Subordinating Conjunction (SUB), Time Adverb (T), Location Adverb
(LOC), Personal Pronoun (PRON), Relative Pronoun (REL), Demonstrative Pronoun (DEM),
Quranic Initial ( INL).
Table1. Phonetic Transcription for Arabic Letters used in Holy Quran Corpus
Letter
Alif
Ba
Ta
Tha
Jim
Ha
Kha
Dal
Dhal
Ra
Zay
Sin
Shin
Sad
Dad
Ta
Arabic
ا
ب
ت
ث
ج
ح
خ
د
د
ر
ز
س
ش
ص
ض
ط
Transcription
A
B
T
Th
J
h
Kh
D
Dh
R
Z
S
Sh
s
d
t
Letter
za
ayn
Ghayn
Fa
Qaf
Kaf
Lam
Mim
Nun
Ha
Waw
Ya
Hamza
Alif maksura
Ta marbuta
Arabic
ظ
ع
غ
ف
ق
ك
ل
م
ن
ه
و
ي
ء
ى
ة
Transcription
Z
‘
Gh
F
Q
K
L
M
N
H
W
Y
‘
A
T
The undiacritized form of the corpus is stored in a new file by removing the diacritics for all the
words of the corpus. It is important to note that the undiacritized form of the Holy Quran corpus
is used in few studies of NLP. Experimental results on undiacritized Arabic are useful because
Arabic script is mostly written without diacritics.
Table2. Mapping from 33-Tags set to 4-Tags set for the Holy Quran corpus
Comprehensive tags set (33 tags)
N, PN, NUM, ADJ, IMPN
V
PRO, NEG, ACC, COND, RES, CERT, INTG, INC, VOC, RET, AMD,
FUT, EXH, EXP, EXL, SUR, AVR, ANS, CONJ, SUB, T, LOC,
PRON, REL, DEM
INL
Simplified tags set
(4 tags)
N
V
P
INL
10. Computer Science & Information Technology (CS & IT)
266
Another work is also done into the NLTK tool [21] in order to process the Holy Quran corpus
files using simplified tag set. The simplified tag set includes only 4-tags which are: Noun (N),
Verb (V) Particle (P) and Quranic Initial (INL). Table2 presents the mapping criteria used to
convert the comprehensive tag set (33 tag) of the original corpus to the simplified one (4 tag).
We implemented new Python modules to integrate the new created files of the Holy Quran corpus
into the NLTK tool in order to perform our experiments.
6.2. Data-sets and evaluation
We conducted several experiments using the previous corpus. The experiments are based on
classical Arabic for undiacritized form according to 4-tags set. The table3 presents some results of
Arabic POS tagger using the two Taggers: Taani’s Rule-based and our proposed method.
In this table for each sentence we consider only the ambiguous terms (misclassified and
unanalyzed words). For example the sentence constituted by 15 words:
(3:59) .”” إن مثل عيسى عند ﷲ كمثل آدم خلقه من تراب ثم قال له كن فيكون
Indeed, the example of Jesus to Allah is like that of Adam. He created Him from dust; then He
said to him, "Be," and he was.
contains one word (Be)" "كنwhich is misclassified by Taani's method and two terms (Jesus)
“ “عيسىand (Adam)“ “ءادمwhich are not resolved by Taani's method. However these terms are
correctly treated by our method.
Table3. Example of obtained results using the Rule-Based tagger and our Hybrid tagger
Sentence
V/ قالP/ ﺛمN/ ترابP/ منV/ ءادم/? خلقهN/ كمثلN/ ٱP/ عيسى/ ? عندN/ مثلP/إن
V/ كن/ ? فيكونP/له
N/ ھدى/? للمتقينP/ فيهN/ ريبP/ الN/ ٱلكتبP/ذلك
P/ ھمP/ وماN/ ٱلءاخرN/ وبﭑليومN/ ءامنا/? بﭑV/ يقولP/ منN/ ٱلناسP/ومن
?/بمؤمنين
?/ نسبحP/ ونحنN/ ٱلدماءV/ ويسفكP/ فيھاV/ يفسدP/ منP/ أتجعل/? فيھاV/قالوا
P/ لكV/ ونقدسN/بحمدك
Taani’s
Term method
عيسى
?
ءادم
?
كن
N
ھدى
?
ءامنا
?
بمؤمنين
V
أتجعل
?
نسبح
ءال
?/ يذبحونN/ ٱلعذابN/ ﺳوءV/ يسومونكمN/ ءال/? فرعونP/ منV/ نجينكمP/وإذ
يذبحون
N/ عظيمN/ ربكمP/ منN/ بالءP/ ذلكمP/ وفىN/ ويستحيون/? نساءكمN/أبناءكم
يستحيون
?/ تشكرونP/ موتكم/? لعلكمN/ بعدP/ منV/ بعثنكمP/موتكم ﺛم
تشكرون
Proposed
method
N
N
V
N
V
N
V
N
V
?
N
N
V
N
N
V
V
N
V
The table4 presents the obtained accuracy of Taani’s Rule-Based method and our proposed
method, with different values of percentage training corpus. From this table, we show that our
proposed POS tagger outperforms Rule-based Tagger in term of accuracy. The size of the training
data is increased with different variation of training corpus. The obtained results for our method
achieve better performance: 97% vs 94% for Taani's method.
11. 267
Computer Science & Information Technology (CS & IT)
Table4: Obtained accuracies of Rule-Based tagger and our Hybrid POS tagger for different values of
training
method
% training
30%
70%
80%
90%
Taani's method
Our method
94%
94,40%
94,20%
94,40%
97%
97,40%
97,60%
97,60%
When we train the tagger on large amount of data we get accurate tagging results. Thus we
conclude that results are dependent on fraction of training data used to train the Tagger. Therefore
considering the sizes of corpus used for the experiments, our tagger achieved remarkable
accuracy with a Holy Quran corpus compared to Taani's method.
7. CONCLUSION AND FUTURE WORK
In this paper, we have presented an efficient and accurate part-of-speech (POS) Tagger technique
for Arabic language using statistical approach. The developed tagger employed an approach that
combines the Taani’s Rule-based method with Hidden Markov Models (HMMs). A suitable
architecture of the HMM model was specified based-on the structure of sentence that allows us to
deal correctly the ambiguity related to the misclassified and unanalyzed word in Arabic RuleBased method. To evaluate the accuracy of the proposed POS Tagger, a series of experiments are
conducted using Holy Quran corpus containing 77 430 terms for undiacritized Classical Arabic
language.
The experiments were performed with conducted further tests on more interesting dataset to
evaluate the real performance of this approach. Accuracy about 97.6% represents a very good
result of our method compared to Taani’s Rule-Based. We note that the accuracy slightly
increased with the increasing of the number of words in the training corpus. In the future, we plan
to improve the tagging accuracy of unknown words by using other training corpus, and applying
our POS tagger in extraction of Multi-Word Terms.
REFERENCES
[1]
[2]
[1]
[2]
[3]
[4]
[5]
[6]
[7]
Lee, S.hyun. & Kim Mi Na, (2008) “This is my paper”, ABC Transactions on ECE, Vol. 10, No. 5,
pp120-122.
Gizem, Aksahya & Ayese, Ozcan (2009) Coomunications & Networks, Network Books, ABC
Publishers.
http://en.wikipedia.org/wiki/Part-of-speech_tagging.
L.Van Guilder, (1995) “Automated Part of Speech Tagging: A Brief Overview” Handout for
LING361, Georgetown University.
H. Halteren, J.Zavrel & Walter Daelemans (2001).Improving Accuracy in NLP Through Combination
of Machine Learning Systems. Computational Linguistics. 27(2): 199–229.
DeRose & J.Steven (1990) "Stochastic Methods for Resolution of Grammatical Category Ambiguity
in Inflected and Uninflected Languages." PhD.Dissertation. Providence, RI: Brown University
Department of Cognitive and Linguistic Sciences.
N. kumar Kumar, Anikel Dalal &Uma Sawant (2006)”hindi part of speech tagging and chunking”,
NLPAI machine learning contest.
M. Mohseni, H. Motalebi, B. Minaei-bidgoli & M. Shokrollahi-far (2008) “A farsi part-of-speech
tagger based on markov”. In the proceedings of ACM symposium on Applied computing, Brazil.
S. Jabbari &B. Allison(2007)“Persian Part of Speech Tagging”, In the Proceedings of Workshop on
Computational Approaches to Arabic Script-Based Languages (CAASL-2), USA.
12. Computer Science & Information Technology (CS & IT)
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
268
E. Brill (1995) “Transformation-Based Error-Driven Learning and Natural Language Processing: A
case Study in Part of Speech Tagging”, Computational Linguistics, USA.
M. Hepple (2000), ”Independence and Commitment: Assumptions for Rapid Training and Execution
of Rule-based Part of-Speech Taggers”, In Proceedings of the 38th Annual Meeting of the
Association for Computational Linguistics (ACL). Hong Kong.
T. Brants (200),“TNT – a Statistical Part-of-Speech Tagger”, In the Proceedings of 6th conference on
applied natural language processing (ANLP), USA.
K. Megerdoomian (2004), “Developing a Persian part-of speech tagger”, In the Proceedings of first
Workshop on Persian Language and computer, Iran .
Khoja, S.( 2001) “ APT: Arabic part-of-speech tagger”. Proceeding of the Student Workshop at the
2nd Meeting of the NAACL, (NAACL’01), Carnegie Mellon University, Pennsylvania, pp: 1-6.
http://zeus.cs.pacificu.edu/shereen/NAACL.pdf
Freeman A (2001), “Brill’s POS tagger and a morphology parser for Arabic”, In ACL’01 Workshop
on Arabic language processing.
Maamouri M, Cieri C. (2002). “Resources for Arabic Natural Language Processing at the LDC”,
Proceedings of the International Symposium on the Processing of Arabic,Tunisia, pp.125-146.
Diab M., Hacioglu K. and Jurafsky D. (2004), “Automatic Tagging of Arabic Text: From Raw Text to
Base Phrase Chunks”. proc. of HLTNAACL’04: 149–152.
Banko M, Moore R. C. (2004). “Part of Speech Tagging in Context”, Proc of the 20th international
conference on Computational Linguistics, Switzerland.
Tlili-Guiassa Y. (2006) “Hybrid Method for Tagging Arabic Text”. Journal of Computer Science 2
(3): 245-248.
L. Young-Suk, K. Papineni & S. Roukos ( 2003), “Language Model Based Arabic Word
Segmentation,” in Proceedings of the Annual Meeting on Association for Computational Linguistics,
Japan, pp. 399- 406.
A.T Al-Taani & S. Abu-Al-Rub (2009),”A rule-based approaches for tagging non-vocalized Arabic
words”. The International Arab Journal of Information Technology, Volume6 (3): 320-328.
T. Brants (2000),” TnT: A statistical part of speech tagger”, Proceedings of the 6th Conference on
Applied Natural Language Processing, Apr. 29- May 04, Association for Computational Linguistics
Morristown, New Jersey, USA., pp: 224-231.
NLTK, Natural Language Toolkit. http://www.nltk.org/Home
Quranic Arabic Corpus: http://corpus.quran.com
Quran Tagset: http://corpus.quran.com/documentation/tagset.jsp
N. Habash & O. Rambow (2005), “Arabic Tokenization, Part-of-Speech Tagging and Morphological
Disambiguation in One Fell Swoop,” in Proceedings of the Annual Meeting on Association for
Computational Linguistics, Michigan, pp. 573-580.
http://sibawayh.emi.ac.ma/web/s/?q=node/79
F. Al Shamsi & A.Guessoum(2006),” A Hidden Markov Model–Based POS Tagger for Arabic”, 8es
Journées internationales d’Analyse statistique des Données Textuelles (JADT).
M. Albared & O.Nazlia(2010),” Automatic Part of Speech Tagging for Arabic: An Experiment Using
Bigram Hidden Markov Model “,Springer-Verlag Berlin Heidelberg, LNAI 6401, pp. 361–370.
Y.O. Mohamed Elhadj(2009),” Statistical Part-of-Speech Tagger for Traditional Arabic Texts”,
Journal of Computer Science 5 (11): 794-800
13. 269
Computer Science & Information Technology (CS & IT)
Authors
Miss. Meryeme Hadni Phd Student in Laboratory of computer and Modelization, Faculty of
Sciences, University Sidi Mohamed Ben Abdellah (USMBA), Fez, Morocco. She has also
presented different papers at different National and International conferences.
Pr. Abdelmonaime LACHKAR : received his PhD degree from the USMBA, Morocco in
2004, He is Professor and Computer Engineering Program Coordinator at (E.N.S.A, FES),
and the Head of the Systems Architecture and Multimedia Team (LSIS Laboratory) at Sidi
Mohamed Ben Abdellah University, Fez, Morocco. His current research interests include
Arabic Natural Language Processing ANLP, Arabic Web Document Clustering and
Categorization, Arabic Information Retrieval Systems, Arabic Text Summarization, Arabic
Ontologies development and usage, Arabic Semantic Search Engines (SSEs).
Pr. Said Alaoui Ouatik i s working as a Professor in Department of Computer Science,
Faculty of Science Dhar EL Mahraz (FSDM), Fez, Morocco. His research interests include
high-dimensional indexing and content-based retrieval, Arabic Document Categorization.
2D/3D Shapes Indexing and Retrieval in large 3D Objects Database
.
Mohammed Meknassi received Ph. D degree in computer sciences from Montreal University
in 1993. Since 1993, he is professor of computer sciences. He teaches and makes his scientific
research in the following fields: Parallel processing, Distributed Computing, Operating
Systems and Image Processing. He is a member of the research unit: Systems Image and
Multimedia (SIM) attached to the laboratory: Computer Sciences, Statistics and Quality
(LISQ). He is the chief of the computer Sciences Department in the Faculty of Sciences Dhar
El Mahraz of Fez.