Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
This document summarizes current research on morphological analysis techniques for the Assamese language. It discusses prior work using rule-based and unsupervised methods for morphological analysis of several Indian languages, including Hindi, Bengali, Punjabi, Marathi, Tamil, Malayalam, Kannada, and Assamese. For Assamese specifically, it describes several studies that used suffix stripping and rule-based approaches to develop morphological analyzers, as well as some initial work on unsupervised techniques. The document concludes that while most existing work on Assamese has used supervised suffix stripping methods, unsupervised techniques show promise but have not been fully explored.
The document discusses text normalization, which involves segmenting and standardizing text for natural language processing. It describes tokenizing text into words and sentences, lemmatizing words into their root forms, and standardizing formats. Tokenization involves separating punctuation, normalizing word formats, and segmenting sentences. Lemmatization determines that words have the same root despite surface differences. Sentence segmentation identifies sentence boundaries, which can be ambiguous without context. Overall, text normalization prepares raw text for further natural language analysis.
This document summarizes research that has been done on computational morphology for the Odia language. It begins with an abstract that outlines how morphological analysis, generation, and parsing are important tools for natural language processing. The document then reviews different works that have developed morphological analyzers and generators for Odia. It describes various methods that have been used, including suffix stripping, finite state transducers, two-level morphology, corpus-based approaches, and paradigm-based approaches. Finally, it outlines several applications of morphology like machine translation, spelling checking, and part-of-speech tagging.
Machine Translation (MT) refers to the use of computers for the task of translating
automatically from one language to another. The differences between languages and
especially the inherent ambiguity of language make MT a very difficult problem. Traditional
approaches to MT have relied on humans supplying linguistic knowledge in the form of rules
to transform text in one language to another. Given the vastness of language, this is a highly
knowledge intensive task. Statistical MT is a radically different approach that automatically
acquires knowledge from large amounts of training data. This knowledge, which is typically
in the form of probabilities of various language features, is used to guide the translation
process. This report provides an overview of MT techniques, and looks in detail at the basic
statistical model.
Finite-state morphological parsing uses finite-state transducers to parse words into their morphological components like stems and affixes. It requires a lexicon of stems and affixes, morphotactic rules describing valid morpheme combinations, and orthographic rules for spelling changes. The parser is built as a cascade of finite-state automata representing the lexicon, morphotactics and spelling rules. It maps surface word forms onto their underlying lexical representations including stems and morphological features. This allows morphological analysis of both regular and irregular forms.
This document describes a rule-based machine translation system for translating English text to Telugu. It discusses the challenges of developing such a system, including differences in grammar between the two languages. An algorithm is proposed that uses rules, probabilities, and rough sets to classify sentences and select the best word translations. The system works by tokenizing English sentences, tagging the words with parts of speech, looking up word translations in a bilingual dictionary, and concatenating the Telugu words to form the output sentence.
Shallow parser for hindi language with an input from a transliteratorShashank Shisodia
This document summarizes a student project to develop a shallow parser for Hindi language with input from a transliterator. The plan is to create a transliterator to convert Roman script to Devanagari, generate a lexicon from corpus analysis, develop a morphological analyzer using finite state transducers, and implement a shallow parser using context free grammar. The system architecture and flow chart are presented. In conclusion, the document notes that shallow parsing is needed to build full parsers for Hindi and transliteration is important for translating names and terms across languages with different alphabets.
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORMijnlc
This article describes the morphological analysis of a standard Arabic natural language processing, as a
part of an electronic dictionary-constricting phase. A fully 3-lettered inflected verbs model are formalized
based on a linguistic classification, using NOOJ platform, the classification gives certain representative
verbs that will considered as lemmas, this verbs form our dictionary entries, they are also conjugated
according to our inflection paradigm relying on certain specific morphological properties. This dictionary
will be considered as an Arabic resource, which will help NLP applications and NOOJ platform to analyse
sophisticated Arabic corpora.
This document summarizes current research on morphological analysis techniques for the Assamese language. It discusses prior work using rule-based and unsupervised methods for morphological analysis of several Indian languages, including Hindi, Bengali, Punjabi, Marathi, Tamil, Malayalam, Kannada, and Assamese. For Assamese specifically, it describes several studies that used suffix stripping and rule-based approaches to develop morphological analyzers, as well as some initial work on unsupervised techniques. The document concludes that while most existing work on Assamese has used supervised suffix stripping methods, unsupervised techniques show promise but have not been fully explored.
The document discusses text normalization, which involves segmenting and standardizing text for natural language processing. It describes tokenizing text into words and sentences, lemmatizing words into their root forms, and standardizing formats. Tokenization involves separating punctuation, normalizing word formats, and segmenting sentences. Lemmatization determines that words have the same root despite surface differences. Sentence segmentation identifies sentence boundaries, which can be ambiguous without context. Overall, text normalization prepares raw text for further natural language analysis.
This document summarizes research that has been done on computational morphology for the Odia language. It begins with an abstract that outlines how morphological analysis, generation, and parsing are important tools for natural language processing. The document then reviews different works that have developed morphological analyzers and generators for Odia. It describes various methods that have been used, including suffix stripping, finite state transducers, two-level morphology, corpus-based approaches, and paradigm-based approaches. Finally, it outlines several applications of morphology like machine translation, spelling checking, and part-of-speech tagging.
Machine Translation (MT) refers to the use of computers for the task of translating
automatically from one language to another. The differences between languages and
especially the inherent ambiguity of language make MT a very difficult problem. Traditional
approaches to MT have relied on humans supplying linguistic knowledge in the form of rules
to transform text in one language to another. Given the vastness of language, this is a highly
knowledge intensive task. Statistical MT is a radically different approach that automatically
acquires knowledge from large amounts of training data. This knowledge, which is typically
in the form of probabilities of various language features, is used to guide the translation
process. This report provides an overview of MT techniques, and looks in detail at the basic
statistical model.
Finite-state morphological parsing uses finite-state transducers to parse words into their morphological components like stems and affixes. It requires a lexicon of stems and affixes, morphotactic rules describing valid morpheme combinations, and orthographic rules for spelling changes. The parser is built as a cascade of finite-state automata representing the lexicon, morphotactics and spelling rules. It maps surface word forms onto their underlying lexical representations including stems and morphological features. This allows morphological analysis of both regular and irregular forms.
This document describes a rule-based machine translation system for translating English text to Telugu. It discusses the challenges of developing such a system, including differences in grammar between the two languages. An algorithm is proposed that uses rules, probabilities, and rough sets to classify sentences and select the best word translations. The system works by tokenizing English sentences, tagging the words with parts of speech, looking up word translations in a bilingual dictionary, and concatenating the Telugu words to form the output sentence.
Shallow parser for hindi language with an input from a transliteratorShashank Shisodia
This document summarizes a student project to develop a shallow parser for Hindi language with input from a transliterator. The plan is to create a transliterator to convert Roman script to Devanagari, generate a lexicon from corpus analysis, develop a morphological analyzer using finite state transducers, and implement a shallow parser using context free grammar. The system architecture and flow chart are presented. In conclusion, the document notes that shallow parsing is needed to build full parsers for Hindi and transliteration is important for translating names and terms across languages with different alphabets.
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORMijnlc
This article describes the morphological analysis of a standard Arabic natural language processing, as a
part of an electronic dictionary-constricting phase. A fully 3-lettered inflected verbs model are formalized
based on a linguistic classification, using NOOJ platform, the classification gives certain representative
verbs that will considered as lemmas, this verbs form our dictionary entries, they are also conjugated
according to our inflection paradigm relying on certain specific morphological properties. This dictionary
will be considered as an Arabic resource, which will help NLP applications and NOOJ platform to analyse
sophisticated Arabic corpora.
Experiments with Different Models of Statistcial Machine Translationkhyati gupta
This document summarizes an experiment conducted on statistical machine translation models. The experiment compared phrase-based, hierarchical, and syntax-based statistical machine translation models. The document outlines the process of data preparation including tokenization, alignment, and training on the Moses platform. It then describes how each model - phrase-based, hierarchical, and syntax-based - works, including rule extraction for the hierarchical model. The document concludes by discussing the advantage of the hierarchical model and how it was able to automatically annotate Hindi data.
This document discusses parsing with context-free grammars. It begins by introducing context-free grammars and their use in parsing sentences. It then discusses parsing as a search problem, and presents top-down and bottom-up parsing algorithms. Top-down parsing builds trees from the root node down, while bottom-up parsing builds trees from the leaves up. Both approaches have advantages and disadvantages related to efficiency. The document also introduces probabilistic context-free grammars, which augment grammars with rule probabilities, and discusses how these can be used for disambiguation.
Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge...Guy De Pauw
This document discusses tokenization and computational verb morphology for Setswana, a Bantu language with a disjunctive orthography. It presents an approach that combines two tokenization transducers and a morphological analyzer to effectively tokenize Setswana text. The approach was tested on a short Setswana text and achieved 93.6% accuracy between the automatically and hand-tokenized texts. While mostly successful, some issues remained around longest matches that were not valid tokens or did not allow morphological analysis. Overall, the approach demonstrated that a precise tokenizer and morphological analyzer can largely resolve the challenges of Setswana's disjunctive writing system.
13. Constantin Orasan (UoW) Natural Language Processing for TranslationRIILP
This document discusses how natural language processing (NLP) techniques can help improve machine translation (MT). It describes some of the linguistic challenges in MT, such as ambiguity at the lexical, syntactic, semantic and pragmatic levels. It then discusses how various NLP tasks, such as tokenization, word sense disambiguation, and handling of named entities could enhance MT systems. Several studies that have successfully integrated NLP techniques like word sense disambiguation into statistical machine translation systems are also summarized.
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)ijnlc
Extracting key phrases from documents is a common task in many applications. In general: The Noun
Phrase Extractor consists of three modules: tokenization; part-of-speech tagging; noun phrase
identification. These will be used as three main steps in building the new system ANPE, This paper aims at
picking Arabic Noun Phrases from a corpus of documents, Relevant criteria (Recall and Precision), will be
used as evaluation measure. On the one hand, when using NPs rather than using single terms, the system
yields more relevant documents from the retrieved ones, on the other hand, it gave low precision because
number of the retrieved documents will be decreased. At the researchers conclude and recommend
improvements for more effective and efficient research in the future.
Lectura 3.5 word normalizationintwitter finitestate_transducersMatias Menendez
This document describes a system for normalizing Spanish tweets using finite-state transducers. The system consists of three main components: 1) An analyzer that tokenizes tweets and identifies standard words, 2) Transducers that generate candidate words for out-of-vocabulary tokens by implementing linguistic models of spelling variation, 3) A statistical language model that selects the most probable sequence of words. The transducers are based on weighted finite-state automata and implement models of texting features such as logograms, shortenings, and non-standard spellings. An evaluation shows the system is effective at normalizing Spanish tweets.
Machine translation systems can translate text from one language to another. Moses is an open-source statistical machine translation toolkit that is commonly used. It takes parallel text corpora to train models for translation. The Moses training process involves word alignment, phrase extraction, and language model building. The Moses decoder then translates new text using these statistical models.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals
In this paper, a novel hierarchical Persian Stemming approach based on the Part-Of-Speech (POS) of the
word in a sentence is presented. The implemented stemmer includes hash tables and several deterministic
finite automata (DFA) in its different levels of hierarchy for removing the prefixes and suffixes of the
words. We had two intentions in using hash tables in our method. The first one is that the DFA don’t
support some special words, so hash table can partly solve the addressed problem. And the second goal is
to speed up the implemented stemmer with omitting the time that DFA need. Because of the hierarchical
organization, this method is fast and flexible enough. Our experiments on test sets from Hamshahri
Collection and Security News from ICTna.ir Site show that our method has the average accuracy of
95.37% which is even improved in using the method on a test set with common topics.
Parts-of-speech can be divided into closed classes and open classes. Closed classes have a fixed set of members like prepositions, while open classes like nouns and verbs are continually changing with new words being created. Parts-of-speech tagging is the process of assigning a part-of-speech tag to each word using statistical models trained on tagged corpora. Hidden Markov Models are commonly used, where the goal is to find the most probable tag sequence given an input word sequence.
Developing an architecture for translation engine using ontologyAlexander Decker
This document proposes an architecture for machine translation that utilizes WordNet ontology and transition network grammars. It aims to improve translation accuracy by using WordNet as a syntactic guide during parsing and determining the grammatical structure of text. It then maps between the source and target languages. The architecture is described at a high level to provide a framework for future integration with other techniques to enhance machine translation.
This document discusses methods for evaluating language models, including intrinsic and extrinsic evaluation. Intrinsic evaluation involves measuring a model's performance on a test set using metrics like perplexity, which is based on how well the model predicts the test set. Extrinsic evaluation embeds the model in an application and measures the application's performance. The document also covers techniques for dealing with unknown words like replacing low-frequency words with <UNK> and estimating its probability from training data.
S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELSijnlc
Globalization and growth of Internet users truly demands for almost all internet based applications to
support
l
oca
l l
anguages. Support
of
l
oca
l
l
anguages can be
given in all internet based applications by
means of Machine Transliteration
and
Machine Translation
.
This paper provides the thorough survey on
machine transliteration models and machine learning
approaches
used for machine transliteration
over the
period
of more than two decades
for internationally used languages as well as Indian languages.
Survey
shows that linguistic approach provides better results for the closely related languages and probability
based statistical approaches are good when one of the
languages is phonetic and other is non
-
phonetic.
B
etter accuracy can be achieved only by using Hybrid and Combined models.
This document discusses Lexical Functional Grammar (LFG) and Generalized Phrase Structure Grammar (GPSG). LFG was developed in the 1970s and emphasizes analyzing phenomena in lexical and functional terms. It uses two levels of structure: c-structure, which is a tree structure, and f-structure, which captures grammatical functions. GPSG was developed in 1985 and is confined to context-free phrase structure rules. It uses immediate dominance and linear precedence rules.
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONkevig
Phonetic typing using the English alphabet has become widely popular nowadays for social media and chat services. As a result, a text containing various English and Bangla words and phrases has become increasingly common. Existing transliteration tools display poor performance for such texts. This paper proposes a robust Three-stage Hybrid Transliteration (THT) framework that can transliterate both English words and phonetic typed Bangla words satisfactorily. This is achieved by adopting a hybrid approach of dictionary-based and rule-based techniques. Experimental results confirm superiority of THT as it significantly outperforms the benchmark transliteration tool.
The document discusses context-free grammars for modeling English syntax. It introduces key concepts like constituency, grammatical relations, and subcategorization. Context-free grammars use rules and symbols to generate sentences. They consist of terminal symbols (words), non-terminal symbols (phrases), and rules to expand non-terminals. Context-free grammars can model syntactic knowledge and generate sentences in both a top-down and bottom-up manner through parsing.
Hybrid part of-speech tagger for non-vocalized arabic textijnlc
The document presents a hybrid part-of-speech tagging method for Arabic text that combines rule-based and statistical approaches. Rule-based tagging alone can misclassify words and leave some untagged, so the method integrates it with a Hidden Markov Model tagger. The hybrid approach is evaluated on two Arabic corpora and achieves accuracy rates of 97.6% and 98%, outperforming the individual rule-based and HMM taggers.
Natural Language processing Parts of speech tagging, its classes, and how to ...Rajnish Raj
Part of speech (POS) tagging is the process of assigning a part of speech tag like noun, verb, adjective to each word in a sentence. It involves determining the most likely tag sequence given the probabilities of tags occurring before or after other tags, and words occurring with certain tags. POS tagging is the first step in many NLP applications and helps determine the grammatical role of words. It involves calculating bigram and lexical probabilities from annotated corpora to find the tag sequence with the highest joint probability.
Statistically-Enhanced New Word IdentificationAndi Wu
This document discusses a method for identifying new words in Chinese text using a combination of rule-based and statistical approaches. Candidate character strings are selected as potential new words based on their independent word probability being below a threshold. Parts of speech are then assigned to candidate strings by examining the part of speech patterns of their component characters and comparing them to existing words in a dictionary to determine the most likely part of speech based on word formation patterns in Chinese. This hybrid approach avoids the overgeneration of rule-based systems and data sparsity issues of purely statistical approaches.
Machine Translation System: Chhattisgarhi to HindiPadma Metta
The document discusses trends in machine translation, including different approaches such as direct machine translation, rule-based machine translation, and corpus-based machine translation. It provides examples of challenges in machine translation like word order, word sense disambiguation, and idioms. The document also describes the proposed methodology for Chhattisgarhi to Hindi machine translation, including lexical analysis, syntactic analysis, and rule-based conversion between the languages.
M ACHINE T RANSLATION D EVELOPMENT F OR I NDIAN L ANGUAGE S A ND I TS A PPROA...ijnlc
This paper presents a survey of Machine translation system for Indian Regional languages. Machine
translation is one of the central areas of Natural language processing (NLP).
Machine translation
(henceforth
referred as MT)
is important for breaking the language barrier and facilitating inter
-
lingual
communication. For a multilingual country like INDIA which is largest democratic country in whole world,
there is a big requirement of automatic machine translation system.
With
the advent of Information
Technology many documents and web pages are coming
up in a local language so
there is
a large need of
good M
T
systems to address all these issue
s in order to establish a
proper
communication between states
and union governments to
exchange information amongst the people of different states.
This paper focuses
on different Machine translation projects done in India along with their features and domain
T URN S EGMENTATION I NTO U TTERANCES F OR A RABIC S PONTANEOUS D IALOGUES ...ijnlc
ext segmentation task is an essential processing task for many of Natural Language Processing (NLP)
such as text summarization, text translation, dialogue language understanding, among others. Turns
segmentation consi
dered the key player in dialogue understanding task for building automatic Human
-
Computer systems. In this paper, we introduce a novel approach to turn segmentation into utterances for
Egyptian spontaneous dialogues and Instance Messages (IM) using Machine
Learning (ML) approach as a
part of automatic understanding Egyptian spontaneous dialogues and IM task. Due to the lack of Egyptian
dialect
dialogue
corpus
the system evaluated by our
corpus
includes 3001 turns, which
are collected,
segmented, and annotat
ed manually from Egyptian call
-
centers. The system achieves F
1
scores
of 90.74%
and accuracy of 95.98%
Experiments with Different Models of Statistcial Machine Translationkhyati gupta
This document summarizes an experiment conducted on statistical machine translation models. The experiment compared phrase-based, hierarchical, and syntax-based statistical machine translation models. The document outlines the process of data preparation including tokenization, alignment, and training on the Moses platform. It then describes how each model - phrase-based, hierarchical, and syntax-based - works, including rule extraction for the hierarchical model. The document concludes by discussing the advantage of the hierarchical model and how it was able to automatically annotate Hindi data.
This document discusses parsing with context-free grammars. It begins by introducing context-free grammars and their use in parsing sentences. It then discusses parsing as a search problem, and presents top-down and bottom-up parsing algorithms. Top-down parsing builds trees from the root node down, while bottom-up parsing builds trees from the leaves up. Both approaches have advantages and disadvantages related to efficiency. The document also introduces probabilistic context-free grammars, which augment grammars with rule probabilities, and discusses how these can be used for disambiguation.
Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge...Guy De Pauw
This document discusses tokenization and computational verb morphology for Setswana, a Bantu language with a disjunctive orthography. It presents an approach that combines two tokenization transducers and a morphological analyzer to effectively tokenize Setswana text. The approach was tested on a short Setswana text and achieved 93.6% accuracy between the automatically and hand-tokenized texts. While mostly successful, some issues remained around longest matches that were not valid tokens or did not allow morphological analysis. Overall, the approach demonstrated that a precise tokenizer and morphological analyzer can largely resolve the challenges of Setswana's disjunctive writing system.
13. Constantin Orasan (UoW) Natural Language Processing for TranslationRIILP
This document discusses how natural language processing (NLP) techniques can help improve machine translation (MT). It describes some of the linguistic challenges in MT, such as ambiguity at the lexical, syntactic, semantic and pragmatic levels. It then discusses how various NLP tasks, such as tokenization, word sense disambiguation, and handling of named entities could enhance MT systems. Several studies that have successfully integrated NLP techniques like word sense disambiguation into statistical machine translation systems are also summarized.
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)ijnlc
Extracting key phrases from documents is a common task in many applications. In general: The Noun
Phrase Extractor consists of three modules: tokenization; part-of-speech tagging; noun phrase
identification. These will be used as three main steps in building the new system ANPE, This paper aims at
picking Arabic Noun Phrases from a corpus of documents, Relevant criteria (Recall and Precision), will be
used as evaluation measure. On the one hand, when using NPs rather than using single terms, the system
yields more relevant documents from the retrieved ones, on the other hand, it gave low precision because
number of the retrieved documents will be decreased. At the researchers conclude and recommend
improvements for more effective and efficient research in the future.
Lectura 3.5 word normalizationintwitter finitestate_transducersMatias Menendez
This document describes a system for normalizing Spanish tweets using finite-state transducers. The system consists of three main components: 1) An analyzer that tokenizes tweets and identifies standard words, 2) Transducers that generate candidate words for out-of-vocabulary tokens by implementing linguistic models of spelling variation, 3) A statistical language model that selects the most probable sequence of words. The transducers are based on weighted finite-state automata and implement models of texting features such as logograms, shortenings, and non-standard spellings. An evaluation shows the system is effective at normalizing Spanish tweets.
Machine translation systems can translate text from one language to another. Moses is an open-source statistical machine translation toolkit that is commonly used. It takes parallel text corpora to train models for translation. The Moses training process involves word alignment, phrase extraction, and language model building. The Moses decoder then translates new text using these statistical models.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals
In this paper, a novel hierarchical Persian Stemming approach based on the Part-Of-Speech (POS) of the
word in a sentence is presented. The implemented stemmer includes hash tables and several deterministic
finite automata (DFA) in its different levels of hierarchy for removing the prefixes and suffixes of the
words. We had two intentions in using hash tables in our method. The first one is that the DFA don’t
support some special words, so hash table can partly solve the addressed problem. And the second goal is
to speed up the implemented stemmer with omitting the time that DFA need. Because of the hierarchical
organization, this method is fast and flexible enough. Our experiments on test sets from Hamshahri
Collection and Security News from ICTna.ir Site show that our method has the average accuracy of
95.37% which is even improved in using the method on a test set with common topics.
Parts-of-speech can be divided into closed classes and open classes. Closed classes have a fixed set of members like prepositions, while open classes like nouns and verbs are continually changing with new words being created. Parts-of-speech tagging is the process of assigning a part-of-speech tag to each word using statistical models trained on tagged corpora. Hidden Markov Models are commonly used, where the goal is to find the most probable tag sequence given an input word sequence.
Developing an architecture for translation engine using ontologyAlexander Decker
This document proposes an architecture for machine translation that utilizes WordNet ontology and transition network grammars. It aims to improve translation accuracy by using WordNet as a syntactic guide during parsing and determining the grammatical structure of text. It then maps between the source and target languages. The architecture is described at a high level to provide a framework for future integration with other techniques to enhance machine translation.
This document discusses methods for evaluating language models, including intrinsic and extrinsic evaluation. Intrinsic evaluation involves measuring a model's performance on a test set using metrics like perplexity, which is based on how well the model predicts the test set. Extrinsic evaluation embeds the model in an application and measures the application's performance. The document also covers techniques for dealing with unknown words like replacing low-frequency words with <UNK> and estimating its probability from training data.
S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELSijnlc
Globalization and growth of Internet users truly demands for almost all internet based applications to
support
l
oca
l l
anguages. Support
of
l
oca
l
l
anguages can be
given in all internet based applications by
means of Machine Transliteration
and
Machine Translation
.
This paper provides the thorough survey on
machine transliteration models and machine learning
approaches
used for machine transliteration
over the
period
of more than two decades
for internationally used languages as well as Indian languages.
Survey
shows that linguistic approach provides better results for the closely related languages and probability
based statistical approaches are good when one of the
languages is phonetic and other is non
-
phonetic.
B
etter accuracy can be achieved only by using Hybrid and Combined models.
This document discusses Lexical Functional Grammar (LFG) and Generalized Phrase Structure Grammar (GPSG). LFG was developed in the 1970s and emphasizes analyzing phenomena in lexical and functional terms. It uses two levels of structure: c-structure, which is a tree structure, and f-structure, which captures grammatical functions. GPSG was developed in 1985 and is confined to context-free phrase structure rules. It uses immediate dominance and linear precedence rules.
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONkevig
Phonetic typing using the English alphabet has become widely popular nowadays for social media and chat services. As a result, a text containing various English and Bangla words and phrases has become increasingly common. Existing transliteration tools display poor performance for such texts. This paper proposes a robust Three-stage Hybrid Transliteration (THT) framework that can transliterate both English words and phonetic typed Bangla words satisfactorily. This is achieved by adopting a hybrid approach of dictionary-based and rule-based techniques. Experimental results confirm superiority of THT as it significantly outperforms the benchmark transliteration tool.
The document discusses context-free grammars for modeling English syntax. It introduces key concepts like constituency, grammatical relations, and subcategorization. Context-free grammars use rules and symbols to generate sentences. They consist of terminal symbols (words), non-terminal symbols (phrases), and rules to expand non-terminals. Context-free grammars can model syntactic knowledge and generate sentences in both a top-down and bottom-up manner through parsing.
Hybrid part of-speech tagger for non-vocalized arabic textijnlc
The document presents a hybrid part-of-speech tagging method for Arabic text that combines rule-based and statistical approaches. Rule-based tagging alone can misclassify words and leave some untagged, so the method integrates it with a Hidden Markov Model tagger. The hybrid approach is evaluated on two Arabic corpora and achieves accuracy rates of 97.6% and 98%, outperforming the individual rule-based and HMM taggers.
Natural Language processing Parts of speech tagging, its classes, and how to ...Rajnish Raj
Part of speech (POS) tagging is the process of assigning a part of speech tag like noun, verb, adjective to each word in a sentence. It involves determining the most likely tag sequence given the probabilities of tags occurring before or after other tags, and words occurring with certain tags. POS tagging is the first step in many NLP applications and helps determine the grammatical role of words. It involves calculating bigram and lexical probabilities from annotated corpora to find the tag sequence with the highest joint probability.
Statistically-Enhanced New Word IdentificationAndi Wu
This document discusses a method for identifying new words in Chinese text using a combination of rule-based and statistical approaches. Candidate character strings are selected as potential new words based on their independent word probability being below a threshold. Parts of speech are then assigned to candidate strings by examining the part of speech patterns of their component characters and comparing them to existing words in a dictionary to determine the most likely part of speech based on word formation patterns in Chinese. This hybrid approach avoids the overgeneration of rule-based systems and data sparsity issues of purely statistical approaches.
Machine Translation System: Chhattisgarhi to HindiPadma Metta
The document discusses trends in machine translation, including different approaches such as direct machine translation, rule-based machine translation, and corpus-based machine translation. It provides examples of challenges in machine translation like word order, word sense disambiguation, and idioms. The document also describes the proposed methodology for Chhattisgarhi to Hindi machine translation, including lexical analysis, syntactic analysis, and rule-based conversion between the languages.
M ACHINE T RANSLATION D EVELOPMENT F OR I NDIAN L ANGUAGE S A ND I TS A PPROA...ijnlc
This paper presents a survey of Machine translation system for Indian Regional languages. Machine
translation is one of the central areas of Natural language processing (NLP).
Machine translation
(henceforth
referred as MT)
is important for breaking the language barrier and facilitating inter
-
lingual
communication. For a multilingual country like INDIA which is largest democratic country in whole world,
there is a big requirement of automatic machine translation system.
With
the advent of Information
Technology many documents and web pages are coming
up in a local language so
there is
a large need of
good M
T
systems to address all these issue
s in order to establish a
proper
communication between states
and union governments to
exchange information amongst the people of different states.
This paper focuses
on different Machine translation projects done in India along with their features and domain
T URN S EGMENTATION I NTO U TTERANCES F OR A RABIC S PONTANEOUS D IALOGUES ...ijnlc
ext segmentation task is an essential processing task for many of Natural Language Processing (NLP)
such as text summarization, text translation, dialogue language understanding, among others. Turns
segmentation consi
dered the key player in dialogue understanding task for building automatic Human
-
Computer systems. In this paper, we introduce a novel approach to turn segmentation into utterances for
Egyptian spontaneous dialogues and Instance Messages (IM) using Machine
Learning (ML) approach as a
part of automatic understanding Egyptian spontaneous dialogues and IM task. Due to the lack of Egyptian
dialect
dialogue
corpus
the system evaluated by our
corpus
includes 3001 turns, which
are collected,
segmented, and annotat
ed manually from Egyptian call
-
centers. The system achieves F
1
scores
of 90.74%
and accuracy of 95.98%
A Novel Approach for Recognizing Text in Arabic Ancient Manuscripts ijnlc
In this paper a system for recognizing Arabic ancient manuscripts is presented. The system has been
divided into four parts. The first part is the image pre-processing where the text in the Arabic ancient
manuscript will be recognized as a collection of Arabic characters through three phases of processing. The
second part is the Arabic text analysis which consists of lexical analyzer; syntax analyzer; and semantic
analyzer. The output of this subsystem is an XML file format that represents the ancient manuscript text.
The third part is the intermediate text generation, in this part an intermediate presentation of the Arabic
text is generated from the XML text file. The fourth part of the system is the Arabic text generation, which
converts the generated text to a modern standard Arabic (MSA) language (this part has four phases: text
organizer; pre-optimizer; semantics generator; and post-optimizer).
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...ijnlc
Word Sense Disambiguation is a classification of me
aning of word in a precise context which is a trick
y
task to perform in Natural Language Processing whic
h is used in application like machine translation,
information extraction and retrieval, automatic or
closed domain question answering system for the rea
son
that of its semantics perceptive. Researchers tried
for unsupervised and knowledge based learning
approaches however such approaches have not proved
more helpful. Various supervised learning
algorithms have been made, but in vain as the attem
pt of creating the training corpus which is a tagge
d
sense marked corpora is tricky. This paper presents
a hybrid approach for resolving ambiguity in a
sentence which is based on integrating lexical know
ledge and world knowledge. English Wordnet
developed at Princeton University, SemCor corpus an
d the JAWS library (Java API for WordNet
searching) has been used for this purpose.
S ENTIMENT A NALYSIS F OR M ODERN S TANDARD A RABIC A ND C OLLOQUIAlijnlc
The rise of social media such as blogs and social n
etworks has fueled interest in sentiment analysis.
With
the proliferation of reviews, ratings, recommendati
ons and other forms of online expression, online op
inion
has turned into a kind of virtual currency for busi
nesses looking to market their products, identify n
ew
opportunities and manage their reputations, therefo
re many are now looking to the field of sentiment
analysis. In this paper, we present a feature-based
sentence level approach for Arabic sentiment analy
sis.
Our approach is using Arabic idioms/saying phrases
lexicon as a key importance for improving the
detection of the sentiment polarity in Arabic sente
nces as well as a number of novels and rich set of
linguistically motivated features (contextual Inten
sifiers, contextual Shifter and negation handling),
syntactic features for conflicting phrases which en
hance the sentiment classification accuracy.
Furthermore, we introduce an automatic expandable w
ide coverage polarity lexicon of Arabic sentiment
words. The lexicon is built with gold-standard sent
iment words as a seed which is manually collected a
nd
annotated and it expands and detects the sentiment
orientation automatically of new sentiment words us
ing
synset aggregation technique and free online Arabic
lexicons and thesauruses. Our data focus on modern
standard Arabic (MSA) and Egyptian dialectal Arabic
tweets and microblogs (hotel reservation, product
reviews, etc.). The experimental results using our
resources and techniques with SVM classifier indica
te
high performance levels, with accuracies of over 95
%.
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...ijnlc
Building
dialogues systems
interaction
has recently gained considerable
attention, but most of the
resourc
es and systems built so far are
tailored to
English and other Indo
-
European languages. The need
for designing
systems for
other languages is increasing such as Arabic language.
For this reasons, there
are more int
erest for Arabic dialogue acts classification
task because it
a key player in Arabic language
under
standing
to
bu
ilding this systems
.
This paper surveys
different techniques
for dialogue acts classification
for Arabic.
W
e describe the
main existing techniques for utterances segmentations and
classification, annotation schemas, and
test corpora for Arabic
dialogues understanding
that have introduced
in the literature
C ONSTRUCTION O F R ESOURCES U SING J APANESE - S PANISH M EDICAL D ATAijnlc
In recent years, Many NLP researches have focused i
n constructing medical ontologies. This paper
introduces a technique for extracting medical infor
mation from the Wikipedia page. Using a dictionary
and then we evaluate on a Japanese-Spanish SMT syst
em. The study shows an increment in the BLEU score
K AMBA P ART O F S PEECH T AGGER U SING M EMORY B ASED A PPROACHijnlc
Part of speech tagging is very important and the in
itial work towards machine translation and text
manipulation. Though much has been done in this reg
ard to the Indo- European and Asiatic languages,
development of part of speech tagging tools for Afr
ican languages is wanting. As a result, these lang
uages
are classified as under resourced languages.
This paper presents data driven part of speech tagg
ing tools for kikamba which is an under resourced
language spoken mostly in Machakos, Makueni and Kit
ui. The tool is made using the lazy learner called
Memory Based Tagger (MBT) with approximately thirty
thousand word corpuses. The corpus is collected,
cleaned and formatted with regard to MBT and experi
ment run.
Very encouraging performance is reported despite li
ttle amount of corpus, which clearly shows that us
ing
the state of art technology of data driven methods
tools can be developed for under resourced language
s.
We report a precision of 83%, recall of 72% and F-s
core of 75% and in terms of accuracy for the known
and unknown words, and accuracy of 94.65% and71.93%
respectively with overall accuracy of
90.68%..This predicts that with little source of co
rpus using data driven approach, we can generate to
ols
for the under resourced languages in Kenya.
Identification of prosodic features of punjabi for enhancing the pronunciatio...ijnlc
Voice browsing requires speech interface framework. Pronunciation Lexicon Specification (PLS) 1.0 is a recommendation of Voice Browser Working Group of W3C (World-Wide Web Consortium), a machine-readable specification of pronunciation information which can be used for speech technology development. This global PLS standard is applicable across European and Asian languages and this specification is extendable to all human languages. However, it currently does not cover morphological, syntactic and semantic information associated with pronunciations. In Indian languages, grammatical information is relatively encoded in its morphology, than syntax unlike English where the grammatical information is an integral part of syntax. In this paper, PLS 1.0 has been examined from the perspective of augmentation of prosodic features of Punjabi such as tone, germination etc.
Text mining is a new and exciting research area that tries to solve the information overload problem by using techniques from machine learning, natural language processing (NLP), data mining, information retrieval (IR), and knowledge management. Text mining involves the pre-processing of document collections such as information extraction, term extraction, text categorization, and storage of intermediate representations. The techniques that are used to analyse these intermediate representations such as clustering, distribution analysis, association rules and visualisation of the results.
Contextual Analysis for Middle Eastern Languages with Hidden Markov Modelsijnlc
Displaying a document in Middle Eastern languages requires contextual analysis due to different presentational forms for each character of the alphabet. The words of the document will be formed by the joining of the correct positional glyphs representing corresponding presentational forms of the
characters. A set of rules defines the joining of the glyphs. As usual, these rules vary from language to language and are subject to interpretation by the software developers.
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...ijnlc
In this paper we will present our work in studying the sublanguage of Arabic SMS-based classified ads.
This study is presented from the developer's point of view. We will use the corpus collected from an
operational system, CATS. We also compare the SMS-based and the Web-based messages. We also discuss
some quantitative properties of the studied text.
ISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITIONijnlc
Information Extraction (IE) is a sub discipline of Artificial Intelligence. IE identifies information in
unstructured information source that adheres to predefined semantics i.e. people, location etc. Recognition
of named entities (NEs) from computer readable natural language text is significant task of IE and natural
language processing (NLP). Named entity (NE) extraction is important step for processing unstructured
content. Unstructured data is computationally opaque. Computers require computationally transparent
data for processing. IE adds meaning to raw data so that it can be easily processed by computers. There
are various different approaches that are applied for extraction of entities from text. This paper elaborates
need of NE recognition for Marathi and discusses issues and challenges involved in NE recognition tasks
for Marathi language. It also explores various methods and techniques that are useful for creation of
learning resources and lexicons that are important for extraction of NEs from natural language
unstructured text.
Hybrid approaches for automatic vowelization of arabic textsijnlc
Hybrid approaches for automatic vowelization of Arabic texts are presented in this article. The process is
made up of two modules. In the first one, a morphological analysis of the text words is performed using the
open source morphological Analyzer AlKhalil Morpho Sys. Outputs for each word analyzed out of context,
are its different possible vowelizations. The integration of this Analyzer in our vowelization system required
the addition of a lexical database containing the most frequent words in Arabic language. Using a
statistical approach based on two hidden Markov models (HMM), the second module aims to eliminate the
ambiguities. Indeed, for the first HMM, the unvowelized Arabic words are the observed states and the
vowelized words are the hidden states. The observed states of the second HMM are identical to those of the
first, but the hidden states are the lists of possible diacritics of the word without its Arabic letters. Our
system uses Viterbi algorithm to select the optimal path among the solutions proposed by Al Khalil Morpho
Sys. Our approach opens an important way to improve the performance of automatic vowelization of
Arabic texts for other uses in automatic natural language processing.
Monitoring and feedback in the process of language acquisition analysis and ...ijnlc
This document summarizes a paper that analyzes how native Japanese and English speakers monitor and provide feedback during language production. It uses mazes, algorithms, and Excel macros to simulate this process. Speakers of both languages check their speech and make corrections at the sentence level, seeking the most efficient path. Mazes show the learning process, while algorithms and spreadsheets model how speakers select words and structure sentences. For example, a Japanese speaker may change a verb to continue a thought, and an English speaker may replace one word with another after monitoring their speech. The analysis suggests this monitoring ability is important for both native and non-native language acquisition.
A survey of named entity recognition in assamese and other indian languagesijnlc
Named Entity Recognition is always important when dealing with major Natural Language Processing
tasks such as information extraction, question-answering, machine translation, document summarization
etc so in this paper we put forward a survey of Named Entities in Indian Languages with particular
reference to Assamese. There are various rule-based and machine learning approaches available for
Named Entity Recognition. At the very first of the paper we give an idea of the available approaches for
Named Entity Recognition and then we discuss about the related research in this field. Assamese like other
Indian languages is agglutinative and suffers from lack of appropriate resources as Named Entity
Recognition requires large data sets, gazetteer list, dictionary etc and some useful feature like
capitalization as found in English cannot be found in Assamese. Apart from this we also describe some of
the issues faced in Assamese while doing Named Entity Recognition.
Novel cochlear filter based cepstral coefficients for classification of unvoi...ijnlc
This document discusses novel cochlear filter based cepstral coefficients (CFCC) for classification of unvoiced fricatives. The authors propose using CFCC features derived from an auditory transform implemented as a bank of cochlear filters to model the human auditory system. Experimental results show CFCC performs better than MFCC for individual fricative classification, with an average 3.41% higher accuracy in clean conditions and lower error rates. CFCC also shows better noise robustness, with classification accuracy dropping less in noisy conditions compared to MFCC. The document provides background on previous work classifying fricatives, details of the proposed CFCC feature extraction method, and comparisons of auditory transforms to Fourier transforms
An exhaustive font and size invariant classification scheme for ocr of devana...ijnlc
The document presents a classification scheme for recognizing Devanagari characters that is invariant to font and size. It identifies the basic symbols that commonly appear in the middle zone of Devanagari text across different fonts and sizes. Through an analysis of over 465,000 words from various sources, it finds that 345 symbols account for 99.97% of text and aims to classify these into groups based on structural properties like the presence or absence of vertical bars. The proposed classification scheme is validated on 25 fonts and 3 sizes to demonstrate its font and size invariance.
Evaluation of subjective answers using glsa enhanced with contextual synonymyijnlc
Evaluation of subjective answers submitted in an exam is an essential but one of the most resource consuming educational activity. This paper details experiments conducted under our project to build a software that evaluates the subjective answers of informative nature in a given knowledge domain. The paper first summarizes the techniques such as Generalized Latent Semantic Analysis (GLSA) and Cosine Similarity that provide basis for the proposed model. The further sections point out the areas of improvement in the previous work and describe our approach towards the solutions of the same. We then discuss the implementation details of the project followed by the findings that show the improvements achieved. Our approach focuses on comprehending the various forms of expressing same entity and thereby capturing the subjectivity of text into objective parameters. The model is tested by evaluating answers submitted by 61 students of Third Year B. Tech. CSE class of Walchand College of Engineering Sangli in a test on Database Engineering.
G2 pil a grapheme to-phoneme conversion tool for the italian languageijnlc
This paper presents a knowledge-based approach for the grapheme to-phoneme conversion (G2P) of isolated words of the Italian language. With more than 7,000 languages in the world, the biggest challenge today is to rapidly port speech processing systems to new languages with low human effort and at reasonable cost. This includes the creation of qualified pronunciation dictionaries. The dictionaries provide the mapping from the orthographic form of a word to its pronunciation, which is useful in both speech synthesis and automatic speech recognition (ASR) systems. For training the acoustic models we need an automatic routine that maps the spelling of training set to a string of phonetic symbols representing the pronunciation.
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATIONIJDKP
The document presents a new hybrid stemming algorithm for Arabic text categorization. It combines three existing stemming approaches - root-based (Khoja stemmer), stem-based (Larkey stemmer), and statistical (N-gram). The hybrid approach first normalizes text, removes stop words and prefixes/suffixes. It then uses bigram similarity and the Dice measure to find valid roots by comparing word bigrams to a root dictionary. The algorithm is evaluated using naive Bayesian and SVM classifiers in an Arabic text categorization system, and is found to outperform the individual stemming methods. The proposed stemmer improves the performance of Arabic text categorization.
Arabic words stemming approach using arabic wordnetIJDKP
The big growth of the Arabic internet content in the last years has raised up the need for an effective
stemming techniques for Arabic language. Arabic stemming algorithms can be ranked, according to three
category, as root-based approach (ex. Khoja); stem-based approach (ex. Larkey); and statistical approach
(ex. N-Garm). However, no stemming of this language is perfect: The existing stemmers have a low
efficiency. In this paper, we introduce a new stemming technique for Arabic words that also solve the
problem of the plural form of irregular nouns in Arabic language, which called broken plural. The
proposed stem extractor provides very accurate results in comparisons with other algorithms.
Consequently the search effectiveness improved.
DEVELOPING A SIMPLIFIED MORPHOLOGICAL ANALYZER FOR ARABIC PRONOMINAL SYSTEMkevig
This paper proposes an improved morphological analyser for Arabic pronominal system using finite state method. The main advantage of the finite state method is very flexible, powerful and efficient. The most important results about FSAs, relates the class of languages generated by finite state automaton to certain closure properties. This result makes the theory of finite-state automata a very versatile and descriptive framework. The main contribution of this work is the full analysis and the representation of morphological analysis of all the inflections of pronoun forms in Arabic. In this paper we build a finite state network for the inflectional forms of the root words, restricted to all the inflections and grammatical properties of generating the dependent and independent forms of pronouns in Arabic language. The results show high score of accuracy in the output with all the needed linguistic features and the evaluation process of output is conducted using f-score test and the achievement is at the rate of 80% to 83%. The results from the study also provide the evidence that Arabic has strong concatenative word formations.
COMPARATIVE ANALYSIS OF ARABIC STEMMING ALGORITHMSIJMIT JOURNAL
In the context of Information Retrieval, Arabic stemming algorithms have become a most research area of
information retrieval. Many researchers have developed algorithms to solve the problem of stemming.
Each researcher proposed his own methodology and measurements to test the performance and compute
the accuracy of his algorithm. Thus, nobody can make accurate comparisons between these algorithms.
Many generic conflation techniques and stemming algorithms are theoretically analyzed in this paper.
Then, the main Arabic language characteristics that are necessary to be mentioned before discussing
Arabic stemmers are summarized. The evaluation of the algorithms in this paper shows that Arabic
stemming algorithm is still one of the most information retrieval challenges. This paper aims to compare
the most of the commonly used light stemmers in terms of affixes lists, algorithms, main ideas, and
information retrieval performance. The results show that the light10 stemmer outperformed the other
stemmers. Finally, recommendations for future research regarding the development of a standard Arabic
stemmer were presented.
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...gerogepatton
Many automatic translation works have been addressed between major European language pairs, by taking advantage of large scale parallel corpora, but very few research works are conducted on the Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system. LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score of 12%, 11%, and 6% respectively.
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...gerogepatton
Many automatic translation works have been addressed between major European language pairs, by
taking advantage of large scale parallel corpora, but very few research works are conducted on the
Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel
Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic
text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation
of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using
Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system.
LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM
based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score
of 12%, 11%, and 6% respectively.
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...ijaia
Many automatic translation works have been addressed between major European language pairs, by taking advantage of large scale parallel corpora, but very few research works are conducted on the Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system. LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score of 12%, 11%, and 6% respectively
ANNOTATED GUIDELINES AND BUILDING REFERENCE CORPUS FOR MYANMAR-ENGLISH WORD A...ijnlc
The document discusses guidelines and building a reference corpus for Myanmar-English word alignment. It presents annotated guidelines for aligning words between sentences in Myanmar and English texts. The guidelines cover various categories like nouns, verbs, punctuation etc. and provide examples of linking words with "sure" and "possible" labels. The researchers created a reference corpus of 500 aligned sentence pairs from an existing Myanmar-English parallel treebank, following the newly defined annotation guidelines. This corpus aims to help evaluate word alignment tasks and improve statistical machine translation between Myanmar and English.
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYAijnlc
Morphological segmentation is a fundamental task in language processing. Some languages, such as
Arabic and Tigrinya,have words packed with very rich morphological information.Therefore, unpacking
this information becomes a necessary taskfor many downstream natural language processing tasks. This
paper presents the first morphological segmentation research forTigrinya. We constructed a new
morphologically segmented corpus with 45,127 manually segmented tokens. Conditional random fields
(CRF) and window-based longshort-term memory (LSTM) neural networkswere employed separately to
develop our boundary detection models. We appliedlanguage-independent character and substring features
for the CRFand character embeddings for the LSTM networks. Experimentswere performed with four
variants of the Begin-Inside-Outside (BIO) chunk annotation scheme. We achieved 94.67% F1 scoreusing
bidirectional LSTMs with fixed-sizewindow approach to morphemeboundary detection.
FURTHER INVESTIGATIONS ON DEVELOPING AN ARABIC SENTIMENT LEXICONkevig
The availability of lexical resources is huge to accelerate and simplify the sentiment analysis in English. In
Arabic, there are few resources and these resources are not comprehensive. Most of the current research
efforts for constructing Arabic Sentiment Lexicon (ASL) depend on a large number of lexical entities.
However, the coverage of all Arabic sentiment expressions can be applied using refined regular
expressions rather than a large number of lexical entities. This paper presents an ASL that more
comprehensive than the existing lexicons, for covering many expressions with different dialects including
Franco-Arabic, and in the same time more compact. Also, this paper shows how to integrate different
lexicons and to refine them. To enrich lexical entries with very robust morphological syntactical
information, regular expressions, the weight of sentiment polarity and n-gram terms have been augmented
to each
FURTHER INVESTIGATIONS ON DEVELOPING AN ARABIC SENTIMENT LEXICONijnlc
The availability of lexical resources is huge to accelerate and simplify the sentiment analysis in English. In Arabic, there are few resources and these resources are not comprehensive. Most of the current research efforts for constructing Arabic Sentiment Lexicon (ASL) depend on a large number of lexical entities. However, the coverage of all Arabic sentiment expressions can be applied using refined regular expressions rather than a large number of lexical entities. This paper presents an ASL that more comprehensive than the existing lexicons, for covering many expressions with different dialects including Franco-Arabic, and in the same time more compact. Also, this paper shows how to integrate different lexicons and to refine them. To enrich lexical entries with very robust morphological syntactical information, regular expressions, the weight of sentiment polarity and n-gram terms have been augmented to each
Building of Database for English-Azerbaijani Machine Translation Expert SystemWaqas Tariq
In the article the results of development of machine translation expert system is presented. The approach of translation correspondences defining is suggested as a background for creation of data base and knowledge base of the system. Methods of transformation rules compiling applied for linguistic knowledge base of the expert system are based on the defining of translation correspondences between Azerbaijani and English languages.
This document discusses different techniques for improving Arabic information retrieval from automatically transcribed Arabic speech, including normalization, stopword removal, light stemming, heavy stemming, and tokenization. It presents the results of an experiment applying these techniques to Arabic text extracted from television news videos. The techniques of normalization, stopword removal, and light stemming increased retrieval precision according to the experiment, while heavy stemming and trigrams had a negative effect. The choice of machine translation engine for the English queries was also shown to significantly impact retrieval effectiveness.
Smart grammar a dynamic spoken language understanding grammar for inflective ...ijnlc
1. The document proposes SmartGrammar, a new method for developing spoken language understanding grammars for inflectional languages like Italian.
2. SmartGrammar uses a morphological analyzer to convert user utterances into their canonical forms before parsing, allowing the grammar to contain only canonical word forms rather than all possible inflections.
3. This significantly reduces the complexity and size of grammars for inflectional languages by representing many possible inflected forms with a single canonical form entry, making grammar development and management easier.
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONkevig
Phonetic typing using the English alphabet has become widely popular nowadays for social media and chat
services. As a result, a text containing various English and Bangla words and phrases has become
increasingly common. Existing transliteration tools display poor performance for such texts. This paper
proposes a robust Three-stage Hybrid Transliteration (THT) framework that can transliterate both English
words and phonetic typed Bangla words satisfactorily. This is achieved by adopting a hybrid approach of
dictionary-based and rule-based techniques. Experimental results confirm superiority of THT as it
significantly outperforms the benchmark transliteration tool.
Using automated lexical resources in arabic sentence subjectivityijaia
This document describes research on developing an Arabic sentiment lexicon using an existing Arabic lexical semantic database called RDI. The researchers generated a sentiment lexicon called SentiRDI by determining the subjectivity and polarity of words and semantic fields in the RDI database. They used a seed list of sentiment words and semantic relations in RDI, like synonyms and antonyms, to propagate sentiment scores. The system achieved 76.1% accuracy in classifying sentences as subjective or objective using multiple machine learning algorithms and an Arabic language corpus. Key challenges in building Arabic sentiment resources included the lack of semantic databases and the morphological complexity of the Arabic language.
Implementation Of Syntax Parser For English Language Using Grammar RulesIJERA Editor
From many years we have been using Chomsky‟s generative system of grammars, particularly context-free grammars (CFGs) and regular expressions (REs), to express the syntax of programming languages and protocols. Syntactic parsing mainly works with syntactic structure of a sentence. The 'syntax' refers to the grammatical and syntactical arrangement of words in a sentence and their relationship with other words. The main focus of syntactic analysis is important to find syntactic structure of a sentence which usually is represented as a tree structure. To identify the syntactic structure is useful in determining the meaning of a sentence Natural language processing processes the data through lexical analysis, Syntax analysis, Semantic analysis, and Discourse processing, Pragmatic analysis. This paper gives various parsing methods. The algorithm in this paper splits the English sentences into parts using POS (Parts Of Speech) tagger, It identifies the type of sentence (Simple, Complex, Interrogate, Facts, active, passive etc.) and then parses these sentences using grammar rules of Natural language. As natural language processing becomes an increasingly relevant, there is a need for tree banks catered to the specific needs of more individualized systems. Here, we present the open source technique to check and correct the grammar. The methodology will give appropriate grammatical suggestions.
Deterministic Finite State Automaton of Arabic Verb System: A Morphological S...CSCJournals
Finite State Morphology serves as an important tool for investigators of natural language processing. Morphological Analysis forms an essential preprocessing step in natural language processing. This paper discusses the morphological analysis and processing of verb forms in Arabic. It focuses on the inflected verb forms and discusses the perfective, imperfective and imperatives. The deterministic finite state morphological parser for the verb forms can deal with Morphological and orthographic features of Arabic and the morphological processes which are involved in Arabic verb formation and conjugation. We use this model to generate and add all the necessary information (prefix, suffix, stem, etc.) to each morpheme of the words; so we need subtags for each morpheme. Using Finite State tool to build the computational lexicon that are usually structured with a list of the stems and affixes of the language together with a representation that tells us how words can be structured together and how the network of all forms can be represented.
PARSING ARABIC VERB PHRASES USING PREGROUP GRAMMARSijnlc
Parsing of Arabic phrases is a crucial requirement for many applications such as question answering and machine translation. The calculus of pregroup introduced by Lambek as an algebraic computational machinery for the grammatical analysis of natural languages. Pregroup grammar used to analyse sentence structure in many European languages such as English, and non-European languages such as Japanese. In Arabic language, Lambek employed the notions of pregroup to analyse some grammatical structures such as conjugating the verb, tense modifiers and equational sentences. This work attempts to develop an initial phase of an efficient automatic pregroup grammar parser by using linear approach to analysethe verbal phrases of Modern Standard Arabic (MSA). The proposed system starts building Arabic lexicon contains all possible categories of Arabic verbs, then analysing the input verbal Arabic phrase to check if it is wellformed by using linear parsing algorithm based on pregroup grammar rules.
USING AUTOMATED LEXICAL RESOURCES IN ARABIC SENTENCE SUBJECTIVITYijaia
A common point in almost any work on Sentiment analysis is the need to identify which elements of
language (words) contribute to express the subjectivity in text. Collecting of these elements (sentiment
words) regardless the context with their polarities (positive/negative) is called sentiment lexical resources
or subjective lexicon. In this paper, we investigate the method for generating Sentiment Arabic lexical
Semantic Database by using lexicon based approach. Also, we study the prior polarity effects of each word
using our Sentiment Arabic Lexical Semantic Database on the sentence-level subjectivity and multiple
machine learning algorithms. The experiments were conducted on MPQA corpus containing subjective and
objective sentences of Arabic language, and we were able to achieve 76.1 % classification accuracy.
Similar to CBAS: CONTEXT BASED ARABIC STEMMER (20)
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...University of Maribor
Slides from talk presenting:
Aleš Zamuda: Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapter and Networking.
Presentation at IcETRAN 2024 session:
"Inter-Society Networking Panel GRSS/MTT-S/CIS
Panel Session: Promoting Connection and Cooperation"
IEEE Slovenia GRSS
IEEE Serbia and Montenegro MTT-S
IEEE Slovenia CIS
11TH INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONIC AND COMPUTING ENGINEERING
3-6 June 2024, Niš, Serbia
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
1. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.3,June 2015
DOI : 10.5121/ijnlc.2015.4301 1
CBAS: CONTEXT BASED ARABIC STEMMER
Mahmoud El-Defrawy, Yasser El-Sonbaty and Nahla A. Belal
College of Computing and Information Technology, AAST, Alexandria, Egypt
ABSTRACT
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are being
utilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is an
essential part of most Natural Language Processing tasks, especially for derivative languages such as
Arabic. However, stemming is faced with the problem of ambiguity, where two or more roots could be
extracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate its
effectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%
improvement over other stemmers.
KEYWORDS
Natural Language Processing, Computational Linguistics, Text Analysis, Stemming
1.INTRODUCTION
Natural Languages (NLs) are the communication channels between humans. It allows conveying
information, exchanging knowledge, and sharing ideas. For many years scientists studied Natural
Languages and developed theories and rules that govern the use of Natural Languages, such as
Grammar and Morphology. Natural Language Processing (NLP) is the intersection between
linguistics, and Computational Science (CS) [1]. NLP allows utilizing linguistics to use Natural
Languages as a way of communication with computational devices[1].
The association curve between linguistics and computational sciences has evolved over time.
Machine Translation (MT) was one of the first NLP tasks in the 1950s; it began as translation
from Russian to English [2]. The progress of MT was limited due to the complexity of linguistics
rules, and low computation power at the time[1]. However, Chomsky’s theory[3] of natural
language’s grammar formed the basis for the formation of Backus-Naur Form (BNF). BNF[4]
notations are commonly used to represent Context Free Grammar (CFG). CFGs are used to
systematically describe, and validate artificial languages, such as programming languages. Using
CFGs to describe some aspects of Naturals Languages requires a non-trivial set of rules, which
results in some ambiguity due to the unexpected rules’ interactions.
The introduction of statistical methods gave some insights for reducing NLP ambiguity[1]. For
example, the Probabilistic CFGs extends the traditional CGFs by deducing linguistic rules and
assigning weights[5]. Rules and weights are statistically deduced from large annotated corpus.
The noticeable improvement in MT sparks the research in NLP [1].
2. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.3,June 2015
2
Table 1. Context Matrix Sample.
عجائب)Wonders( دول)Countries( ﺟامعة)University( القاھرة)Cairo(
نظم)Systems( 0 5 10 3
سائح)Tourist( 8 5 0 14
حكم)Judgement( 0 12 0 9
NLP tends to work with large sets of data from various languages. It raises the need for defining a
concise representation for the data while preserving as many of its features as possible. Concise
representation is required mostly for any NLP task (word, sentence, or document levels).
Stemming is a primary NLP task, and it contributes in many other NLP tasks [6]. Stemming is
reducing a word to its basic form[7], while preserving its main characteristics. Many languages
define linguistic rules for stemming but not with the same degree[8].
Derivative languages are highly systematic, and highly supportive for stemming analysis. Most of
the derivative languages share the property that complex forms are derived from basic ones.
Arabic is one of the derivative languages that linguistically supports stemming. The Arabic
language is a widely used language[9] and it exists in different formats. For example, Arabic
words can be given in the format of separate text or it could be extracted from images[10]. Arabic
language defines an accurate set of rules known as morphological rules, or morphology.
Morphology accurately describes the formulation of an Arabic word from its basic form. The
basic forms are commonly called roots. However, stemming is faced with ambiguity, as most of
the NLP tasks.
Various techniques were used to resolve ambiguity. Among which is semantic analysis, that is to
capture the intended meaning of a word [11]. Semantic analysis is a very powerful tool to tackle
the ambiguity problem but it is very challenging to model. Distributional Semantic (DS)[11] is a
type of semantic analysis based on co-occurrence analysis. It represents a word’s meaning by its
context (surrounding words) distribution as shown in Table 1. For example, the first row of Table
1 shows that the words عجائب)means"Wonders( and (means “Countries”) دول appeared in the
context of the word (means “Systems”) نظم with frequencies 0 and 5, respectively. Different
measures can be computed such as Pointwise Mutual Information (PMI), Positive PMI (PPMI),
and Smoothed PMI (SPMI). They measure the correlation between a word, and its context[12,13]
.
This paper introduces a context based Arabic stemmer for extracting a word’s root. The proposed
stemmer (CBAS) explores all possible roots then selects the appropriate root using Distributional
Semantics (DS). The DS utilization impact is viewed as a series of comparisons with other
stemmers using a manually annotated set of articles. The paper is organized as follows; section 2
is an introduction to Arabic morphology. Section 3 explores related work, and techniques used for
constructing stemmers. The description of the proposed stemmer (CBAS) is introduced in section
4. In section 5, a detailed analysis, and evaluation of the proposed stemmer (CBAS) is presented.
Finally, a conclusion is presented in section 6.
2. BACKGROUND
Regardless of the approach used for developing morphological analyzers, a basic understanding
of morphological rules is needed to set expectations, evaluate the results, and design
improvements. This section introduces Arabic morphology, and common challenges.
Morphology is the study of word formulation. Arabic morphology is based on the derivation
principle, whereas words are acquired from roots. Roots are usually three, four, or five characters.
3. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.3,June 2015
3
Roots are the seeds for Arabic words generation. A new word is acquired by modifying its root.
For example the word كاتب (kātb, means “Writer”) is derived from the root ب ت ك (kāf tāʾ bāʾ ,
means “Wrote”) by adding ا (ʾlf) in the middle.
Not every addition is considered to be valid. Arabic language introduces a set of templates to
define valid combinations and additions. Templates are referred to as patterns. It is an ordered
sequence of letters. Since patterns work with all roots, a set of letters generically represent roots
letters and its order while augmented letters are represented by themselves in correct positions.
For example the pattern فاعل (fāʿl, means “Actor”) was used to derive the previous word كاتب
(kātb, means”Writer”) by substituting ف (fāʾ) with ك (kāf), ع (ʿyn) with ت (tāʿ), and ل (lām)
with ب (bāʾ),respectively. As noted, roots are commonly written as separated characters to
indicate possible insertions[14].
Augmented letters are reflected on the pattern, and finally on the word itself. However, an
augmented unit (one, or more augmented letters) which is added in the front, or at the end of a
word, is called a prefix, or a suffix addition. In most cases, prefixes and suffixes are not part of a
word’s meaning, rather they are additional features[14]. For example الكاتب (ʾlkātb, means”The
writer”) by adding ال (ʾl, means”The”) in front of the word. This point of view would
substantially reduce the number of enumerated patterns.
As described above, the root-pattern system is simple, elegant, and straightforward. However, the
system is faced with morphological challenges, namely vocalization, mutation, and the absence of
diacritics (annotation above, or below a word’s letter that captures morphological, and
grammatical additional features). For example, a letter can change its form due to grammatical or
phonological rules. Another challenge of the root pattern system is stopwords, such as connection
words that do not obey derivational rules.
The process of deriving back a word to its root (stemming) looks like a straightforward operation,
by simply aligning a word to its pattern, and collecting the letters corresponding to ف (fāʾ), ع
(ʿyn), and ل (lām) letters. However, due to the challenges described above, a word may be
derived back to multiple roots.
3. RELATED WORK
Information Retrieval (IR) is one of the early tasks that utilized stemming analysis[7]. But,
Stemming analysis is not limited to IR. Stemming analysis has improved many tasks, such as
Machine Translation (MT)[15], Sentiment Analysis [16], and many more tasks. This section
views common stemming analysis algorithms for different Natural Languages (NLs).
The nature of a language has a great impact on the development of related stemming algorithms.
For example, the nature of the English language makes English stemmers concerned with
removing word’s suffixes only, while removing prefixes may imply a different meaning, such as
sufficient and insufficient. Various stemmers were developed for English[8,17]. However, the
underlying nature of the language limited its extension for other languages, among which is the
Arabic[18] and Urdu[19] languages. However, stemming is not effective in the same degree for
all languages[20,21]. Arabic is a morphological rich language which has enriched the Natural
Language Processing (NLP)[22-24].
Arabic roots have rich linguistic features; they are semantically representative, derivable, and
finite in numbers. Stemmers were developed over the years to take advantages of such features.
This section introduces common Arabic stemmers, and their root extraction process which
4. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.3,June 2015
4
encapsulates semantic decisions. Finally, it introduces semantic analysis techniques
independently from stemming analysis.
Khoja stemmer[18] is one of the early and most powerful approaches developed for Arabic
stemming[7]. Khoja simulates the linguistic process as much as possible. It removes prefixes, and
suffixes from a word often after the normalization process, then matches the resulting word to a
pattern, and finally extracts the root. The extracted root gets validated against a list of correct
Arabic roots to ensure linguistic correctness. Khoja stemmer[18] resolves ambiguity by defining a
set of linguistic paths, or decisions based on various features such as words first character,
prefixes, or suffixes length. Additionally, decisions are implicitly ordered whereas the result
would be the first correct root. For example, Khoja stemmer handles roots with duplicate letters
first.
Root extraction is a highly complex process due to the existence of overlapping rules, which
requires more information. A new type of stemming analysis introduced is light stemming. Light
stemming is another way of acquiring reduced representation of Arabic words. Light stemming is
not as complex as root extraction. It removes prefixes and suffixes only from a word. For
example, the word الكاتبون (ʾlkātbūn, means “The writers”) would be stemmed to the word كاتب
(kātb, means “Writer”) instead of ب ت ك (kāf tāʾ bāʾ, means “Writing”). Light stemming is
widely used for Information Retrieval (IR)[25]. Light stemming has shown competitive results in
IR against root extraction based stemmers [6 ,26]. Light stemmers are relatively faster and
efficient, which preserves more specific features of the word, for example, كاتب (kātb, means
“Writer”) is more related to the word الكاتبون (ʾlkātbūn, means “The writers) than ب ت ك (kāf tāʾ
bāʾ, means “Writing”). But, the number of words in Arabic without prefixes and suffixes is far
more than the roots listed in Arabic dictionaries[14]. However, there is no explicit evidence that
lightly stemmed words are more efficient than roots[26].
ISRI [27] is another linguistic based Arabic stemmer that roughly uses the same sequence used by
Khojas stemmer [18]. But, the main difference is that ISRI does not linguistically validate the
extracted root, no dictionary is used, and organizes the defined pattern set as sub-groups whereas
each sub-group has common features. Besides, changes to the normalization process, prefixes,
and suffixes handling. The main goal of ISRI is to get the minimum representation of a given
word. Due to various changes, and prioritization, ISRI resolves ambiguity differently, by
changing the order of applying morphological rules. But, ISRI still makes static decisions. Most
likely, ISRI would be used for information retrieval rather than linguistic based tasks.
Tashaphyne [28] is another Arabic stemmer. It mainly supports light stemming (removing
prefixes and suffixes). It follows the same approach used by Khoja [18], and ISRI [27] stemmers.
It can be used for root extraction as well.
Darwish[29] utilizes the existing word-root pairs used to construct the Finite State Transducer
(FST) like stemmers. It uses a learning based technique to not only infer Arabic patterns, but also
to rank extracted roots. This methodology enumerates possible roots like FSTs [30] approaches,
but additionally gives a preference to the extracted roots. It is another way to handle ambiguity
other than static rules. However, part of the inferred patterns would be inadequate due to cases
such as vocalization, and mutation. And, the ranking of roots is based on inferred patterns
frequency, neglecting the words features. Later, this approach has been modified to handle light
stemming (prefixes and suffixes removal) [7].
Arabic grammar has high influence on morphological analysis[14]. ElixirFM [31] employs
syntactic features to enhance morphological results. It takes advantage from Prague Arabic
Dependency Treebank (PADT)[32] to acquire syntactic features and other morphological features
5. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.3,June 2015
5
from BuckWalter [33] stem dictionary. ElixirFM[31] defines a set of morphological rules to
extract possible stems while ranking is used to disambiguate the extracted roots, or stems using
the underlying data.
MADAMIRA[15] morphological analyzer consists of two tools MADA [34], and AMIRA [35].
MADAMIRA [15] takes the advantage of large annotated corpus by using machine learning
techniques such as Support Vector Machine[35]. It combines several tools such as word
segmentation, Part of Speech Tagging (POST), and light stemming. It is different from the
previous approaches, since it does not explicitly define morphological rules.
Stemmers are part of many NLP tasks. For example, in sentiment analysis they employ the use of
classifiers such as Naive Bayes or Support Vector Machines (SVMs) to perform classification on
sets of test samples given a tagged training set and rich feature sets for various tasks[16,36,37],
question and answer systems [38,39], and many more. However, stemming is not limited to NLP.
It has been used for Information Retrieval (IR), and most of the stemmers are evaluated indirectly
using IR benchmarks [7]. Many IR experiments [6,26,27] showed that Arabic roots had improved
the Arabic IR.
4. PROPOSED STEMMER (CBAS)
The proposed stemmer, Context-Based Arabic Stemmer (CBAS), utilizes distributional similarity
of a word’s context to gain additional information about its semantic. This information assists
with the selection of the correct root by excluding semantically irrelevant candidate roots within
the context.
This section introduces the main phases of the proposed stemmer (CBAS), context matrix
construction, roots generation, and root selection, whereas each phase consists of a set of steps.
The proposed algorithm is shown in Fig. 1 and Fig. 2
Fig.1: Context Matrix Construction Algorithm
6. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.3,June 2015
6
Fig.2: Stemmer’s Algorithm
4.1. Data Resources
Predefined linguistic data is an essential part of the proposed stemmer (CBAS). This section
introduces the data defined by CBAS. The Arabic word consists of three parts prefix, infix, and
suffix[14]. Prefixes and suffixes are a set of features that can be added to a word, such as the
definite article ال (ʾl, means”The”) or connected pronouns ھم (hm, means “Them”). Prefixes and
suffixes lists contain individual and compound letters that could appear in the front or the end of
the Arabic word. There is also a list of Arabic patterns which is used to extract possible Arabic
roots. The final list is the Roots’ dictionary which has been extracted from the Khoja stemmer
[18] to validate the extracted root. The prefixes, suffixes, patterns, and dictionary lists are being
used for the roots’ generation phase.
Previous lists are commonly defined for Arabic stemmers. However, CBAS uses a raw data set
which consists of a set of Articles that have been extracted from Omani newspapers[40]. The
dataset contains 20291 articles from various topics, for example, culture and sport[40], which
represents a wide range of the Arabic language current usage. The raw dataset plays a central role
in selecting semantically correct roots, which differs from other used stemmers which do not
commonly employ context in their algorithms.
4.2. Context Matrix Construction
Context Matrix is a powerful and flexible tool to acquire some semantic properties [11]. It defines
a window of n words, where the target word is at position i, and the rest of the surrounding words
are its context [11]. The window slides over the corpus associating the target word with its
context distribution as show in Table 1. Various measures can be computed from the context
matrix, and employed in different tasks.
4.3. Root Generation
7. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.3,June 2015
7
It is automation for root extraction. However, unlike the manual process, it extracts all possible
roots. It consists of three major sub processes, word segmentation, pattern matching, and root
validation.
– Word segmentation breaks a word into all possible three parts, prefix, suffix, and infix, using
the predefined prefixes, and suffixes lists.
– Pattern matching matches the infix obtained in word segmentation with one or more patterns
with respect to its length. For each matched pattern, roots characters are collected, and passed to
dictionary validation. It also handles weak letters, stopwords, and some other linguistic cases.
– Dictionary validation ensures that the extracted roots are linguistically correct. It validates
extracted roots against a list of correct Arabic roots.
Dictionary is not sufficient to generate only one correct root, due to various linguistic cases,
roots’ generation has the potential of extracting one, or more correct roots.
4.4. Roots’ Selection
This section will utilize the context matrix to select an appropriate root from two or more
candidate roots. Pointwise Mutual Information (PMI) [12,13] measures the correlation between
two or more words. Since some words can produce two, or more root candidates. The proposed
algorithm (CBAS) uses a variation of PMI, Smoothed PMI (SPMI) [12] to handle sparse
matrices. As shown in Table 2, SPMI achieved the highest accuracy of 81.5% when compared to
PMI and PPMI, where the achieved accuracy was 78.84% and 79.49%, respectively. This is due
to the fact that SPMI overcomes the tendency towards rare co-occurrence events, which is a side
effect of PMI [12].
SPMI is utilized to measure the correlation between the generated roots, and its previous context.
To take advantage of the underlying matrix, set words are derived for each candidate root, in
addition to the root itself. For each derived word, the average SPMI is computed with the
previous word (as context). The root with the highest average correlation to its context is then
selected.
5. RESULTS AND EVALUATION
IR is the common methodology for evaluating a new stemmer because of the lack of stemmed
benchmarks[21]. This section introduces the validation dataset, evaluation measures, and finally
the experimental results.
5.1. Validation Dataset
Direct evaluation is important to show the stemming accuracy, and potential improvements. A
manually annotated dataset has been provided to measure the stemmer accuracy, and compare
with other stemmers. The dataset is part of the Intentional Corpus of Arabic (ICA) [41].Various
Arabic resources have contributed in collecting the ICA such as newspapers, books, and
magazines. It has been constructed to provide an appropriate representation to Arabic language in
Modern Standard Arabic (MSA) [41]. The dataset consists of 10302 tokens associated with
various features. There exist 3629 unique word-root pairs, while other words do not have roots
associated due to the existence of stopwords, and non-Arabic words. The dataset contains 8941
words after stopwords removal. This is shown in Fig 3.
8. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.3,June 2015
8
Fig.3: Validation Dataset
5.2. Evaluation Criteria
Stemming is beneficial for many tasks, where every task uses the roots in a different way. For
example, IR uses roots as a cluster representative to group related words, while sentiment analysis
is more concerned with the linguistic accuracy of a root. A set of metrics were used to measure
different usages, and compare them with other stemmers.
Stemming accuracy is one of the basic measures for the effectiveness of the stemmer. It is defined
as the ratio between the number of correctly stemmed words, and the number of the words in the
complete dataset.
Collecting related words under the same group is important for tasks such as IR. There are two
variations of grouping related words. First, words can be grouped correctly under a semantically
correct root; this is referred to it as classification. While the second is to group related words
together, not necessarily under a correct root, and this is referred to as clustering. Standard
metrics for classification, and clustering are: accuracy, precision, recall, and F1 measure, and are
defined as follows [42, 43]:
∑= ∪
∩
=
n
i ii
ii
YX
YX
n
accuracy
1 ||
||1
∑=
∩
=
n
i i
ii
Y
YX
n
precision
1 ||
||1
9. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.3,June 2015
9
∑=
∩
=
n
i i
ii
X
YX
n
recall
1 ||
||1
∑= +
∩
=
n
i ii
ii
YX
YX
n
measureF
1
1
||||
||1
Where
n is the number of extracted roots.
X is the set of extracted root.
Xi is an individual extracted root.
And
Y is the set of extracted root.
Yi is an individual valid root.
5.3. Results
The complete 8941 words were used to test the proposed stemmer (CBAS), with a window size
n=3, then the set was reduced to a set of unique word-root pairs to be compared with other
stemmers. Table 2 shows the comparison between the proposed stemmer (CBAS) and other
stemmers. It shows that the proposed stemmer (CBAS) achieved an accuracy of 81.5% with an
improvement of 9.4%, 67.3%, and 51.2% over Khoja, ISRI, and Tashphanye stemmers,
respectively. Accuracy enhancement is due to exploring various possibilities of roots. Such
exploration would not be possible without distributional semantics, which provides a dynamic
and robust way for selecting an appropriate root.
Table 3 and Table 4; show the performance of the proposed stemmer (CBAS) when using it as a
grouping mechanism. Table 3 clarifies that the proposed stemmer (CBAS) has a higher potential
to linguistically group Arabic words than other stemmers. CBAS outperformed other stemmers in
the classification task, with an accuracy of 65.45%. While Table 4 shows that the proposed
stemmer (CBAS) has potential improvements in non-linguistic based tasks, achieving an accuracy
of 73.83% in clustering.
By comparing linguistic (classification), and non-linguistic (clustering) grouping measures, there
is an increase in all corresponding measures. This is due to that some clusters were correctly
formulated irrespective to the clusters seeds. Classification and clustering measures show the
superiority of the CBAS over other stemmers. This indicates the beneficial features of the CBAS
for the IR task.
Table 2. Stemmers Linguistic Accuracy
Stemmer Linguistic Accuracy
Khoja 72.1%
ISRI 14.2%
Tashaphanye 30.3%
CBAS-PMI 78.84%
CBAS-PPMI 79.49%
CBAS 81.5%
10. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.3,June 2015
10
Table 3. Stemmers Classification Measures
Stemmer Accuracy Precision Recall F1 measure
Khoja 57.53% 57.53% 59.59% 58.55%
ISRI 10.43% 10.43% 10.49% 10.46%
Tashaphanye 25.07% 25.07% 25.15% 25.11%
CBAS 65.45% 65.45% 68.23% 66.51%
Table 4. Stemmers Clustering Measures
Stemmer Accuracy Precision Recall F1 measure
Khoja 71.71% 93.09% 75.74% 83.52%
ISRI 12.59% 69.40% 13.34% 22.27%
Tashaphanye 32.25% 72.54% 37.03% 49.03%
CBAS 73.83% 93.71% 75.46% 84.50%
6. CONCLUSION
Many stemmers were developed to gain the rich linguistic features provided by the roots. Most of
the stemmers made explicit decisions, statistical-based or linguistic-based, to select only one root.
Other stemmers used ranking to express their selection preference rather than selecting a single
root. However, at the very end, a single root would be chosen. Static decisions are very
appropriate for common and frequent cases. However, adding other features such as syntactic and
manual annotations would also be valuable. The introduced stemmer employs distributional
similarity to handle incorrect roots selection, which is a side effect of root generation phase. The
existence of robust filtering mechanisms, such distributional analysis, allows exploring various
roots. Distributional analysis has several advantages. It can be computed for any corpus and any
language, and it is relatively fast and inexpensive to construct compared to manually annotated
corpus. Distributional semantics covers many relations between words, and it is robust against
any preferences, or missing information. It is also very adaptive to context changes, which makes
it suitable for many topics. However, distributional analysis is not as accurate as manually
annotated data; hence, the word generation process was added to the roots selection phase to
tolerate possible errors. The previous techniques were compared to the proposed stemmer
(CBAS) results. CBAS shows an accuracy of 81.5% with an improvement of 9.4%, 67.3%, and
51.2% over Khoja, ISRI, and Tashphanye stemmers, respectively. CBAS also shows an
improvement in classification and clustering, with an accuracy of 65.45% and 73.83%,
respectively. Results indicate that the proposed stemmer (CBAS) enhances stemming and other
related tasks. CBAS represents a methodology for capturing a word’s context and makes
decisions based on it. CBAS could change its behaviour based on the underlying data which
could be specialized in a sub domain of the Arabic language. The statistical model used by CBAS
is relatively simple. It incorporates important information (context) of a word which would be a
complex process to include in a rule based stemmer. The statistical model reduces linguistic
complexity of representing various linguistic cases. It also prevents unexpected interactions and
prioritization schemes for ordering the rules.
11. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.3,June 2015
11
REFERENCES
[1] P. M. Nadkarni, L. Ohno-Machado, and W. W. Chapman, "Natural language processing: an
introduction," Journal of the American Medical Informatics Association, vol. 18, pp. 544-551, 2011.
[2] J. Hutchins, "The first public demonstration of machine translation: the Georgetown-IBM system, 7th
January 1954," noviembre de, 2005.
[3] N. Chomsky, "Three models for the description of language," Information Theory, IRE Transactions
on, vol. 2, pp. 113-124, 1956.
[4] A. Aho, "R. Sethi, and J. D. Ullman," Compilers: Principles, Techniques, and Tools, 1988.
[5] D. Klein and C. D. Manning, "Accurate unlexicalized parsing," in Proceedings of the 41st Annual
Meeting on Association for Computational Linguistics-Volume 1, 2003, pp. 423-430.
[6] M. Aljlayl and O. Frieder, "On Arabic search: improving the retrieval effectiveness via a light
stemming approach," in Proceedings of the eleventh international conference on Information and
knowledge management, 2002, pp. 340-347.
[7] I. A. Al‐Sughaiyer and I. A. Al‐Kharashi, "Arabic morphological analysis techniques: A
comprehensive survey," Journal of the American Society for Information Science and Technology,
vol. 55, pp. 189-213, 2004.
[8] M. F. Porter, "Snowball: A language for stemming algorithms," ed, 2001.
[9] J. Xu, A. Fraser, and R. Weischedel, "Empirical studies in strategies for Arabic retrieval," in
Proceedings of the 25th annual international ACM SIGIR conference on Research and development
in information retrieval, 2002, pp. 269-274.
[10] R. Fathalla, Y. El Sonbaty, and M. A. Ismail, "Extraction of Arabic Words form Complex Color
Images," in 9th IEEE International Conference on Document Analysis and Recognition (ICDAR
2007), Brazil, pp. 1223-1227.
[11] C. Akkaya, J. Wiebe, and R. Mihalcea, "Utilizing semantic composition in distributional semantic
models for word sense discrimination and word sense disambiguation," in Semantic Computing
(ICSC), 2012 IEEE Sixth International Conference on, 2012, pp. 45-51.
[12] D. Jurafsky. Word Senses and Word Relations.
[13] G. Bouma, "Normalized (pointwise) mutual information in collocation extraction," Proceedings of
GSCL, pp. 31-40, 2009.
[14] K. C. Ryding, A reference grammar of modern standard Arabic: Cambridge university press, 2005.
[15]A. Pasha, M. Al-Badrashiny, M. Diab, A. El Kholy, R. Eskander, N. Habash, et al., "Madamira: A fast,
comprehensive tool for morphological analysis and disambiguation of arabic," in Proceedings of the
Language Resources and Evaluation Conference (LREC), Reykjavik, Iceland, 2014.
[16] S. M. Oraby, Y. El-Sonbaty, and M. A. El-Nasr, "Exploring the Effects of Word Roots for Arabic
Sentiment Analysis," in International Joint Conference on Natural Language Processing, Nagoya,
Japan, 2013, pp. 471-479.
[17] J. B. Lovins, Development of a stemming algorithm: MIT Information Processing Group, Electronic
Systems Laboratory, 1968.
[18] S. Khoja and R. Garside, "Stemming arabic text," Lancaster, UK, Computing Department, Lancaster
University, 1999.
[19] M. S. Husain, "An unsupervised approach to develop stemmer," International Journal on Natural
Language Computing, vol. 1, pp. 15-23, 2012.
[20] D. Harman, "How effective is suffixing?," JASIS, vol. 42, pp. 7-15, 1991.
[21] I. Smirnov, "Overview of stemming algorithms," Mechanical Translation, vol. 52, 2008.
[22] Y. Benajiba, M. Diab, and P. Rosso, "Arabic named entity recognition using optimized feature sets,"
in Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2008, pp.
284-293.
[23] K. Darwish and D. W. Oard, "CLIR Experiments at Maryland for TREC-2002: Evidence combination
for Arabic-English retrieval," DTIC Document2003.
[24] L. S. Larkey and M. E. Connell, "Arabic information retrieval at UMass in TREC-10," DTIC
Document2006.
[25] L. S. Larkey, L. Ballesteros, and M. E. Connell, "Light stemming for Arabic information retrieval," in
Arabic computational morphology, ed: Springer, 2007, pp. 221-243.
12. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.3,June 2015
12
[26] L. S. Larkey, L. Ballesteros, and M. E. Connell, "Improving stemming for Arabic information
retrieval: light stemming and co-occurrence analysis," in Proceedings of the 25th annual international
ACM SIGIR conference on Research and development in information retrieval, 2002, pp. 275-282.
[27] K. Taghva, R. Elkhoury, and J. Coombs, "Arabic stemming without a root dictionary," in null, 2005,
pp. 152-157.
[28] T. Zerrouki. (2010). Tashaphyne, Arabic light stemmer/segment.
[29] K. Darwish, "Building a shallow Arabic morphological analyzer in one day," in Proceedings of the
ACL-02 workshop on Computational approaches to semitic languages, 2002, pp. 1-8.
[30] K. R. Beesley, "Arabic morphological analysis on the Internet," in Proceedings of the 6th
International Conference and Exhibition on Multi-lingual Computing, 1998.
[31] O. Smrž, "Elixirfm: implementation of functional arabic morphology," in Proceedings of the 2007
Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources,
2007, pp. 1-8.
[32] O. PetrZemánek, "Prague Arabic Dependency Treebank: A Word on the Million Words."
[33] T. Buckwalter, "Buckwalter {Arabic} Morphological Analyzer Version 1.0," 2002.
[34] N. Habash, O. Rambow, and R. Roth, "MADA+ TOKAN: A toolkit for Arabic tokenization,
diacritization, morphological disambiguation, POS tagging, stemming and lemmatization," in
Proceedings of the 2nd International Conference on Arabic Language Resources and Tools
(MEDAR), Cairo, Egypt, 2009, pp. 102-109.
[35] M. Diab, K. Hacioglu, and D. Jurafsky, "Automated methods for processing arabic text: From
tokenization to base phrase chunking," Arabic Computational Morphology: Knowledge-based and
Empirical Methods. Kluwer/Springer, 2007.
[36] S. N. Saleh and Y. El-Sonbaty, "A feature selection algorithm with redundancy reduction for text
classification," in Computer and information sciences, 2007. iscis 2007. 22nd international
symposium on, 2007, pp. 1-6.
[37] S. Oraby, Y. El-Sonbaty, and M. A. El-Nasr, "Finding Opinion Strength Using Rule-Based Parsing
for Arabic Sentiment Analysis," in Advances in Soft Computing and Its Applications, ed: Springer,
2013, pp. 509-520.
[38] A. M. Ezzeldin, M. H. Kholief, and Y. El-Sonbaty, "ALQASIM: Arabic language question answer
selection in machines," in Information Access Evaluation. Multilinguality, Multimodality, and
Visualization, ed: Springer, 2013, pp. 100-103.
[39] A. M. Ezzeldin, Y. El-Sonbaty, and M. H. Kholief, "Exploring the Effects of Root Expansion,
Sentence Splitting and Ontology on Arabic Answer Selection," Natural Language Processing and
Cognitive Science: Proceedings 2014, p. 273, 2015.
[40] M. Abbas, K. Smaïli, and D. Berkani, "Evaluation of Topic Identification Methods on Arabic
Corpora," JDIM, vol. 9, pp. 185-192, 2011.
[41] S. Alansary, M. Nagi, and N. Adly, "Building an International Corpus of Arabic (ICA): progress of
compilation stage," in 7th international conference on language engineering, Cairo, Egypt, 2007, pp.
5-6.
[42] S. Godbole and S. Sarawagi, "Discriminative methods for multi-labeled classification," in Advances
in Knowledge Discovery and Data Mining, ed: Springer, 2004, pp. 22-30.
[43] M. Hillenmeyer. Machine Learning.