This document describes research on developing an Arabic sentiment lexicon using an existing Arabic lexical semantic database called RDI. The researchers generated a sentiment lexicon called SentiRDI by determining the subjectivity and polarity of words and semantic fields in the RDI database. They used a seed list of sentiment words and semantic relations in RDI, like synonyms and antonyms, to propagate sentiment scores. The system achieved 76.1% accuracy in classifying sentences as subjective or objective using multiple machine learning algorithms and an Arabic language corpus. Key challenges in building Arabic sentiment resources included the lack of semantic databases and the morphological complexity of the Arabic language.
This document summarizes current research on morphological analysis techniques for the Assamese language. It discusses prior work using rule-based and unsupervised methods for morphological analysis of several Indian languages, including Hindi, Bengali, Punjabi, Marathi, Tamil, Malayalam, Kannada, and Assamese. For Assamese specifically, it describes several studies that used suffix stripping and rule-based approaches to develop morphological analyzers, as well as some initial work on unsupervised techniques. The document concludes that while most existing work on Assamese has used supervised suffix stripping methods, unsupervised techniques show promise but have not been fully explored.
An implementation of apertium based assamese morphological analyzerijnlc
Morphological Analysis is an important branch of linguistics for any Natural Language Processing Technology. Morphology studies the word structure and formation of word of a language. In current scenario of NLP research, morphological analysis techniques have become more popular day by day. For processing any language, morphology of the word should be first analyzed. Assamese language contains very complex morphological structure. In our work we have used Apertium based Finite-State-Transducers for developing morphological analyzer for Assamese Language with some limited domain and we get 72.7% accuracy
Comparative performance analysis of two anaphora resolution systemsijfcstjournal
Anaphora Resolution is the process of finding referents in the given discourse. Anaphora Resolution is one
of the complex tasks of linguistics. This paper presents the performance analysis of two computational
models that uses Gazetteer method for resolving anaphora in Hindi Language. In Gazetteer method
different classes (Gazettes) of elements are created. These Gazettes are used to provide external knowledge
to the system. The two models use Recency and Animistic factor for resolving anaphors. For Recency factor
first model uses the concept of centering approach and second uses the concept of Lappin Leass approach.
Gazetteers are used to provide Animistic knowledge. This paper presents the experimental results of both
the models. These experiments are conducted on short Hindi stories, news articles and biography content
from Wikipedia. The respective accuracy for both the model is analyzed and finally the conclusion is drawn
for the best suitable model for Hindi Language.
FURTHER INVESTIGATIONS ON DEVELOPING AN ARABIC SENTIMENT LEXICONijnlc
The availability of lexical resources is huge to accelerate and simplify the sentiment analysis in English. In Arabic, there are few resources and these resources are not comprehensive. Most of the current research efforts for constructing Arabic Sentiment Lexicon (ASL) depend on a large number of lexical entities. However, the coverage of all Arabic sentiment expressions can be applied using refined regular expressions rather than a large number of lexical entities. This paper presents an ASL that more comprehensive than the existing lexicons, for covering many expressions with different dialects including Franco-Arabic, and in the same time more compact. Also, this paper shows how to integrate different lexicons and to refine them. To enrich lexical entries with very robust morphological syntactical information, regular expressions, the weight of sentiment polarity and n-gram terms have been augmented to each
This document summarizes research that has been done on computational morphology for the Odia language. It begins with an abstract that outlines how morphological analysis, generation, and parsing are important tools for natural language processing. The document then reviews different works that have developed morphological analyzers and generators for Odia. It describes various methods that have been used, including suffix stripping, finite state transducers, two-level morphology, corpus-based approaches, and paradigm-based approaches. Finally, it outlines several applications of morphology like machine translation, spelling checking, and part-of-speech tagging.
Automatic text simplification evaluation aspectsiwan_rg
The document discusses automatic text simplification (ATS) and methods for evaluating ATS systems. It provides an overview of common evaluation metrics like BLEU, SARI, FKGL, and SAMSA and compares their abilities to measure simplicity, meaning preservation, and grammaticality. The document also outlines a proposed project to build a corpus and develop a graded reading scale to guide the simplification of Arabic fiction works for educational purposes.
A corpus driven comparative analysis of modal verbs in pakistani and british ...Alexander Decker
This document discusses a corpus-driven comparative analysis of modal verbs in Pakistani and British English fiction. It begins by introducing the study, which compiled corpora of 1 million words each from Pakistani English fiction (PEF) and British English fiction (BEF). Part-of-speech tagging was performed on both corpora using CLAWS tagging, and concordance lines of modal verbs were manually explored using Antconc software. The study aims to identify differences in modal verb usage between PEF and BEF and provide stylistic interpretations. A literature review covers previous research on modal verb classification and analyses of modal verbs in different varieties of English. The methodology explains that both qualitative and quantitative methods are used, including corpus compilation, POS tagging, and concord
This document summarizes current research on morphological analysis techniques for the Assamese language. It discusses prior work using rule-based and unsupervised methods for morphological analysis of several Indian languages, including Hindi, Bengali, Punjabi, Marathi, Tamil, Malayalam, Kannada, and Assamese. For Assamese specifically, it describes several studies that used suffix stripping and rule-based approaches to develop morphological analyzers, as well as some initial work on unsupervised techniques. The document concludes that while most existing work on Assamese has used supervised suffix stripping methods, unsupervised techniques show promise but have not been fully explored.
An implementation of apertium based assamese morphological analyzerijnlc
Morphological Analysis is an important branch of linguistics for any Natural Language Processing Technology. Morphology studies the word structure and formation of word of a language. In current scenario of NLP research, morphological analysis techniques have become more popular day by day. For processing any language, morphology of the word should be first analyzed. Assamese language contains very complex morphological structure. In our work we have used Apertium based Finite-State-Transducers for developing morphological analyzer for Assamese Language with some limited domain and we get 72.7% accuracy
Comparative performance analysis of two anaphora resolution systemsijfcstjournal
Anaphora Resolution is the process of finding referents in the given discourse. Anaphora Resolution is one
of the complex tasks of linguistics. This paper presents the performance analysis of two computational
models that uses Gazetteer method for resolving anaphora in Hindi Language. In Gazetteer method
different classes (Gazettes) of elements are created. These Gazettes are used to provide external knowledge
to the system. The two models use Recency and Animistic factor for resolving anaphors. For Recency factor
first model uses the concept of centering approach and second uses the concept of Lappin Leass approach.
Gazetteers are used to provide Animistic knowledge. This paper presents the experimental results of both
the models. These experiments are conducted on short Hindi stories, news articles and biography content
from Wikipedia. The respective accuracy for both the model is analyzed and finally the conclusion is drawn
for the best suitable model for Hindi Language.
FURTHER INVESTIGATIONS ON DEVELOPING AN ARABIC SENTIMENT LEXICONijnlc
The availability of lexical resources is huge to accelerate and simplify the sentiment analysis in English. In Arabic, there are few resources and these resources are not comprehensive. Most of the current research efforts for constructing Arabic Sentiment Lexicon (ASL) depend on a large number of lexical entities. However, the coverage of all Arabic sentiment expressions can be applied using refined regular expressions rather than a large number of lexical entities. This paper presents an ASL that more comprehensive than the existing lexicons, for covering many expressions with different dialects including Franco-Arabic, and in the same time more compact. Also, this paper shows how to integrate different lexicons and to refine them. To enrich lexical entries with very robust morphological syntactical information, regular expressions, the weight of sentiment polarity and n-gram terms have been augmented to each
This document summarizes research that has been done on computational morphology for the Odia language. It begins with an abstract that outlines how morphological analysis, generation, and parsing are important tools for natural language processing. The document then reviews different works that have developed morphological analyzers and generators for Odia. It describes various methods that have been used, including suffix stripping, finite state transducers, two-level morphology, corpus-based approaches, and paradigm-based approaches. Finally, it outlines several applications of morphology like machine translation, spelling checking, and part-of-speech tagging.
Automatic text simplification evaluation aspectsiwan_rg
The document discusses automatic text simplification (ATS) and methods for evaluating ATS systems. It provides an overview of common evaluation metrics like BLEU, SARI, FKGL, and SAMSA and compares their abilities to measure simplicity, meaning preservation, and grammaticality. The document also outlines a proposed project to build a corpus and develop a graded reading scale to guide the simplification of Arabic fiction works for educational purposes.
A corpus driven comparative analysis of modal verbs in pakistani and british ...Alexander Decker
This document discusses a corpus-driven comparative analysis of modal verbs in Pakistani and British English fiction. It begins by introducing the study, which compiled corpora of 1 million words each from Pakistani English fiction (PEF) and British English fiction (BEF). Part-of-speech tagging was performed on both corpora using CLAWS tagging, and concordance lines of modal verbs were manually explored using Antconc software. The study aims to identify differences in modal verb usage between PEF and BEF and provide stylistic interpretations. A literature review covers previous research on modal verb classification and analyses of modal verbs in different varieties of English. The methodology explains that both qualitative and quantitative methods are used, including corpus compilation, POS tagging, and concord
Development of morphological analyzer for hindiMade Artha
This document presents the development of a morphological analyzer for the Hindi language. It discusses the importance of morphological analysis in natural language processing. Hindi is described as a morphologically rich, inflected language capable of generating hundreds of words from root words. The proposed analyzer will take in single Hindi words or sentences and return the root word along with features like part of speech, gender, number, and person. It will use both a rule-based approach and corpus-based approach, as previous analyzers have not worked for both inflectional and derivational morphemes. The document outlines the structure of Hindi words and sentences to provide context for developing the analyzer.
A Tool to Search and Convert Reduplicate Words from Hindi to PunjabiIJERA Editor
This document discusses reduplication in Hindi and proposes developing a tool to identify and translate reduplicated words from Hindi to Punjabi. It begins by defining reduplication and describing different types, including morphological, lexical, echo formation, compounds, word and discontinuous reduplication. It then states the problem of automatically identifying echo words in Hindi conversations to translate to Punjabi. Next, it discusses using rule-based and statistical approaches and choosing rule-based. Finally, it outlines the work to be done, including studying reduplication, past research, translating echo words, and rule-based approaches.
Sanskrit in Natural Language ProcessingHitesh Joshi
As Sanskrit is most unambiguous language as compare to other natural languages. As stated by Rick Briggs, NASA it is the most suitable language for the computer in natural language processing.
An OT Account of Phonological Alignment and Epenthesis in Aligarh Urduijtsrd
This paper provides the phonological properties of the alignment and the economical procedures of the epenthesis at the syllable structure of the words in Aligarh Urdu. The paper determines the behavior of certain segments that attach to its own neighboring words and elaborates the economy of the syllable structure of tokens in a particular language. In Aligarh Urdu, there are various types of segmental processes in terms of addition or deletion of phonemes that affects to the root and alters the entire physical mechanism structure of words. The objectives of this paper are to know the exact economic conditions of syllable structures in the words after the addition, elision and alignment of segments in Aligarh Urdu. All the process of conflicts between the addition and deletion of the segments will manipulate within the framework of constraint rankings in Optimality Theory Prince and Smolensky, 1993 . The general purpose of this paper is to reveal the whole criteria of implications of principles of Optimality Theory and explore the actual framework of syllables with their marginal and obligatory components. The researcher governs the phonological property of consonant clusters with the help of faithfulness constraints and markedness constraints. The architecture of root word completely varies from the artificial formulation of other words, but after the imposition of constraints, we reveal the concrete fact of linguistic items in a specific language. The groundwork of this paper leads to the systematic phenomena of epenthesis and the elimination of vowels or consonants with the tenets of OT. This paper deals with the gradient property of segments that alters the framework of underlying form and affected by some other features at the surface form. The generalization of each step of the syllable structure of words should be related to the positional variation of input and output candidates. The conflicts between input and output candidates to become the winner as an optimal candidate can be solved only on the presence of constraint rankings that are evolving in the Optimality Theory. Mohd Hamid Raza ""An OT Account of Phonological Alignment and Epenthesis in Aligarh Urdu"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-3 , April 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23517.pdf
Paper URL: https://www.ijtsrd.com/humanities-and-the-arts/other/23517/an-ot-account-of-phonological-alignment-and-epenthesis-in-aligarh-urdu/mohd-hamid-raza
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
The document discusses House's model of translation quality assessment and its application to analyzing a short story and its translation. House's model assesses translation quality through a systematic comparison of the original text and its translation at the levels of language, register (including field, tenor, and mode), and genre. It categorizes errors in translation as either covertly erroneous errors due to mismatches in register dimensions, or overtly erroneous errors involving denotative mismatches or target language errors. The document applies House's model to analyze a translation of John Steinbeck's short story "The Grapes of Wrath" into Persian to determine if it is a covert or overt translation as defined by House's typology.
This document outlines several papers on Arabic spell checking techniques. It discusses approaches that use morphological analysis, edit distance, and knowledge bases to detect and correct spelling errors in Arabic text. The techniques detect ill-formed words, semantically incorrect words, missing, incorrect, or extra letters. Evaluations show the approaches achieve over 80% recall and 90% precision on error detection and correction. Later papers propose full spell checking systems but do not report experimental results.
Implementation Of Syntax Parser For English Language Using Grammar RulesIJERA Editor
From many years we have been using Chomsky‟s generative system of grammars, particularly context-free grammars (CFGs) and regular expressions (REs), to express the syntax of programming languages and protocols. Syntactic parsing mainly works with syntactic structure of a sentence. The 'syntax' refers to the grammatical and syntactical arrangement of words in a sentence and their relationship with other words. The main focus of syntactic analysis is important to find syntactic structure of a sentence which usually is represented as a tree structure. To identify the syntactic structure is useful in determining the meaning of a sentence Natural language processing processes the data through lexical analysis, Syntax analysis, Semantic analysis, and Discourse processing, Pragmatic analysis. This paper gives various parsing methods. The algorithm in this paper splits the English sentences into parts using POS (Parts Of Speech) tagger, It identifies the type of sentence (Simple, Complex, Interrogate, Facts, active, passive etc.) and then parses these sentences using grammar rules of Natural language. As natural language processing becomes an increasingly relevant, there is a need for tree banks catered to the specific needs of more individualized systems. Here, we present the open source technique to check and correct the grammar. The methodology will give appropriate grammatical suggestions.
The document discusses several approaches to Arabic spell checking. It begins by outlining common Arabic spelling errors and then summarizes seven different approaches: 1) a stochastic approach focusing on space insertions and deletions, 2) a Prolog-based approach detecting errors and offering corrections, 3) an approach using n-gram scores and a matrix method, 4) generating an Arabic word list for error detection and language modeling, 5) a system with error detection and correction components using a dictionary and error model, 6) improving this system by enhancing the dictionary and candidate generation, and 7) concluding that the results are promising but there is still room for improvement.
The document describes the key steps in natural language processing including morphological analysis, part-of-speech tagging, lexical processing, syntactic processing, semantic analysis, knowledge representation, discourse analysis, and applications such as machine translation. It outlines techniques for analyzing words, assigning parts of speech, determining word meanings, parsing sentences into syntactic structures, assigning semantic meanings, representing knowledge, analyzing discourse, and translating between languages.
The document discusses natural language processing and some of the key challenges involved. It describes how NLP systems aim to understand human language in written or spoken form by performing tasks like morphological analysis, parsing, semantic analysis, and discourse processing. It also discusses sources of ambiguity in natural language and different models and algorithms used to represent linguistic knowledge and process language, with the goal of building intelligent systems that can understand human communication.
The document discusses key concepts from J.C. Catford's theory of translation, including definitions of translation, levels of translation, and categories of linguistic analysis that are applicable to translation. It defines translation as replacing textual material in one language with equivalent material in another. It outlines Catford's categories of linguistic analysis including unit, structure, class, and system. It also discusses ranks in grammar from morpheme to sentence and types of translation according to extent, level, and rank.
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORMijnlc
This article describes the morphological analysis of a standard Arabic natural language processing, as a
part of an electronic dictionary-constricting phase. A fully 3-lettered inflected verbs model are formalized
based on a linguistic classification, using NOOJ platform, the classification gives certain representative
verbs that will considered as lemmas, this verbs form our dictionary entries, they are also conjugated
according to our inflection paradigm relying on certain specific morphological properties. This dictionary
will be considered as an Arabic resource, which will help NLP applications and NOOJ platform to analyse
sophisticated Arabic corpora.
Shallow parser for hindi language with an input from a transliteratorShashank Shisodia
This document summarizes a student project to develop a shallow parser for Hindi language with input from a transliterator. The plan is to create a transliterator to convert Roman script to Devanagari, generate a lexicon from corpus analysis, develop a morphological analyzer using finite state transducers, and implement a shallow parser using context free grammar. The system architecture and flow chart are presented. In conclusion, the document notes that shallow parsing is needed to build full parsers for Hindi and transliteration is important for translating names and terms across languages with different alphabets.
This document discusses coreference resolution in Arabic texts. It begins with an introduction to coreference and characteristics of the Arabic language. It then discusses related work on coreference resolution, the methodology used including an MT-based stemmer and coreference resolution process. It discusses challenges of Arabic information extraction due to its complex morphology. It also covers annotated Arabic corpora and tools used for annotation. In conclusion, it summarizes information on coreference resolution in Arabic.
J.C. Ian Catford was a Scottish linguist born in 1917 who studied translation. He defined translation as substituting text in one language for an equivalent text in another language. Catford viewed translation as unidirectional, from the source language to the target language. He examined issues like extent, meaning, language levels, and finding equivalents in the target language. Catford differentiated between full translation, where every part of the source text is replaced, and partial translation, where some parts may be left untranslated.
This document discusses discourse analysis and discourse relations in Arabic texts. It defines discourse as written or spoken language and explains that texts have cohesive devices that link words, clauses, and sentences. There are two types of discourse relations: explicitly signaled relations indicated by connectives, and implicitly inferred relations without signaling. The document also describes different types of discourse connectives in Arabic and presents an annotation scheme for connectives and their arguments. It discusses related work on Arabic corpora and efforts to collect and analyze Arabic connectives. Finally, it introduces Rhetorical Structure Theory as one of the most popular theories for representing discourse structure.
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMERijnlc
The document describes a deep BiGRU-CNN-CRF neural network model that jointly performs word segmentation, stemming, and named entity recognition for the Myanmar language. The model uses character-level and syllable-level representations as input. It was trained on a manually annotated Myanmar corpus. Evaluation results showed the neural sequence labeling architecture achieved state-of-the-art performance for the joint tasks of word segmentation, stemming, and named entity recognition in Myanmar text.
USING AUTOMATED LEXICAL RESOURCES IN ARABIC SENTENCE SUBJECTIVITYijaia
A common point in almost any work on Sentiment analysis is the need to identify which elements of
language (words) contribute to express the subjectivity in text. Collecting of these elements (sentiment
words) regardless the context with their polarities (positive/negative) is called sentiment lexical resources
or subjective lexicon. In this paper, we investigate the method for generating Sentiment Arabic lexical
Semantic Database by using lexicon based approach. Also, we study the prior polarity effects of each word
using our Sentiment Arabic Lexical Semantic Database on the sentence-level subjectivity and multiple
machine learning algorithms. The experiments were conducted on MPQA corpus containing subjective and
objective sentences of Arabic language, and we were able to achieve 76.1 % classification accuracy.
FURTHER INVESTIGATIONS ON DEVELOPING AN ARABIC SENTIMENT LEXICONkevig
The availability of lexical resources is huge to accelerate and simplify the sentiment analysis in English. In
Arabic, there are few resources and these resources are not comprehensive. Most of the current research
efforts for constructing Arabic Sentiment Lexicon (ASL) depend on a large number of lexical entities.
However, the coverage of all Arabic sentiment expressions can be applied using refined regular
expressions rather than a large number of lexical entities. This paper presents an ASL that more
comprehensive than the existing lexicons, for covering many expressions with different dialects including
Franco-Arabic, and in the same time more compact. Also, this paper shows how to integrate different
lexicons and to refine them. To enrich lexical entries with very robust morphological syntactical
information, regular expressions, the weight of sentiment polarity and n-gram terms have been augmented
to each
USING OBJECTIVE WORDS IN THE REVIEWS TO IMPROVE THE COLLOQUIAL ARABIC SENTIME...ijnlc
One of the main difficulties in sentiment analysis of the Arabic language is the presence of the
colloquialism. In this paper, we examine the effect of using objective words in conjunction with sentimental
words on sentiment classification for the colloquial Arabic reviews, specifically Jordanian colloquial
reviews. The reviews often include both sentimental and objective words; however, the most existing
sentiment analysis models ignore the objective words as they are considered useless. In this work, we
created tow lexicons: the first includes the colloquial sentimental words and compound phrases, while the
other contains the objective words associated with values of sentiment tendency based on a particular
estimation method. We used these lexicons to extract sentiment features that would be training input to the
Support Vector Machines (SVM) to classify the sentiment polarity of the reviews. The reviews dataset have
been collected manually from JEERAN website. The results of the experiments show that the proposed
approach improves the polarity classification in comparison to two baseline models, with accuracy 95.6%.
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...ijnlc
This article introduces a methodology for analyzing sentiment in Arabic text using a global foreign lexical
source. Our method leverages the available resource in another language such as the SentiWordNet in
English to the limited language resource that is Arabic. The knowledge that is taken from the external
resource will be injected into the feature model whilethe machine-learning-based classifier is trained. The
first step of our method is to build the bag-of-words (BOW) model of the Arabic text. The second step
calculates the score of polarity using translation machine technique and English SentiWordNet. The scores
for each text will be added to the model in three pairs for objective, positive, and negative. The last step of
our method involves training the ML classifier on that model to predict the sentiment of the Arabic text.
Our method increases the performance compared with the baseline model that is BOW in most cases. In
addition, it seems a viable approach to sentiment analysis in Arabic text where there is limitation of the
available resource.
Development of morphological analyzer for hindiMade Artha
This document presents the development of a morphological analyzer for the Hindi language. It discusses the importance of morphological analysis in natural language processing. Hindi is described as a morphologically rich, inflected language capable of generating hundreds of words from root words. The proposed analyzer will take in single Hindi words or sentences and return the root word along with features like part of speech, gender, number, and person. It will use both a rule-based approach and corpus-based approach, as previous analyzers have not worked for both inflectional and derivational morphemes. The document outlines the structure of Hindi words and sentences to provide context for developing the analyzer.
A Tool to Search and Convert Reduplicate Words from Hindi to PunjabiIJERA Editor
This document discusses reduplication in Hindi and proposes developing a tool to identify and translate reduplicated words from Hindi to Punjabi. It begins by defining reduplication and describing different types, including morphological, lexical, echo formation, compounds, word and discontinuous reduplication. It then states the problem of automatically identifying echo words in Hindi conversations to translate to Punjabi. Next, it discusses using rule-based and statistical approaches and choosing rule-based. Finally, it outlines the work to be done, including studying reduplication, past research, translating echo words, and rule-based approaches.
Sanskrit in Natural Language ProcessingHitesh Joshi
As Sanskrit is most unambiguous language as compare to other natural languages. As stated by Rick Briggs, NASA it is the most suitable language for the computer in natural language processing.
An OT Account of Phonological Alignment and Epenthesis in Aligarh Urduijtsrd
This paper provides the phonological properties of the alignment and the economical procedures of the epenthesis at the syllable structure of the words in Aligarh Urdu. The paper determines the behavior of certain segments that attach to its own neighboring words and elaborates the economy of the syllable structure of tokens in a particular language. In Aligarh Urdu, there are various types of segmental processes in terms of addition or deletion of phonemes that affects to the root and alters the entire physical mechanism structure of words. The objectives of this paper are to know the exact economic conditions of syllable structures in the words after the addition, elision and alignment of segments in Aligarh Urdu. All the process of conflicts between the addition and deletion of the segments will manipulate within the framework of constraint rankings in Optimality Theory Prince and Smolensky, 1993 . The general purpose of this paper is to reveal the whole criteria of implications of principles of Optimality Theory and explore the actual framework of syllables with their marginal and obligatory components. The researcher governs the phonological property of consonant clusters with the help of faithfulness constraints and markedness constraints. The architecture of root word completely varies from the artificial formulation of other words, but after the imposition of constraints, we reveal the concrete fact of linguistic items in a specific language. The groundwork of this paper leads to the systematic phenomena of epenthesis and the elimination of vowels or consonants with the tenets of OT. This paper deals with the gradient property of segments that alters the framework of underlying form and affected by some other features at the surface form. The generalization of each step of the syllable structure of words should be related to the positional variation of input and output candidates. The conflicts between input and output candidates to become the winner as an optimal candidate can be solved only on the presence of constraint rankings that are evolving in the Optimality Theory. Mohd Hamid Raza ""An OT Account of Phonological Alignment and Epenthesis in Aligarh Urdu"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-3 , April 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23517.pdf
Paper URL: https://www.ijtsrd.com/humanities-and-the-arts/other/23517/an-ot-account-of-phonological-alignment-and-epenthesis-in-aligarh-urdu/mohd-hamid-raza
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
The document discusses House's model of translation quality assessment and its application to analyzing a short story and its translation. House's model assesses translation quality through a systematic comparison of the original text and its translation at the levels of language, register (including field, tenor, and mode), and genre. It categorizes errors in translation as either covertly erroneous errors due to mismatches in register dimensions, or overtly erroneous errors involving denotative mismatches or target language errors. The document applies House's model to analyze a translation of John Steinbeck's short story "The Grapes of Wrath" into Persian to determine if it is a covert or overt translation as defined by House's typology.
This document outlines several papers on Arabic spell checking techniques. It discusses approaches that use morphological analysis, edit distance, and knowledge bases to detect and correct spelling errors in Arabic text. The techniques detect ill-formed words, semantically incorrect words, missing, incorrect, or extra letters. Evaluations show the approaches achieve over 80% recall and 90% precision on error detection and correction. Later papers propose full spell checking systems but do not report experimental results.
Implementation Of Syntax Parser For English Language Using Grammar RulesIJERA Editor
From many years we have been using Chomsky‟s generative system of grammars, particularly context-free grammars (CFGs) and regular expressions (REs), to express the syntax of programming languages and protocols. Syntactic parsing mainly works with syntactic structure of a sentence. The 'syntax' refers to the grammatical and syntactical arrangement of words in a sentence and their relationship with other words. The main focus of syntactic analysis is important to find syntactic structure of a sentence which usually is represented as a tree structure. To identify the syntactic structure is useful in determining the meaning of a sentence Natural language processing processes the data through lexical analysis, Syntax analysis, Semantic analysis, and Discourse processing, Pragmatic analysis. This paper gives various parsing methods. The algorithm in this paper splits the English sentences into parts using POS (Parts Of Speech) tagger, It identifies the type of sentence (Simple, Complex, Interrogate, Facts, active, passive etc.) and then parses these sentences using grammar rules of Natural language. As natural language processing becomes an increasingly relevant, there is a need for tree banks catered to the specific needs of more individualized systems. Here, we present the open source technique to check and correct the grammar. The methodology will give appropriate grammatical suggestions.
The document discusses several approaches to Arabic spell checking. It begins by outlining common Arabic spelling errors and then summarizes seven different approaches: 1) a stochastic approach focusing on space insertions and deletions, 2) a Prolog-based approach detecting errors and offering corrections, 3) an approach using n-gram scores and a matrix method, 4) generating an Arabic word list for error detection and language modeling, 5) a system with error detection and correction components using a dictionary and error model, 6) improving this system by enhancing the dictionary and candidate generation, and 7) concluding that the results are promising but there is still room for improvement.
The document describes the key steps in natural language processing including morphological analysis, part-of-speech tagging, lexical processing, syntactic processing, semantic analysis, knowledge representation, discourse analysis, and applications such as machine translation. It outlines techniques for analyzing words, assigning parts of speech, determining word meanings, parsing sentences into syntactic structures, assigning semantic meanings, representing knowledge, analyzing discourse, and translating between languages.
The document discusses natural language processing and some of the key challenges involved. It describes how NLP systems aim to understand human language in written or spoken form by performing tasks like morphological analysis, parsing, semantic analysis, and discourse processing. It also discusses sources of ambiguity in natural language and different models and algorithms used to represent linguistic knowledge and process language, with the goal of building intelligent systems that can understand human communication.
The document discusses key concepts from J.C. Catford's theory of translation, including definitions of translation, levels of translation, and categories of linguistic analysis that are applicable to translation. It defines translation as replacing textual material in one language with equivalent material in another. It outlines Catford's categories of linguistic analysis including unit, structure, class, and system. It also discusses ranks in grammar from morpheme to sentence and types of translation according to extent, level, and rank.
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORMijnlc
This article describes the morphological analysis of a standard Arabic natural language processing, as a
part of an electronic dictionary-constricting phase. A fully 3-lettered inflected verbs model are formalized
based on a linguistic classification, using NOOJ platform, the classification gives certain representative
verbs that will considered as lemmas, this verbs form our dictionary entries, they are also conjugated
according to our inflection paradigm relying on certain specific morphological properties. This dictionary
will be considered as an Arabic resource, which will help NLP applications and NOOJ platform to analyse
sophisticated Arabic corpora.
Shallow parser for hindi language with an input from a transliteratorShashank Shisodia
This document summarizes a student project to develop a shallow parser for Hindi language with input from a transliterator. The plan is to create a transliterator to convert Roman script to Devanagari, generate a lexicon from corpus analysis, develop a morphological analyzer using finite state transducers, and implement a shallow parser using context free grammar. The system architecture and flow chart are presented. In conclusion, the document notes that shallow parsing is needed to build full parsers for Hindi and transliteration is important for translating names and terms across languages with different alphabets.
This document discusses coreference resolution in Arabic texts. It begins with an introduction to coreference and characteristics of the Arabic language. It then discusses related work on coreference resolution, the methodology used including an MT-based stemmer and coreference resolution process. It discusses challenges of Arabic information extraction due to its complex morphology. It also covers annotated Arabic corpora and tools used for annotation. In conclusion, it summarizes information on coreference resolution in Arabic.
J.C. Ian Catford was a Scottish linguist born in 1917 who studied translation. He defined translation as substituting text in one language for an equivalent text in another language. Catford viewed translation as unidirectional, from the source language to the target language. He examined issues like extent, meaning, language levels, and finding equivalents in the target language. Catford differentiated between full translation, where every part of the source text is replaced, and partial translation, where some parts may be left untranslated.
This document discusses discourse analysis and discourse relations in Arabic texts. It defines discourse as written or spoken language and explains that texts have cohesive devices that link words, clauses, and sentences. There are two types of discourse relations: explicitly signaled relations indicated by connectives, and implicitly inferred relations without signaling. The document also describes different types of discourse connectives in Arabic and presents an annotation scheme for connectives and their arguments. It discusses related work on Arabic corpora and efforts to collect and analyze Arabic connectives. Finally, it introduces Rhetorical Structure Theory as one of the most popular theories for representing discourse structure.
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMERijnlc
The document describes a deep BiGRU-CNN-CRF neural network model that jointly performs word segmentation, stemming, and named entity recognition for the Myanmar language. The model uses character-level and syllable-level representations as input. It was trained on a manually annotated Myanmar corpus. Evaluation results showed the neural sequence labeling architecture achieved state-of-the-art performance for the joint tasks of word segmentation, stemming, and named entity recognition in Myanmar text.
USING AUTOMATED LEXICAL RESOURCES IN ARABIC SENTENCE SUBJECTIVITYijaia
A common point in almost any work on Sentiment analysis is the need to identify which elements of
language (words) contribute to express the subjectivity in text. Collecting of these elements (sentiment
words) regardless the context with their polarities (positive/negative) is called sentiment lexical resources
or subjective lexicon. In this paper, we investigate the method for generating Sentiment Arabic lexical
Semantic Database by using lexicon based approach. Also, we study the prior polarity effects of each word
using our Sentiment Arabic Lexical Semantic Database on the sentence-level subjectivity and multiple
machine learning algorithms. The experiments were conducted on MPQA corpus containing subjective and
objective sentences of Arabic language, and we were able to achieve 76.1 % classification accuracy.
FURTHER INVESTIGATIONS ON DEVELOPING AN ARABIC SENTIMENT LEXICONkevig
The availability of lexical resources is huge to accelerate and simplify the sentiment analysis in English. In
Arabic, there are few resources and these resources are not comprehensive. Most of the current research
efforts for constructing Arabic Sentiment Lexicon (ASL) depend on a large number of lexical entities.
However, the coverage of all Arabic sentiment expressions can be applied using refined regular
expressions rather than a large number of lexical entities. This paper presents an ASL that more
comprehensive than the existing lexicons, for covering many expressions with different dialects including
Franco-Arabic, and in the same time more compact. Also, this paper shows how to integrate different
lexicons and to refine them. To enrich lexical entries with very robust morphological syntactical
information, regular expressions, the weight of sentiment polarity and n-gram terms have been augmented
to each
USING OBJECTIVE WORDS IN THE REVIEWS TO IMPROVE THE COLLOQUIAL ARABIC SENTIME...ijnlc
One of the main difficulties in sentiment analysis of the Arabic language is the presence of the
colloquialism. In this paper, we examine the effect of using objective words in conjunction with sentimental
words on sentiment classification for the colloquial Arabic reviews, specifically Jordanian colloquial
reviews. The reviews often include both sentimental and objective words; however, the most existing
sentiment analysis models ignore the objective words as they are considered useless. In this work, we
created tow lexicons: the first includes the colloquial sentimental words and compound phrases, while the
other contains the objective words associated with values of sentiment tendency based on a particular
estimation method. We used these lexicons to extract sentiment features that would be training input to the
Support Vector Machines (SVM) to classify the sentiment polarity of the reviews. The reviews dataset have
been collected manually from JEERAN website. The results of the experiments show that the proposed
approach improves the polarity classification in comparison to two baseline models, with accuracy 95.6%.
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...ijnlc
This article introduces a methodology for analyzing sentiment in Arabic text using a global foreign lexical
source. Our method leverages the available resource in another language such as the SentiWordNet in
English to the limited language resource that is Arabic. The knowledge that is taken from the external
resource will be injected into the feature model whilethe machine-learning-based classifier is trained. The
first step of our method is to build the bag-of-words (BOW) model of the Arabic text. The second step
calculates the score of polarity using translation machine technique and English SentiWordNet. The scores
for each text will be added to the model in three pairs for objective, positive, and negative. The last step of
our method involves training the ML classifier on that model to predict the sentiment of the Arabic text.
Our method increases the performance compared with the baseline model that is BOW in most cases. In
addition, it seems a viable approach to sentiment analysis in Arabic text where there is limitation of the
available resource.
Arabic words stemming approach using arabic wordnetIJDKP
The big growth of the Arabic internet content in the last years has raised up the need for an effective
stemming techniques for Arabic language. Arabic stemming algorithms can be ranked, according to three
category, as root-based approach (ex. Khoja); stem-based approach (ex. Larkey); and statistical approach
(ex. N-Garm). However, no stemming of this language is perfect: The existing stemmers have a low
efficiency. In this paper, we introduce a new stemming technique for Arabic words that also solve the
problem of the plural form of irregular nouns in Arabic language, which called broken plural. The
proposed stem extractor provides very accurate results in comparisons with other algorithms.
Consequently the search effectiveness improved.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals
This document discusses corpus linguistics and quantitative research design. It defines a corpus as a large collection of texts used for linguistic analysis. Corpus linguistics allows researchers to empirically test hypotheses about language patterns and features based on large amounts of real-world data. Quantitative analysis of corpus data shows how frequently certain words, constructions, and patterns are used. Specialized corpora can focus on particular text types, languages, or learner language. Various software tools are used to analyze corpora through frequency lists, keyword lists, collocation analysis, and other methods.
DEVELOPING A SIMPLIFIED MORPHOLOGICAL ANALYZER FOR ARABIC PRONOMINAL SYSTEMkevig
This paper proposes an improved morphological analyser for Arabic pronominal system using finite state method. The main advantage of the finite state method is very flexible, powerful and efficient. The most important results about FSAs, relates the class of languages generated by finite state automaton to certain closure properties. This result makes the theory of finite-state automata a very versatile and descriptive framework. The main contribution of this work is the full analysis and the representation of morphological analysis of all the inflections of pronoun forms in Arabic. In this paper we build a finite state network for the inflectional forms of the root words, restricted to all the inflections and grammatical properties of generating the dependent and independent forms of pronouns in Arabic language. The results show high score of accuracy in the output with all the needed linguistic features and the evaluation process of output is conducted using f-score test and the achievement is at the rate of 80% to 83%. The results from the study also provide the evidence that Arabic has strong concatenative word formations.
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
In recent decades speech interactive systems have gained increasing importance. Performance of an ASR system mainly depends on the availability of large corpus of speech. The conventional method of building a large vocabulary speech recognizer for any language uses a top-down approach to speech. This approach requires large speech corpus with sentence or phoneme level transcription of the speech utterances. The transcriptions must also include different speech order so that the recognizer can build models for all the sounds present. But, for Telugu language, because of its complex nature, a very large, well annotated speech database is very difficult to build. It is very difficult, if not impossible, to cover all the words of any Indian language, where each word may have thousands and millions of word forms. A significant part of grammar that is handled by syntax in English (and other similar languages) is handled within morphology in Telugu. Phrases including several words (that is, tokens) in English would be mapped on to a single word in Telugu.Telugu language is phonetic in nature in addition to rich in morphology. That is why the speech technology developed for English cannot be applied to Telugu language. This paper highlights the work carried out in an attempt to build a voice enabled text editor with capability of automatic term suggestion. Main claim of the paper is the recognition enhancement process developed by us for suitability of highly inflecting, rich morphological languages. This method results in increased speech recognition accuracy with very much reduction in corpus size. It also adapts Telugu words to the database dynamically, resulting in growth of the corpus.
Arabic language is the most spoken languages in the Semitic languages group, and one of the most common languages in the world spoken by more than 422 million. It is also of paramount importance to Muslims, it is a sacred language of the Islamic Holly Book (Quran) and prayer (and other acts of worship) in Islam is performed only by mastering some of Arabic words. Arabic is also a major ritual language of a number of Christian churches in the Arab world and it is also used in writing several intellectual and religious Jewish books in the Middle Ages. Despite this, there is no semantic Arabic lexicon which researchers can depend on. In this paper we introduce Azhary as a lexical ontology for the Arabic language. It groups Arabic words into sets of synonyms called synsets, and records a number of relationships between words such as synonym, antonym, hypernym, hyponym, meronym, holonym and association relations. The ontology contains 26,195 words organized in 13,328 synsets. It has been developed and contrasted against AWN which is the most common available Arabic lexical ontology.
Adopting Quadrilateral Arabic Roots in Search Engine of E-library Systempaperpublications3
Abstract: Information retrieval is the method to retrieve information according to user needs. E-library is one of the interesting ways for study and education because it includes a huge amount of information and it is stored in special database or extracting from a corpus of documents. The E-library is a part of an information retrieval system. It provides methods to get information and increase knowledge. But there are inadequacies in the Arabic terms search library and they can be solved by enhancing or adopting algorithm in order to make the search of Arabic language more efficient and easier. In this work, an algorithm for quadrilaterals of Arabic words for use with a search engine of the E-library has been adopted and integrated with New Approach for Extracting Quadrilateral Arabic Root and Pattern – based Stemmer for Finding Arabic Roots. According to the analysis done between ordinary search and quad search, it can observe the Confidence Interval of the Difference (95%) in the ordinal search located between 1.34 and 1.66. In contrast, it located between 1.45 and 1.95 in quad search. So, the result of using algorithm for quadrilateral of the Arabic word is more effective than ordinary search regarding the results in analysis tests.
Word sense disambiguation using wsd specific wordnet of polysemy wordsijnlc
This paper presents a new model of WordNet that is used to disambiguate the correct sense of polysemy
word based on the clue words. The related words for each sense of a polysemy word as well as single sense
word are referred to as the clue words. The conventional WordNet organizes nouns, verbs, adjectives and
adverbs together into sets of synonyms called synsets each expressing a different concept. In contrast to the
structure of WordNet, we developed a new model of WordNet that organizes the different senses of
polysemy words as well as the single sense words based on the clue words. These clue words for each sense
of a polysemy word as well as for single sense word are used to disambiguate the correct meaning of the
polysemy word in the given context using knowledge based Word Sense Disambiguation (WSD) algorithms.
The clue word can be a noun, verb, adjective or adverb.
Stemming is the process of reducing words to their stems or roots. Due to the morphological richness and
complexity of the Arabic language, stemming is an essential part of most Natural Language Processing
(NLP) tasks for this language. In this paper, we study the impact of different stemming approaches on the
Named Entity Recognition (NER) task for Arabic and explore the merits, limitations and differences
between light stemming and root-extraction methods. Our experiments are evaluated on the standard
ANERCorp dataset as well as the AQMAR Arabic Wikipedia Named Entity Corpus.
COMPARATIVE ANALYSIS OF ARABIC STEMMING ALGORITHMSIJMIT JOURNAL
In the context of Information Retrieval, Arabic stemming algorithms have become a most research area of
information retrieval. Many researchers have developed algorithms to solve the problem of stemming.
Each researcher proposed his own methodology and measurements to test the performance and compute
the accuracy of his algorithm. Thus, nobody can make accurate comparisons between these algorithms.
Many generic conflation techniques and stemming algorithms are theoretically analyzed in this paper.
Then, the main Arabic language characteristics that are necessary to be mentioned before discussing
Arabic stemmers are summarized. The evaluation of the algorithms in this paper shows that Arabic
stemming algorithm is still one of the most information retrieval challenges. This paper aims to compare
the most of the commonly used light stemmers in terms of affixes lists, algorithms, main ideas, and
information retrieval performance. The results show that the light10 stemmer outperformed the other
stemmers. Finally, recommendations for future research regarding the development of a standard Arabic
stemmer were presented.
Hybrid approaches for automatic vowelization of arabic textsijnlc
Hybrid approaches for automatic vowelization of Arabic texts are presented in this article. The process is
made up of two modules. In the first one, a morphological analysis of the text words is performed using the
open source morphological Analyzer AlKhalil Morpho Sys. Outputs for each word analyzed out of context,
are its different possible vowelizations. The integration of this Analyzer in our vowelization system required
the addition of a lexical database containing the most frequent words in Arabic language. Using a
statistical approach based on two hidden Markov models (HMM), the second module aims to eliminate the
ambiguities. Indeed, for the first HMM, the unvowelized Arabic words are the observed states and the
vowelized words are the hidden states. The observed states of the second HMM are identical to those of the
first, but the hidden states are the lists of possible diacritics of the word without its Arabic letters. Our
system uses Viterbi algorithm to select the optimal path among the solutions proposed by Al Khalil Morpho
Sys. Our approach opens an important way to improve the performance of automatic vowelization of
Arabic texts for other uses in automatic natural language processing.
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...gerogepatton
Many automatic translation works have been addressed between major European language pairs, by taking advantage of large scale parallel corpora, but very few research works are conducted on the Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system. LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score of 12%, 11%, and 6% respectively.
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...gerogepatton
Many automatic translation works have been addressed between major European language pairs, by
taking advantage of large scale parallel corpora, but very few research works are conducted on the
Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel
Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic
text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation
of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using
Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system.
LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM
based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score
of 12%, 11%, and 6% respectively.
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...ijaia
Many automatic translation works have been addressed between major European language pairs, by taking advantage of large scale parallel corpora, but very few research works are conducted on the Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system. LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score of 12%, 11%, and 6% respectively
Natural language processing with python and amharic syntax parse tree by dani...Daniel Adenew
Natural Language Processing is an interrelated disincline adding the capability of communicating as human beings to Computerworld. Amharic language is having much improvement over time thanks to researcher at PHD, MSC level at AAU. Here , I have tried to study and come up a limited scope solution that does syntax parsing for Amharic language and draws syntax parse trees using Python!!
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...ijnlc
Building
dialogues systems
interaction
has recently gained considerable
attention, but most of the
resourc
es and systems built so far are
tailored to
English and other Indo
-
European languages. The need
for designing
systems for
other languages is increasing such as Arabic language.
For this reasons, there
are more int
erest for Arabic dialogue acts classification
task because it
a key player in Arabic language
under
standing
to
bu
ilding this systems
.
This paper surveys
different techniques
for dialogue acts classification
for Arabic.
W
e describe the
main existing techniques for utterances segmentations and
classification, annotation schemas, and
test corpora for Arabic
dialogues understanding
that have introduced
in the literature
Similar to Using automated lexical resources in arabic sentence subjectivity (20)
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Using automated lexical resources in arabic sentence subjectivity
1. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 5, No. 6, November 2014
USING AUTOMATED LEXICAL RESOURCES IN
ARABIC SENTENCE SUBJECTIVITY
Hanaa Mobarz1, Mohsen Rashown 2, Ibrahim Farag 3
1,3Department of Computer Science, Faculty of Computers and Information, Cairo
University
2Department of Electronics and Communications, Faculty of Engineering, Cairo
University
ABSTRACT
A common point in almost any work on Sentiment analysis is the need to identify which elements of
language (words) contribute to express the subjectivity in text. Collecting of these elements (sentiment
words) regardless the context with their polarities (positive/negative) is called sentiment lexical resources
or subjective lexicon. In this paper, we investigate the method for generating Sentiment Arabic lexical
Semantic Database by using lexicon based approach. Also, we study the prior polarity effects of each word
using our Sentiment Arabic Lexical Semantic Database on the sentence-level subjectivity and multiple
machine learning algorithms. The experiments were conducted on MPQA corpus containing subjective and
objective sentences of Arabic language, and we were able to achieve 76.1 % classification accuracy.
KEYWORDS
Sentiment analysis, Lexical recourses, Opinion mining, Subjectivity Lexicon, Arabic opinion mining.
1. INTRODUCTION
Subjective text expresses opinions, emotions, sentiment and beliefs, while objective text generally
report facts. So the task of distinguishing subjective from objective text is useful for many natural
languages processing applications like mining opinions from product reviews [1], summarizing
different opinions [2], question answering [3], etc.
Sentiment analysis or opinion mining refers to the general method to extract subjectivity and
polarity from text, and semantic orientation refers to the polarity and strength of words, phrases,
or texts. Our concern is primarily with the semantic orientation of texts, but we extract the
sentiment of words and phrases towards that goal [4].
Lexical Resources for sentiment analysis is collection of words that express the subjectivity in
text. In the research literature, theses words are also known as sentiment words, opinion words,
opinion-bearing words, and polar words .There are two types of opinion words; positive opinion
words which used to express the desired state (for examples " beautiful,
wonderful, good, and
love ), and negative opinion words which used to express
the undesired state (for examples
2. terrible,
bad, hate, and ugly ).
There are three main approaches for generating subjective lexicons; manual approach, dictionary-based
approach, and corpus-based approach. Manual approach is very time consuming ([5], [6],
[7], and [8]) and thus it is not usually used alone, but combined with automated approaches as the
final check because automated methods make mistakes. Also manual approach has a lack of
DOI : 10.5121/ijaia.2014.5601 1
3. International Journal of Artificial Intelligence Applications (IJAIA), Vol. 5, No. 6, November 2014
neutralism that the decision of classification may be affected by (culture, religion, ethics ... etc).
Dictionary based approach; one of the simple techniques in this approach is based on
bootstrapping using a small set of seed opinion words and a dictionary, e.g., WordNet [9]. The
strategy is to first collect a small set of opinion words manually with known orientations, and then
to grow this set by searching in the WordNet for their synonyms and antonyms. The newly found
words are added to the seed list. The next iteration starts. The iterative process stops when no
more new words are found. This approach is unable to find opinion words with domain specific
orientations. Corpus based approach; the methods here depend on a seed list of opinion words to
find other opinion words in a large corpus and syntactic or co-occurrence patterns. Using the
corpus-based approach alone to identify all opinion words is not as effective as the dictionary-based
approach because it is hard to prepare a huge corpus to cover all language words. However,
this approach has a major advantage that the dictionary-based approach does not have. It can help
find domain specific opinion words and their orientations if a corpus from only the specific
domain is used in the discovery process [10].
Numerous of researches are available on building sentiment lexicons in English and other
languages but there are a very few number of such resources in Arabic. There are some factors
that have contributed to reducing the research related to the construction of Arabic lexical
resources for sentiment analysis such as:-
• The lack of availability of Arabic semantic database, still consider one of the most
2
difficult problems that face researcher who try to generate an Arabic subjective lexicon.
• Arabic is a high inflectional and derivational language; often a single word has more than
one affix such that it may be expressed as a combination of prefix(es), lemma (stem), and
suffix(es). The prefixes are articles, prepositions, or conjunctions. The suffixes are
generally objects or personal/possessive anaphora.
• The same three-letter root can give rise to different words, different meanings with
different orientations. For example ( , ,) may be ( beautiful ,positive) or (
camel, objective) .For example (,
4. ,) may be ( easy ,positive) or (
plain, objective). For example ( ,,) may be (! modesty ,positive) or
(#$! rascaldom, negative).
• Arabic letters can be written with different shapes according to their position in the word.
E.g. Alif has four forms (',,%, ), Yaa has two forms(* , ) ) and Taa el marpouta and
el haa el marpouta (,+ ).
The main contribution of this work is the development of an Arabic lexical resource of sentiment
analysis by exploiting the relations available in the Arabic lexical semantics database. The
database archives approximately 150,000 Arabic words, 18,413 semantic fields, and 20 semantic
relations, such as synonyms, antonym, hyponymy and causality. In addition, we study the prior
polarity effects of each word using our Sentiment Arabic Lexical resource on the sentence-level
subjectivity and multiple machine learning algorithms.
The paper is organized as follows: Section 2 mentions related works of subjective lexicon and
sentence level subjectivity classification. Section 3 gives an over view of Arabic Lexical
Semantic Data Base –RDI which is part of The Research and Development International (RDI)
toolkit. It is a main resource for generating our subjective lexicon. Section 4 presents the
proposed algorithm for generating Sentiment Arabic lexical semantic database-RDI (SentiRDI)
and how to determine the term orientation and subjectivity. Section 5 describes the corpus that is
used in sentence subjectivity classifier. Section 6 focuses on tools used in classification and the
experimental results. Finally, Section 7 draws our conclusions and future work.
5. International Journal of Artificial Intelligence Applications (IJAIA), Vol. 5, No. 6, November 2014
3
2. RELATED WORK
In this section, we will present the most notable research work on building English sentiment
lexicons and previous attempts to build Arabic sentiment lexicons. In addition, we will also
present research work performed earlier on subjectivity analysis has been applied on English and
Arabic.
Lexical Resources for opinion mining is used in determining term orientation (subjective
/objective) and term polarity (positive/negative) out of context by using cognitive knowledge of
human beings which called prior polarity lexicons. Various previous works ([11]; [12]; [13]) have
already proposed techniques for making dictionaries for those sentiment words. Several sentiment
lexicons are available for English such as SentiWordNet [13], Subjectivity Word List [14],
WordNet Affect list [15], Taboada’s adjective list [16]. SentiWordNet is the most widely used in
several applications such as sentiment analysis, opinion mining and emotion analysis.
SentiWordNet is an automatically constructed lexical resource for English that assigns a
positivity score and a negativity score to each WordNet synset. [17].
For Arabic sentiment lexicon there are few attempts. An Arabic lexicon from similarity graph (A
similarity graph is a type of graph where by two words or phrases have an edge if they are similar
in polarity or meaning) was created in [18]. Lexicon consists of two columns, the first is the word
and the second column represents the score of the word. An Arabic sentiment lexicon was created
in [19] that contained strong as well as weak subjective clues by manually translating the MPQA
lexicon [14].
A machine translation procedure is used to translate available English lexicons [20], including
SentiWordNet [13], which is the most famous and most widely used English polarity lexicon
[21], into Arabic. They retrieved 229,452 entries, including expressions commonly used in social
media. The authors reported having problems with both coverage and with the quality of some of
the entries. They also stated that they have not tested the system for the task of sentiment analysis.
An Egyptian dialect sentiment lexicon was created in [22]. The researchers identified a set of
lexico-syntactic patterns indicative of subjectivity, used a seed list of 380 manually constructed
words, and subsequently performed pattern matching on a data set collected from tweeter. The
incorrectly learned candidate terms were manually filtered. They retrieved 4,392 entries (193
compounds negative, 83 compound positive, 3,344 negative, and 772 positive).The work
addressed dialectical or slang terms for the Egyptian dialect, which makes it unsuitable for use for
other dialects.
Arabic sentiment lexicon that assigns sentiment scores to the words found in the Arabic WordNet
(Arabic version of WordNet) was created in [23]. The authors here used lexicon_based approach
for generating their lexicon by Starting from a small seed list contains 20 words (10 positive
words and 10 negative words); they used semi-supervised learning to propagate the scores in the
Arabic WordNet by exploiting the synset relations. The algorithm assigned a positive sentiment
score to more than 800, a negative score to more than 600 and a neutral score to more than 6000
words in the Arabic WordNet. The lexicon was evaluated by incorporating it into a machine
learning-based classifier.
The sentence level classification considers each sentence as a separate unit and assumes that
sentence should contain only one opinion. Sentence-level sentiment analysis has two tasks
subjectivity classification and sentiment classification.
Research work performed earlier on subjectivity analysis has been applied only on English and
mostly at document-level and word-level. Some methods such as ([24], [25], [26], [27], [28],
6. International Journal of Artificial Intelligence Applications (IJAIA), Vol. 5, No. 6, November 2014
[29], [30], [31] and [32]) which concentrated at sentence-level explored the effects of adjective
orientation and gradability on sentence subjectivity. Authors in [29] judge the sentence
subjectivity based on the adjectives used in the sentence. The idea was extended in [1] by using
the polarity of the previous sentence to break classification ties.
Bootstrapping is another approach used in [27] is considered the state of art in sentence
subjectivity detection. The idea is to use a high-precision classifier to automatically label a large
un-annotated data. The labelled data is then given to a pattern supervised learning algorithm to
generate a set of linguistically rich extraction patterns. The learned patterns are used to identify
more subjective statements and the bootstrapping process continues. A similar method to
automatically construct a corpus of HTML documents with polarity labels was used in [30].
Similar work involving self-training is described in [33] and [34]. A comprehensive survey of
subjectivity recognition using different clues and features was presented in [35].
In Arabic, There are few works addressing subjectivity classification problem on sentence level
([18], [35]). Authors in [18] hit subjectivity analysis task on the three levels of document, phrase,
and word and also provide polarity for mined sentences. They use lexicon based similarity graph
algorithm to obtain sentiment polarities. A two-stage classification approach for subjectivity and
polarity classification on sentence level was adapted in [35]. Authors use SVM classifier with a
linear kernel to achieve these tasks.
4
3. ARABIC LEXICAL SEMANTIC DATA BASE –RDI
Arabic Lexical Semantic database (RDI-ArabSemanticDB) is part of The Research and
Development International (RDI) toolkit. The Research and Development International (RDI)
toolkit mainly consists of Arabic RDI-ArabMorpho-POS tagger [37] and RDI-ArabSemanticDB
tool [38]. Arabic Lexical Semantic database of semantic fields [38] based on number of important
references ( 8 9:8 45-6 – 7 45- -,- $-./ 0123 45-6 ) (Glossary of
Arabic verbs conjugation – the intermediary dictionary – dictionary thesaurus great)
Applying 20 semantic relations between fields (synonym, antonym, part of, kind of, causality,
hyponym ….etc) and morphological analysis of each word into its type, prefix, form root and
suffix. Around 2000 general concept are managed under 18,413 semantic fields, and 150,000
words based on stems (multiplied given prefix and suffix). The Database is flexible to be updated
specifically for certain domains such as news and broadcasting domains. Some examples of
semantic relations in RDI-ArabSemanticDB are given in Figure 1. Arabic Lexical Semantics
Database RDI toll can do four functions [38]; first, retrieving the Arabic words that fall under a
certain semantic field belongs to the category of specific knowledge of semantic fields previously
public; known as placement front. Second, attributing any word to one or more of the semantic
fields referred to in advance; known as posterior placement. Third, retrieving the semantic
relations between any two pairs of semantic fields. Forth, drawing semantic fields associated with
any semantic relationship with any particular semantic field.
7. International Journal of Artificial Intelligence Applications (IJAIA), Vol. 5, No. 6, November 2014
5
Synonym
........,
,
,
10. + ,
-. /!
The word Terrorism and select searching synonyms to come as a result
of search, including all positions that contain the word terrorism or
what synonyms extremism -.... etc.
Include
3
4- 5/
$ $ #$ %
' (
6
7
%
*+
11. ,
-. /!
....- :
- $
- '9
8$
The word terrorism and select searching inclusion comes as a result of
the search, including what topics covered in this day of bombing-
Destruction- prohibition -.... etc.
Figure1. Examples of semantic relation
4. BUILDING AN ARABIC SUBJECTIVE LEXICON (SENTIRDI)
SentiRDI is a Sentiment Arabic lexical semantic database-RDI which determines both
subjectivity and orientation for all semantic fields (18,413) and all words (150,000) which are
covered by Arabic Lexical Semantic database. The algorithm determines both subjectivity and
orientation for each semantic field. Our algorithm begins with chosen seed sets. Initially, we used
the seed list defined in D.Turney and L. Littman [39]. The seed list contained 14 words {good,
nice, excellent, positive, fortunate, correct, superior, bad, nasty, poor, negative, unfortunate,
wrong, inferior}.We translated them into Arabic and used the 14 translated words in the initial
run but it couldn't reach to all the semantic fields. The seed sets were extended by randomly
choosing new words from the semantic fields that were unreachable by the previous seed sets and
by adding these words to them. Also, we added an objective seed set in order to reach to all
semantic fields in Arabic lexical semantic database. Figure 2, presents the three seed sets of our
algorithm (positive, negative, and objective).
Positive Seed
Set (S_pos)
[ ;/, !, #$5, 6$, =8, +5, ?$@-,
ABC, $5, D 2, E , ?$F, $G,
1F,
6/]
[familiarity, humility, courage, magnanimity, generosity,
quality, tender, laugh, Beauty, honesty, easy,
intelligence, diligence, polite, hope ]
Negative Seed
Set (S_neg)
[HI, J;, , K$, 0-C, $E;, L9B, 8,
I, ?$8,M, E,
NF, J, 5, 8]
[Fear, poverty, rebellion, pessimism, weak, Corruption,
sadness, hatred, greed, crying, beg, guilt, ugliness,
ignorance, Chagrin ].
Objective Seed
Set(S_obj)
[O, P2 ,,/]
[Machine, Earth, man, sound].
Figure2. Positive, negative, and objective seed sets
12. International Journal of Artificial Intelligence Applications (IJAIA), Vol. 5, No. 6, November 2014
The expansion algorithm is divided into two parts; first, the expansion algorithm of objective seed
set which explained in (Figure 3). Second, the expansion algorithm of positive and negative seed
sets which explained in (Figure 4).
6
Function: Expand_Objective_Seedset
Input:
- S_obj: objective seed set
- Obj_Rel{ Totality, Part_of, Inclusion_K, KindOf Circumstantial_Place , Locality_Place,
Circumstantial_Time , and Locality_Time}
- Q: an object contains the Lexical Semantic database RDI.
- Current_level: number of current iteration ; initialized by 0
- Obj_list : list of objective semantic fields
Begin
For each Semantic field in S_obj
Add Semantic field in obj_ queue // initialize obj_queue
While (obj_queue is not empty)
Current_level Current_level+1
Cur_Node_obj remove the top node from obj_queue
If Cur_Node_obj(not found) in Obj_list //to ensure that the added node isn't visited
Cur_node_obj.level Current_level
Cur_node_obj.repetation 1
Add Cur_node_obj to Obj_list
Cur_node_obj.repetation Cur_node_obj.repetation +1;
For each R in Obj_Rel search Q with Cur_node_obj to get Related_Nodes
For each N in Related_Nodes
If (N not found in obj_queue AND N not found in Obj_list)
Add N to Obj_queue
Figure3. Expansion algorithm of objective seed set
Output:
Else
End
Figure 3 explains the Procedure Expand_Objective_Seedset. The procedure takes objective seed
set (Figure 2), set of semantic relations were needed to expand the objective seed set, the Lexical
Semantic database RDI, and the Current_level variable which initialize by zero and incremented
by one in each iteration. Queue is initialized by semantic fields in objective seed set. Before
putting any semantic field in Obj_list, we ensure that it wasn't added before, set its level with the
value of Current_level, set its repetition value equal one. If the semantic field putted in Obj_list
before, its repetition value increment by one. All objective relations were applied over each new
semantic field in order to expand the set. The function stops when there isn't any new semantic
fields are added (queue is empty).
Figure4 explains the Procedure Expand_Pos_Neg_Seedset. The procedure takes positive and
negative seed sets (Figure 2), set of semantic relations were needed to expand two seed sets, the
Lexical Semantic database RDI, and the Current_level variable which initialize by zero and
incremented by one in each iteration. This procedure is the same as the pervious but it has two
differences; the initialization of two queues and types of semantic relations are needed to expand
positive and negative seed sets. Positive and negative queues are initialized by the synonyms of
words in positive and negative seed sets. For the relations; there are two types of relations;
relations are applied to the semantic field, the resulted semantic fields have the same orientation
{Causality, Causative, hyponymy, and Hypernym} and relations are applied to the semantic field,
the resulted semantic fields have the opposite orientation {Antonym}.
13. International Journal of Artificial Intelligence Applications (IJAIA), Vol. 5, No. 6, November 2014
7
Function: Expand_Pos_Neg_Seedset
Input:
- S_pos , S_neg: positive and negative seed set
- PN_Same_Rel { Causality, Causative, hyponymy, and Hypernym }
- PN_Opposite_Rel { Antonym}
- Q: an object contains the Lexical Semantic database RDI.
- Current_level: number of current iteration ; initialized by 0
- Pos_list : list of positive semantic fields
- Neg_list :list of negative semantic fields
Begin
For each Word in S_pos Get Synonyms
Add Synonyms in pos_ queue // initialize pos_queue
For each Word in S_neg Get Synonyms
Add Synonyms in neg_ queue // initialize neg_queue
While ((pos_queue is not empty) OR (Neg_queue is not empty) )
Current_level Current_level+1
Cur_Node_pos remove the top node from pos_queue
If Cur_Node_pos (not found) in Pos_list //to ensure that the added node isn't visited
Cur_node_pos.level Current_level
Cur_node_pos.repetation 1
Add Cur_node_pos to pos_list
Else
Cur_node_pos.repetation Cur_node_pos.repetation +1;
Cur_Node_neg remove the top node from neg_queue
If Cur_Node_neg( not found) in neg_list
Cur_node_neg.level Current_level
Cur_node_neg.repetation 1
Add Cur_node_neg to neg_list
Cur_node_neg.repetation Cur_node_neg.repetation +1;
For each R in PN_Same_Rel search Q with Cur_node_pos to get Related_Nodes
For each N in Related_Nodes
If (N not found in pos_queue AND N not found in Pos_list)
Add N to pos_queue
For each R in PN_Opposite_Rel search Q with Cur_node_pos to get Related_Nodes
For each N in Related_Nodes
If (N not found in neg_queue AND N not found in Neg_list)
Add N to neg_queue
For each R in PN_Same_Rel search Q with Cur_node_neg to get Related_Nodes
For each N in Related_Nodes
If (N not found in neg_queue AND N not found in Neg_list)
Add N to neg_queue
For each R in PN_Opposite_Rel search Q with Cur_node_neg to get Related_Nodes
For each N in Related_Nodes
If (N not found in pos_queue AND N not found in Pos_list)
Add N to pos_queue
Figure4. Expansion algorithm of positive and negative seed sets
Output:
Else
End
After more than hundred trials found that number of terms in seed set are thirty four (15 positive
terms, 15 negative terms and 4 objective terms) which cover 98.64% of semantic fields of
Database [40]. The rest of semantic fields (250) were annotated manually. Number of positive
14. International Journal of Artificial Intelligence Applications (IJAIA), Vol. 5, No. 6, November 2014
semantic fields are 3156, negative semantic fields are 4169 and objective semantic fields are
10,839.
8
4.1 Experiments analysis
The evaluation of semantic fields in SentiRDI is based on the exact match between identified
polarity of semantic fields and the manually annotated semantic fields according to three
properties (positive, negative, and objective). We used precision, recall and F-measure in
evaluation.
Testing SentiRDI is done by two ways; the first method by taking a subset of Arabic semantic
fields as a “gold standard” and annotated it manually. The subset contained 7612 semantic fields
was annotated manually by 5 annotators. The second method, by translating the Micro-WNOp
[41] (by Google translator) gold standard that is used to evaluate the SentiWordNet which
contains 1.105 synsets .
Table 1 shows the results of two tests that explained in the previous section. For each test, we
calculate Recall, Precision, and F_measure for all semantic fields (All_SF), objective semantic
fields (Obj_SF), negative semantic fields (Neg_SF), and positive semantic fields (Pos_SF) were
included in each test. The results of first test are better than the second one in terms of recall and
precision, and F-measure for all and objective semantic fields. The results of second tests are better
than the first in terms of Recall, Precision, and F-measure for negative semantic fields.
Table1. Testing SentiRDI Results
First Test Second Test
From results , we conclude that the direct translation from one language (English) to another
language (Arabic) doesn't give accurate results due to the drawbacks of machine translation
including the loss of polarity sentiments of some words when translated to other language (the
same term may have a lot of meaning with different polarities). Table 2 shows some terms that
are found in the Micro-WNOp [41] gold standard and give different meaning with different
polarities.
Table2. Samples in Micro-WNOp gold slandered.
Word
Different Meaning
Positive Negative Objective
Ball
a picnic
15. 9N
Shindig 2 Q;
a game of ball games +8 R$-% S6 -
Football +8
a bullet T$T
Shark
Crook R$2N
Swindler $B6
Shark U A
Sea dog B
Q
Humble
Humble !6
Meek 1
Lowly !
Servile V
Cure Healing ?$; ش
Curing fish AE 1 J3
Salting meat 4BQ Q3
All_S
F
Obj_S
F
Neg_S
F
Pos_SF All_SF Obj_S
F
Neg_S
F
Pos_SF
Recall 87.32 89.3 83.16 86.44 84.67 89 85 79,43
Precision 87.72 93.09 79.95 74.76 84.95 87.93 82,45 80
F-measure
87.52 91.16 81.52 80.18 84.81 88.46 83.7 79,71
16. International Journal of Artificial Intelligence Applications (IJAIA), Vol. 5, No. 6, November 2014
9
5. SUBJECTIVITY CLASSIFIER
5.1 Polarity determination of An Arabic word using SentiRDI
The simplest case that input Arabic word (from text) may be exact matching with one of pre-defined
polarity of lexicon's semantic fields. However, in most cases , input word might be
derivational or fall under a certain semantic field ,so, Arabic RDI-ArabMorpho-POS tagger and
RDI-ArabSemanticDB tool were aiding to get stem , root, and synonyms (one or more semantic
field ) of the word in order to find the corresponding pre-defined polarity of lexicon semantic
fields. If the input word not found, we can classify it manually and add it to our lexicon.
Function: determining polarity of An Arabic Word
Input:
- Word
- Subjective lexicon (SentiRDI)
- Pol_word : word polarity (positive, negative, objective)
1- Inputs are word, its stem and its root (get stem and root using RDI which uses AAAA**** search
algorithm to find the correct root for the word and perform its morphological analysis
with 96.8% accuracy).
2- Search for the inputs in the SentiRDI, if it is found then get its polarity (pol_word) and
3- Else get the synonyms of the input word and the polarity of each one.
4- Get number of positive synonyms (P), number of Negative synonyms (N) and number of
objective synonyms (O).
5- Get M=max( P,N,O) // Get maximum of three number
if (M=P) then pol_word= positive
if (M=N) then pol_word= negative
if (M=O) then pol_word= objective
else if (N=P) then pol_word= objective
else if (N=O) then pol_word= negative
else if (P=O) then pol_word= positive
6- If the word couldn't be found in semantic database in this case we determine its
polarity manually and add it with its set polarity to SentiRDI.
Figure5. Determining polarity of an Arabic word algorithm
Output:
Begin
exit.
{
}
End
Figure5 explains how to determine the prior polarity of an input Arabic word using SentiRDI.
The procedure takes two inputs word and SentiRDI. First, stem and root of input word are gotten
by using RDI tools. Search for Word, root and stem in SentiRDI if found set the polarity of word
with the founded polarity then exit (step 2). Else, get the synonyms of the word and the polarity
of each one. Count number of positive synonyms (P), number of negative synonyms (N), and
17. International Journal of Artificial Intelligence Applications (IJAIA), Vol. 5, No. 6, November 2014
number of objective synonyms (O) (step 4) then get the maximum of the three numbers and set it
to (M). Set the polarity of input word with the polarity of (M) (step 5). If the input word isn't
covered by the semantic database, in this case set the polarity of word manually and add it to
SentiRDI.
10
5.2 Corpus
Corpus1 is Translated MPQA corpus provided in [42] containing subjective and objective
sentences of Arabic language .for English, MPQA corpus2 consisting of 535 English-language
news articles from 187 different foreign and U.S., manually annotated for subjectivity [32]. The
corpus consists of 9700sentences, 55% of them are labelled as subjective, while the rest are
objective.
5.3 Text Pre-processing
Simple text pre-processing was executed in order to remove special characters and non Arabic
characters. More advanced text pre-processing was executed in order to prepare it for input to
different learning algorithm.
• First step was extracting names from documents: this step reduced calculations for
determining the prior polarity for tokens because all named entities are objective.
Technique used in this step was Arabic Named Entity Recognition (ANER) [43] which
was an integration approach between two machine learning techniques, namely
bootstrapping semi-supervised pattern recognition and Conditional Random Fields (CRF)
classifier as a supervised technique. This technique extracted approximately 59,772
named entity from subjective corpus and 50,787 named entity from objective corpus.
• The Second step that was assigning Part Of Speech tags (POS). The tool that used to
assign POS tags was the Research and Development International (RDI) tool.
• Third step to obtain the prior polarity of words by using SentiRDI (Section 5.1).
Table3. Statistics from the MPQA Arabic opinion corpus
Subjective corpus Objective corpus
Total tokens 129.169 90,357
Total sentences 5335 4365
Total NER 59772 50787
1 http://www.cse.unt.edu/~rada/downloads.html#msa
2 http://www.cs.pitt.edu/mpqa
5.4 Features selection
Feature execration involves extracting tokens that are relevant to detecting sentiment in the
sentence. The features proposed here were similar as in [44]:-
1. POS is the Part of Speech (POS) tagging of the word. RDI-ArabMorpho-POS tagger was
used [38] and by using Prior polarity lexicon (SentiRDI) for determining polarity of each
word.
18. International Journal of Artificial Intelligence Applications (IJAIA), Vol. 5, No. 6, November 2014
11
1.1 Number of positive noun.
1.2 Number of positive verb.
1.3 Number of negative noun.
1.4 Number of negative verb.
2. Average Polarity of sentence =
Where n is number of words in sentence, Pwi polarity of word i in sentence that is specified
before from prior polarity database (SentiRDI) such that
3. Average Term Frequency: Inverse Sentence Frequency (TF-ISF) for sentence (Si) can be
computed by the following equation:-
Where TF presents the number of occurrences of each term within the sentence and can be
normalized by dividing it by size of sentence.
Where Nt,s is the number of occurrences of term t in sentence S. ||S|| is the number of words in
sentence S. ISF is used for terms that appear in the small number of sentences. This factor is
useful because numbers of subjective terms are small compared with neutral (objective) ones.
6. EXPERIMENTAL ANALYSIS
We use more than thirty classifiers with different types (Bayes, Rules, Tress and Function). The
experimentation was executed using 10 Cross-Fold-Validation with two prior polarity subjective
lexicons (MPQA subjectivity lexicon which is proposed by [19] and SentiRDI Lexicon).
The evaluation is based on the exact match between identified subjectivity sentence and the
manually annotated true subjective sentence. We used Precision, Recall and F-measure to
evaluate.
Subjectivity classifiers results by using SentiRDI and MPQA subjectivity lexicon is presented in
Table 4. From our experiment we find that most classifiers give better results with SentiRDI
19. International Journal of Artificial Intelligence Applications (IJAIA), Vol. 5, No. 6, November 2014
except Naïve Bayes that gives higher precision with MPQA subjective lexicon. FT classifier
gives the best result with selective features that is give accuracy 76.1% with using SentiRDI as a
prior polarity subjectivity lexicon.
12
Table4. Subjectivity classifiers results
Classifier Name
MPQA Subjectivity
7. CONCLUSION AND FUTURE WORK
Lexicon
This paper investigates the method for generating Sentiment Arabic lexical Semantic Database by
using dictionary based approach. Also, we study the prior polarity effects of each word using our
Sentiment Arabic Lexical Semantic Database on the sentence-level subjectivity and multiple
machine learning algorithms. Our main contribution is to acquire the sentiment Arabic lexical
semantic database (SentiRDI) having very large number of Arabic words with different
derivational forms and part of speech tags. The created lexicon is context independent and it can
be used in any opinion corpora.
In the future, we are going to extend the database depending on further analysis of exiting opinion
mining Arabic corpora. Also, we are going to try more classifiers to improve results more than
this and apply contextual polarity to get good examples can be added to our lexicon and improve
the classification results.
REFERENCES
[1] Minqing Hu and Bing Liu 2004. Mining opinion features in customer reviews, Proceedings of the
National Conference on Artificial Intelligence, 755–760.
[2] Kavita Ganesan , ChengXiang Zhai and Jiawei Han,2010. Opinosis:a graph-based approach to
abstractive summarization of highly redundant opinions, Proceedings of the 23rd International
Conference on Computational Linguistics,340–348
[3] Alexandra Balahur Ester Boldrini , Andrés Montoyo and Patricio Martínez-Barco, 2009. Opinion
and Generic Question Answering systems: a performance analysis, Proceedings of the ACL-IJCNLP
2009 Conference Short Papers, 157–160.
[4] M. Taboada, J. Brooke, M. Tofiloski, K. Voll, and M. Stede, Lexicon-based methods for sentiment
analysis, Computational linguistics 37, no. 2, pp. 267-307, 2011.
[5] S. R. Das and M. Y. Chen, “Yahoo! for Amazon: Sentiment extraction from small talk on the Web,”
Management Science, vol. 53, pp. 1375–1388, 2007.
SentiRDI
NaiveBayes
F_measure 60.8 70.55
Precision 71.9 70.6
recall 65.5 70.5
Ridor
F_measure 71.2 73.82
Precision 71.7 75.4
recall 71.8 72.3
FT
F_measure 72.4 76.1
Precision 72.5 76.2
recall 72.6 76
LMT F_measure 72.4 76
Precision 72.6 76
recall 72.6 76
20. International Journal of Artificial Intelligence Applications (IJAIA), Vol. 5, No. 6, November 2014
[6] S. Morinaga, K. Yamanishi, K. Tateishi, and T. Fukushima, “Mining product reputations on the
Web,” Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining
(KDD), pp. 341–349, 2002.
[7] R. M. Tong, “An operational system for detecting and tracking opinions in on-line
13
discussion.”Proceedings of the Workshop on Operational Text Classification (OTC), 2001.
[8] YI, J., NASUKAWA, T., BUNESCU, R. AND NIBLACK, W. 2003. Sentiment analyzer: Extracting
sentiments about a given topic using natural language processing techniques, In Proceedings of the
3rd IEEE International Conference on Data Mining, 427-434.
[9] C. Fellbaum, ed., Wordnet: An Electronic Lexical Database. MIT Press, 1998.
[10] B. Liu, Sentiment Analysis and Subjectivity, Handbook of Natural Language Processing. Second
Edition, (editors: N. Indurkhya and F. J. Damerau), 2010.
[11] Pang Bo, Lee Lillian, and Vaithyanathan Shivakumar,2002. Thumbs up? Sentiment classification
using machine learning techniques In Proceedings of EMNLP, pages 79–86.
[12] Janyce Wiebe and Rada Mihalcea , 2006. Word sense and subjectivity, In Proceedings of
COLING/ACL-06. Pages 1065--1072.
[13] Esuli Andrea and Sebastiani Fabrizio,2006. SentiWordNet: A publicly available lexical resource for
opinion mining, In Proceedings fLREC.
[14] Wilson Theresa, Wiebe Janyce and Hoffmann Paul ,2005. Recognizing Contextual Polarity in
Phrase-Level Sentiment Analysis, In Proceedings of HLT/EMNLP 2005, Vancouver,Canada.
[15] Carlo Strapparava and Alessandro Valitutti ,2004. WordNet-Affect: an affective extension of
WordNet, In Proceedings of LREC 2004, pages 1083 – 1086, Lisbon, May .
[16] Kimberly Voll and Maite Taboada., 2007. Not All Words are Created Equal: Extracting
SemanticOrientation as a Function of Adjective elevanc, In Proceedings of the 20th Australian Joint
Conference on Artificial Intelligence. pages. 337-346.
[17] Amitava Das and Sivaji Bandyopadhyay,2010. Towards The Global SentiWordNet, In the
Workshop on Model and Measurement of Meaning (M3), PACLIC 24, November 4, Sendai, Japan.
[18] Elhawary, M., and Elfeky, M. (2010). “Mining Arabic Business Reviews.” IEEE International
Conference on Data Mining Workshops
[19] M. Elarnaoty, S. AbdelRahman, and A. Fahmy. A Machine Learning Approach For Opinion Holder
Extraction Arabic Language. CoRR, abs/1206.1011, 2012.
[20] Muhammad Abdul-Mageed, Mona Diab, 2012. Toward building a large-scale Arabic sentiment
lexicon, Proceedings of the 6th International Global WordNet Conference.
[21] M. Abdul-Mageed, M. Korayem, and A. YoussefAgha. “Yes we can?”: Subjectivity annotation and
tagging for the health domain. In Proceedings of the International Confrence Recent Advances in
Natural Language Processing RANLP, Hissar, Bulgaria, 2011.
[22] Samhaa R. El-Beltagym, Ahmed Ali, 2013. Open issues in the sentiment analysis of Arabic social
media: a case study, Proceedings of 9th International Conference on Innovations in Information
Technology (IIT), pp. 215–220.
[23] Mahyoub, F. H., Siddiqui, M. A., Dahab, M. Y. (2014). Building an Arabic Sentiment Lexicon
Using Semi-Supervised Learning. Journal of King Saud University-Computer and Information
Sciences.
[24] Vasileios Hatzivassiloglou and Janyce M. Wiebe, 2000. Effects of adjective orientation and
gradability on sentence subjectivity, in Proceedings of the International Conference on
Computational Linguistics (COLING).
[25] Janyce Wiebe and Theresa Wilson,2002 . Learning to disambiguate potentially subjective
expressions, in Proceedings of the Conference on Natural Language Learning (CoNLL), pp. 112–
118.
[26] Hong Yu and Vasileios Hatzivassiloglou, 2003. Towards answering opinion questions: Separating
facts from opinions and identifying the polarity of opinion sentences,” in Proceedings of the
Conference on Empirical Methods in Natural Language Processing (EMNLP).
[27] Ellen Riloff and Janyce Wiebe ,2003. Learning extraction patterns for subjective expressions, in
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
[28] Pang, B. , and Lee, L. ,2004. A sentimental education: Sentiment analysis using subjectivity
summarization based on minimum cuts, in Proceedings of the Association for Computational
Linguistics (ACL), pp. 271–278.
[29] Kim, S.M. , and Hovy, E. ,2005. Automatic detection of opinion bearing words and sentences, in
Companion Volume to the Proceedings of the International Joint Conference on Natural Language
Processing (IJCNLP).
21. International Journal of Artificial Intelligence Applications (IJAIA), Vol. 5, No. 6, November 2014
[30] Kaji N. , and Kitsuregawa, M., 2006. Automatic construction of polari-tagged corpus from HTML
14
documents, in Proceedings of the COLING/ACL Main Conference Poster Sessions.
[31] Wilson , T. , Wiebe, J. , and Hwa, R. ,2006. Just how mad are you? Finding strong and weak opinion
clauses, in Proceedings of AAAI, pp. 761–769, 2004. (Extended version in Computational
Intelligence, vol. 22, no. 2, pp. 73–99.
[32] Breck, E. , Choi, Y. , and Cardie, C. ,2007. Identifying expressions of opinion in context, in
Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad,
India.
[33] Janyce Wiebe and Ellen Riloff, 2005. Creating Subjective and Objective Sentence Classifiers from
Unannotated Texts, In Proceeding of CICLing-05, International Conference on Intelligent Text
Processing and Computational Linguistics, pages 486–497, Mexico City, Mexico.
[34] Riloff, E. , and Wiebe, J. , and Wilson, T., 2003. Learning subjective nouns using extraction pattern
bootstrapping, in Proceedings of the Conference on Natural Language Learning (CoNLL), pp. 25–
32.
[35] Wiebe, J. , Wilson, T. , Bruce, R. , Bell, M. , and Martin, M. ,2004. Learning subjective language,
Computational Linguistics, vol. 30, pp. 277–308.
[36] M. Abdul-Mageed and M. T. Diab. Subjectivity and sentiment annotation of modern standard arabic
newswire. In Proceedings of the 5th Linguistic Annotation Workshop, LAW V ’11,pages 110–118,
2011.
[37] M.Attia, and M.Rashwan (2004). “A Large Scale Arabic POS Tagger Based on a Compact Arabic
POS Tag Set and Application on the Statistical Inference of Syntactic Diacritics of Arabic Text
Words”, NEMLAR.
[38] M.Attia, M.Rashwan, A.Ragheb, M.A.Al-Basoumy, and S.Abdou, 2008.A Compact Arabic Lexical
Semantics Language Resource Based on the Theory of Semantic Fields , Proceedings of the 6th
international conference on Advances in Natural Language Processing.
[39] Peter D. Turney, Michael L. Littman, 2002. Unsupervised learning of semantic orientation from a
hundred-billion-word corpus, Technical Report EGB-1094, National Research Council Canada.
[40] Hanaa B. Mobarz, , Mohsen Rashwan, and Samir AbdelRahman, 2011, Generating lexical
Resources for Opinion Mining in Arabic language automatically, The Eleventh Conference on
Language Engineering ESOLEC', Cairo-Egypt, http://esole-eg.org/index.php/en/conferences, Sept.
[41] Cerini, S., Compagnoni, V., Demontis, A., Formentelli, M., and Gandini, G. ,2007. Language
resources and linguistic theory: Typology, second language acquisition, English linguistics
(Forthcoming), chapter Micro-WNOp: A gold standard for the evaluation of automatically compiled
lexical resources for opinion mining,Franco Angeli Editore, Milano, IT.
[42] J. Wiebe and E. Riloff, 2005. Creating Subjective and Objective Sentence Classifiers from
Unannotated Texts, In Proceeding of CICLing-05, International Conference on Intelligent Text
Processing and Computational Linguistics, pages 486–497, Mexico City, Mexico.
[43] Samir AbdelRahman, Mohamed Elarnaoty, Marwa Magdy and Aly Fahmy, 2010. Integrated
Machine Learning Techniques for Arabic Named Entity Recognition, IJCSI International Journal of
Computer Science, pp. 1694-0784.
[44] Samir AbdelRahman ,Hanaa B. Mobarz, , Mohsen Rashwan, and Ibrahim Farg, 2014, Arabic
Phrase-Level Contextual Polarity Recognition to Enhance Sentiment Arabic Lexical Semantic
Database Generation, IJACSC International Journal of Advanced Research in Artificial Intelligence,
Vol 5,No.10.