Ontology is one of the most popular representation model used for knowledge representation, sharing and reusing. The Arabic language has complex morphological, grammatical, and semantic aspects. Due to complexity of Arabic language, automatic Arabic terminology extraction is difficult. In addition, concept extraction from Arabic documents has been challenging research area, because, as opposed to term extraction, concept extraction are more domain related and more selective. In this paper, we present a new concept extraction method for Arabic ontology construction, which is the part of our ontology construction framework. A new method to extract domain relevant single and multi-word concepts in the domain has been proposed, implemented and evaluated. Our method combines linguistic, statistical information and domain knowledge. It first uses linguistic patterns based on POS tags to extract concept candidates, and then stop words filter is implemented to filter unwanted strings. To determine relevance of these candidates within the domain, different statistical measures and new domain relevance measure are implemented for first time for Arabic language. To enhance the performance of concept extraction, a domain knowledge will be integrated into the module. The concepts scores are calculated according to their statistical values and domain knowledge values. In order to evaluate the performance of the method, precision scores were calculated. The results show the high effectiveness of the proposed approach to extract concepts for Arabic ontology construction.
Arabic morphology complex is a core challenge of
Arabic information retrieval. This limitation makes
Arabic language is so complex environment of
information retrieval specialists due to the high
inflected in Arabic language. This paper explains the
importance of the stemming in information retrieval
through developed a method to search about information needs in Arabic text. The new developed
method based on light stemmer to increase matching
the keywords between the query, with the related
document in the test collection.
A Survey of Arabic Text Classification Models IJECEIAES
There is a huge content of Arabic text available over online that requires an organization of these texts. As result, here are many applications of natural languages processing (NLP) that concerns with text organization. One of the is text classification (TC). TC helps to make dealing with unorganized text. However, it is easier to classify them into suitable class or labels. This paper is a survey of Arabic text classification. Also, it presents comparison among different methods in the classification of Arabic texts, where Arabic text is represented a complex text due to its vocabularies. Arabic language is one of the richest languages in the world, where it has many linguistic bases. The researche in Arabic language processing is very few compared to English. As a result, these problems represent challenges in the classification, and organization of specific Arabic text. Text classification (TC) helps to access the most documents, or information that has already classified into specific classes, or categories to one or more classes or categories. In addition, classification of documents facilitate search engine to decrease the amount of document to, and then to become easier to search and matching with queries.
MULTI-WORD TERM EXTRACTION BASED ON NEW HYBRID APPROACH FOR ARABIC LANGUAGEcsandit
Arabic Multiword Term are relevant strings of words in text documents. Once they are
automatically extracted, they can be used to increase the performance of any text mining
applications such as Categorisation, Clustering, Information Retrieval System, Machine
Translation, and Summarization, etc. This paper introduces our proposed Multiword term
extraction system based on the contextual information. In fact, we propose a new method based
a hybrid approach for Arabic Multiword term extraction. Like other method based on hybrid
approach, our method is composed by two main steps: the Linguistic approach and the
Statistical one. In the first step, the Linguistic approach uses Part Of Speech (POS) Tagger
(Taani’s Tagger) and the Sequence Identifier as patterns in order to extract the candidate
AMTWs. While in the second one which includes our main contribution, the Statistical approach
incorporates the contextual information by using a new proposed association measure based on
Termhood and Unithood for AMWTs extraction. To evaluate the efficiency of our proposed
method for AMWTs extraction, this later has been tested and compared using three different
association measures: the proposed one named NTC-Value, NC-Value, and C-Value. The
experimental results using Arabic Texts taken from the environment domain, show that our
hybrid method outperforms the other ones in term of precision, in addition, it can deal correctly
with tri-gram Arabic Multiword terms.
COMPARATIVE ANALYSIS OF ARABIC STEMMING ALGORITHMSIJMIT JOURNAL
In the context of Information Retrieval, Arabic stemming algorithms have become a most research area of
information retrieval. Many researchers have developed algorithms to solve the problem of stemming.
Each researcher proposed his own methodology and measurements to test the performance and compute
the accuracy of his algorithm. Thus, nobody can make accurate comparisons between these algorithms.
Many generic conflation techniques and stemming algorithms are theoretically analyzed in this paper.
Then, the main Arabic language characteristics that are necessary to be mentioned before discussing
Arabic stemmers are summarized. The evaluation of the algorithms in this paper shows that Arabic
stemming algorithm is still one of the most information retrieval challenges. This paper aims to compare
the most of the commonly used light stemmers in terms of affixes lists, algorithms, main ideas, and
information retrieval performance. The results show that the light10 stemmer outperformed the other
stemmers. Finally, recommendations for future research regarding the development of a standard Arabic
stemmer were presented.
AN EFFECTIVE ARABIC TEXT CLASSIFICATION APPROACH BASED ON KERNEL NAIVE BAYES ...ijaia
With growing texts of electronic documents used in many applications, a fast and accurate text classification method is very important. Arabic text classification is one of the most challenging topics. This is probably caused by the fact that Arabic words have unlimited variation in the meaning, in addition to the problems that are specific to Arabic language only. Many studies have been proved that Naive Bayes (NB)
classifier is being relatively robust, easy to implement, fast, and accurate for many different fields such as text classification. However, non-linear classification and strong violations of the independence assumptions problems can lead to very poor performance of NB classifier. In this paper, first, we preprocess
the Arabic documents to tokenize only the Arabic words. Second, we convert those words into vectors using term frequency and inverse document frequency (TF-IDF) technique. Third, we propose an efficient approach based on Kernel Naive Bayes (KNB) classifier to solve the non-linearity problem of
Arabic text classification. Finally, experimental results and performance evaluation on our collected dataset of Arabic topic mining corpus are presented, showing the effectiveness of the proposed KNB classifier against other baseline classifiers.
Class Diagram Extraction from Textual Requirements Using NLP Techniquesiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Named Entity Recognition System for Hindi Language: A Hybrid ApproachWaqas Tariq
Named Entity Recognition (NER) is a major early step in Natural Language Processing (NLP) tasks like machine translation, text to speech synthesis, natural language understanding etc. It seeks to classify words which represent names in text into predefined categories like location, person-name, organization, date, time etc. In this paper we have used a combination of machine learning and Rule based approaches to classify named entities. The paper introduces a hybrid approach for NER. We have experimented with Statistical approaches like Conditional Random Fields (CRF) & Maximum Entropy (MaxEnt) and Rule based approach based on the set of linguistic rules. Linguistic approach plays a vital role in overcoming the limitations of statistical models for morphologically rich language like Hindi. Also the system uses voting method to improve the performance of the NER system. Keywords: NER, MaxEnt, CRF, Rule base, Voting, Hybrid approach
Arabic morphology complex is a core challenge of
Arabic information retrieval. This limitation makes
Arabic language is so complex environment of
information retrieval specialists due to the high
inflected in Arabic language. This paper explains the
importance of the stemming in information retrieval
through developed a method to search about information needs in Arabic text. The new developed
method based on light stemmer to increase matching
the keywords between the query, with the related
document in the test collection.
A Survey of Arabic Text Classification Models IJECEIAES
There is a huge content of Arabic text available over online that requires an organization of these texts. As result, here are many applications of natural languages processing (NLP) that concerns with text organization. One of the is text classification (TC). TC helps to make dealing with unorganized text. However, it is easier to classify them into suitable class or labels. This paper is a survey of Arabic text classification. Also, it presents comparison among different methods in the classification of Arabic texts, where Arabic text is represented a complex text due to its vocabularies. Arabic language is one of the richest languages in the world, where it has many linguistic bases. The researche in Arabic language processing is very few compared to English. As a result, these problems represent challenges in the classification, and organization of specific Arabic text. Text classification (TC) helps to access the most documents, or information that has already classified into specific classes, or categories to one or more classes or categories. In addition, classification of documents facilitate search engine to decrease the amount of document to, and then to become easier to search and matching with queries.
MULTI-WORD TERM EXTRACTION BASED ON NEW HYBRID APPROACH FOR ARABIC LANGUAGEcsandit
Arabic Multiword Term are relevant strings of words in text documents. Once they are
automatically extracted, they can be used to increase the performance of any text mining
applications such as Categorisation, Clustering, Information Retrieval System, Machine
Translation, and Summarization, etc. This paper introduces our proposed Multiword term
extraction system based on the contextual information. In fact, we propose a new method based
a hybrid approach for Arabic Multiword term extraction. Like other method based on hybrid
approach, our method is composed by two main steps: the Linguistic approach and the
Statistical one. In the first step, the Linguistic approach uses Part Of Speech (POS) Tagger
(Taani’s Tagger) and the Sequence Identifier as patterns in order to extract the candidate
AMTWs. While in the second one which includes our main contribution, the Statistical approach
incorporates the contextual information by using a new proposed association measure based on
Termhood and Unithood for AMWTs extraction. To evaluate the efficiency of our proposed
method for AMWTs extraction, this later has been tested and compared using three different
association measures: the proposed one named NTC-Value, NC-Value, and C-Value. The
experimental results using Arabic Texts taken from the environment domain, show that our
hybrid method outperforms the other ones in term of precision, in addition, it can deal correctly
with tri-gram Arabic Multiword terms.
COMPARATIVE ANALYSIS OF ARABIC STEMMING ALGORITHMSIJMIT JOURNAL
In the context of Information Retrieval, Arabic stemming algorithms have become a most research area of
information retrieval. Many researchers have developed algorithms to solve the problem of stemming.
Each researcher proposed his own methodology and measurements to test the performance and compute
the accuracy of his algorithm. Thus, nobody can make accurate comparisons between these algorithms.
Many generic conflation techniques and stemming algorithms are theoretically analyzed in this paper.
Then, the main Arabic language characteristics that are necessary to be mentioned before discussing
Arabic stemmers are summarized. The evaluation of the algorithms in this paper shows that Arabic
stemming algorithm is still one of the most information retrieval challenges. This paper aims to compare
the most of the commonly used light stemmers in terms of affixes lists, algorithms, main ideas, and
information retrieval performance. The results show that the light10 stemmer outperformed the other
stemmers. Finally, recommendations for future research regarding the development of a standard Arabic
stemmer were presented.
AN EFFECTIVE ARABIC TEXT CLASSIFICATION APPROACH BASED ON KERNEL NAIVE BAYES ...ijaia
With growing texts of electronic documents used in many applications, a fast and accurate text classification method is very important. Arabic text classification is one of the most challenging topics. This is probably caused by the fact that Arabic words have unlimited variation in the meaning, in addition to the problems that are specific to Arabic language only. Many studies have been proved that Naive Bayes (NB)
classifier is being relatively robust, easy to implement, fast, and accurate for many different fields such as text classification. However, non-linear classification and strong violations of the independence assumptions problems can lead to very poor performance of NB classifier. In this paper, first, we preprocess
the Arabic documents to tokenize only the Arabic words. Second, we convert those words into vectors using term frequency and inverse document frequency (TF-IDF) technique. Third, we propose an efficient approach based on Kernel Naive Bayes (KNB) classifier to solve the non-linearity problem of
Arabic text classification. Finally, experimental results and performance evaluation on our collected dataset of Arabic topic mining corpus are presented, showing the effectiveness of the proposed KNB classifier against other baseline classifiers.
Class Diagram Extraction from Textual Requirements Using NLP Techniquesiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Named Entity Recognition System for Hindi Language: A Hybrid ApproachWaqas Tariq
Named Entity Recognition (NER) is a major early step in Natural Language Processing (NLP) tasks like machine translation, text to speech synthesis, natural language understanding etc. It seeks to classify words which represent names in text into predefined categories like location, person-name, organization, date, time etc. In this paper we have used a combination of machine learning and Rule based approaches to classify named entities. The paper introduces a hybrid approach for NER. We have experimented with Statistical approaches like Conditional Random Fields (CRF) & Maximum Entropy (MaxEnt) and Rule based approach based on the set of linguistic rules. Linguistic approach plays a vital role in overcoming the limitations of statistical models for morphologically rich language like Hindi. Also the system uses voting method to improve the performance of the NER system. Keywords: NER, MaxEnt, CRF, Rule base, Voting, Hybrid approach
Hybrid approaches for automatic vowelization of arabic textsijnlc
Hybrid approaches for automatic vowelization of Arabic texts are presented in this article. The process is
made up of two modules. In the first one, a morphological analysis of the text words is performed using the
open source morphological Analyzer AlKhalil Morpho Sys. Outputs for each word analyzed out of context,
are its different possible vowelizations. The integration of this Analyzer in our vowelization system required
the addition of a lexical database containing the most frequent words in Arabic language. Using a
statistical approach based on two hidden Markov models (HMM), the second module aims to eliminate the
ambiguities. Indeed, for the first HMM, the unvowelized Arabic words are the observed states and the
vowelized words are the hidden states. The observed states of the second HMM are identical to those of the
first, but the hidden states are the lists of possible diacritics of the word without its Arabic letters. Our
system uses Viterbi algorithm to select the optimal path among the solutions proposed by Al Khalil Morpho
Sys. Our approach opens an important way to improve the performance of automatic vowelization of
Arabic texts for other uses in automatic natural language processing.
A New Approach to Parts of Speech Tagging in Malayalamijcsit
Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word’s
usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes
include additional information, with case markers (number, gender etc) and tense markers. A large number
of current language processing systems use a parts-of-speech tagger for pre-processing.
There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based
Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the
oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and
statistical information to assign tag to words. It use large corpus, so that Time complexity and Space
complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic
Approach is the widely used one nowadays because of its accuracy.
Malayalam is a Dravidian family of languages, inflectional with suffixes with the root word forms. The
currently used Algorithms are efficient Machine Learning Algorithms but these are not built for
Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.
My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use
Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence
structure along with the dictionary entry.
There is a huge content of Arabic text available over online that requires an organization of these texts. As result, here are many applications of natural languages processing (NLP) that concerns with text organization. One of the is text classification (TC). TC helps to make dealing with unorganized text. However, it is easier to classify them into suitable class or labels. This paper is a survey of Arabic text classification. Also, it presents comparison among different methods in the classification of Arabic texts, where Arabic text is represented a complex text due to its vocabularies. Arabic language is one of the richest languages in the world, where it has many linguistic bases. The research in Arabic language processing is very few compared to English. As a result, these problems represent challenges in the classification, and organization of specific Arabic text. Text classification (TC) helps to access the most documents, or information that has already classified into specific classes, or categories to one or more classes or categories. In addition, classification of documents facilitate search engine to decrease the amount of document to, and then to become easier to search and matching with queries.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
Diacritic Oriented Arabic Information Retrieval SystemCSCJournals
Arabic language support in search engines and operating systems is improved in recent years. Searching in the Internet is reliable and can be compared to the excellent support for several other languages, including English. However, for text with diacritics there are some limitations. For this reason, most Information retrieval (IR) systems remove diacritics from text and ignore it for its complexity. Searching text with diacritics is important for some kinds of documents, such as those of religious books, some newspapers and children stories. This research shows the design and development of the system that overcome the problem. The proposed system considers diacritics. The proposed system includes the design complexity in the retrieving algorithm rather than the information repository, which is database in this study. Also, this study analyses the results and the performance. Results are promising and performance analysis shows methods to enhance design and increase the performance. The proposed system can be integrated in search engines, text editors and any information retrieval system that include Arabic text. Performance analysis of the proposed system shows that this system is reliable. The proposed system is applied on database of Hadeeth, which is religious book includes the prophet action and statements. The system can be applied in any kind of data repository.
Information retrieval (IR) system aims to retrieve
relevant documents to a user query where the query is a set of
keywords. Cross-language information retrieval (CLIR) is a
retrieval process in which the user fires queries in one language to
retrieve information from another language. The growing
requirement on the Internet for users to access information
expressed in language other than their own has led to Cross
Language Information Retrieval (CLIR) becoming established as
a major topic in IR.
New instances classification framework on Quran ontology applied to question ...TELKOMNIKA JOURNAL
Instances classification with the small dataset for Quran ontology is the current research problem which appears in Quran ontology development. The existing classification approach used machine learning: Backpropagation Neural Network. However, this method has a drawback; if the training set amount is small, then the classifier accuracy could decline. Unfortunately, Holy Quran has a small corpus. Based on this problem, our study aims to formulate new instances classification framework for small training corpus applied to semantic question answering system. As a result, the instances classification framework consists of several essential components: pre-processing, morphology analysis, semantic analysis, feature extraction, instances classification with Radial Basis Function Networks algorithm, and the transformation module. This algorithm is chosen since it robustness to noisy data and has an excellent achievement to handle small dataset. Furthermore, document processing module on question answering system is used to access instances classification result in Quran ontology.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Novelty detection via topic modeling in research articlescsandit
In today’s world redundancy is the most vital problem faced in almost all domains. Novelty
detection is the identification of new or unknown data or signal that a machine learning system
is not aware of during training. The problem becomes more intense when it comes to “Research
Articles”. A method of identifying novelty at each sections of the article is highly required for
determining the novel idea proposed in the research paper. Since research articles are semistructured,
detecting novelty of information from them requires more accurate systems. Topic
model provides a useful means to process them and provides a simple way to analyze them. This
work compares the most predominantly used topic model- Latent Dirichlet Allocation with the
hierarchical Pachinko Allocation Model. The results obtained are promising towards
hierarchical Pachinko Allocation Model when used for document retrieval.
Implementation of Urdu Probabilistic ParserWaqas Tariq
The implementation of Urdu probabilistic parser is the main contribution of this research work. In the beginning, a lot of Urdu text was collected from different sources. The sentences in the text were subsequently tagged. The tagged sentences were then parsed by a chart parser to formulate the rules. In the next step, probabilities were assigned to these rules to get a Probabilistic Context Free Grammar. For Urdu probabilistic parser, the idea of shift-reduce multi-path strategy is used. The developed software performs the syntactic analysis of a sentence, using a given set of probabilistic phrase structure rules. The parse with the highest probability is selected, as the most suitable one from a set of possible parses produced by this parser. The structure of each sentence is represented in the form of successive rules. This parser parses sentences with 74% accuracy.
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES ijnlc
Stemming is the process of term conflation. It conflates all the word variants to a common form called as stem. It plays significant role in numerous Natural Language Processing (NLP) applications like morphological analysis, parsing, document summarization, text classification, part-of-speech tagging, question-answering system, machine translation, word sense disambiguation, information retrieval (IR), etc. Each of these tasks requires some pre-processing to be done. Stemming is one of the important building blocks for all these applications. This paper, presents an overview of various stemming techniques, evaluation criteria for stemmers and various existing stemmers for Indic languages.
Single document keywords extraction in Bahasa Indonesia using phrase chunkingTELKOMNIKA JOURNAL
Keywords help readers to understand the idea of a document quickly. Unfortunately, considerable time and effort are often needed to come up with a good set of keywords manually. This research focused on generating keywords from a document automatically using phrase chunking. Firstly, we collected part of speech patterns from a collection of documents. Secondly, we used those patterns to extract candidate keywords from the abstract and the content of a document. Finally, keywords are selected from the candidates based on the number of words in the keyword phrases and some scenarios involving candidate reduction and sorting. We evaluated the result of each scenario using precision, recall, and F-measure. The experiment results show: i) shorter-phrase keywords with string reduction extracted from the abstract and sorted by frequency provides the highest score, ii) in every proposed scenario, extracting keywords using the abstract always presents a better result, iii) using shorter-phrase patterns in keywords extraction gives better score in comparison to using all phrase patterns, iv) sorting scenarios based on the multiplication of candidate frequencies and the weight of the phrase patterns offer better results.
Script Identification of Text Words from a Tri-Lingual Document Using Voting ...CSCJournals
In a multi script environment, majority of the documents may contain text information printed in more than one script/language forms. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify different script regions of the document. In this context, this paper proposes to develop a model to identify and separate text words of Kannada, Hindi and English scripts from a printed tri-lingual document. The proposed method is trained to learn thoroughly the distinct features of each script. The binary tree classifier is used to classify the input text image. Experimentation conducted involved 1500 text words for learning and 1200 text words for testing. Extensive experimentation has been carried out on both manually created data set and scanned data set. The results are very encouraging and prove the efficacy of the proposed model. The average success rate is found to be 99% for manually created data set and 98.5% for data set constructed from scanned document images.
Design and Development of a Malayalam to English Translator- A Transfer Based...Waqas Tariq
This paper describes a transfer based scheme for translating Malayalam, a Dravidian language, to English. This system inputs Malayalam sentences and outputs equivalent English sentences. The system comprises of a preprocessor for splitting the compound words, a morphological parser for context disambiguation and chunking, a syntactic structure transfer module and a bilingual dictionary. All the modules are morpheme based to reduce dictionary size. The system does not rely on a stochastic approach and it is based on a rule-based architecture along with various linguistic knowledge components of both Malayalam and English. The system uses two sets of rules: rules for Malayalam morphology and rules for syntactic structure transfer from Malayalam to English. The system is designed using artificial intelligence techniques.
A survey on phrase structure learning methods for text classificationijnlc
Text classification is a task of automatic classification of text into one of the predefined categories. The
problem of text classification has been widely studied in different communities like natural language
processing, data mining and information retrieval. Text classification is an important constituent in many
information management tasks like topic identification, spam filtering, email routing, language
identification, genre classification, readability assessment etc. The performance of text classification
improves notably when phrase patterns are used. The use of phrase patterns helps in capturing non-local
behaviours and thus helps in the improvement of text classification task. Phrase structure extraction is the
first step to continue with the phrase pattern identification. In this survey, detailed study of phrase structure
learning methods have been carried out. This will enable future work in several NLP tasks, which uses
syntactic information from phrase structure like grammar checkers, question answering, information
extraction, machine translation, text classification. The paper also provides different levels of classification
and detailed comparison of the phrase structure learning methods.
An Efficient Semantic Relation Extraction Method For Arabic Texts Based On Si...CSCJournals
Semantic relation extraction is an important component of ontologies that can support many applications e.g. text mining, question answering, and information extraction. However, extracting semantic relations between concepts is not trivial and one of the main challenges in Natural Language Processing (NLP) Field. In this paper, we propose a method for semantic relation extraction between concepts. The method relies on the definition of concept context and the semantic similarity measures to extract relations from domain corpus. In this work, we implemented algorithm for concept context construction and for similarity computation based on different semantic similarity measures. We analyze the proposed methods and evaluate their performance. The preliminary experiments showed that the best results precision of 83% are obtained with Lin measure at minimum confidence =0.50 and precision of 85% with the Cosine and Jaccard similarity measures. The main advantage is the automatic and unsupervised operation; it doesn't need any pre labeled training data. Also used effectively for relation extraction in various domains. The results show the high effectiveness of the proposed approach to extract relations for Arabic ontology construction.
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language pair. The machine translation system will take input script as English sentence and parse with the help of Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will take the parsed output and separate the source text word by word and searches for their corresponding target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also reordering rules are there. After applying the reordering rules, English sentence will be syntactically reordered to suit Marathi language
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...ijnlc
This article introduces a methodology for analyzing sentiment in Arabic text using a global foreign lexical
source. Our method leverages the available resource in another language such as the SentiWordNet in
English to the limited language resource that is Arabic. The knowledge that is taken from the external
resource will be injected into the feature model whilethe machine-learning-based classifier is trained. The
first step of our method is to build the bag-of-words (BOW) model of the Arabic text. The second step
calculates the score of polarity using translation machine technique and English SentiWordNet. The scores
for each text will be added to the model in three pairs for objective, positive, and negative. The last step of
our method involves training the ML classifier on that model to predict the sentiment of the Arabic text.
Our method increases the performance compared with the baseline model that is BOW in most cases. In
addition, it seems a viable approach to sentiment analysis in Arabic text where there is limitation of the
available resource.
Car-Following Parameters by Means of Cellular Automata in the Case of EvacuationCSCJournals
This study is attention to the car-following model, an important part in the micro traffic flow. Different from Nagel–Schreckenberg’s studies in which car-following model without agent drivers and diligent ones, agent drivers and diligent ones are proposed in the car-following part in this work and lane-changing is also presented in the model. The impact of agent drivers and diligent ones under certain circumstances such as in the case of evacuation is considered. Based on simulation results, the relations between evacuation time and diligent drivers are obtained by using different amounts of agent drivers; comparison between previous (Nagel–Schreckenberg) and proposed model is also found in order to find the evacuation time. Besides, the effectiveness of reduction the evacuation time is presented for various agent drivers and diligent ones.
Customer Opinions Evaluation: A Case Study on Arabic Tweets gerogepatton
This paper presents an automatic method for extracting, processing, and analysis of customer opinions
on Arabic social media. We present a four-step approach for mining of Arabic tweets. First, Natural
Language Processing (NLP) with different types of analyses had performed. Second, we present an
automatic and expandable lexicon for Arabic adjectives. The initial lexicon is built using 1350 adjectives
as seeds from processing of different datasets in Arabic language. The lexicon is automatically expanded
by collecting synonyms and morphemes of each word through Arabic resources and google translate.
Third, emotional analysis was considered by two different methods; Machine Learning (ML) and rulebased method. Finally, Feature Selection (FS) is also considered to enhance the mining results. The
experimental results reveal that the proposed method outperforms counterpart ones with an improvement
margin of up to 4% using F-Measure.
CUSTOMER OPINIONS EVALUATION: A CASESTUDY ON ARABIC TWEETSgerogepatton
This paper presents an automatic method for extracting, processing, and analysis of customer opinions on Arabic social media. We present a four-step approach for mining of Arabic tweets. First, Natural Language Processing (NLP) with different types of analyses had performed. Second, we present an automatic and expandable lexicon for Arabic adjectives. The initial lexicon is built using 1350 adjectives as seeds from processing of different datasets in Arabic language. The lexicon is automatically expanded by collecting synonyms and morphemes of each word through Arabic resources and google translate. Third, emotional analysis was considered by two different methods; Machine Learning (ML) and rulebased method. Finally, Feature Selection (FS) is also considered to enhance the mining results. The experimental results reveal that the proposed method outperforms counterpart ones with an improvement margin of up to 4% using F-Measure.
CUSTOMER OPINIONS EVALUATION: A CASESTUDY ON ARABIC TWEETSijaia
This paper presents an automatic method for extracting, processing, and analysis of customer opinions on Arabic social media. We present a four-step approach for mining of Arabic tweets. First, Natural Language Processing (NLP) with different types of analyses had performed. Second, we present an automatic and expandable lexicon for Arabic adjectives. The initial lexicon is built using 1350 adjectives as seeds from processing of different datasets in Arabic language. The lexicon is automatically expanded by collecting synonyms and morphemes of each word through Arabic resources and google translate. Third, emotional analysis was considered by two different methods; Machine Learning (ML) and rulebased method. Finally, Feature Selection (FS) is also considered to enhance the mining results. The experimental results reveal that the proposed method outperforms counterpart ones with an improvement margin of up to 4% using F-Measure.
Hybrid approaches for automatic vowelization of arabic textsijnlc
Hybrid approaches for automatic vowelization of Arabic texts are presented in this article. The process is
made up of two modules. In the first one, a morphological analysis of the text words is performed using the
open source morphological Analyzer AlKhalil Morpho Sys. Outputs for each word analyzed out of context,
are its different possible vowelizations. The integration of this Analyzer in our vowelization system required
the addition of a lexical database containing the most frequent words in Arabic language. Using a
statistical approach based on two hidden Markov models (HMM), the second module aims to eliminate the
ambiguities. Indeed, for the first HMM, the unvowelized Arabic words are the observed states and the
vowelized words are the hidden states. The observed states of the second HMM are identical to those of the
first, but the hidden states are the lists of possible diacritics of the word without its Arabic letters. Our
system uses Viterbi algorithm to select the optimal path among the solutions proposed by Al Khalil Morpho
Sys. Our approach opens an important way to improve the performance of automatic vowelization of
Arabic texts for other uses in automatic natural language processing.
A New Approach to Parts of Speech Tagging in Malayalamijcsit
Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word’s
usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes
include additional information, with case markers (number, gender etc) and tense markers. A large number
of current language processing systems use a parts-of-speech tagger for pre-processing.
There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based
Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the
oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and
statistical information to assign tag to words. It use large corpus, so that Time complexity and Space
complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic
Approach is the widely used one nowadays because of its accuracy.
Malayalam is a Dravidian family of languages, inflectional with suffixes with the root word forms. The
currently used Algorithms are efficient Machine Learning Algorithms but these are not built for
Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.
My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use
Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence
structure along with the dictionary entry.
There is a huge content of Arabic text available over online that requires an organization of these texts. As result, here are many applications of natural languages processing (NLP) that concerns with text organization. One of the is text classification (TC). TC helps to make dealing with unorganized text. However, it is easier to classify them into suitable class or labels. This paper is a survey of Arabic text classification. Also, it presents comparison among different methods in the classification of Arabic texts, where Arabic text is represented a complex text due to its vocabularies. Arabic language is one of the richest languages in the world, where it has many linguistic bases. The research in Arabic language processing is very few compared to English. As a result, these problems represent challenges in the classification, and organization of specific Arabic text. Text classification (TC) helps to access the most documents, or information that has already classified into specific classes, or categories to one or more classes or categories. In addition, classification of documents facilitate search engine to decrease the amount of document to, and then to become easier to search and matching with queries.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
Diacritic Oriented Arabic Information Retrieval SystemCSCJournals
Arabic language support in search engines and operating systems is improved in recent years. Searching in the Internet is reliable and can be compared to the excellent support for several other languages, including English. However, for text with diacritics there are some limitations. For this reason, most Information retrieval (IR) systems remove diacritics from text and ignore it for its complexity. Searching text with diacritics is important for some kinds of documents, such as those of religious books, some newspapers and children stories. This research shows the design and development of the system that overcome the problem. The proposed system considers diacritics. The proposed system includes the design complexity in the retrieving algorithm rather than the information repository, which is database in this study. Also, this study analyses the results and the performance. Results are promising and performance analysis shows methods to enhance design and increase the performance. The proposed system can be integrated in search engines, text editors and any information retrieval system that include Arabic text. Performance analysis of the proposed system shows that this system is reliable. The proposed system is applied on database of Hadeeth, which is religious book includes the prophet action and statements. The system can be applied in any kind of data repository.
Information retrieval (IR) system aims to retrieve
relevant documents to a user query where the query is a set of
keywords. Cross-language information retrieval (CLIR) is a
retrieval process in which the user fires queries in one language to
retrieve information from another language. The growing
requirement on the Internet for users to access information
expressed in language other than their own has led to Cross
Language Information Retrieval (CLIR) becoming established as
a major topic in IR.
New instances classification framework on Quran ontology applied to question ...TELKOMNIKA JOURNAL
Instances classification with the small dataset for Quran ontology is the current research problem which appears in Quran ontology development. The existing classification approach used machine learning: Backpropagation Neural Network. However, this method has a drawback; if the training set amount is small, then the classifier accuracy could decline. Unfortunately, Holy Quran has a small corpus. Based on this problem, our study aims to formulate new instances classification framework for small training corpus applied to semantic question answering system. As a result, the instances classification framework consists of several essential components: pre-processing, morphology analysis, semantic analysis, feature extraction, instances classification with Radial Basis Function Networks algorithm, and the transformation module. This algorithm is chosen since it robustness to noisy data and has an excellent achievement to handle small dataset. Furthermore, document processing module on question answering system is used to access instances classification result in Quran ontology.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Novelty detection via topic modeling in research articlescsandit
In today’s world redundancy is the most vital problem faced in almost all domains. Novelty
detection is the identification of new or unknown data or signal that a machine learning system
is not aware of during training. The problem becomes more intense when it comes to “Research
Articles”. A method of identifying novelty at each sections of the article is highly required for
determining the novel idea proposed in the research paper. Since research articles are semistructured,
detecting novelty of information from them requires more accurate systems. Topic
model provides a useful means to process them and provides a simple way to analyze them. This
work compares the most predominantly used topic model- Latent Dirichlet Allocation with the
hierarchical Pachinko Allocation Model. The results obtained are promising towards
hierarchical Pachinko Allocation Model when used for document retrieval.
Implementation of Urdu Probabilistic ParserWaqas Tariq
The implementation of Urdu probabilistic parser is the main contribution of this research work. In the beginning, a lot of Urdu text was collected from different sources. The sentences in the text were subsequently tagged. The tagged sentences were then parsed by a chart parser to formulate the rules. In the next step, probabilities were assigned to these rules to get a Probabilistic Context Free Grammar. For Urdu probabilistic parser, the idea of shift-reduce multi-path strategy is used. The developed software performs the syntactic analysis of a sentence, using a given set of probabilistic phrase structure rules. The parse with the highest probability is selected, as the most suitable one from a set of possible parses produced by this parser. The structure of each sentence is represented in the form of successive rules. This parser parses sentences with 74% accuracy.
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES ijnlc
Stemming is the process of term conflation. It conflates all the word variants to a common form called as stem. It plays significant role in numerous Natural Language Processing (NLP) applications like morphological analysis, parsing, document summarization, text classification, part-of-speech tagging, question-answering system, machine translation, word sense disambiguation, information retrieval (IR), etc. Each of these tasks requires some pre-processing to be done. Stemming is one of the important building blocks for all these applications. This paper, presents an overview of various stemming techniques, evaluation criteria for stemmers and various existing stemmers for Indic languages.
Single document keywords extraction in Bahasa Indonesia using phrase chunkingTELKOMNIKA JOURNAL
Keywords help readers to understand the idea of a document quickly. Unfortunately, considerable time and effort are often needed to come up with a good set of keywords manually. This research focused on generating keywords from a document automatically using phrase chunking. Firstly, we collected part of speech patterns from a collection of documents. Secondly, we used those patterns to extract candidate keywords from the abstract and the content of a document. Finally, keywords are selected from the candidates based on the number of words in the keyword phrases and some scenarios involving candidate reduction and sorting. We evaluated the result of each scenario using precision, recall, and F-measure. The experiment results show: i) shorter-phrase keywords with string reduction extracted from the abstract and sorted by frequency provides the highest score, ii) in every proposed scenario, extracting keywords using the abstract always presents a better result, iii) using shorter-phrase patterns in keywords extraction gives better score in comparison to using all phrase patterns, iv) sorting scenarios based on the multiplication of candidate frequencies and the weight of the phrase patterns offer better results.
Script Identification of Text Words from a Tri-Lingual Document Using Voting ...CSCJournals
In a multi script environment, majority of the documents may contain text information printed in more than one script/language forms. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify different script regions of the document. In this context, this paper proposes to develop a model to identify and separate text words of Kannada, Hindi and English scripts from a printed tri-lingual document. The proposed method is trained to learn thoroughly the distinct features of each script. The binary tree classifier is used to classify the input text image. Experimentation conducted involved 1500 text words for learning and 1200 text words for testing. Extensive experimentation has been carried out on both manually created data set and scanned data set. The results are very encouraging and prove the efficacy of the proposed model. The average success rate is found to be 99% for manually created data set and 98.5% for data set constructed from scanned document images.
Design and Development of a Malayalam to English Translator- A Transfer Based...Waqas Tariq
This paper describes a transfer based scheme for translating Malayalam, a Dravidian language, to English. This system inputs Malayalam sentences and outputs equivalent English sentences. The system comprises of a preprocessor for splitting the compound words, a morphological parser for context disambiguation and chunking, a syntactic structure transfer module and a bilingual dictionary. All the modules are morpheme based to reduce dictionary size. The system does not rely on a stochastic approach and it is based on a rule-based architecture along with various linguistic knowledge components of both Malayalam and English. The system uses two sets of rules: rules for Malayalam morphology and rules for syntactic structure transfer from Malayalam to English. The system is designed using artificial intelligence techniques.
A survey on phrase structure learning methods for text classificationijnlc
Text classification is a task of automatic classification of text into one of the predefined categories. The
problem of text classification has been widely studied in different communities like natural language
processing, data mining and information retrieval. Text classification is an important constituent in many
information management tasks like topic identification, spam filtering, email routing, language
identification, genre classification, readability assessment etc. The performance of text classification
improves notably when phrase patterns are used. The use of phrase patterns helps in capturing non-local
behaviours and thus helps in the improvement of text classification task. Phrase structure extraction is the
first step to continue with the phrase pattern identification. In this survey, detailed study of phrase structure
learning methods have been carried out. This will enable future work in several NLP tasks, which uses
syntactic information from phrase structure like grammar checkers, question answering, information
extraction, machine translation, text classification. The paper also provides different levels of classification
and detailed comparison of the phrase structure learning methods.
An Efficient Semantic Relation Extraction Method For Arabic Texts Based On Si...CSCJournals
Semantic relation extraction is an important component of ontologies that can support many applications e.g. text mining, question answering, and information extraction. However, extracting semantic relations between concepts is not trivial and one of the main challenges in Natural Language Processing (NLP) Field. In this paper, we propose a method for semantic relation extraction between concepts. The method relies on the definition of concept context and the semantic similarity measures to extract relations from domain corpus. In this work, we implemented algorithm for concept context construction and for similarity computation based on different semantic similarity measures. We analyze the proposed methods and evaluate their performance. The preliminary experiments showed that the best results precision of 83% are obtained with Lin measure at minimum confidence =0.50 and precision of 85% with the Cosine and Jaccard similarity measures. The main advantage is the automatic and unsupervised operation; it doesn't need any pre labeled training data. Also used effectively for relation extraction in various domains. The results show the high effectiveness of the proposed approach to extract relations for Arabic ontology construction.
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language pair. The machine translation system will take input script as English sentence and parse with the help of Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will take the parsed output and separate the source text word by word and searches for their corresponding target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also reordering rules are there. After applying the reordering rules, English sentence will be syntactically reordered to suit Marathi language
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...ijnlc
This article introduces a methodology for analyzing sentiment in Arabic text using a global foreign lexical
source. Our method leverages the available resource in another language such as the SentiWordNet in
English to the limited language resource that is Arabic. The knowledge that is taken from the external
resource will be injected into the feature model whilethe machine-learning-based classifier is trained. The
first step of our method is to build the bag-of-words (BOW) model of the Arabic text. The second step
calculates the score of polarity using translation machine technique and English SentiWordNet. The scores
for each text will be added to the model in three pairs for objective, positive, and negative. The last step of
our method involves training the ML classifier on that model to predict the sentiment of the Arabic text.
Our method increases the performance compared with the baseline model that is BOW in most cases. In
addition, it seems a viable approach to sentiment analysis in Arabic text where there is limitation of the
available resource.
Car-Following Parameters by Means of Cellular Automata in the Case of EvacuationCSCJournals
This study is attention to the car-following model, an important part in the micro traffic flow. Different from Nagel–Schreckenberg’s studies in which car-following model without agent drivers and diligent ones, agent drivers and diligent ones are proposed in the car-following part in this work and lane-changing is also presented in the model. The impact of agent drivers and diligent ones under certain circumstances such as in the case of evacuation is considered. Based on simulation results, the relations between evacuation time and diligent drivers are obtained by using different amounts of agent drivers; comparison between previous (Nagel–Schreckenberg) and proposed model is also found in order to find the evacuation time. Besides, the effectiveness of reduction the evacuation time is presented for various agent drivers and diligent ones.
Customer Opinions Evaluation: A Case Study on Arabic Tweets gerogepatton
This paper presents an automatic method for extracting, processing, and analysis of customer opinions
on Arabic social media. We present a four-step approach for mining of Arabic tweets. First, Natural
Language Processing (NLP) with different types of analyses had performed. Second, we present an
automatic and expandable lexicon for Arabic adjectives. The initial lexicon is built using 1350 adjectives
as seeds from processing of different datasets in Arabic language. The lexicon is automatically expanded
by collecting synonyms and morphemes of each word through Arabic resources and google translate.
Third, emotional analysis was considered by two different methods; Machine Learning (ML) and rulebased method. Finally, Feature Selection (FS) is also considered to enhance the mining results. The
experimental results reveal that the proposed method outperforms counterpart ones with an improvement
margin of up to 4% using F-Measure.
CUSTOMER OPINIONS EVALUATION: A CASESTUDY ON ARABIC TWEETSgerogepatton
This paper presents an automatic method for extracting, processing, and analysis of customer opinions on Arabic social media. We present a four-step approach for mining of Arabic tweets. First, Natural Language Processing (NLP) with different types of analyses had performed. Second, we present an automatic and expandable lexicon for Arabic adjectives. The initial lexicon is built using 1350 adjectives as seeds from processing of different datasets in Arabic language. The lexicon is automatically expanded by collecting synonyms and morphemes of each word through Arabic resources and google translate. Third, emotional analysis was considered by two different methods; Machine Learning (ML) and rulebased method. Finally, Feature Selection (FS) is also considered to enhance the mining results. The experimental results reveal that the proposed method outperforms counterpart ones with an improvement margin of up to 4% using F-Measure.
CUSTOMER OPINIONS EVALUATION: A CASESTUDY ON ARABIC TWEETSijaia
This paper presents an automatic method for extracting, processing, and analysis of customer opinions on Arabic social media. We present a four-step approach for mining of Arabic tweets. First, Natural Language Processing (NLP) with different types of analyses had performed. Second, we present an automatic and expandable lexicon for Arabic adjectives. The initial lexicon is built using 1350 adjectives as seeds from processing of different datasets in Arabic language. The lexicon is automatically expanded by collecting synonyms and morphemes of each word through Arabic resources and google translate. Third, emotional analysis was considered by two different methods; Machine Learning (ML) and rulebased method. Finally, Feature Selection (FS) is also considered to enhance the mining results. The experimental results reveal that the proposed method outperforms counterpart ones with an improvement margin of up to 4% using F-Measure.
Dictionary based concept mining an application for turkishcsandit
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
A new keyphrases extraction method based on suffix tree data structure for ar...ijma
Document Clustering is a branch of a larger area of scientific study known as data mining .which is an
unsupervised classification using to find a structure in a collection of unlabeled data. The useful
information in the documents can be accompanied by a large amount of noise words when using Full Text
Representation, and therefore will affect negatively the result of the clustering process. So it is with great
need to eliminate the noise words and keeping just the useful information in order to enhance the quality of
the clustering results. This problem occurs with different degree for any language such as English,
European, Hindi, Chinese, and Arabic Language. To overcome this problem, in this paper, we propose a
new and efficient Keyphrases extraction method based on the Suffix Tree data structure (KpST), the
extracted Keyphrases are then used in the clustering process instead of Full Text Representation. The
proposed method for Keyphrases extraction is language independent and therefore it may be applied to any
language. In this investigation, we are interested to deal with the Arabic language which is one of the most
complex languages. To evaluate our method, we conduct an experimental study on Arabic Documents
using the most popular Clustering approach of Hierarchical algorithms: Agglomerative Hierarchical
algorithm with seven linkage techniques and a variety of distance functions and similarity measures to
perform Arabic Document Clustering task. The obtained results show that our method for extracting
Keyphrases increases the quality of the clustering results. We propose also to study the effect of using the
stemming for the testing dataset to cluster it with the same documents clustering techniques and
similarity/distance measures.
Due to an exponential growth in the generation of textual data, the need for tools and mechanisms for automatic summarization of documents has become very critical. Text documents are vital to any organization's day-to-day working and as such, long documents often hamper trivial work. Therefore, an automatic summarizer is vital towards reducing human effort. Text summarization is an important activity in the analysis of a high volume text documents and is currently a major research topic in Natural Language Processing. It is the process of generation of the summary of input text by extracting the representative sentences from it. In this project, we present a novel technique for generating the summarization of domain specific text by using Semantic Analysis for text summarization, which is a subset of Natural Language Processing.
Arabic text categorization algorithm using vector evaluation methodijcsit
Text categorization is the process of grouping documents into categories based on their contents. This
process is important to make information retrieval easier, and it became more important due to the huge
textual information available online. The main problem in text categorization is how to improve the
classification accuracy. Although Arabic text categorization is a new promising field, there are a few
researches in this field. This paper proposes a new method for Arabic text categorization using vector
evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the
tested document's words are calculated to determine the document keywords which will be compared with
the keywords of the corpus categorizes to determine the tested document's best category.
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISHcscpconf
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
The effect of training set size in authorship attribution: application on sho...IJECEIAES
Authorship attribution (AA) is a subfield of linguistics analysis, aiming to identify the original author among a set of candidate authors. Several research papers were published and several methods and models were developed for many languages. However, the number of related works for Arabic is limited. Moreover, investigating the impact of short words length and training set size is not well addressed. To the best of our knowledge, no published works or researches, in this direction or even in other languages, are available. Therefore, we propose to investigate this effect, taking into account different stylomatric combination. The Mahalanobis distance (MD), Linear Regression (LR), and Multilayer Perceptron (MP) are selected as AA classifiers. During the experiment, the training dataset size is increased and the accuracy of the classifiers is recorded. The results are quite interesting and show different classifiers behaviours. Combining word-based stylomatric features with n-grams provides the best accuracy reached in average 93%.
Automatically finding domain specific key terms from a given set of research paper is a challenging task and research papers to a particular area of research is a concern for many people including students, professors and researchers. A domain classification of papers facilitates that search process. That is, having a list of domains in a research field, we try to find out to which domain(s) a given paper is more related. Besides, processing the whole paper to read take a long time. In this paper, using domain knowledge requires much human effort, e.g., manually composing a set of labeling a large corpus. In particular, we use the abstract and keyword in research paper as the seeing terms to identify similar terms from a domain corpus which are then filtered by checking their appearance in the research papers. Experiments show the TF –IDF measure and the classification step make this method more precisely to domains. The results show that our approach can extract the terms effectively, while being domain independent.
There is a vast amount of unstructured Arabic information on the Web, this data is always organized in
semi-structured text and cannot be used directly. This research proposes a semi-supervised technique that
extracts binary relations between two Arabic named entities from the Web. Several works have been
performed for relation extraction from Latin texts and as far as we know, there isn’t any work for Arabic
text using a semi-supervised technique. The goal of this research is to extract a large list or table from
named entities and relations in a specific domain. A small set of a handful of instance relations are
required as input from the user. The system exploits summaries from Google search engine as a source
text. These instances are used to extract patterns. The output is a set of new entities and their relations. The
results from four experiments show that precision and recall varies according to relation type. Precision
ranges from 0.61 to 0.75 while recall ranges from 0.71 to 0.83. The best result is obtained for (player, club)
relationship, 0.72 and 0.83 for precision and recall respectively.
A hybrid composite features based sentence level sentiment analyzerIAESIJAI
Current lexica and machine learning based sentiment analysis approaches
still suffer from a two-fold limitation. First, manual lexicon construction and
machine training is time consuming and error-prone. Second, the
prediction’s accuracy entails sentences and their corresponding training text
should fall under the same domain. In this article, we experimentally
evaluate four sentiment classifiers, namely support vector machines (SVMs),
Naive Bayes (NB), logistic regression (LR) and random forest (RF). We
quantify the quality of each of these models using three real-world datasets
that comprise 50,000 movie reviews, 10,662 sentences, and 300 generic
movie reviews. Specifically, we study the impact of a variety of natural
language processing (NLP) pipelines on the quality of the predicted
sentiment orientations. Additionally, we measure the impact of incorporating
lexical semantic knowledge captured by WordNet on expanding original
words in sentences. Findings demonstrate that the utilizing different NLP
pipelines and semantic relationships impacts the quality of the sentiment
analyzers. In particular, results indicate that coupling lemmatization and
knowledge-based n-gram features proved to produce higher accuracy results.
With this coupling, the accuracy of the SVM classifier has improved to
90.43%, while it was 86.83%, 90.11%, 86.20%, respectively using the three
other classifiers.
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)ijnlc
Extracting key phrases from documents is a common task in many applications. In general: The Noun
Phrase Extractor consists of three modules: tokenization; part-of-speech tagging; noun phrase
identification. These will be used as three main steps in building the new system ANPE, This paper aims at
picking Arabic Noun Phrases from a corpus of documents, Relevant criteria (Recall and Precision), will be
used as evaluation measure. On the one hand, when using NPs rather than using single terms, the system
yields more relevant documents from the retrieved ones, on the other hand, it gave low precision because
number of the retrieved documents will be decreased. At the researchers conclude and recommend
improvements for more effective and efficient research in the future.
Query Answering Approach Based on Document SummarizationIJMER
The growing of online information obliged the availability of a thorough research in the
domain of automatic text summarization within the Natural Language Processing (NLP)
community.The aim of this paper is to propose a novel approach for a language independent automatic
summarization approach that combines three main approaches. The Rhetorical Structure Theory
(RST), the query processing approach, and the Network Representationapproach (NRA). RST, as a
theory of major aspect for the structure of natural text, is used to extract the semantic relation behind
the text.Query processing approachclassifies the question type and finds the answer in a way that suits
the user’s needs. The NRA is used to create a graph representing the extracted semantic relation. The
output is an answer, which not only responses to the question, but also gives the user an opportunity to
find additional information that is related to the question.We implemented the proposed approach. As a
case study, the implemented approachis applied on Arabic text in the agriculture field. The
implemented approach succeeded in summarizing extension documents according to user's query. The
approach results have been evaluated using Recall, Precision and F-score measures.
A Novel approach for Document Clustering using Concept ExtractionAM Publications
In this paper we present a novel approach to extract the concept from a document and cluster such set of documents depending on the concept extracted from each of them. We transform the corpus into vector space by using term frequency–inverse document frequency then calculate the cosine distance between each document, followed by clustering them using K means algorithm. We also use multidimensional scaling to reduce the dimensionality within the corpus. It results in the grouping of documents which are most similar to each other with respect to their content and the genre.
ATAR: Attention-based LSTM for Arabizi transliterationIJECEIAES
A non-standard romanization of Arabic script, known as Arbizi, is widely used in Arabic online and SMS/chat communities. However, since state-of-the-art tools and applications for Arabic NLP expects Arabic to be written in Arabic script, handling contents written in Arabizi requires a special attention either by building customized tools or by transliterating them into Arabic script. The latter approach is the more common one and this work presents two significant contributions in this direction. The first one is to collect and publicly release the first large-scale “Arabizi to Arabic script” parallel corpus focusing on the Jordanian dialect and consisting of more than 25 k pairs carefully created and inspected by native speakers to ensure highest quality. Second, we present ATAR, an ATtention-based LSTM model for ARabizi transliteration. Training and testing this model on our dataset yields impressive accuracy (79%) and BLEU score (88.49).
A Survey of Various Methods for Text SummarizationIJERD Editor
Document summarization means retrieved short and important text from the source document. In this paper, we studied various techniques. Plenty of techniques have been developed on English summarization and other Indian languages but very less efforts have been taken for Hindi language. Here, we discusses various techniques in which so many features are included such as time and memory consumption, efficiency, accuracy, ambiguity, redundancy.
Similar to A New Concept Extraction Method for Ontology Construction From Arabic Text (20)
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
A New Concept Extraction Method for Ontology Construction From Arabic Text
1. Abeer Alarfaj & Abdulmalik Alsalamn
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (9) : Issue (1) : 2020 1
A New Concept Extraction Method for Ontology Construction
From Arabic Text
Abeer Alarfaj aaalarfaj@pnu.edu.sa
Department of Computer Sciences
College of Computer and Information Sciences
Princess Nora Bint AbdulRahman University
Riyadh, Saudi Arabia
Abdulmalik Alsalamn salman@ksu.edu.sa
Department of Computer Science
College of Computer and Information Sciences
King Saud University
Riyadh, Saudi Arabia
Abstract
Ontology is one of the most popular representation model used for knowledge representation,
sharing and reusing. The Arabic language has complex morphological, grammatical, and
semantic aspects. Due to complexity of Arabic language, automatic Arabic terminology extraction
is difficult. In addition, concept extraction from Arabic documents has been challenging research
area, because, as opposed to term extraction, concept extraction are more domain related and
more selective. In this paper, we present a new concept extraction method for Arabic ontology
construction, which is the part of our ontology construction framework. A new method to extract
domain relevant single and multi-word concepts in the domain has been proposed, implemented
and evaluated. Our method combines linguistic, statistical information and domain knowledge. It
first uses linguistic patterns based on POS tags to extract concept candidates, and then stop
words filter is implemented to filter unwanted strings. To determine relevance of these candidates
within the domain, different statistical measures and new domain relevance measure are
implemented for first time for Arabic language. To enhance the performance of concept
extraction, a domain knowledge will be integrated into the module. The concepts scores are
calculated according to their statistical values and domain knowledge values. In order to evaluate
the performance of the method, precision scores were calculated. The results show the high
effectiveness of the proposed approach to extract concepts for Arabic ontology construction.
Keywords: Ontology Construction, Arabic Ontology, Arabic Language Processing, Concept
Extraction, Arabic Term Extraction, Specific Domain Corpus.
1. INTRODUCTION
Ontology construction includes several steps as follows: term extraction, synonyms extraction,
concept learning, finding relations between extracted concepts and adding them in the existing
ontology [1]. Automatic extraction of concepts is one of the most important tasks of ontology
learning.
Term extraction is a prerequisite for all aspects of ontology learning from text. Its purpose is to
extract domain relevant terms from natural language text. Terms are the linguistic realization of
domain specific concepts. Term can be a single word or multi-word compound relevant for the
domain in question as a term [9].
The Arabic language has complex morphological, grammatical, and semantic aspects since it is a
highly derivational and inflectional language, which makes morphological analysis a very complex
2. Abeer Alarfaj & Abdulmalik Alsalamn
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (9) : Issue (1) : 2020 2
task. All these difficulties pose a significant challenge to researchers developing NLP systems in
general and particularly on the terminologies extraction for Arabic [1].
To build Arabic ontology, the first step is to find the important concepts of the domain. The
concept linguistically represented by terms, so to extract the domain specific terms from texts. For
English there are some studies done for concept extraction, moreover, there are some studies for
unstructured Arabic documents for key phrase extraction and multiword terms extraction such as
(El-Beltagy and Rafea [11]; Boulaknadel et al [5]; Bounhas and Slimani [6]; Saif and AbAziz[18]).
However, key phrase extraction is different from concept extraction. In the framework by (Al-Arfaj
and Al-Salman [3]), concept extraction consists of terminology extraction and concept
identification. Concept extraction from Arabic documents has been challenging research area,
because, as opposed to term extraction, concept extraction are more domain related and more
selective.
According to the study [1], it remains open work how to extract and rank single and multi-word
concepts by relevance to the domain.
In this paper, we first describe the new proposed method for concept extraction from Arabic text.
Then we present experiments used to evaluate its performance with medicine documents from
hadith corpus.
The distinctive features of our method for concept extraction are as follows:
Our method is unsupervised; it does not need training data.
Also, our method does not use contrasting corpora to identify domain concepts. This
leads to avoiding the problem of skewness in terms frequency information.
It can extract domain relevance concept using a combination of statistical measures and
domain knowledge. This overcomes the problems that have been found in the methods
that are based only on the frequency or TF-IDF to measure the importance of the
candidates.
The main contributions of the paper are as the following:
Propose a new method for concept extraction from Arabic text.
Compare different statistical algorithms for term weighting and extraction.
The rest of the paper is organized as follows. A brief overview of the existing works in concepts
extraction is summarized in section 2. Section 3 provides the details of the proposed method and
its main steps. Section 3.1 describes the pre-processing steps. The algorithm used and
implemented in this study for candidate concept extraction is described in Section 3.2. Section
3.3 presents the concept extraction selection algorithm. In Section 3.4, a candidate concepts
selection algorithm using domain knowledge is presented. Section 3.5 describes the post-
processing techniques used in this study. Section 4 provides a comparative evaluation of the
method in terms of precision, summarizes the finding and results. Finally, section 5 concludes the
paper and discuss areas of future work.
2. RELATED WORKS
The main objective of concept extraction task is to identify domain concepts from pre-processed
documents for the domain being investigated. Concept extraction is a primary and the basic layer
for ontology learning from text and very useful in many applications, such as search,
classification, clustering. The key challenge is how to extract domain specific concepts that
represent the key information of a corpus in a domain of interest.
Concept formation should provide an intension definition of concepts, their extension and the
lexical that are used to refer to them [9]. Also, [7] considered that a concept should have a
linguistic realization. Therefore, in order to identify the set of concepts of a domain, it is necessary
to analyze a document to identify the important domain terms that represent concepts, which can
3. Abeer Alarfaj & Abdulmalik Alsalamn
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (9) : Issue (1) : 2020 3
be a single word or multiword term. The importance of term is measured by modeling statistical
features and linguistic features. The terms above a certain threshold are referred to concepts.
Therefore, the major challenge in concept extraction is to be able to differentiate domain terms
from non-domain terms [4].
Many concept extraction methods have been proposed in the literature.TF-IDF (Term Frequency-
Inverse Document Frequency) is a popular method that is widely used in information retrieval and
machine learning fields. This method adopted by Text-To-Onto [8], first, it employs a set of pre-
defined linguistic filters (particularly the POS tag based rules) to extract possible candidate terms,
including single-word terms and multi-word terms, from texts. Then, some statistical measures
are used to remove irrelevant concepts.
Clustering techniques can be used to induce concepts. Based on Harris distributional hypothesis
(Harris 1970 [24]), which stated that words that occur in similar contexts often share related
meaning, the concept is considered as a cluster of related and similar terms. Also, Formal
concept analysis and Latent semantic indexing algorithm used to build attribute-values pairs that
correspond to concepts [23]. Another approach utilized WordNet to extract synonyms and
relevant information about a given concept that contributes to concept definition [4].
In [20] unsupervised approach used for concept extraction from clinical text. Their method
combined natural language descriptions of concepts with word representations, and composing
these into higher-order concept vectors. These concept vectors are then used to assign labels to
candidate phrases which are extracted using a syntactic chunker. They reported an exact F-
score of.32 and an inexact F-score of.45 on the well-known I2b2-2010 challenge corpus.
Authors in [17] proposed a method for concepts extraction using semantic information, semantic
graph-based Chinese domain concept extraction (SGCCE). First, the similarities between terms
are calculated using the word co-occurrence, the LDA topic model and Word2Vec. Then, a
semantic graph of terms is constructed based on the similarities between the terms. Finally,
according to the semantic graph of the terms, community detection algorithms are used to divide
the terms into different concept clusters. The experimental results showed the effectiveness of
their proposed method. The results of the concept extraction are used for relation extraction
tasks [16].
A method for MWT extraction in Arabic for environment domain is proposed in [5]. The authors
identified candidate terms by first, using linguistic filters. Second, four statistical measures which
are LLR, FLR, MI and t-score are used for ranking MWT candidates. Their experiment showed
that the LLR, FLR and t-score measures outperform the MI measure and LLR outperform other
methods with precision value equals to 85%.
[22] Presented a hybrid approach for extracting collocations from Crescen Quranic Corpus. They
first, analyzed the text with AraMorph, and then simple terms were first extracted using TF-IDF
measure. They obtained precision value 88%. For collocations extraction, the authors used rule
based approach and MI they reached precision value 0.86%.
In [26] authors extracted Arabic terminology from Islamic corpus. They used the linguistic filter to
extract candidate MWTs matching given syntactic patterns. In the statistical filter, they applied
TF-IDF to rank the single word terms candidate, and statistical measures (PMI, Kappa, Chi-
square, T-test, Piatersky- Shapiro and Rank Aggregation) for ranking the MWTs candidates.
From the experiments, the authors indicated the effectiveness of Rank Aggregation compared to
others association measures with precision value 80% in the n-best list with n=100.
The method in [27] considered contextual information and both termhood and unithood for
association measures at the statistical filtering. To extract MWT candidates, the authors applied
syntactic patterns and for candidates ranking, several statistical measures have been used
including C-value, NCvalue, NTC-value and NLC-value. Their experimental results showed that
4. Abeer Alarfaj & Abdulmalik Alsalamn
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (9) : Issue (1) : 2020 4
NLC-value measure outperformed others with precision value 82% on an environment Arabic
corpus.
An extensive analysis of term extraction approaches and the summary of Arabic terminology
extraction methods, with main intent of highlighting their strengths and weaknesses on extract
domain relevant terms detailed in [1].
In this work, we implemented the five algorithms for concept extraction, Concept Frequency-
Document Frequency (CF-DF), Term Frequency-Inverse Document Frequency (TF-IDF),
Concept Frequency-Inverse Document Frequency (CF-IDF), Relative Concept Frequency (RCF),
Average Concept Frequency in the corpus (Avg-CF).
We investigate the performance of these algorithms and demonstrate that CF-DF algorithm is the
best choice for concept extraction. We also discuss why one method performs better than other
and what could be done to further improve the performance.
Our contributions to the Arabic concept extraction field are as follows. We evaluate and compare
different statistical measures and proposed a new one for candidates’ concepts weighting. Our
method can extract rare concepts, even those appearing with low frequency. It also excludes
irrelevant concepts even if they occur frequently in the corpus.
We assessed the effectiveness of our method by computing the precision for each experiment.
For evaluation purposes, we focus on the medicine domain from hadith corpus. We observe that
our method for concept extraction from Arabic text significantly improves precision.
The output list from this module constitutes the fundamental layer of ontologies.
3. PROPOSED METHOD FOR ARABIC CONCEPT EXTRACTION
This section describes the baseline measures and the new method that we propose for Arabic
concept extraction and ranking based on linguistic, statistical information and domain knowledge.
Our method for concept extraction has four main steps; described in Figure 1: pre-processing,
linguistic filter to extract candidate terms (single and multi-word), candidate selection according to
different weighting models, and post-processing (refinement and expert validation).
After the preprocessing phase, the annotated corpus will be input to concept extraction phase. In
this phase, terms representing concepts will be extracted. First, we apply linguistic filter (Pattern
on POS annotated text) to extract concept candidates. Additionally, a stop word will be used to
eliminate terms that are not expected to be a concept in the domain, which improves precision of
the output candidate terms. Second, we calculate the weights of candidates using a combination
of statistical and domain knowledge. The higher the weight of a candidate is, the more relevance
to the domain. Statistical information is obtained using different statistical measures (CF-DF, TF-
IDF,CF-IDF, RCF, Avg-CF). Third, since all the statistical measures are based on the frequency
only, the multiword terms and the concepts, which are important to the domain but has less
frequency, will not be extracted. To enhance the performance of concept extraction, a domain
knowledge will be integrated into the module. The domain knowledge is obtained from a domain
specific list extracted from the taxonomy of the corpus.
Finally, our method generates a ranked list of key concepts according to their weights. Average
threshold is computed to prune incorrect concepts. The concepts will be displayed to the expert to
choose the valid concept and to add the missing one. In the following, we describe each of these
steps in more details.
3.1 Pre-Processing
Arabic documents were processed in the steps described in [2]. After the preprocessing phase,
the annotated corpus will be input to concept extraction phase. More details of preprocessing
5. Abeer Alarfaj & Abdulmalik Alsalamn
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (9) : Issue (1) : 2020 5
steps and tools are presented in [2]. First preprocessing phase is implemented. Table1 shows the
sample hadith text before and after preprocessing steps.
Normalization
Many level of orthographic normalization is carried on input documents before analyzing the text.
This includes normalization of the hamzat alif to a bar alif, normalization of taa marbutah to haa,
and removal of extra spaces between words. In this paper we work on the texts without diacritics,
since these diacritics have no effect in determining concepts and relations between them and
generate a problem and overhead in most of the Arabic morphological analyzer. In this work the
Ambiguity is solved by using context, for example, the word العين Has context {حق العين ,النظره العين
}and {العين ,,الكحل ,}امراه so, according to the context we can determine the mean of the words.
These steps performed for normalizing the input text.
Convert to UTF-8 encoding
Remove diacritics and tatweel.
Remove non-Arabic letters.
Replace ( إ آ )أ by ()ا
Replace ()ة by ()ه
Remove extra spaces between words
Separate punctuations from words
Sentence Segmentation and Tokenization
The output of this phase is individual sentences to be considered for further processing. Each of
the individual sentences is given as input to the next module of parsing. After sentence
segmented, the next step is to provide the sentence to the Stanford POS tagger for tagging the
tokens.
3.2 Candidate Concept Extraction
The main objective of this step is to extract all possible candidates for concepts using linguistic
techniques. The details steps are as follows.
3.2.1 The Linguistic Filter
We use the Stanford POS tagger for Arabic text [19] to perform the linguistic analysis phase. The
tagger assigns for each token in the sentence its grammatical category. For example, in Table 1,
each sentence in the hadith text, the tagger identifies nouns, verb, and preposition. We chose the
Stanford POS tagger because it reaches an accuracy of a 96.42% on Arabic and it is written in
java that can be easily integrated in our module.
Arabic terms consist mostly of nouns and adjectives and sometimes prepositions. We observed
that many of the terms mentioned in the hadith corpus are multi-word terms. The syntactic
structure of multi-word terms can be used in semantic relations extraction between concepts.
6. Abeer Alarfaj & Abdulmalik Alsalamn
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (9) : Issue (1) : 2020 6
Preprocessing steps Sample hadith Text
Sample hadith Text ثالثة في الشفاء:الكي عن أمتي وأنهى ،نار َّةيوك ،محجم وشرطة ،عسل شربة.
Text after normalization ثالثه في الشفاء:، عسل شربهالكي عن امتي وانهى ، نار وكيه ، محجم وشرطه.
Text after sentence segmentation ثالثه في الشفاء
عسل شربه
محجم وشرطه
نار وكيه
الكي عن امتي وانهى
Text after Stanford POS tagger الشفاء/DTNNفي/INثالثه/CD
شربه/NNعسل/NN
و/CCشرطه/NNمحجم/JJ
و/CCكيه/NNنار/NN
و/CCانهى/VBDامتي/NNPعن/INالكي/DTNN
The list of Noun phrases after using
linguistic filters
الشفاء,عسل شربه,شرطه,محجم شرطه,نار كيه,امتي,الكي
The list of Noun phrases after stemming شفاء,عسل شرب,شرط,محجم شرط,نار كيه,امتي,الكي
The list of Noun phrases after root
extraction
شفو,عسل شرب,شرط,حجم شرط,نير كيه,امتي,الكي
Text after MADA+Stanford POS tagger الشفاء/DTNNفي/INثالث/NNه/PRP$
:/PUNCشرب/VBDه/PRPعسل/NN
،/JJو/CCشرط/NNه/PRP$محجم/NN
،/NNPو/CCكي/NNPه/PRP$نار/NN
،/JJو/CCانهي/VBDامة/NNي/PRP$عن/INالكي/DTNNP
./PUNC
The list of Noun phrases after using
linguistic filters
الشفاء,ثالث,عسل,شرط,محجم,كي,نار,امة,الكي
TABLE 1: A sample hadith text from Medicine Book after linguistic analysis steps.
7. Abeer Alarfaj & Abdulmalik Alsalamn
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (9) : Issue (1) : 2020 7
3.2.2 Candidate Concepts Extraction Following Patterns
Concepts in ontology represent a set of classes of entities or things within a domain [14].
According to literature, concepts are usually described by noun phrases, so we considered every
noun phrase in the document as a candidate term. Therefore, for extracting noun phrase
candidates we implemented linguistic filters based on predefined POS patterns to extract noun
phrases that constitute multi-word terms.
First, we applied Stanford POS tagger prior to linguistic filters. Next step is to determine patterns
for noun phrase concepts using tags. The algorithm for extracting noun phrases based on the
patterns shown in Algorithm 1. Noun phrase consist of one head noun followed by one or more
nouns or adjective [5,6]. A linguistic filter is applied on each tagged sentence to extract candidate
multiword terms. Based on our inspection of the concepts we identified by studying medicine
FIGURE 1: The Proposed Concept Extraction Method.
Annotated
Corpus
Domain
Knowledge
Candidate concept
extraction
List of
candidat
e terms
Linguistic Filter
Weight calculation
using statistical
Measures
Stop- Word Filter
Candidate Concept
Selection
CF-DF, TF-IDF,CF-
IDF, RCF, Avg-CF
Final list of Concepts
Concept Ranking
Domain valid
concepts
List of Ranked
candidate concepts
Weight calculation
using domain
knowledge
Refinement and
validation by expert
Alhadith
Corpus
Preprocessing
Tokenization
POS tagging
Stemming
Sentence
splitter
8. Abeer Alarfaj & Abdulmalik Alsalamn
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (9) : Issue (1) : 2020 8
book from hadith corpus, we implemented all the filters in Table 2 for our experiments. In Table 1,
we see the example of the noun phrases extracted from the sample hadith text. Many candidate
terms identified by this method are not key concept. Therefore, several filters are used to reduce
the number of candidate terms. The first filter is stop words.
Pattern Example
(Noun )+ الشفاء
(Noun)+ (Adjective)+ السوداء َّةبالح
Noun Preposition Noun الداء من الحجامة
TABLE 2: Examples of Linguistic patterns of noun phrases for ontological terms extraction.
3.2.3 Stop Word Filter
We notice that the POS tagger returns the terms such as شيء , احدهما, احيانا as nouns. In order to
improve precision of the output candidate terms, we have implemented our own stop words filter
to eliminate terms that are not expected to be a concept in the domain. The list of stop words is
constructed based on the domain observations. Further, stop words are not allowed at the
beginning or end of the phrase. If a phrase starts or ends with a stop word, it will be removed
from the list of candidate terms and keep ones in the extracted noun phrase for relation
extraction. (examples: “الداء من الحجامه” “الراس على الحجامه”).
In Table 3, we see the example of the candidate terms after applying stop word filter.
TABLE 3: Example of using Stop Word Filter.
ALGORITHM 1: Candidate Concepts Extraction Algorithm.
The list of Noun phrases after
linguistic filter
The list of Candidate terms after
stop word filter
,وسلم,عليه,هللا,النبي الحلواء ,العسل النبي,هللا,الحلواء,العسل
Algorithm: Linguistic Filters
Input: Set of statements S in Preprocessed Text
Patterns:
Pattern 1= If a noun is followed by one or more noun
Pattern 2= If one or more noun is followed by one or more adjective
Pattern 3= if a noun is followed by preposition followed by a noun
Output: set of noun phrases NPlist
Procedure:
Begin
For each statement Si in S do
Ci ← Stanford POS tagger (Si)
For each Pattern in Patterns
NPi = Apply Pattern(Ci)
Add NPi to NPlist
i←i+1
End
For each NPi in NPlist
For each word in NPi
If NPi[head]or NPi[end] exists in Stop word list
Remove NPi from NPlist
End
End
return NPlist
End
9. Abeer Alarfaj & Abdulmalik Alsalamn
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (9) : Issue (1) : 2020 9
3.3 Candidate Concept Selection
After extracting candidate terms using the linguistic filter, our method assigns weights to these
candidate terms, ranks them, and extracts the ones with higher weights. The weight of concept
candidate is determined using a combination of the statistical and domain knowledge. It is based
on the following idea: candidate term that is included in the domain specific list is likely to be a
domain concept. Statistical knowledge is obtained from the different statistical measures
implemented in this module. To measure the relative importance of the concept candidates in the
specific domain, we use a domain list that consists of domain key concepts. The list of concept
candidates is parsed to determine if it is found in domain specific list.
Our method assigns higher weights to more domain specific concepts; candidates that are
frequently appear in the corpus and in the domain specific list. All the weights of concept
candidates are normalized to range from 0 to 1 after they are calculated. All candidate concepts
are then ranked in descending order by their weights.
The domain concepts can be extracted from the ranked list according to different parameters
either by defining a specific number of concepts to be extracted or by setting the average
frequency as threshold for concepts to be extracted. Algorithm 2 shows the algorithm for
candidate concepts selection and ranking using different statistical measures.
Details of the proposed method for candidate concepts selection are presented in the following
sub-sections.
AGORITHM 2: The Candidate Concepts Selection and Ranking.
3.3.1 Candidates Concepts Selection using Different Statistical Measures
Although our linguistic filter returns noun phrases, it may include phrases which do not belong to
domain-specific (e.g. ‘امتي’ in Table 1). In order to refine the results, weighting models which are
the statistical measures to determine the importance of terms has been employed which places
highest weighted terms as the most important concepts. We have implemented different
algorithms to assign weights for every candidate terms.
3.3.2 Weight Calculation of Concept Candidates
After multi-word terms extraction using linguistic filters, it still suffers from the problem that these
extracted phrases may not cohesive enough to be treated as a term. For tackling this problem
[6,22] use statistical measures such as Mutual Information measure and LikeLihood ratio test to
Algorithm: Candidate concepts selection
Purpose: extract candidate concepts and ranking using statistical measures
Input: corpus= set of documents of a specific domain
Avg-threshold= frequency threshold for candidate terms
Output: Lterms: list of ranked terms
Begin
Read corpus
Normalizing input text file
Sentence segmentation
Tag the tokens
Extract candidate terms t by filtering with patterns
Apply stop word filter for each candidate term t
For each candidate term t Apply statistical measures
CF-DF-value(t)= CF(t)xDF(t)
Add t to Lterms
End
Rank Lterms by the value obtained CF-DF-value
Compute Avg-threshold
Select from Lterms the candidate above Avg-threshold
End
10. Abeer Alarfaj & Abdulmalik Alsalamn
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (9) : Issue (1) : 2020 10
score the extracted units. However, term frequency has shown to produce better performance
than other measures for multi-word term extraction [10].
Our goal is to extract domain specific terms. To measure the importance of concept candidates in
the corpus of the domain, there are different algorithms based on the following idea: terms which
occur most frequently are considered as the relevant terms of the domain.
Text2Onto [8] extracts concepts using different weighting schema (entropy, Relative Term
Frequency (RTF), TF-IDF) to measure their domain relevance. In the OntoLearn system [13], to
extract domain concept two statistical measures Domain Relevance (DR), Domain Consensus
(DC) are used together. However, this measure doesn't consider the rare concepts. Also, the
domain relevance measure depends on the term's frequency in the target domain corpus and the
contrastive corpus.
(Jiang & Tan [12] extract concept by first, noun phrase extraction from the corpus using linguistic
filter. Then, they compute the frequency of these phrases in the target corpus and contrastive
corpus to assign higher weight to more domain specific term.
The problems of using contrastive corpus are: If there is a large difference between the size of
corpus and the contrastive corpora, it will leading to high skewness in the frequency of noun
phrases among the domain. For tackling this problem, in our algorithm to measure domain
relevance we rely on [12,13] measures and we did not use contrastive corpus to measure
importance of concepts to the target domain.
The assignment of weights is made using one of the following algorithms:
Concept Frequency-Document Frequency (CF-DF)
The base line TF-IDF measures the term relevancy to a given document in the document
collections. In this thesis, we proposed a new algorithm for extracting conceptual terms from
domain texts. The proposed method is based on the assumption that document frequency of a
concept in the document collection is a good measure for estimating concept relevance to
specific domain. The idea is to measure concept relevance to the domain, if a concept occurs in
multiple documents, it is considered more relevant compared to with those occurring only in
single document. Each concept obtains the value from its frequency in the document multiplied by
the frequency of the same concept in the documents collection.
Let C be the set of candidate concepts extracted using the linguistic filter. The weight of a
candidate c C is computed as:
Where Concept Frequency (CF) is the number of occurrences of each concept c in the document
d, and then normalized to prevent bias towards longer document by dividing the count by the max
concept frequency in the document d for all candidates c C .
Document Frequency (DF) is the number of documents in a corpus that contain the concept c.
From the equation 1, we can see that, the higher the frequency of concepts the more likely the
concept is to be important.
Term Frequency-Inverse Document Frequency (TF-IDF)
Term Frequency TF(t,d) is the number of times term t occurs in the document d. That measures
popularity of term in document, which is normalized with the maximum term frequency in the
document as shown in equation 2. TF is usually normalized to prevent a bias towards longer
11. Abeer Alarfaj & Abdulmalik Alsalamn
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (9) : Issue (1) : 2020 11
documents which may have a higher term count regardless of the actual importance of that term
in the document.
Where:
is the frequency of occurrence of term t in the target document d.
is the max frequency in the document d for all t.
Document Frequency (DF) is the number of documents in a corpus that contain the term.
Inverse Document Frequency (IDF) is the measure of the importance of the term to the corpus
in general terms. Estimate the rarity of a term in the whole document collection; attempt to
automatically remove terms that are not important because they are common on the whole
corpus. It means that if a term occurs in all the documents of the collection, its IDF is zero.
This is computed by dividing the number of documents in the corpus by the number of documents
that contain that term (DF) and then taking the logarithm of that quotient.
Using the following formula:
Where:
is the inverse document frequency for term t
is the total number of documents
is the number of documents containing t
Given that if term t does not occur in the corpus, the denominator can leads t division by zero. So
it is common to add 1 as shown in equation 4.
For each candidate concepts, we calculate TF-IDF. To achieve this, equation 2 and 4 are
combined to form equation 5.
Finally, normalization is done by dividing TF-IDF value of all concepts by square root of the sum
of square of TF-IDF value of each concept.
The terms with high values of TF-IDF are important terms, since they have a high TF in the given
document and low occurrences in the remaining documents in the corpus which filtering out
common terms. TF-IDF tends to extracting more single words terms than multi word terms
concepts. This is because multi word terms are less likely to appear than single word terms.
While in domain specific corpus, the domain concepts are multi word terms and the probability of
concepts occurring in many documents is high.
Also, TF-IDF assigning higher weights to rarer candidate concepts; if the important domain
oncepts appear in most of domain documents it will be discarded since its IDF value tends to be
12. Abeer Alarfaj & Abdulmalik Alsalamn
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (9) : Issue (1) : 2020 12
zero. Thus, the TF-IDF may perform poorly in some context. However, in our work we observed
that the important concepts in the domain are not appearing frequently in most of the domain
documents.
Concept Frequency-Inverse Document Frequency (CF-IDF)
CF-IDF is a modified version of TF-IDF to handle multi-word terms [25]. For each candidate
concepts, the weight is calculated by the following:
If concept length =1 then
Else
(7)
Where:
CF is the number of times a term occurs in a document d.
D is the total number of documents in the corpus.
For single word concept the weight is computed using equation 6, where is
computed from equation 4. For multi word concept, the weight is computed using equation 7,
which set DF to 1. This is due to that multi word terms do not occur frequently within a collection
of documents as single word terms.
Relative Concept Frequency (RCF)
It calculates relative term frequency which is calculated by dividing the absolute concept
frequency of the concept c in the document d (number of times concept c occurs in document d)
divided by the maximum absolute concept frequency of the document (the number of times any
concept occurs the maximum number of times in the document d). It is computed in the following
way:
Average Concept Frequency In Corpus (Avg-CF)
Avg-CF is calculated by dividing the total frequency of a concept in a corpus by its document
frequency [21]. For each document in the corpus D, we aggregate concept frequencies across all
documents D. Formally, given a candidate c C, its weight with respect to D is calculated as
follows:
Where is frequency of the concept c in the document d for all di D.
The ranked list of extracted candidate concepts, shown in Table 4, illustrates how our methods
select the most relevant concept to the domain.
CF-DF TF-IDF CF-IDF RCF Avg-CF
للعين شفاء ماؤها:
1.1
بعضنا وريقه:
1.1010.0
بعضنا وريقه:1.111850 شفاء:1.1 هللا:1.1
لكل:1.1 سفعه وجهها:1.1010.0 سفعه وجهها:1.111850 بعضنا وريقه:1.1
النبي:1.0.0.00
شهيد:1.1 هوامك:1.1010.0 جهنم نار:1.111850 سفعه وجهها:1.1
الم:1.0010.0
13. Abeer Alarfaj & Abdulmalik Alsalamn
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (9) : Issue (1) : 2020 13
شرطه:1.1 نسيكه:1.1010.0 البغي مهر:1.111850 هوامك:1.1
هللا رسول:1.1.0.5.
سحر:1.1 جهنم نار:1.1010.0 للعين شفاء ماؤها:1.111850 نسيكه:1.1
ناس:1.110100
الناس رب:1.1 البغي مهر:1.1010.0 المريض كراهيه:1.111850 جهنم نار:1.1
شفاء:1.150.0.
بارض:1.1
للعين شفاء ماؤها:
1.1010.0
جهنم فيح:1.111850 البغي مهر:1.1
نار:1.180011
الوشم:1.1 العدوى:1.1010.0 جهنم فوح:1.111850 عجوه:1.1
داء:1.180011
المن:1.1
المريض كراهيه:
1.1010.0
طيره وال عدوى:1.111850 المجذوم:1.1
عين:1.1818.0
هللا:1.1 جهنم فوح:1.1010.0 تمرات سبع:1.111850 للعين شفاء ماؤها:1.1
لكل:1.1.0001
الكماه:1.1 عينها:1.1010.0 الناس رب:1.111850 لكل:1.1
عدوى:1.1.0001
حق العين:1.1
طيره وال عدوى:
1.1010.0
حمه ذي:1.111850 العدوى:1.1
كلمه:1.1.10001
الحمى:1.1 شهاده:1.1010.0 الكلب ثمن:1.111850 المريض كراهيه:1.1
طيره:1.1.1000
اجر الحجامه:1.1 سته:1.1010.0 ايام ثالثه:1.111850 جهنم فيح:1.1
سحر:1.1.1000
الباس:1.1
حمه ذي:1.1010.0 ارضنا تربه:1.111850 جهنم فوح:1.1
رقيه:1.101000
TABLE 4: The top 15 ranked concept extracted by different algorithms for Medicine domain from Hadith
corpus.
3.4 Candidate Concepts Selection using Domain knowledge
After extracting terms based on the different algorithms, there are several important concepts to
the domain but have significantly low value. Domain concept extraction can benefit from using
domain knowledge, because background concepts with low frequency can be found: even if they
occur only once or twice in the given corpus, they will be extracted if they contain domain
knowledge units. Our method integrates the domain knowledge to allow extracting the concepts
which are semantically relevant to the target domain.
As seen from the Table 4, الهندي العود:1.000 , البحري القسط:1.000 , السوداء الحبه:1.000 , which are
important concepts for the domain but they have scores less than the average threshold for the
RCF algorithm which is 0.4. After domain knowledge integration, a high score assigned to these
concepts and extracted as an important concepts for the domain as shown in Table 5.
3.4.1 New Weight Calculation Based On The Domain Knowledge
After assigning weights to the concept candidates in C, we use domain list to extract the relevant
terms. For each of the extracted candidate concepts, we identify the individual words of the
phrase. If all of its words or at least one of its words is in the domain list, then the term is
assigned high score and defined as a domain concept. When the heuristic {If all of its words are
in the domain list} is applied, this will increase precision of the method. While when the heuristic
{if at least one of its words are in the domain list}, the recall will be increased.
Give a candidate c C, its domain relevance measure is defined as follows:
New-Weight(c) = statistical-values (c) + domain-value (c)
Where:
statistical-values (c) is the weight assigned to the concept using the different statistical measures.
domain-value (c) is the weight assigned to the concept using domain list.
14. Abeer Alarfaj & Abdulmalik Alsalamn
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (9) : Issue (1) : 2020 14
Finally, we generate a ranked list of concept candidates {c1,c0,……..cn} C according to their
weights.
Algorithm 3 shows the algorithm for the candidate selection using the domain knowledge. Table 5
shows the top 15 ranked concept extracted using statistical algorithms and domain knowledge.
ALGORITHM 3: The Candidate Selection using The Domain Knowledge.
CF-DF TF-IDF CF-IDF RCF Avg-CF
المن:0.1 الشفاء بيدك:1.1010.0 جهنم فيح:1.111850 جهنم فيح:0.1 العذره:1.101000
حق العين:0.1 المن:1.1010.0 الشفاء بيدك:1.111850 الشفاء بيدك:0.1 الجنب ذات:1.108000
الحمى:0.1 العحق ين:1.1010.0 حق العين:1.111850 المن:0.1 الهندي العود:1.108000
جهنم فيح:
1.000000
العسل:1.1010.0 المن:1.158100 حق العين:0.1 القسط:1.101015
الشفاء بيدك:1.08 جهنم فيح:1.151108 العسل:1.158100 العسل:0.1 الشفاء:1.118.00
المبطون:166..... الحمى:1.10.010 الحمى:1.1000.5 الحمى:0.1 الحمى:1.118.00010
الشفاء:1.18 للمريض:1.1.8000 للمريض:1.1.0850 للمريض:1.8 السوداء الحبه:1.118.00
العذره:1.111010 المبطون:1.1.8000 المبطون:1.1.0850 المبطون:1.8 الطاعون:1.110550
االهندي لعود:
1.115801
الكحل:1.1.8000 الكحل:1.1.0850 الكحل:1.8 جهنم فيح:1.111010
الطاعون:
1.110000
الدم:1.1.80001 الدم:
1.1.0850010.000018
العين:1.8 العين:1.111010
القسط:1.110500 العين:1.1.1800 االتن البان:1.1.100. الشفاء:1.8 السحر:1.111010
الجنب ذات:
1.110..0
الشفاء:1.100185 العين:1.105100 الدم:1.8 السوداء الحبيبه:
1.111010
العين:1.11818. االتن البان:1.100005 الشفاء:1.10.50. االتن البان:1.. الدم وجهه:1.11818.
العسل:1.11818. البحري القسط: ا القسطلبحري:1.100805 البحري القسط: للمريض:1.11818.
Algorithm: Extract candidate terms and ranking
Purpose: extract candidate terms and ranking using domain knowledge
Input : List of candidate terms Lterms
and Domain specific list Dlist
Avg-threshold= frequency threshold for candidate terms
Output: Lconcepts: list of ranked concepts
Begin
For each candidate term t in Lterms
If t exist in Dlist return D-weight(t)
For each word in t
If t[word] exists in Dlist return D-weight(t)
Weight (t) =S-weight (t)+ D-weight(t)
Add t to Lconcepts
End
Rank Lconcepts by their weight value
Select from Lconcepts the candidate above AVG-threshold
End
End
15. Abeer Alarfaj & Abdulmalik Alsalamn
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (9) : Issue (1) : 2020 15
1.1010.50 1.000000
السوداء الحبه:
1.11818.
القسط:1.1001.8 الهندي العود:1.100805 القسط:1.000000 الشفاء بيدك:1.11818.
TABLE 5: The top 15 ranked concept extracted using domain knowledge for Medicine domain from Hadith
corpus.
We observed that our method extracted multi-word terms as concept candidates, while the others
tend to extract single word terms. For example, in Table 4 looking at the 15 concept candidates
extracted by Avg-CF algorithm, most terms are single word terms. On the other hand, in Table 5
we observed that our method using domain specific knowledge able to extract multi-word terms
that are relevant concepts.
3.5 Post Processing
Refinement and Validation
An important task of this stage is the removing of unrelated concepts from the extracted
candidate concepts list. We adopt a pruning strategy: terms that are frequently used in a corpus
are likely to denote domain concepts, while less frequent terms removed from the extracted
candidate concepts list.
We set an average frequency of the extracted concept as a threshold value and prune all
concepts that have a frequency lesser than this value.
Where is the total number of concepts.
When applying threshold average, the remaining concepts omitted from candidate list which
reduce recall but increase precision, if we need to keep all terms we don’t check for threshold. As
not all terms generated by our methods are domain concept, the domain experts determine their
relevance to the domain. The relevant concepts are then used to represent domain concepts.
4. EXPERIMENTS, EVALUATION AND RESULTS
4.1 Experiments
We setup a number of experiments to investigate the effectiveness of our method using the
different statistical measures and domain knowledge for Arabic concept extraction. In the first
experiment, we implemented and compared the following algorithms.
TF-IDF(as baseline) and modified version from it for multi-word terms CF-IDF, RCF, Avg-CF and
our domain relevance measure CF-DF to rank the candidate concepts. In the second experiment,
we integrated domain knowledge to assign weights for the candidate concept
In this section, the experimental results obtained by our method are presented. As mentioned
above, the experiment has been conducted in the Medicine Book from Hadith corpus. In order to
evaluate the performance of our method, recall and precision scores will be calculated in these
experiments. These measures are the most commonly used for the assessment of terms
extraction systems, and trace their origins back to the Information Retrieval discipline.
Precision is defined as the number of concepts correctly returned by the extractor, divided by the
total number of concepts returned by the extractor (see equation 11). On the other hand, recall is
the ratio of the number of concepts correctly extracted to the total number of concepts in the
documents (see equation 12).
16. Abeer Alarfaj & Abdulmalik Alsalamn
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (9) : Issue (1) : 2020 16
To find the number of concept correctly identified by our extractor, domain experts are needed to
examine the output. For recall, to find the total number of concepts in the documents, domain
experts are needed to identify them manually from documents. Because manually identifying
concepts from documents is time consuming, in these experiments we followed the previous
studies in [1], we used precision only as a measure to evaluate our method.
4.2 Results
For each of the five experiments, we list the top 15 ranked concepts from medicine field from
hadith corpus. The ranked candidate concepts are extracted by the measures CF-DF, TF-IDF,
CF-IDF, RCF and Avg-CF.
For evaluating our method, we tested the precision of the five measures and the algorithms after
domain knowledge integration. The results are shown in Tables 6 and 7 respectively.
Since evaluating entire candidate lists is tedious, past studies have focused on the top-n terms
[1]. Previous research has also shown that the evaluation over a sample of size n are comparable
to the evaluation over the entire population set [15]. So, we have calculated the precision for each
of the above experiments. For each experiment, the concepts to be extracted set to top 100, top
200 and above average threshold.
The precision values of the candidates extracted at the different values N were computed using
equation 11.The comparisons among the performances of each algorithm have been shown in
Table 6 . And the performances after domain knowledge integration have been shown in Table 7.
Total number of concepts evaluated = N
N CF-DF TF-IDF CF-IDF RCF Avg-CF
>average
threshold 0.92 0.65 0.65 0.67 0.66
Top 100 0.88 0.81 0.81 0.72 0.69
Top 200 0.72 0.66 0.65 0.68 0.60
TABLE 6: The precision for each algorithm evaluated for Medicine domain from Hadith corpus.
The reported results are as follows: for CF-DF, TF-IDF, CF-IDF, RCF, Avg-CF the precision was
0.92, 0.65, 0.65, 0.67, 0.66 respectively at N above threshold value. As shown in the Table 6, the
method using the CF-DF performs the best for all the Ns. The TF-IDF and CF-IDF measures
produced the same precision at all N. This is due to that the important concepts in the domain are
not appearing frequently in most of the domain documents. The other RCF, Avg-CF measures
show fluctuations of performance at different N.
Table 7 shows the evaluation results for all algorithms after domain knowledge integration. We
can observed the increase in the precision for all algorithms at different N. for CF-DF, TF-IDF,
CF-IDF, RCF, Avg-CF the precision was 0.94, 0.88, 0.90, 0.74, 0.97 respectively at N above
threshold value. From the experiment, we can conclude that, the CF-DF algorithm produced the
best and most stable results before and after domain knowledge integration. Important concepts
concentrate on the top of the list, while non important concepts tend to appear at the bottom of
the list. The Avg-CF algorithm is the best in the concept coverage.
Most of the studies described in the previous section for multiword terms extraction from Arabic
text deal with bi-grams only. Moreover, they rely on LRR or a combination of LRR and C-value
17. Abeer Alarfaj & Abdulmalik Alsalamn
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (9) : Issue (1) : 2020 17
and ignore contextual information in the ranking step. Compared to other method our method
consider contextual information using domain knowledge which increase precision compared to
85% for [5], 80% for [26] and 82% for [27].
Total number of concepts evaluated = N
N CF-DF TF-IDF CF-IDF RCF Avg-CF
>average
threshold 0.94 0.88 0.90 0.74 0.97
Top 100 0.90 0.88 0.85 0.82 0.75
Top 200 0.77 0.74 0.73 0.75 0.65
TABLE 7: The precision for each algorithm evaluated after using domain knowledge.
Figure 2 illustrates the comparison between the five algorithms for concept extraction, and Figure
3 illustrates the comparison between the five algorithms after domain knowledge integration.
FIGURE 3: The performance of algorithms after using domain knowledge at different N.
As can be shown from experimental results, CF-DF outperforms all methods in term of precision.
To see how our method performs in different domains, we compared the performance of the
method on medicine domain and food domain. Table 8 shows its precision in both Medicine
domain and food domain when the number of extracted concepts is 100. The results show that
our method for concept extraction performs better in Medicine domain than in food domain. Some
errors were created due to errors of the subsequent POS tagging and noun phrase extraction. In
Medicine field the noun phrases is more than in food field. While food domain has more verbs
and the tagger tags it as noun. Accordingly, the method will assign a high weight to the terms and
considered as domain concept. For example, ياكلها الدباء:1.1 , االقط واكل:1.1 .
FIGURE 2: The performance of algorithms for concept extraction at different N.
18. Abeer Alarfaj & Abdulmalik Alsalamn
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (9) : Issue (1) : 2020 18
The results (Table 8) show that the precisions in the range of 55% and 65%. This means that
more than half of the concepts are identified. Most of the concepts are single word terms and it is
identified correctly by our method. For example, الدباء:1.10 , التمر:1.1 , الثريد:1.1 , الثوم:1.1 . For multi
word concept, مادوم بر خبز:1.155 , والفضه الذهب انيه:1.155 , وقديد دباء:1.8 .
TABLE 8: The precision for each algorithm evaluated for medicine domain and Food domain from hadith
corpus.
FIGURE 4: The performance of algorithms for concepts extraction in Medicine domain and Food domain.
5. CONCLUSION
We have presented a new method for concept extraction from Arabic texts. It uses a combination
of linguistic, statistical and domain knowledge to extract domain relevant concepts. We proposed
what NLP techniques are used to extract concept candidates, and how statistical and domain
knowledge can be used and combined to extract domain relevant concepts.
Our method performs in unsupervised manner which means, no need for training data. Also, it
does not require general corpora to measure relevance of concepts to the domain. This leads to
avoiding the problem of skewness in terms frequency information. Further, we experimented with
a small set of documents and we have the promising results, this is different from other methods
that is work effectively only for large number of documents in the corpus.
Our method also used effectively for concept extraction in various domains. Although it works by
combining statistical and domain knowledge, but our evaluation showed that statistical
information can be used alone if no domain knowledge available.
The various experiments have been performed to assess the effectiveness of each used
algorithm. We reported in the previous section the experimental results on concept extraction
from Arabic texts. The experimental results show that the concept extractor module is effective in
extraction concept from Arabic documents. Our method after domain knowledge integration
performed better than the algorithms used alone.
In conclusion, our initial experiments support our assumption about the usefulness of our method
for concept extraction. As shown through the evaluation, our method has a strong ability to
extract domain relevance concept using a combination of statistical measures and domain
knowledge. This overcomes the problems that have been found in the methods that based only
on the frequency or TF-IDF to measure the importance of the candidates. From our initial results,
Total number of concepts evaluated = 100
CF-DF TF-IDF CF-IDF RCF Avg-CF
Medicine 0.90 0.88 0.85 0.82 0.75
Food 0.60 0.55 0.57 0.60 0.63
19. Abeer Alarfaj & Abdulmalik Alsalamn
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (9) : Issue (1) : 2020 19
we have found that using domain knowledge to determine domain relevant concept increases the
precision of concept extraction.
Our contributions to the Arabic concept extraction field are as follows. We evaluate and compare
different statistical measures and proposed a new one for candidates weighting. Our method can
extract rare concepts, even those appearing with low frequency. It also excludes irrelevant
concepts even if they occur frequently in the corpus.
To see how our method performs in different domains, we compared the performance of the
method on other domains. In the food field results, the precision seems to be worse than that in
the medicine field. This result is because of the errors of the subsequent POS tagging and noun
phrase extraction. On the other hand, more than half of the concepts are correctly identified.
The results show the high effectiveness of the proposed approach to extract concepts for Arabic
ontology construction. The output list from this module constitutes the fundamental layer of
ontologies. In the future, we will continue to evaluate and compare results of other domains and
we will use other preprocessing tools to enhance the precision. We will develop a method for
semantic relation extraction between the extracted concepts.
6. REFERENCES
[1] A.Al-Arfaj and A. Al-Salman. “Towards Concept Extraction for Ontologies on Arabic
language,” in Proceeding of 0rdInternationalConference on Islamic Applications in Computer
Science And Technology, 1-3 Oct 2015, Turkey.
[2] A.Al-Arfaj and A. Al-Salma. “Arabic NLP Tools for Ontology Construction from Arabic Text:
An Overview,” in Proceeding of International Conference on Electrical and Information
Technologies, (ICEIT'15) March 25-27, 2015 Marrakech, Morocco, pp. 246 – 251.
[3] A.Al-Arfaj and A. Al-Salman. “Towards Ontology Construction from Arabic Texts- A
Proposed Framework” in Proceeding of The 1.th IEEE International Conference on
Computer and Information Technology (CIT 2014), 2014, pp. 737-742.
[4] A. Zouaq, D. Gasevic, M. Hatala. “Towards open ontology learning and filtering”. Information
Systems, vol. 36, no.7, pp. 1064–1081, 2011
[5] S. Boulaknadel, B. Daille and D. Aboutajdine. "A multi-word term extraction program for
Arabic language," in Proceeding of the 6th International Conference on Language
Resources and Evaluation, May 28-30, Marrakech Morocco., 2008, pp.1485-1488.
[6] I.Bounhas and Y.Slimani. "A hybrid approach for Arabic multi-word term extraction," In
Proceedings of the IEEE International Conference on Natural Language Processing and
Knowledge Engineering (IEEE NLP-KE), Dalian, China, August 21-23 2009, pp. 429-436.
[7] P. Buitelaar, P. Cimiano and B. Magnini. “Ontology Learning from Text: An Overview”. In:
Ontology learning from text: methods, evaluation and applications. Breuker J, Dieng R,
Guarino N, Mantaras RLd, Mizoguchi R, Musen M, editors. Amsterdam, Berlin, Oxford,
Tokyo, Washington DC: IOS Press 2005.
[8] P. Cimiano and J. Volker. "Text2Onto – A Framework for Ontology Learning and Data
Driven Change Discovery," in Proceeding of the 10th International Conference on
Applications of Natural language to Information system (NLDB), Spain, 2005, pp. 227–238.
[9] P. Cimiano., A. Madche., S. Staab and J. Volker. “Ontology Learning.” IN Staab,S and
Studer, R (eds.), Handbook on Ontologies, International Handbooks on Information
20. Abeer Alarfaj & Abdulmalik Alsalamn
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (9) : Issue (1) : 2020 20
Systems, DOI 10.1007/978-3-540-92673-3, Springer-Verlag Berlin Heidelberg 2009,
pp.245-267.
[10] B. Daille. “Study and Implementation of Combined Techniques for Automatic Extraction of
Terminology.”. In Resnik and Judith (Eds): The Balancing Act: Combining Symbolic and
Statistical Approaches to Language, Cambridge, MA, USA: Mit Press, 1996, pp.49-66.
[11] S. El-Beltagy and A. Rafea. "KP-Miner: A Keyphrase Extraction System for English and
Arabic Documents". Information systems, 34(1), pp. 132-144, 2009.
[12] X. Jiang and A. Tan. "CRCTOL: A Semantic-Based Domain Ontology Learning System."
Journal of the American Society for Information Science and Technology (JASIST), 61(1),
pp.150-168, 2010.
[13] R. Navigli and P. Velardi. “Learning Domain Ontologies from Document Warehouses and
Dedicated Websites,” Computational Linguistics, 01(0), pp. 181-179, 2004.
[14] N. Noy and D. McGuinness. "Ontology Development 101: A Guide to Creating Your First
Ontology." Report SMI-2001-0880, Department of Mechanical and Industrial Engineering,
University of Toronto,pp.1-25, 2001
[15] M. Pazienza., M. Pennacchiotti and F. Zanzotto. “Terminology Extraction: An Analysis of
Linguistic and Statistical Approaches,” Knowledge Mining, ser.: Studies in Fuzziness and
Soft Computing, Sirmakessis, S., Ed., Berlin/Heidelberg: Springer, vol. 185, 2005, pp. 255–
279.
[16] J. Qiu., Y. Chai., Y. Liu., Z. Gu., S. Li and Z. Tian. “Automatic nontaxonomic relation
extraction from big data in smart city,” IEEE Access, vol. 0, pp. 0.58.–74864, 2018
[17] J. Qiu., Y. Chai., Z. Tian., X. Du and M. Guizani. “Automatic Concept Extraction Based on
Semantic Graphs From Big Data in Smart City,” IEEE TRANSACTIONS ON
COMPUTATIONAL SOCIAL SYSTEMS, pp.1-9,2019.
[18] A.Saif and M. AbAziz, "An Automatic Collocation Extraction from Arabic Corpus." Journal of
Computer Science ,7 (1), pp. 6-11, 2011.
[19] K. Toutanova,, D. Klein., C. Manning and Y. Singer. “Feature-rich part-of-speech tagging
with cyclic dependency network,” In Proceedings of Human Language Technology –North
American Chapter( HLT-NAACL), 2003, pp. 252–259.
[20] S. Tulkens., S. Suster and W. Dealemans. “Unsupervised concept extraction from clinical
text through semantic composition”. Journal of Biomedical Informatics. Vol.01,pp.110-
120,2019
[21] Z. Zhang., B. Christopher and F. Ciravegna. “A Comparative Evaluation of Term Recognition
Algorithms,” in Proceeding of The 6th International Conference on Lnaguage Resources and
Evaluation (LREC2008), May 28-31,2008, Marrakech, Morocco, pp.2108- 2113.
[22] S. Zaidi., M. Laskri and A. Abdelali. "Arabic collocations extraction using Gate," in
Proceeding of International Conference on Machine and Web Intelligence (ICMWI), 2010,
pp. 473 - 475.
[23] M. Rizoiu and J. Velcin. “Topic Extraction for Ontology Learning”. Ontology Learning and
Knowledge Discovery Using the Web: Challenges and Recent Advances, Wong W., Liu W.
and Bennamoun M. eds. (Ed.).2011, pp. 38-61
21. Abeer Alarfaj & Abdulmalik Alsalamn
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (9) : Issue (1) : 2020 21
[24] Z. Harris. “Distributional structure”. structural and transformational linguistics, pp.008–794,
1970.
[25] K. Sarkar. “A Hybrid Approach to Extract Keyphrases from Medical Documents.”
International Journal of Computer Application, 36(18), pp. 14-19, 2013.
[26] A. Mashaan Abed., S. Tiun and M. AlBared. “Arabic Term Extraction using Combined
Approach on Islamic document”. Journal of Theoretical & Applied Information Technology,
58 (3), pp.601-608,2013.
[27] A. El-Mahdaouy, S. Alaoui Ouatik and E. “ A study of association measures and their
combination for Arabic MWT extraction” in Proceedings 11th International Conference on
Terminology and Artificial Intelligence,2013, pp. 45-52.