The massive grow of the modern information retrieval system (IRS), especially in natural languages
becomes more difficult. The search in Arabic languages, as natural language, is not good enough yet. This
paper will try to build similar thesaurus based on Arabic language in two mechanisms, the first one is full
word mechanisms and the other is stemmed mechanisms, and then to compare between them.
The comparison made by this study proves that the similar thesaurus using stemmed mechanisms get more
better results than using traditional in the same mechanisms and similar thesaurus improved more the
recall and precision than traditional information retrieval system at recall and precision levels.
Automatically finding domain specific key terms from a given set of research paper is a challenging task and research papers to a particular area of research is a concern for many people including students, professors and researchers. A domain classification of papers facilitates that search process. That is, having a list of domains in a research field, we try to find out to which domain(s) a given paper is more related. Besides, processing the whole paper to read take a long time. In this paper, using domain knowledge requires much human effort, e.g., manually composing a set of labeling a large corpus. In particular, we use the abstract and keyword in research paper as the seeing terms to identify similar terms from a domain corpus which are then filtered by checking their appearance in the research papers. Experiments show the TF –IDF measure and the classification step make this method more precisely to domains. The results show that our approach can extract the terms effectively, while being domain independent.
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...csandit
Document Clustering algorithms goal is to create clusters that are coherent internally, but
clearly different from each other. The useful expressions in the documents is often accompanied
by a large amount of noise that is caused by the use of unnecessary words, so it is indispensable
to eliminate it and keeping just the useful information.
Keyphrases extraction systems in Arabic are new phenomena. A number of Text Mining
applications can use it to improve her results. The Keyphrases are defined as phrases that
capture the main topics discussed in document; they offer a brief and precise summary of
document content. Therefore, it can be a good solution to get rid of the existent noise from
documents.
In this paper, we propose a new method to solve the problem cited above especially for Arabic
language documents, which is one of the most complex languages, by using a new Keyphrases
extraction algorithm based on the Suffix Tree data structure (KpST). To evaluate our approach,
we conduct an experimental study on Arabic Documents Clustering using the most popular
approach of Hierarchical algorithms: Agglomerative Hierarchical algorithm with seven linkage
techniques and a variety of distance functions and similarity measures to perform Arabic
Document Clustering task. The obtained results show that our approach for extracting
Keyphrases improves the clustering results.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A new keyphrases extraction method based on suffix tree data structure for ar...ijma
Document Clustering is a branch of a larger area of scientific study known as data mining .which is an
unsupervised classification using to find a structure in a collection of unlabeled data. The useful
information in the documents can be accompanied by a large amount of noise words when using Full Text
Representation, and therefore will affect negatively the result of the clustering process. So it is with great
need to eliminate the noise words and keeping just the useful information in order to enhance the quality of
the clustering results. This problem occurs with different degree for any language such as English,
European, Hindi, Chinese, and Arabic Language. To overcome this problem, in this paper, we propose a
new and efficient Keyphrases extraction method based on the Suffix Tree data structure (KpST), the
extracted Keyphrases are then used in the clustering process instead of Full Text Representation. The
proposed method for Keyphrases extraction is language independent and therefore it may be applied to any
language. In this investigation, we are interested to deal with the Arabic language which is one of the most
complex languages. To evaluate our method, we conduct an experimental study on Arabic Documents
using the most popular Clustering approach of Hierarchical algorithms: Agglomerative Hierarchical
algorithm with seven linkage techniques and a variety of distance functions and similarity measures to
perform Arabic Document Clustering task. The obtained results show that our method for extracting
Keyphrases increases the quality of the clustering results. We propose also to study the effect of using the
stemming for the testing dataset to cluster it with the same documents clustering techniques and
similarity/distance measures.
Dictionary based concept mining an application for turkishcsandit
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
Arabic text categorization algorithm using vector evaluation methodijcsit
Text categorization is the process of grouping documents into categories based on their contents. This
process is important to make information retrieval easier, and it became more important due to the huge
textual information available online. The main problem in text categorization is how to improve the
classification accuracy. Although Arabic text categorization is a new promising field, there are a few
researches in this field. This paper proposes a new method for Arabic text categorization using vector
evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the
tested document's words are calculated to determine the document keywords which will be compared with
the keywords of the corpus categorizes to determine the tested document's best category.
Automatically finding domain specific key terms from a given set of research paper is a challenging task and research papers to a particular area of research is a concern for many people including students, professors and researchers. A domain classification of papers facilitates that search process. That is, having a list of domains in a research field, we try to find out to which domain(s) a given paper is more related. Besides, processing the whole paper to read take a long time. In this paper, using domain knowledge requires much human effort, e.g., manually composing a set of labeling a large corpus. In particular, we use the abstract and keyword in research paper as the seeing terms to identify similar terms from a domain corpus which are then filtered by checking their appearance in the research papers. Experiments show the TF –IDF measure and the classification step make this method more precisely to domains. The results show that our approach can extract the terms effectively, while being domain independent.
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...csandit
Document Clustering algorithms goal is to create clusters that are coherent internally, but
clearly different from each other. The useful expressions in the documents is often accompanied
by a large amount of noise that is caused by the use of unnecessary words, so it is indispensable
to eliminate it and keeping just the useful information.
Keyphrases extraction systems in Arabic are new phenomena. A number of Text Mining
applications can use it to improve her results. The Keyphrases are defined as phrases that
capture the main topics discussed in document; they offer a brief and precise summary of
document content. Therefore, it can be a good solution to get rid of the existent noise from
documents.
In this paper, we propose a new method to solve the problem cited above especially for Arabic
language documents, which is one of the most complex languages, by using a new Keyphrases
extraction algorithm based on the Suffix Tree data structure (KpST). To evaluate our approach,
we conduct an experimental study on Arabic Documents Clustering using the most popular
approach of Hierarchical algorithms: Agglomerative Hierarchical algorithm with seven linkage
techniques and a variety of distance functions and similarity measures to perform Arabic
Document Clustering task. The obtained results show that our approach for extracting
Keyphrases improves the clustering results.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A new keyphrases extraction method based on suffix tree data structure for ar...ijma
Document Clustering is a branch of a larger area of scientific study known as data mining .which is an
unsupervised classification using to find a structure in a collection of unlabeled data. The useful
information in the documents can be accompanied by a large amount of noise words when using Full Text
Representation, and therefore will affect negatively the result of the clustering process. So it is with great
need to eliminate the noise words and keeping just the useful information in order to enhance the quality of
the clustering results. This problem occurs with different degree for any language such as English,
European, Hindi, Chinese, and Arabic Language. To overcome this problem, in this paper, we propose a
new and efficient Keyphrases extraction method based on the Suffix Tree data structure (KpST), the
extracted Keyphrases are then used in the clustering process instead of Full Text Representation. The
proposed method for Keyphrases extraction is language independent and therefore it may be applied to any
language. In this investigation, we are interested to deal with the Arabic language which is one of the most
complex languages. To evaluate our method, we conduct an experimental study on Arabic Documents
using the most popular Clustering approach of Hierarchical algorithms: Agglomerative Hierarchical
algorithm with seven linkage techniques and a variety of distance functions and similarity measures to
perform Arabic Document Clustering task. The obtained results show that our method for extracting
Keyphrases increases the quality of the clustering results. We propose also to study the effect of using the
stemming for the testing dataset to cluster it with the same documents clustering techniques and
similarity/distance measures.
Dictionary based concept mining an application for turkishcsandit
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
Arabic text categorization algorithm using vector evaluation methodijcsit
Text categorization is the process of grouping documents into categories based on their contents. This
process is important to make information retrieval easier, and it became more important due to the huge
textual information available online. The main problem in text categorization is how to improve the
classification accuracy. Although Arabic text categorization is a new promising field, there are a few
researches in this field. This paper proposes a new method for Arabic text categorization using vector
evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the
tested document's words are calculated to determine the document keywords which will be compared with
the keywords of the corpus categorizes to determine the tested document's best category.
A survey of named entity recognition in assamese and other indian languagesijnlc
Named Entity Recognition is always important when dealing with major Natural Language Processing
tasks such as information extraction, question-answering, machine translation, document summarization
etc so in this paper we put forward a survey of Named Entities in Indian Languages with particular
reference to Assamese. There are various rule-based and machine learning approaches available for
Named Entity Recognition. At the very first of the paper we give an idea of the available approaches for
Named Entity Recognition and then we discuss about the related research in this field. Assamese like other
Indian languages is agglutinative and suffers from lack of appropriate resources as Named Entity
Recognition requires large data sets, gazetteer list, dictionary etc and some useful feature like
capitalization as found in English cannot be found in Assamese. Apart from this we also describe some of
the issues faced in Assamese while doing Named Entity Recognition.
Text Mining is the technique that helps users to find out useful information from a large amount of text documents on the web or database. Most popular text mining and classification methods have adopted term-based approaches. The term based approaches and the pattern-based method describing user preferences. This review paper analyse how the text mining work on the three level i.e sentence level, document level and feature level. In this paper we review the related work which is previously done. This paper also demonstrated that what are the problems arise while doing text mining done at the feature level. This paper presents the technique to text mining for the compound sentences.
Explore detailed Topic Modeling via LDA Laten Dirichlet Allocation and their steps.
Thanks, for your time, if you enjoyed this short video there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
This paper proposes Natural language based Discourse Analysis method used for extracting
information from the news article of different domain. The Discourse analysis used the Rhetorical Structure
theory which is used to find coherent group of text which are most prominent for extracting information
from text. RST theory used the Nucleus- Satellite concept for finding most prominent text from the text
document. After Discourse analysis the text analysis has been done for extracting domain related object
and relates this object. For extracting the information knowledge based system has been used which
consist of domain dictionary .The domain dictionary has a bag of words for domain. The system is
evaluated according gold-of-art analysis and human decision for extracted information.
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...cscpconf
On-line text documents rapidly increase in size with the growth of World Wide Web. To manage such a huge amount of texts,several text miningapplications came into existence. Those
applications such as search engine, text categorization,summarization, and topic detection arebased on feature extraction.It is extremely time consuming and difficult task to extract keyword or feature manually.So an automated process that extracts keywords or features needs to be
established.This paper proposes a new domain keyword extraction technique that includes a new weighting method on the base of the conventional TF•IDF. Term frequency-Inverse
document frequency is widely used to express the documentsfeature weight, which can’t reflect the division of terms in the document, and then can’t reflect the significance degree and the difference between categories. This paper proposes a new weighting method to which a new weight is added to express the differences between domains on the base of original TF•IDF.The extracted feature can represent the content of the text better and has a better distinguished
Detailed documented with the definition of text mining along with challenges, implementing modeling techniques, word cloud and much more.
Thanks, for your time, if you enjoyed this short video there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
A Large number of digital text information is generated every day. Effectively searching, managing and
exploring the text data has become a main task. In this paper, we first present an introduction to text
mining and LDA topic model. Then we deeply explained how to apply LDA topic model to text corpus by
doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data
retrieving, pre-processing, fitting the model and an application of document exploring system. The result of
the experiments shows LDA topic model working effectively on documents clustering and finding the
similar documents. Furthermore, the document exploring system could be a useful research tool for
students and researchers.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...cseij
In this paper we combine our previous research in the field of Semantic web, especially ontology learning and population with Sentence retrieval. To do this we developed a new approach to sentence retrieval
modifying our previous TF-ISF method which uses local context information to take into account only document level information. This is quite a new approach to sentence retrieval, presented for the first time
in this paper and also compared to the existing methods that use information from whole document collection. Using this approach and developed methods for sentence retrieval on a document level it is possible to assess the relevance of a sentence by using only the information from the retrieved sentence’s document and to define a document level OWL representation for sentence retrieval that can be
automatically populated. In this way the idea of Semantic Web through automatic and semi-automatic
extraction of additional information from existing web resources is supported. Additional information is
formatted in OWL document containing document sentence relevance for sentence retrieval.
Presentation of the main IR models
Presentation of our submission to TREC KBA 2014 (Entity oriented information retrieval), in partnership with Kware company (V. Bouvier, M. Benoit)
A survey of named entity recognition in assamese and other indian languagesijnlc
Named Entity Recognition is always important when dealing with major Natural Language Processing
tasks such as information extraction, question-answering, machine translation, document summarization
etc so in this paper we put forward a survey of Named Entities in Indian Languages with particular
reference to Assamese. There are various rule-based and machine learning approaches available for
Named Entity Recognition. At the very first of the paper we give an idea of the available approaches for
Named Entity Recognition and then we discuss about the related research in this field. Assamese like other
Indian languages is agglutinative and suffers from lack of appropriate resources as Named Entity
Recognition requires large data sets, gazetteer list, dictionary etc and some useful feature like
capitalization as found in English cannot be found in Assamese. Apart from this we also describe some of
the issues faced in Assamese while doing Named Entity Recognition.
Text Mining is the technique that helps users to find out useful information from a large amount of text documents on the web or database. Most popular text mining and classification methods have adopted term-based approaches. The term based approaches and the pattern-based method describing user preferences. This review paper analyse how the text mining work on the three level i.e sentence level, document level and feature level. In this paper we review the related work which is previously done. This paper also demonstrated that what are the problems arise while doing text mining done at the feature level. This paper presents the technique to text mining for the compound sentences.
Explore detailed Topic Modeling via LDA Laten Dirichlet Allocation and their steps.
Thanks, for your time, if you enjoyed this short video there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
This paper proposes Natural language based Discourse Analysis method used for extracting
information from the news article of different domain. The Discourse analysis used the Rhetorical Structure
theory which is used to find coherent group of text which are most prominent for extracting information
from text. RST theory used the Nucleus- Satellite concept for finding most prominent text from the text
document. After Discourse analysis the text analysis has been done for extracting domain related object
and relates this object. For extracting the information knowledge based system has been used which
consist of domain dictionary .The domain dictionary has a bag of words for domain. The system is
evaluated according gold-of-art analysis and human decision for extracted information.
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...cscpconf
On-line text documents rapidly increase in size with the growth of World Wide Web. To manage such a huge amount of texts,several text miningapplications came into existence. Those
applications such as search engine, text categorization,summarization, and topic detection arebased on feature extraction.It is extremely time consuming and difficult task to extract keyword or feature manually.So an automated process that extracts keywords or features needs to be
established.This paper proposes a new domain keyword extraction technique that includes a new weighting method on the base of the conventional TF•IDF. Term frequency-Inverse
document frequency is widely used to express the documentsfeature weight, which can’t reflect the division of terms in the document, and then can’t reflect the significance degree and the difference between categories. This paper proposes a new weighting method to which a new weight is added to express the differences between domains on the base of original TF•IDF.The extracted feature can represent the content of the text better and has a better distinguished
Detailed documented with the definition of text mining along with challenges, implementing modeling techniques, word cloud and much more.
Thanks, for your time, if you enjoyed this short video there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
A Large number of digital text information is generated every day. Effectively searching, managing and
exploring the text data has become a main task. In this paper, we first present an introduction to text
mining and LDA topic model. Then we deeply explained how to apply LDA topic model to text corpus by
doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data
retrieving, pre-processing, fitting the model and an application of document exploring system. The result of
the experiments shows LDA topic model working effectively on documents clustering and finding the
similar documents. Furthermore, the document exploring system could be a useful research tool for
students and researchers.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...cseij
In this paper we combine our previous research in the field of Semantic web, especially ontology learning and population with Sentence retrieval. To do this we developed a new approach to sentence retrieval
modifying our previous TF-ISF method which uses local context information to take into account only document level information. This is quite a new approach to sentence retrieval, presented for the first time
in this paper and also compared to the existing methods that use information from whole document collection. Using this approach and developed methods for sentence retrieval on a document level it is possible to assess the relevance of a sentence by using only the information from the retrieved sentence’s document and to define a document level OWL representation for sentence retrieval that can be
automatically populated. In this way the idea of Semantic Web through automatic and semi-automatic
extraction of additional information from existing web resources is supported. Additional information is
formatted in OWL document containing document sentence relevance for sentence retrieval.
Presentation of the main IR models
Presentation of our submission to TREC KBA 2014 (Entity oriented information retrieval), in partnership with Kware company (V. Bouvier, M. Benoit)
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...cscpconf
Document Clustering algorithms goal is to create clusters that are coherent internally, but clearly different from each other. The useful expressions in the documents is often accompanied
by a large amount of noise that is caused by the use of unnecessary words, so it is indispensable to eliminate it and keeping just the useful information. Keyphrases extraction systems in Arabic are new phenomena. A number of Text Mining applications can use it to improve her results. The Keyphrases are defined as phrases that capture the main topics discussed in document; they offer a brief and precise summary of document content. Therefore, it can be a good solution to get rid of the existent noise from documents. In this paper, we propose a new method to solve the problem cited above especially for Arabic language documents, which is one of the most complex languages, by using a new Keyphrases
extraction algorithm based on the Suffix Tree data structure (KpST). To evaluate our approach, we conduct an experimental study on Arabic Documents Clustering using the most popular
approach of Hierarchical algorithms: Agglomerative Hierarchical algorithm with seven linkage techniques and a variety of distance functions and similarity measures to perform Arabic
Document Clustering task. The obtained results show that our approach for extracting Keyphrases improves the clustering results.
Answer extraction and passage retrieval forWaheeb Ahmed
—Question Answering systems (QASs) do the task of
retrieving text portions from a collection of documents that
contain the answer to the user’s questions. These QASs use a
variety of linguistic tools that be able to deal with small
fragments of text. Therefore, to retrieve the documents which
contains the answer from a large document collections, QASs
employ Information Retrieval (IR) techniques to minimize the
number of documents collections to a treatable amount of
relevant text. In this paper, we propose a model for passage
retrieval model that do this task with a better performance for
the purpose of Arabic QASs. We first segment each the top five
ranked documents returned by the IR module into passages.
Then, we compute the similarity score between the user’s
question terms and each passage. The top five passages (with
high similarity score) are retrieved are retrieved. Finally,
Answer Extraction techniques are applied to extract the final
answer. Our method achieved an average for precision of
87.25%, Recall of 86.2% and F1-measure of 87%.
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)ijnlc
Extracting key phrases from documents is a common task in many applications. In general: The Noun
Phrase Extractor consists of three modules: tokenization; part-of-speech tagging; noun phrase
identification. These will be used as three main steps in building the new system ANPE, This paper aims at
picking Arabic Noun Phrases from a corpus of documents, Relevant criteria (Recall and Precision), will be
used as evaluation measure. On the one hand, when using NPs rather than using single terms, the system
yields more relevant documents from the retrieved ones, on the other hand, it gave low precision because
number of the retrieved documents will be decreased. At the researchers conclude and recommend
improvements for more effective and efficient research in the future.
Survey On Building A Database Driven Reverse DictionaryEditor IJMTER
Reverse dictionaries are widely used for a reference work that is organized by concepts,
phrases, or the definitions of words. This paper describe the many challenges inherent in building a
reverse lexicon, and map drawback to the well known abstract similarity problem The criterion web
search engines are basic versions of system; they take benefit of huge scale which permits inferring
general interest concerning documents from link information. This paper describe the basic study of
database driven reverse dictionary using three large-scale dataset namely person names, general English
words and biomedical concepts. This paper analyzes difficulties arising in the use of documents
produced by Reverse dictionary.
A Novel Approach for Keyword extraction in learning objects using text miningIJSRD
Keyword extraction, concept finding are in learning objects is very important subject in today’s eLearning environment. Keywords are subset of words that contains the useful information about the content of the document. Keyword extraction is a process that is used to get the important keywords from documents. In this proposed System Decision tree algorithm is used for feature selection process using wordnet dictionary. WordNet is a lexical database of English which is used to find similarity from the candidate words. The words having highest similarity are taken as keywords.
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
The internet has caused a humongous growth in the amount of data available to the common
man. Summaries of documents can help find the right information and are particularly effective
when the document base is very large. Keywords are closely associated to a document as they
reflect the document's content and act as indexes for the given document. In this work, we
present a method to produce extractive summaries of documents in the Kannada language. The
algorithm extracts key words from pre-categorized Kannada documents collected from online
resources. We combine GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse
Document Frequency) methods along with TF (Term Frequency) for extracting key words and
later use these for summarization. In the current implementation a document from a given category is selected from our database and depending on the number of sentences given by theuser, a summary is generated.
Diacritic Oriented Arabic Information Retrieval SystemCSCJournals
Arabic language support in search engines and operating systems is improved in recent years. Searching in the Internet is reliable and can be compared to the excellent support for several other languages, including English. However, for text with diacritics there are some limitations. For this reason, most Information retrieval (IR) systems remove diacritics from text and ignore it for its complexity. Searching text with diacritics is important for some kinds of documents, such as those of religious books, some newspapers and children stories. This research shows the design and development of the system that overcome the problem. The proposed system considers diacritics. The proposed system includes the design complexity in the retrieving algorithm rather than the information repository, which is database in this study. Also, this study analyses the results and the performance. Results are promising and performance analysis shows methods to enhance design and increase the performance. The proposed system can be integrated in search engines, text editors and any information retrieval system that include Arabic text. Performance analysis of the proposed system shows that this system is reliable. The proposed system is applied on database of Hadeeth, which is religious book includes the prophet action and statements. The system can be applied in any kind of data repository.
Car-Following Parameters by Means of Cellular Automata in the Case of EvacuationCSCJournals
This study is attention to the car-following model, an important part in the micro traffic flow. Different from Nagel–Schreckenberg’s studies in which car-following model without agent drivers and diligent ones, agent drivers and diligent ones are proposed in the car-following part in this work and lane-changing is also presented in the model. The impact of agent drivers and diligent ones under certain circumstances such as in the case of evacuation is considered. Based on simulation results, the relations between evacuation time and diligent drivers are obtained by using different amounts of agent drivers; comparison between previous (Nagel–Schreckenberg) and proposed model is also found in order to find the evacuation time. Besides, the effectiveness of reduction the evacuation time is presented for various agent drivers and diligent ones.
An Evaluation and Overview of Indices Based on Arabic DocumentsIJCSEA Journal
The paper aims at giving an overview about inverted files , signature files, suffix array and suffix tree based on Arabic documents collection. The paper also aims at giving the comparison points between all these techniques and the performance of this techniques on each of the comparison points. Any information retrieval System is usually evaluated through efficiency and effectiveness of this system. Moreover, there are two aspects of efficiency: Time and Space. The time measure represents the time needed to retrieve a document relevant to a specified query, while space represents the capacity of memory needed to create the two indices.
An Evaluation and Overview of Indices Based on Arabic DocumentsIJCSEA Journal
The paper aims at giving an overview about inverted files , signature files, suffix array and suffix tree based on Arabic documents collection. The paper also aims at giving the comparison points between all these techniques and the performance of this techniques on each of the comparison points. Any information retrieval System is usually evaluated through efficiency and effectiveness of this system. Moreover, there are two aspects of efficiency: Time and Space. The time measure represents the time needed to retrieve a document relevant to a specified query, while space represents the capacity of memory needed to create the two indices.
In this paper, four indices will be built: inverted-file , signature-file, suffix array and suffix tree. However, to measure the performance of each one, a retrieval system must be built to compare the results of using these indices.
A collection of 242 Arabic Abstracts from the proceeding of the Saudi Arabian National Computer Conferences have been used in these systems, and a collection of 60 Arabic queries have been run on the there systems. We found out that the retrieval result for inverted files is better than the retrieval result for other indices.
The paper aims at giving an overview about inverted files , signature files, suffix array and suffix tree
based on Arabic documents collection. The paper also aims at giving the comparison points between all
these techniques and the performance of this techniques on each of the comparison points. Any information
retrieval System is usually evaluated through efficiency and effectiveness of this system. Moreover, there
are two aspects of efficiency: Time and Space. The time measure represents the time needed to retrieve a
document relevant to a specified query, while space represents the capacity of memory needed to create
the two indices.
In this paper, four indices will be built: inverted-file , signature-file, suffix array and suffix tree. However,
to measure the performance of each one, a retrieval system must be built to compare the results of using
these indices.
A collection of 242 Arabic Abstracts from the proceeding of the Saudi Arabian National Computer
Conferences have been used in these systems, and a collection of 60 Arabic queries have been run on the
there systems. We found out that the retrieval result for inverted files is better than the retrieval result for
other indices.
The enormous amount of information stored in unstructured texts cannot simply be used for further
processing by computers, which typically handle text as simple sequences of character strings. Therefore, specific
(pre-) processing methods and algorithms are required in order to extract useful patterns. Text Mining is the
discovery of valuable, yet hidden, information from the text document. Text classification (Also called Text
Categorization) is one of the important research issues in the field of text mining. It is necessary to
classify/categorize large texts (documents) into specific classes. Text Classification assigns a text document to one of a
set of predefined classes. This paper covers different text classification techniques and also includes Classifier
Architecture and Text Classification Applications.
Clustering the results of a search helps the user to overview the information returned. In this paper, we
look upon the clustering task as cataloguing the search results. By catalogue we mean a structured label
list that can help the user to realize the labels and search results. Labelling Cluster is crucial because
meaningless or confusing labels may mislead users to check wrong clusters for the query and lose extra
time. Additionally, labels should reflect the contents of documents within the cluster accurately. To be able
to label clusters effectively, a new cluster labelling method is introduced. More emphasis was given to
/produce comprehensible and accurate cluster labels in addition to the discovery of document clusters. We
also present a new metric that employs to assess the success of cluster labelling. We adopt a comparative
evaluation strategy to derive the relative performance of the proposed method with respect to the two
prominent search result clustering methods: Suffix Tree Clustering and Lingo.
we perform the experiments using the publicly available Datasets Ambient and ODP-239
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Similar to SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON (20)
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Key Trends Shaping the Future of Infrastructure.pdf
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
1. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013
DOI : 10.5121/ijcsea.2013.3201 1
SIMILAR THESAURUS BASED ON ARABIC
DOCUMENT: AN OVERVIEW AND
COMPARISON
Essam S. Hanandeh,
Department of Computer Information System, Zarqa University, Zarqa, Jordan
Hanandeh@zu.edu.jo
ABSTRACT
The massive grow of the modern information retrieval system (IRS), especially in natural languages
becomes more difficult. The search in Arabic languages, as natural language, is not good enough yet. This
paper will try to build similar thesaurus based on Arabic language in two mechanisms, the first one is full
word mechanisms and the other is stemmed mechanisms, and then to compare between them.
The comparison made by this study proves that the similar thesaurus using stemmed mechanisms get more
better results than using traditional in the same mechanisms and similar thesaurus improved more the
recall and precision than traditional information retrieval system at recall and precision levels.
KEYWORDS
Similarity thesaurus, Recall, Precision, information retrieval, Traditional
1. INTRODUCTION
A thesaurus (plural: thesauri) is a valuable tool in IR, in both indexing and in searching processes.
It is used as a controlled vocabulary and as a means for expanding or altering queries (query
expansion)[10]. Most thesauri that users encounter are manually constructed by domain experts
and/or experts at document description. Manual thesaurus construction is a time-consuming and
quite expensive process, and the results are bound to be more or less subjective since the person
creating the thesaurus make choices that affect the structure of the thesaurus. There is a need for
methods of automatically construct thesauri, which in addition to the improvements in time and
cost aspects can result in more objective thesauri that are easier to update. These developed
thesauri have been designed to comply with international and Arabic standards, and is capable of
performing all tasks & duties needed in thesauri building. It automates most tasks in the building
and maintenance of thesauri and insures integrity of its structures and its relations with database
materials.
2. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013
2
2. INFORMATION SYSTEM
Information retrieval exhibits similarity to many other areas of information processing. The most
important computer-based information systems today are the management information systems
(MIS), data base management systems (DBMS), decision support systems (DSS), question-
answering systems (QA), as well as information retrieval system (IRS) [8].
Information Retrieval (IR) is best understood if one remembers that the information consists of
documents. In that context, information retrieval deals with the representation, storage, and access
to documents or representatives of documents [10]. Information Retrieval systems have an
important role in the studies of libraries and information, and they are really considered the top of
this field. They get use of facts and ideas achieved in all studies, theoretical or practical, such as
indexing, classifying, subjective analysis, concluding, computers, bibliography and others.
As we know that half of the science is in its organization, and the main objective of this
specialization is to supply and afford the suitable and sufficient information in the suitable time to
the suitable person or researcher.
3. REPRESENTATION OF DOCUMENTS AND QUERIES:
The following section describes the steps of representing the documents and queries
automatically:
3.1 Filtering:
In filtering the document collection, these filtering processes consist mainly of:
3.1.1 Eliminating Stop words:
Stop word is a word that occurs so frequently in documents in the collection that it is useless for
purposes of retrieval [3], Elimination of stop words reduces the size of the indexing structure and
thus increases the performance of the system and enables it to retrieve more relevant documents.
Stop words in Arabic include some of grammatical links such as the definite article (,)اﻟـ attached
and separated prepositions, conjunctions, interrogative words, negative words, exclamations ,
calling letters, adverbs of time and place They also include all the pronouns, demonstratives,
subject and object pronouns, the Five Distinctive Nouns, some numbers, additions and verbs.
Stop words may be separated or attached ones in a form of prefixes or suffixes [3].
3.1.2 Stemming:
Stemming is the deletion of prefixes and suffixes (getting the root of the word). The root means:
the part of the word that is left after deletion of prefixes, suffixes and Infixes [4].
Stems are thought to be useful for improving retrieval performance because they reduce variants
of the same root word to a common concept. Furthermore, stemming has the secondary effect of
3. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013
3
reducing the size of the indexing structure because the number of distinct index terms is reduced
[3].
3.1.3 Indexing:
Selection of index terms from the collection of filtered terms as seen in the construction
processes.
Indexing is defined as the process of choosing a term or a number of terms that can represent what
the document contains. These terms have been called (Index terms) [8]. Indexing can be
performed either manually (Manual Indexing) or through using computers software and programs
(automatic Indexing) [6]. To decide the location of accuracy of the keywords in a certain text, the
researcher will generate the inverted files. This file will contain all the keywords that the text
contains, accompanied by the number of each paragraph containing such words. The paragraphs
may be a sentence, a paragraph, a whole page of a document or the complete document [3].
4. ALGORITHM
Phase 1: preparing documents:
1 Use vector space model to put text of documents and query in vectors.
2 Normalization.
Removing stop words those were collected by Al-Shalabi, et al [2], and they gained
98% success in distinguishing in addition to deleting some signs appeared. (stop
words are the words that occur so frequently in documents in the collection that it is
useless for purposes of retrieval [9]), Elimination of stop words reduces the size of
the indexing structure and thus increases the performance of the system and enables it
to retrieve more relevant documents.
Deleting punctuation marks, commas, follow stops, especial signs, numbers (contents
that has no meanings).
3 Stemming : the following stemming algorithm as in [11] with a little bet modification :
Let T denote the set of characters of the Arabic surface Full word
Let Li denote the position of letter i in term T
Let Stem denote the term after stemming in each step
Let D denote the set of definite articles ( )ال
Let S denote the set of suffixes
S ={ ،ه،ة ،ك ،و ،ي ،ن ،ا ت
،ان ،ﯾﻦ ،ون ،ات ،ھﻢ ،ھﻦ ،ھﺎ ،ﻛﻢ ،ﻛﻦ ،ﻧﺎ ،وا ،ﺗﻢ ،ﻧﻲ ،ﺗﻦ ،ﺗﮫ ،ﯾﮫ ،ﻣﺎ ،ﯾﺎ ،ﺗﺎ ،ﺗﻚ ،ري ،ﻟﻲ ،ار
،ﯾﺮ ،ﺗﻲ ،ﯾﻞ ﯾﺔ
،ﻟﯿﻦ ،ﯾﺎن ،ﯾﺘﺶ ،ﯾﻮن ،رات ،ﻣﺎن ،رﯾﻦ رھﺎ، ،ﯾﻨﺎ }ﻟﮭﺎ
Let P denote the set of prefixes
P={ ،ي ت ، ،ن ،ب ل
،ال ،ﻟﻞ ،ﺳﻲ ،ﺳﺎ ،ﺳﺖ ،ﺳﻦ ﻛﺎ ، ﻓﺎ ، ﺑﺎ ، ،ﻟﻲ ﻟﺖ ، ،ﻟﻦ ،ﻓﺖ ﻓﻲ ، ،ﻓﻦ اس
،اﻟﺢ ،ﻣﺎل ،ﻻل ،اﻟﻢ ،اﻻ ،اﻟﺲ ،اﻟﻊ ،ﻟﻠﻢ ،اﻟﻚ اﻟﻒ }
Let n is the total number of characters in the Arabic word
Step 1: Remove any diacritic in T
4. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013
4
Step2: If the length of T is > 3 characters then,
Remove the prefix Waw “ ”و in position L1
Step 3: Normalize آ ,إ ,أ of T to ا (plain alif)
Step 4: Normalize ى in Ln of T to ي
Replace the sequence of ى in Ln-1 and ء in Ln to ئ
Replace the sequence of ي in Ln-1 and ء in Ln to ئ
Normalize ه in Ln of T to ة
Step 5: For all variations of D ()ال do,
Locate the definite article Di in T
If Di in T matches Di = Di + Characters in T ahead of Di
Stem = T – Di
Step 6: If the length of Stem is > 3 characters then,
For all variations of S, obtain the most frequent suffix,
Match the region of Si to longest suffix in Stem
If length of (Stem -Si ) >= to 3 char then,
Stem = Stem – Si
Step 7: If the length of Stem is > 3 characters then,
For all variations of P do
Match the region of Pi in Stem
If the length of (Stem -Pi ) > 3 characters then,
Stem = Stem – Pi
Step 8: Return the Stem
Phase 2: building a traditional IRS.
4 Selection of index terms from the collection of filtered terms. Ricardo Baeza-Yates and
Berthier Ribeiro-Neto in [9], show that the inverted file (or inverted index) is a word
oriented mechanism for indexing a text collection in order to speed up the searching task,
Index terms can be Individual words, group of words, or phrases, but most of them are
single words [9] for this reason researcher choose a single words (i.e., single term) as
index terms in this work.
This phase includes an affecting decision, to repeat the required word within the document to be
an "Index Term". If it only appears once, can a word be used as an Index Term? Or researcher
must use words that repeat many times in the same document?
In this thesis the index terms should be the words that repeated from (2-7) times in the text. After
studying some files and taking the average number of occurrence of some words, they found in
[7] that the best index terms to be used are those repeated within the average. (not too little, nor
too much).
It is expected that the use of a controlled vocabulary leads to an improvement in retrieval
performance. So we ignore the terms that appear in most documents in the collection (i.e., has
high frequencies), and the terms that appear only once in a document (i.e., terms that low
frequencies).
1 Creating the inverted file based on the stemmed words of each documents. (The stemmed
words technique used is suffix prefix removal).
5. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013
5
2 Compute the frequencies of each index term in each document (tf).
3 Compute the normalized frequencies of terms in each document by using the
following formula
jll
ji
ji
freq
freq
f
,
,
,
max
=
(1)
1 Compute the inverse document frequency, for each index term Ki in a document dj, as
follows
i
i
n
N
idf log=
(2)
Where N is the total number of documents in the collections, and ni is the number
of documents in which index term ki appears.
2 Calculate the weight of each term in a document by multiplying the normalized
frequencies with inverse document frequencies as follows
ijiji idffw ∗= ,,
(3)
After these steps, we have an inverted file that contains index terms (i.e., words) and terms
frequency and the weight of each term in a document.
Phase 3: Building Similar Thesaurus:
In this paper, the researcher uses Cosine equation, as it is the most common equation in building
the similarity thesaurus, and the threshold similarity was a variable to be entered while the system
working.
∑∑
∑
==
=
=
n
i
ki
n
i
ji
ki
n
i
ji
ww
ww
1
2
,
1
2
,
,
1
,
kj,
*
)*(
SsimilarityCosine
(4)
All the results were between 0 and 1 as (0<=Wi,k <=1) & (0<=Wi,j <=1)
5. EXPERIMENTS AND RESULTS
This study aims to reinforcing IRS depending on 242 Arabic abstract documents that used by
(Hmeidi & Kanaan, 1997) in [5], also to realize the importance of using stemmed words in these
systems instead of full words. All these abstracts involve computer science and information
system.
To achieve this aim, the researcher designed and built an automatic information retrieval system
from scratch to handle Arabic text. Work on these results that we got after applying 59 queries
6. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013
6
from the Relevance Judgments documents began and results were analyzed using the Recall and
Precision criteria. After that, Average of Recall and Precision were calculated.
Researcher has constructed an automatic stemmed words and full word index using inverted file
technique. Depending on these indexing words researcher has built three information retrieval
systems; in the first system, the researcher has used a Traditional Information Retrieval system
using a term frequency-inverse document frequency (tf-idf) for index term weights. In the second
one, the researcher used Similar Thesaurus by using Vector Space Model with Cosine formula
using a term frequency-inverse document frequency (tf-idf) for index term weights, and compare
between the similar measurements to find out the best that will be used in building the Similar
thesaurus.
3.1 Results
Table 1 shows the effect of using Traditional (full words, stemmed) and using Similarity thesaurus(full
words, stemmed).
Table 1 :Number of Retrieved, Relevant, Irrelevant documents using Traditional and when
using similarity thesaurus
Retrieved Relevant Irrelevant
Traditional-Full
Words
1706 763 943
Traditional -Stemmed
words
2399 1022 1377
Thesaurus -Full
Words
1704 771 933
Thesaurus -Stemmed
words
2029 991 1038
Figure 1 & Figure 2: Comparison values of the Average Recall Precision when using Traditional and when
using similarity thesaurus
Traditional (Recall Docs)
0
500
1000
1500
2000
2500
3000
Retrieved Relevant Irrelevant
No.ofDocs
Full Words Stemmed words
Figure 1: relevant and irrelevant shape from retrieved document in Tradition system
7. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013
7
Similarity Thesaurus (Recall Docs)
0
500
1000
1500
2000
2500
3000
Retrieved Relevant Irrelevant
No.ofDocs
Full Words Stemmed words
Figure 2: relevant and irrelevant shape from retrieved document in the case of similarity thesaurus
retrieving system
Table 2, Figure 3, and Figure 4 shows the percentage of the relevant retrieved documents from
all the relevant documents in the collection, when using Traditional and when using similarity
thesaurus
.
Table 2: shows the percentage of the relevant retrieved documents
% of Relevant Docs that Retrieved
Traditional-Full
Words
46.07487923
Traditional -
Stemmed words
61.71497585
Thesaurus -Full
Words
46.557971
Thesaurus -Stemmed
words
62.681159
0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
% R e l e v a n t D o c s t h a t R e
F u l l w o r d s S t e m m e d w o r d s
Figure 3: % of retrieved document in Traditional system
8. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013
8
S i m i l a r i t y
0
2 0
4 0
6 0
8 0
% R e l e v a n t D o c s t h a t R e t r i e v e d
F u l l w o r d s S t e m m e d w o r d s
Figure 4: % of retrieved document in thesaurus system
Table 3 shows percentage when using Traditional and when using similarity thesaurus
Table 3: Percentage of using Traditional & similarity thesaurus
Traditional Similarity
Full words 46.07487923 46.55797101
Stemmed words 61.71497585 62.68115942
Table 4: shows how much better were the results of using Traditional and when using similarity
thesaurus
Average Recall Precision
Recall Roots with using
Similarity
Thesaurus
Roots with using
Traditional retrieving
% of Improvement for
using Association
Thesaurus over Traditional
retrieving
0 0.908 0.917966102 -1.00%
0.1 0.87 0.875762712 -0.58%
0.2 0.810178571 0.785762712 2.44%
0.3 0.709464286 0.695254237 1.42%
0.4 0.664821429 0.626237288 3.86%
0.5 0.541428571 0.523389831 1.80%
0.6 0.438571429 0.442542373 -0.40%
0.7 0.325357143 0.290847458 3.45%
0.8 0.251428571 0.198305085 5.31%
0.9 0.13875 0.084745763 5.40%
q1 `2q2 0.047288136 0.91%
Table 4: Effect of using Traditional and when using similarity thesaurus
Table 5 shows the effect of using the stemmed words for information retrieving were always
better than using Full words, and ensure that using thesauri is much better than using Traditional
information retrieval.
9. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013
9
Table 5: Average of all the Relative work
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Traditional-
Full Words
0.91 0.87 0.81 0.71 0.66 0.54 0.44 0.33 0.25 0.14
Traditional -
Stemmed
words
0.92 0.88 0.79 0.7 0.63 0.52 0.44 0.29 0.2 0.08
Thesaurus -
Full Words
0.92 0.8 0.68 0.54 0.4 0.33 0.21 0.13 0.11 0.07
Thesaurus -
Stemmed
words
0.9 0.82 0.65 0.49 0.39 0.26 0.21 0.12 0.08 0.04
Averages
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Precision
Simi with Stemming Trad with Stemming
Simi with fullword Trad with fullword
Figure 5: A comparison between the values of average Recall Precision with all cases
6. CONCLUSION:-
In this paper, researcher built similar thesaurus in tow mechanisms (full word, and stemmed) ,
researcher found out that the results for retrieval information in used stemmed word get better
result than using full word in case of used traditional search, but when researcher used the similar
thesaurus in both mechanisms, got better result of stemmed than full word.. Finally, the results
when using stemmed in similar thesaurus got better result over than using stemmed in traditional
search.
10. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013
10
7. FUTURE WORK:-
In this study i built the similar thesaurus in both mechanisms full word and stemming word. I
hope in the future to build an associative thesaurus and compare between them to know what is
better for retrieval information.
REFERENCES
[1] Adriani, M. and Croft, W. “Retrieval Effectiveness of Various Indexing Techniques on Indonesian
News Articles”, 1997.
[2] Al-Shalabi, R. Kannan, G., Al-Jaam, J., Hasnah A., and Helat, E., “Stop-word Removal Algorithm for
Arabic Language”, processing of the 1st International Conference on Information & Communication
Technologies: from theory to Applications-ICTTA, Damascus, 2004.
[3] baeza-yates R.,and Rierio-neto B., “Modern Information Retrieval” , Addison-Wesley,New-
York,1999.
[4] Darwish, K., “Building a Shallow Arabic Morphological Analyzer in one Day”, Acl Workshop on
Computational Approaches to Semitic Language, PP 47-57, 2002.
[5] Kanaan, G. “Comparing Automatic Statistical and Syntactic Phrase Indexing for Arabic Information
Retrieval”, Ph.D.Thesis, University of Illinois, Chicago, USA, 1997.
[6] Monica Lassi ”Automatic Thesaurus Construction”, A paper written within the GSLT course
Linguistic Resource, autumn 2002.
[7]Al-Shalabi, R. Kannan, G., Al-Jaam, J., Hasnah A., and Helat, E., “Stop-word Removal Algorithm for
Arabic Language”, processing of the 1st International Conference on Information & Communication
Technologies: from theory to Applications-ICTTA, Damascus, 2004.
[8]Kanaan, G. Ghassan and Wedyan, M. (2006). Constructing an Automatic Thesaurus to Enhance Arabic
Information Retrieval System. The 2nd Jordanian International Conference on Computer Science and
Engineering, JICCSE 2006, Salt, Jordan. 89-97.
[9]Smeaton, A.F., Van Rijsbergen, C.J., The Retrieval Effects of Query Expansion on Feedback Document
Retrieval System, The Computer Journal, 26(3), p239-46, 1983.
[10]T. R. Addis, Machine Understanding of Natural Language, International Journal of Man-Machine
Studies, Vol. 9, No. 2, March 1977, pp. 207-222.
[11]Aljlayl, M, and Frieder, O, "on Arabic Search: Improving the Retrieval Effectiveness via a Light
Stemming Approach" ACM Conference on Information and Knowledge Management, Mcelean, VA ,
November, 2002
AUTHORS
Assistant Prof. in Zarqa University , jordan