This document proposes using Word2Vec and decision trees to extract keywords from textual documents and classify the documents. It reviews related work on keyword extraction and text classification techniques. The proposed approach involves preprocessing text, representing words as vectors with Word2Vec, calculating frequently occurring keywords for each category, and using decision trees to classify documents based on keyword similarity. Experiments using different preprocessing and Word2Vec settings achieved an F-score of up to 82% for document classification.
Semantic tagging for documents using 'short text' informationcsandit
Â
Tagging documents with relevant and comprehensive k
eywords offer invaluable assistance to
the readers to quickly overview any document. With
the ever increasing volume and variety of
the documents published on the internet, the intere
st in developing newer and successful
techniques for annotating (tagging) documents is al
so increasing. However, an interesting
challenge in document tagging occurs when the full
content of the document is not readily
accessible. In such a scenario, techniques which us
e âshort textâ, e.g., a document title, a news
article headline, to annotate the entire article ar
e particularly useful. In this paper, we pro-
pose a novel approach to automatically tag document
s with relevant tags or key-phrases using
only âshort textâ information from the documents. W
e employ crowd-sourced knowledge from
Wikipedia, Dbpedia, Freebase, Yago and similar open
source knowledge bases to generate
semantically relevant tags for the document. Using
the intelligence from the open web, we prune
out tags that create ambiguity in or âtopic driftâ
from the main topic of our query document.
We have used real world dataset from a corpus of re
search articles to annotate 50 research
articles. As a baseline, we used the full text info
rmation from the document to generate tags. The
proposed and the baseline approach were compared us
ing the author assigned keywords for the
documents as the ground truth information. We found
that the tags generated using proposed
approach are better than using the baseline in term
s of overlap with the ground truth tags
measured via Jaccard index (0.058 vs. 0.044). In te
rms of computational efficiency, the
proposed approach is at least 3 times faster than t
he baseline approach. Finally, we
qualitatively analyse the quality of the predicted
tags for a few samples in the test corpus. The
evaluation shows the effectiveness of the proposed
approach both in terms of quality of tags
generated and the computational time.
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...cscpconf
Â
On-line text documents rapidly increase in size with the growth of World Wide Web. To manage such a huge amount of texts,several text miningapplications came into existence. Those
applications such as search engine, text categorization,summarization, and topic detection arebased on feature extraction.It is extremely time consuming and difficult task to extract keyword or feature manually.So an automated process that extracts keywords or features needs to be
established.This paper proposes a new domain keyword extraction technique that includes a new weighting method on the base of the conventional TFâ˘IDF. Term frequency-Inverse
document frequency is widely used to express the documentsfeature weight, which canât reflect the division of terms in the document, and then canât reflect the significance degree and the difference between categories. This paper proposes a new weighting method to which a new weight is added to express the differences between domains on the base of original TFâ˘IDF.The extracted feature can represent the content of the text better and has a better distinguished
Dictionary based concept mining an application for turkishcsandit
Â
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
Performance analysis on secured data method in natural language steganographyjournalBEEI
Â
The rapid amount of exchange information that causes the expansion of the internet during the last decade has motivated that a research in this field. Recently, steganography approaches have received an unexpected attention. Hence, the aim of this paper is to review different performance metric; covering the decoding, decrypting and extracting performance metric. The process of data decoding interprets the received hidden message into a code word. As such, data encryption is the best way to provide a secure communication. Decrypting take an encrypted text and converting it back into an original text. Data extracting is a process which is the reverse of the data embedding process. The effectiveness evaluation is mainly determined by the performance metric aspect. The intention of researchers is to improve performance metric characteristics. The evaluation success is mainly determined by the performance analysis aspect. The objective of this paper is to present a review on the study of steganography in natural language based on the criteria of the performance analysis. The findings review will clarify the preferred performance metric aspects used. This review is hoped to help future research in evaluating the performance analysis of natural language in general and the proposed secured data revealed on natural language steganography in specific.
The Process of Information extraction through Natural Language ProcessingWaqas Tariq
Â
Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, especially textual documents, in response to a query or topic statement, which may itself be unstructured, e.g., a sentence or even another document, or which may be structured, e.g., a boolean expression. The need for effective methods of automated IR has grown in importance because of the tremendous explosion in the amount of unstructured data, both internal, corporate document collections, and the immense and growing number of document sources on the Internet.. The topics covered include: formulation of structured and unstructured queries and topic statements, indexing (including term weighting) of document collections, methods for computing the similarity of queries and documents, classification and routing of documents in an incoming stream to users on the basis of topic or need statements, clustering of document collections on the basis of language or topic, and statistical, probabilistic, and semantic methods of analyzing and retrieving documents. Information extraction from text has therefore been pursued actively as an attempt to present knowledge from published material in a computer readable format. An automated extraction tool would not only save time and efforts, but also pave way to discover hitherto unknown information implicitly conveyed in this paper. Work in this area has focused on extracting a wide range of information such as chromosomal location of genes, protein functional information, associating genes by functional relevance and relationships between entities of interest. While clinical records provide a semi-structured, technically rich data source for mining information, the publications, in their unstructured format pose a greater challenge, addressed by many approaches.
Semantic tagging for documents using 'short text' informationcsandit
Â
Tagging documents with relevant and comprehensive k
eywords offer invaluable assistance to
the readers to quickly overview any document. With
the ever increasing volume and variety of
the documents published on the internet, the intere
st in developing newer and successful
techniques for annotating (tagging) documents is al
so increasing. However, an interesting
challenge in document tagging occurs when the full
content of the document is not readily
accessible. In such a scenario, techniques which us
e âshort textâ, e.g., a document title, a news
article headline, to annotate the entire article ar
e particularly useful. In this paper, we pro-
pose a novel approach to automatically tag document
s with relevant tags or key-phrases using
only âshort textâ information from the documents. W
e employ crowd-sourced knowledge from
Wikipedia, Dbpedia, Freebase, Yago and similar open
source knowledge bases to generate
semantically relevant tags for the document. Using
the intelligence from the open web, we prune
out tags that create ambiguity in or âtopic driftâ
from the main topic of our query document.
We have used real world dataset from a corpus of re
search articles to annotate 50 research
articles. As a baseline, we used the full text info
rmation from the document to generate tags. The
proposed and the baseline approach were compared us
ing the author assigned keywords for the
documents as the ground truth information. We found
that the tags generated using proposed
approach are better than using the baseline in term
s of overlap with the ground truth tags
measured via Jaccard index (0.058 vs. 0.044). In te
rms of computational efficiency, the
proposed approach is at least 3 times faster than t
he baseline approach. Finally, we
qualitatively analyse the quality of the predicted
tags for a few samples in the test corpus. The
evaluation shows the effectiveness of the proposed
approach both in terms of quality of tags
generated and the computational time.
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...cscpconf
Â
On-line text documents rapidly increase in size with the growth of World Wide Web. To manage such a huge amount of texts,several text miningapplications came into existence. Those
applications such as search engine, text categorization,summarization, and topic detection arebased on feature extraction.It is extremely time consuming and difficult task to extract keyword or feature manually.So an automated process that extracts keywords or features needs to be
established.This paper proposes a new domain keyword extraction technique that includes a new weighting method on the base of the conventional TFâ˘IDF. Term frequency-Inverse
document frequency is widely used to express the documentsfeature weight, which canât reflect the division of terms in the document, and then canât reflect the significance degree and the difference between categories. This paper proposes a new weighting method to which a new weight is added to express the differences between domains on the base of original TFâ˘IDF.The extracted feature can represent the content of the text better and has a better distinguished
Dictionary based concept mining an application for turkishcsandit
Â
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
Performance analysis on secured data method in natural language steganographyjournalBEEI
Â
The rapid amount of exchange information that causes the expansion of the internet during the last decade has motivated that a research in this field. Recently, steganography approaches have received an unexpected attention. Hence, the aim of this paper is to review different performance metric; covering the decoding, decrypting and extracting performance metric. The process of data decoding interprets the received hidden message into a code word. As such, data encryption is the best way to provide a secure communication. Decrypting take an encrypted text and converting it back into an original text. Data extracting is a process which is the reverse of the data embedding process. The effectiveness evaluation is mainly determined by the performance metric aspect. The intention of researchers is to improve performance metric characteristics. The evaluation success is mainly determined by the performance analysis aspect. The objective of this paper is to present a review on the study of steganography in natural language based on the criteria of the performance analysis. The findings review will clarify the preferred performance metric aspects used. This review is hoped to help future research in evaluating the performance analysis of natural language in general and the proposed secured data revealed on natural language steganography in specific.
The Process of Information extraction through Natural Language ProcessingWaqas Tariq
Â
Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, especially textual documents, in response to a query or topic statement, which may itself be unstructured, e.g., a sentence or even another document, or which may be structured, e.g., a boolean expression. The need for effective methods of automated IR has grown in importance because of the tremendous explosion in the amount of unstructured data, both internal, corporate document collections, and the immense and growing number of document sources on the Internet.. The topics covered include: formulation of structured and unstructured queries and topic statements, indexing (including term weighting) of document collections, methods for computing the similarity of queries and documents, classification and routing of documents in an incoming stream to users on the basis of topic or need statements, clustering of document collections on the basis of language or topic, and statistical, probabilistic, and semantic methods of analyzing and retrieving documents. Information extraction from text has therefore been pursued actively as an attempt to present knowledge from published material in a computer readable format. An automated extraction tool would not only save time and efforts, but also pave way to discover hitherto unknown information implicitly conveyed in this paper. Work in this area has focused on extracting a wide range of information such as chromosomal location of genes, protein functional information, associating genes by functional relevance and relationships between entities of interest. While clinical records provide a semi-structured, technically rich data source for mining information, the publications, in their unstructured format pose a greater challenge, addressed by many approaches.
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
Â
A Large number of digital text information is generated every day. Effectively searching, managing and
exploring the text data has become a main task. In this paper, we first present an introduction to text
mining and LDA topic model. Then we deeply explained how to apply LDA topic model to text corpus by
doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data
retrieving, pre-processing, fitting the model and an application of document exploring system. The result of
the experiments shows LDA topic model working effectively on documents clustering and finding the
similar documents. Furthermore, the document exploring system could be a useful research tool for
students and researchers.
A Text Mining Research Based on LDA Topic Modellingcsandit
Â
A Large number of digital text information is gener
ated every day. Effectively searching,
managing and exploring the text data has become a m
ain task. In this paper, we first represent
an introduction to text mining and a probabilistic
topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles a
nd usersâ tweets topic modelling. The
former one builds up a document topic model, aiming
to a topic perspective solution on
searching, exploring and recommending articles. The
latter one sets up a user topic model,
providing a full research and analysis over Twitter
usersâ interest. The experiment process
including data collecting, data pre-processing and
model training is fully documented and
commented. Further more, the conclusion and applica
tion of this paper could be a useful
computation tool for social and business research.
Text Mining is the technique that helps users to find out useful information from a large amount of text documents on the web or database. Most popular text mining and classification methods have adopted term-based approaches. The term based approaches and the pattern-based method describing user preferences. This review paper analyse how the text mining work on the three level i.e sentence level, document level and feature level. In this paper we review the related work which is previously done. This paper also demonstrated that what are the problems arise while doing text mining done at the feature level. This paper presents the technique to text mining for the compound sentences.
Single document keywords extraction in Bahasa Indonesia using phrase chunkingTELKOMNIKA JOURNAL
Â
Keywords help readers to understand the idea of a document quickly. Unfortunately, considerable time and effort are often needed to come up with a good set of keywords manually. This research focused on generating keywords from a document automatically using phrase chunking. Firstly, we collected part of speech patterns from a collection of documents. Secondly, we used those patterns to extract candidate keywords from the abstract and the content of a document. Finally, keywords are selected from the candidates based on the number of words in the keyword phrases and some scenarios involving candidate reduction and sorting. We evaluated the result of each scenario using precision, recall, and F-measure. The experiment results show: i) shorter-phrase keywords with string reduction extracted from the abstract and sorted by frequency provides the highest score, ii) in every proposed scenario, extracting keywords using the abstract always presents a better result, iii) using shorter-phrase patterns in keywords extraction gives better score in comparison to using all phrase patterns, iv) sorting scenarios based on the multiplication of candidate frequencies and the weight of the phrase patterns offer better results.
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
Â
Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...cseij
Â
In this paper we combine our previous research in the field of Semantic web, especially ontology learning and population with Sentence retrieval. To do this we developed a new approach to sentence retrieval
modifying our previous TF-ISF method which uses local context information to take into account only document level information. This is quite a new approach to sentence retrieval, presented for the first time
in this paper and also compared to the existing methods that use information from whole document collection. Using this approach and developed methods for sentence retrieval on a document level it is possible to assess the relevance of a sentence by using only the information from the retrieved sentenceâs document and to define a document level OWL representation for sentence retrieval that can be
automatically populated. In this way the idea of Semantic Web through automatic and semi-automatic
extraction of additional information from existing web resources is supported. Additional information is
formatted in OWL document containing document sentence relevance for sentence retrieval.
A prior case study of natural language processing on different domain IJECEIAES
Â
In the present state of digital world, computer machine do not understand the humanâs ordinary language. This is the great barrier between humans and digital systems. Hence, researchers found an advanced technology that provides information to the users from the digital machine. However, natural language processing (i.e. NLP) is a branch of AI that has significant implication on the ways that computer machine and humans can interact. NLP has become an essential technology in bridging the communication gap between humans and digital data. Thus, this study provides the necessity of the NLP in the current computing world along with different approaches and their applications. It also, highlights the key challenges in the development of new NLP model.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Â
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
Answer extraction and passage retrieval forWaheeb Ahmed
Â
âQuestion Answering systems (QASs) do the task of
retrieving text portions from a collection of documents that
contain the answer to the userâs questions. These QASs use a
variety of linguistic tools that be able to deal with small
fragments of text. Therefore, to retrieve the documents which
contains the answer from a large document collections, QASs
employ Information Retrieval (IR) techniques to minimize the
number of documents collections to a treatable amount of
relevant text. In this paper, we propose a model for passage
retrieval model that do this task with a better performance for
the purpose of Arabic QASs. We first segment each the top five
ranked documents returned by the IR module into passages.
Then, we compute the similarity score between the userâs
question terms and each passage. The top five passages (with
high similarity score) are retrieved are retrieved. Finally,
Answer Extraction techniques are applied to extract the final
answer. Our method achieved an average for precision of
87.25%, Recall of 86.2% and F1-measure of 87%.
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUESijcseit
Â
As the information access across languages increases, the importance of a system that supports querybased
searching with the presence of multilingual also grows. Gathering the information in different
natural language is the most difficult task, which requires huge resources like database and digital
libraries. Cross language information retrieval (CLIR) enables to search in multilingual document
collections using the native language which can be supported by the different data mining techniques. This
paper deals with various data mining techniques that can be used for solving the problems encountered in
CLIR.
A Novel Approach for Keyword extraction in learning objects using text miningIJSRD
Â
Keyword extraction, concept finding are in learning objects is very important subject in todayĂ¢ââââ¢s eLearning environment. Keywords are subset of words that contains the useful information about the content of the document. Keyword extraction is a process that is used to get the important keywords from documents. In this proposed System Decision tree algorithm is used for feature selection process using wordnet dictionary. WordNet is a lexical database of English which is used to find similarity from the candidate words. The words having highest similarity are taken as keywords.
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...IJDKP
Â
As existing computer search engines struggle to understand the meaning of natural language, semantically
enriched metadata may improve interest-based search engine capabilities and user satisfaction.
This paper presents an enhanced version of the ecosystem focusing on semantic topic metadata detection
and enrichments. It is based on a previous paper, a semantic metadata enrichment software ecosystem
(SMESE). Through text analysis approaches for topic detection and metadata enrichments this paper
propose an algorithm to enhance search engines capabilities and consequently help users finding content
according to their interests. It presents the design, implementation and evaluation of SATD (Scalable
Annotation-based Topic Detection) model and algorithm using metadata from the web, linked open data,
concordance rules, and bibliographic record authorities. It includes a prototype of a semantic engine using
keyword extraction, classification and concept extraction that allows generating semantic topics by text,
and multimedia document analysis using the proposed SATD model and algorithm.
The performance of the proposed ecosystem is evaluated using a number of prototype simulations by
comparing them to existing enriched metadata techniques (e.g., AlchemyAPI, DBpedia, Wikimeta, Bitext,
AIDA, TextRazor). It was noted that SATD algorithm supports more attributes than other algorithms. The
results show that the enhanced platform and its algorithm enable greater understanding of documents
related to user interests.
Information Retrieval System is an effective process that helps a user to trace relevant information by Natural Language Processing (NLP). In this research paper, we have presented present an algorithmic Information Retrieval System(BIRS) based on information and the system is significant mathematically and statistically. This paper is demonstrated by two algorithms for finding out the lemmatization of Bengali words such as Trie and Dictionary Based Search by Removing Affix (DBSRA) as well as compared with Edit Distance for the exact lemmatization. We have presented the Bengali Anaphora resolution system using the Hobbsâ algorithm to get the correct expression of information. As the actions of questions answering algorithms, the TF-IDF and Cosine Similarity are developed to find out the accurate answer from the documents. In this study, we have introduced a Bengali Language Toolkit (BLTK) and Bengali Language Expression (BRE) that make the easiest implication of our task. We have also developed Bengali root wordâs corpus, synonym wordâs corpus, stop wordâs corpus and gathered 672 articles from the popular Bengali newspapers âThe Daily Prothom Aloâ which is our inserted information. For testing this system, we have created 19335 questions from the introduced information and got 97.22% accurate answer.
A Novel Method for Keyword Retrieval using Weighted Standard Deviation: âD4 A...idescitation
Â
Genetic Algorithm (GA) has been a successful method
that is been used for extracting keywords. This paper presents
a full method by which keywords can be derived from the
various corpuses. We have built equations that exploit the
structure of the documents from which the keywords need to
be extracted. The procedures are been broken into two
distinguished profiles: one is to weigh the words in the whole
document content and the other is to explore the possibilities
of the occurrence of key terms by using genetic algorithm.
The basic equations of the heuristic mechanism is been varied
to allow the complete exploitation of document. The Genetic
Algorithm and the enhanced standard deviation method is
used in full potential to enable the generation of the key
terms that describe the given text document. The new
technique has an enhanced performance and better time
complexities.
Automatically finding domain specific key terms from a given set of research paper is a challenging task and research papers to a particular area of research is a concern for many people including students, professors and researchers. A domain classification of papers facilitates that search process. That is, having a list of domains in a research field, we try to find out to which domain(s) a given paper is more related. Besides, processing the whole paper to read take a long time. In this paper, using domain knowledge requires much human effort, e.g., manually composing a set of labeling a large corpus. In particular, we use the abstract and keyword in research paper as the seeing terms to identify similar terms from a domain corpus which are then filtered by checking their appearance in the research papers. Experiments show the TF âIDF measure and the classification step make this method more precisely to domains. The results show that our approach can extract the terms effectively, while being domain independent.
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
Â
A Large number of digital text information is generated every day. Effectively searching, managing and
exploring the text data has become a main task. In this paper, we first present an introduction to text
mining and LDA topic model. Then we deeply explained how to apply LDA topic model to text corpus by
doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data
retrieving, pre-processing, fitting the model and an application of document exploring system. The result of
the experiments shows LDA topic model working effectively on documents clustering and finding the
similar documents. Furthermore, the document exploring system could be a useful research tool for
students and researchers.
A Text Mining Research Based on LDA Topic Modellingcsandit
Â
A Large number of digital text information is gener
ated every day. Effectively searching,
managing and exploring the text data has become a m
ain task. In this paper, we first represent
an introduction to text mining and a probabilistic
topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles a
nd usersâ tweets topic modelling. The
former one builds up a document topic model, aiming
to a topic perspective solution on
searching, exploring and recommending articles. The
latter one sets up a user topic model,
providing a full research and analysis over Twitter
usersâ interest. The experiment process
including data collecting, data pre-processing and
model training is fully documented and
commented. Further more, the conclusion and applica
tion of this paper could be a useful
computation tool for social and business research.
Text Mining is the technique that helps users to find out useful information from a large amount of text documents on the web or database. Most popular text mining and classification methods have adopted term-based approaches. The term based approaches and the pattern-based method describing user preferences. This review paper analyse how the text mining work on the three level i.e sentence level, document level and feature level. In this paper we review the related work which is previously done. This paper also demonstrated that what are the problems arise while doing text mining done at the feature level. This paper presents the technique to text mining for the compound sentences.
Single document keywords extraction in Bahasa Indonesia using phrase chunkingTELKOMNIKA JOURNAL
Â
Keywords help readers to understand the idea of a document quickly. Unfortunately, considerable time and effort are often needed to come up with a good set of keywords manually. This research focused on generating keywords from a document automatically using phrase chunking. Firstly, we collected part of speech patterns from a collection of documents. Secondly, we used those patterns to extract candidate keywords from the abstract and the content of a document. Finally, keywords are selected from the candidates based on the number of words in the keyword phrases and some scenarios involving candidate reduction and sorting. We evaluated the result of each scenario using precision, recall, and F-measure. The experiment results show: i) shorter-phrase keywords with string reduction extracted from the abstract and sorted by frequency provides the highest score, ii) in every proposed scenario, extracting keywords using the abstract always presents a better result, iii) using shorter-phrase patterns in keywords extraction gives better score in comparison to using all phrase patterns, iv) sorting scenarios based on the multiplication of candidate frequencies and the weight of the phrase patterns offer better results.
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
Â
Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...cseij
Â
In this paper we combine our previous research in the field of Semantic web, especially ontology learning and population with Sentence retrieval. To do this we developed a new approach to sentence retrieval
modifying our previous TF-ISF method which uses local context information to take into account only document level information. This is quite a new approach to sentence retrieval, presented for the first time
in this paper and also compared to the existing methods that use information from whole document collection. Using this approach and developed methods for sentence retrieval on a document level it is possible to assess the relevance of a sentence by using only the information from the retrieved sentenceâs document and to define a document level OWL representation for sentence retrieval that can be
automatically populated. In this way the idea of Semantic Web through automatic and semi-automatic
extraction of additional information from existing web resources is supported. Additional information is
formatted in OWL document containing document sentence relevance for sentence retrieval.
A prior case study of natural language processing on different domain IJECEIAES
Â
In the present state of digital world, computer machine do not understand the humanâs ordinary language. This is the great barrier between humans and digital systems. Hence, researchers found an advanced technology that provides information to the users from the digital machine. However, natural language processing (i.e. NLP) is a branch of AI that has significant implication on the ways that computer machine and humans can interact. NLP has become an essential technology in bridging the communication gap between humans and digital data. Thus, this study provides the necessity of the NLP in the current computing world along with different approaches and their applications. It also, highlights the key challenges in the development of new NLP model.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Â
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
Answer extraction and passage retrieval forWaheeb Ahmed
Â
âQuestion Answering systems (QASs) do the task of
retrieving text portions from a collection of documents that
contain the answer to the userâs questions. These QASs use a
variety of linguistic tools that be able to deal with small
fragments of text. Therefore, to retrieve the documents which
contains the answer from a large document collections, QASs
employ Information Retrieval (IR) techniques to minimize the
number of documents collections to a treatable amount of
relevant text. In this paper, we propose a model for passage
retrieval model that do this task with a better performance for
the purpose of Arabic QASs. We first segment each the top five
ranked documents returned by the IR module into passages.
Then, we compute the similarity score between the userâs
question terms and each passage. The top five passages (with
high similarity score) are retrieved are retrieved. Finally,
Answer Extraction techniques are applied to extract the final
answer. Our method achieved an average for precision of
87.25%, Recall of 86.2% and F1-measure of 87%.
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUESijcseit
Â
As the information access across languages increases, the importance of a system that supports querybased
searching with the presence of multilingual also grows. Gathering the information in different
natural language is the most difficult task, which requires huge resources like database and digital
libraries. Cross language information retrieval (CLIR) enables to search in multilingual document
collections using the native language which can be supported by the different data mining techniques. This
paper deals with various data mining techniques that can be used for solving the problems encountered in
CLIR.
A Novel Approach for Keyword extraction in learning objects using text miningIJSRD
Â
Keyword extraction, concept finding are in learning objects is very important subject in todayĂ¢ââââ¢s eLearning environment. Keywords are subset of words that contains the useful information about the content of the document. Keyword extraction is a process that is used to get the important keywords from documents. In this proposed System Decision tree algorithm is used for feature selection process using wordnet dictionary. WordNet is a lexical database of English which is used to find similarity from the candidate words. The words having highest similarity are taken as keywords.
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...IJDKP
Â
As existing computer search engines struggle to understand the meaning of natural language, semantically
enriched metadata may improve interest-based search engine capabilities and user satisfaction.
This paper presents an enhanced version of the ecosystem focusing on semantic topic metadata detection
and enrichments. It is based on a previous paper, a semantic metadata enrichment software ecosystem
(SMESE). Through text analysis approaches for topic detection and metadata enrichments this paper
propose an algorithm to enhance search engines capabilities and consequently help users finding content
according to their interests. It presents the design, implementation and evaluation of SATD (Scalable
Annotation-based Topic Detection) model and algorithm using metadata from the web, linked open data,
concordance rules, and bibliographic record authorities. It includes a prototype of a semantic engine using
keyword extraction, classification and concept extraction that allows generating semantic topics by text,
and multimedia document analysis using the proposed SATD model and algorithm.
The performance of the proposed ecosystem is evaluated using a number of prototype simulations by
comparing them to existing enriched metadata techniques (e.g., AlchemyAPI, DBpedia, Wikimeta, Bitext,
AIDA, TextRazor). It was noted that SATD algorithm supports more attributes than other algorithms. The
results show that the enhanced platform and its algorithm enable greater understanding of documents
related to user interests.
Information Retrieval System is an effective process that helps a user to trace relevant information by Natural Language Processing (NLP). In this research paper, we have presented present an algorithmic Information Retrieval System(BIRS) based on information and the system is significant mathematically and statistically. This paper is demonstrated by two algorithms for finding out the lemmatization of Bengali words such as Trie and Dictionary Based Search by Removing Affix (DBSRA) as well as compared with Edit Distance for the exact lemmatization. We have presented the Bengali Anaphora resolution system using the Hobbsâ algorithm to get the correct expression of information. As the actions of questions answering algorithms, the TF-IDF and Cosine Similarity are developed to find out the accurate answer from the documents. In this study, we have introduced a Bengali Language Toolkit (BLTK) and Bengali Language Expression (BRE) that make the easiest implication of our task. We have also developed Bengali root wordâs corpus, synonym wordâs corpus, stop wordâs corpus and gathered 672 articles from the popular Bengali newspapers âThe Daily Prothom Aloâ which is our inserted information. For testing this system, we have created 19335 questions from the introduced information and got 97.22% accurate answer.
A Novel Method for Keyword Retrieval using Weighted Standard Deviation: âD4 A...idescitation
Â
Genetic Algorithm (GA) has been a successful method
that is been used for extracting keywords. This paper presents
a full method by which keywords can be derived from the
various corpuses. We have built equations that exploit the
structure of the documents from which the keywords need to
be extracted. The procedures are been broken into two
distinguished profiles: one is to weigh the words in the whole
document content and the other is to explore the possibilities
of the occurrence of key terms by using genetic algorithm.
The basic equations of the heuristic mechanism is been varied
to allow the complete exploitation of document. The Genetic
Algorithm and the enhanced standard deviation method is
used in full potential to enable the generation of the key
terms that describe the given text document. The new
technique has an enhanced performance and better time
complexities.
Automatically finding domain specific key terms from a given set of research paper is a challenging task and research papers to a particular area of research is a concern for many people including students, professors and researchers. A domain classification of papers facilitates that search process. That is, having a list of domains in a research field, we try to find out to which domain(s) a given paper is more related. Besides, processing the whole paper to read take a long time. In this paper, using domain knowledge requires much human effort, e.g., manually composing a set of labeling a large corpus. In particular, we use the abstract and keyword in research paper as the seeing terms to identify similar terms from a domain corpus which are then filtered by checking their appearance in the research papers. Experiments show the TF âIDF measure and the classification step make this method more precisely to domains. The results show that our approach can extract the terms effectively, while being domain independent.
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...ijaia
Â
Regression models and their statistical analyses is one of the most important tool used by scientists and practitioners. The aim of a regression model is to fit parametric functions to data. It is known that the true regression is unknown and specific methods are created and used strictly pertaining to the roblem. For the pioneering work to develop procedures for fitting functions, we refer to the work on the methods of least
absolute deviations, least squares deviations and minimax absolute deviations. Todayâs widely celebrated
procedure of the method of least squares for function fitting is credited to the published works of Legendre and Gauss. However, the least squares based models in practice may fail to provide optimal results in nonGaussian situations especially when the errors follow distributions with the fat tails. In this paper an unorthodox method of estimating linear regression coefficients by minimising GMSE(geometric mean of squared errors) is explored. Though GMSE(geometric mean of squared errors) is used to compare models it is rarely used to obtain the coefficients. Such a method is tedious to handle due to the large number of roots obtained by minimisation of the loss function. This paper offers a way to tackle that problem.
Application is illustrated with the âAdvertisingâ dataset from ISLR and the obtained results are compared
with the results of the method of least squares for single index linear regression model.
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...csandit
Â
Text mining and Text classification are the two pro
minent and challenging tasks in the field of
Machine learning. Text mining refers to the process
of deriving high quality and relevant
information from text, while Text classification de
als with the categorization of text documents
into different classes. The real challenge in these
areas is to address the problems like handling
large text corpora, similarity of words in text doc
uments, and association of text documents with
a subset of class categories. The feature extractio
n and classification of such text documents
require an efficient machine learning algorithm whi
ch performs automatic text classification.
This paper describes the classification of product
review documents as a multi-label
classification scenario and addresses the problem u
sing Structured Support Vector Machine.
The work also explains the flexibility and performan
ce of the proposed approach for e
fficient text classification.
Text preprocessing is a vital stage in text classification (TC) particularly and text mining generally. Text preprocessing tools is to reduce multiple forms of the word to one form. In addition, text preprocessing techniques are provided a lot of significance and widely studied in machine learning. The basic phase in text classification involves preprocessing features, extracting relevant features against the features in a database. However, they have a great impact on reducing the time requirement and speed resources needed. The effect of the preprocessing tools on English text classification is an area of research. This paper provides an evaluation study of several preprocessing tools for English text classification. The study includes using the raw text, the tokenization, the stop words, and the stemmed. Two different methods chi-square and TF-IDF with cosine similarity score for feature extraction are used based on BBC English dataset. The Experimental results show that the text preprocessing effect on the feature extraction methods that enhances the performance of English text classification especially for small threshold values.
Text classification supervised algorithms with term frequency inverse documen...IJECEIAES
Â
Over the course of the previous two decades, there has been a rise in the quantity of text documents stored digitally. The ability to organize and categorize those documents in an automated mechanism, is known as text categorization which is used to classify them into a set of predefined categories so they may be preserved and sorted more efficiently. Identifying appropriate structures, architectures, and methods for text classification presents a challenge for researchers. This is due to the significant impact this concept has on content management, contextual search, opinion mining, product review analysis, spam filtering, and text sentiment mining. This study analyzes the generic categorization strategy and examines supervised machine learning approaches and their ability to comprehend complex models and nonlinear data interactions. Among these methods are k-nearest neighbors (KNN), support vector machine (SVM), and ensemble learning algorithms employing various evaluation techniques. Thereafter, an evaluation is conducted on the constraints of every technique and how they can be applied to real-life situations.
Feature selection, optimization and clustering strategies of text documentsIJECEIAES
Â
Clustering is one of the most researched areas of data mining applications in the contemporary literature. The need for efficient clustering is observed across wide sectors including consumer segmentation, categorization, shared filtering, document management, and indexing. The research of clustering task is to be performed prior to its adaptation in the text environment. Conventional approaches typically emphasized on the quantitative information where the selected features are numbers. Efforts also have been put forward for achieving efficient clustering in the context of categorical information where the selected features can assume nominal values. This manuscript presents an in-depth analysis of challenges of clustering in the text environment. Further, this paper also details prominent models proposed for clustering along with the pros and cons of each model. In addition, it also focuses on various latest developments in the clustering task in the social network and associated environments.
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
Â
The internet has caused a humongous growth in the amount of data available to the common
man. Summaries of documents can help find the right information and are particularly effective
when the document base is very large. Keywords are closely associated to a document as they
reflect the document's content and act as indexes for the given document. In this work, we
present a method to produce extractive summaries of documents in the Kannada language. The
algorithm extracts key words from pre-categorized Kannada documents collected from online
resources. We combine GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse
Document Frequency) methods along with TF (Term Frequency) for extracting key words and
later use these for summarization. In the current implementation a document from a given category is selected from our database and depending on the number of sentences given by theuser, a summary is generated.
Keyword Extraction Based Summarization of Categorized Kannada Text Documents ijsc
Â
The internet has caused a humongous growth in the number of documents available online. Summaries of documents can help find the right information and are particularly effective when the document base is very large. Keywords are closely associated to a document as they reflect the document's content and act as indices for a given document. In this work, we present a method to produce extractive summaries of documents in the Kannada language, given number of sentences as limitation. The algorithm extracts key words from pre-categorized Kannada documents collected from online resources. We use two feature selection techniques for obtaining features from documents, then we combine scores obtained by GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse Document Frequency) methods along with TF (Term Frequency) for extracting key words and later use these for summarization based on rank of the sentence. In the current implementation, a document from a given category is selected from our database and depending on the number of sentences given by the user, a summary is generated.
The enormous amount of information stored in unstructured texts cannot simply be used for further
processing by computers, which typically handle text as simple sequences of character strings. Therefore, specific
(pre-) processing methods and algorithms are required in order to extract useful patterns. Text Mining is the
discovery of valuable, yet hidden, information from the text document. Text classification (Also called Text
Categorization) is one of the important research issues in the field of text mining. It is necessary to
classify/categorize large texts (documents) into specific classes. Text Classification assigns a text document to one of a
set of predefined classes. This paper covers different text classification techniques and also includes Classifier
Architecture and Text Classification Applications.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Survey on Key Phrase Extraction using Machine Learning ApproachesYogeshIJTSRD
Â
The automated keyword extraction task is to define a collection of representative terms for the text. Extracting keywords defines a small collection of terms, key phrases and keywords that define the documentâs context. Keyword search allows large document collections to be searched effectively. To allocate suitable key phrases to new documents, text categorization techniques can be applied. A predefined collection of key phrases from which all key phrases for new documents are selected is given in the training documents. The training data for each key phrase describes a collection of documents associated with it. Standard machine learning techniques are used for each key phrase to construct a classifier from the training materials, using those relevant to it as positive examples and the rest as negative examples. Provided a new text, it is processed by the classifier of each key phrase. Preeti Sondhi | Aakib Jabbar "Survey on Key Phrase Extraction using Machine Learning Approaches" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-3 , April 2021, URL: https://www.ijtsrd.com/papers/ijtsrd39890.pdf Paper URL: https://www.ijtsrd.com/other-scientific-research-area/other/39890/survey-on-key-phrase-extraction-using-machine-learning-approaches/preeti-sondhi
French machine reading for question answeringAli Kabbadj
Â
This paper proposes to unlock the main barrier to machine reading and comprehension French natural language texts. This open the way to machine to find to a question a precise answer buried in the mass of unstructured French texts. Or to create a universal French chatbot. Deep learning has produced extremely promising results for various tasks in natural language understanding particularly topic classification, sentiment analysis, question answering, and language translation. But to be effective Deep Learning methods need very large training da-tasets. Until now these technics cannot be actually used for French texts Question Answering (Q&A) applications since there was not a large Q&A training dataset. We produced a large (100 000+) French training Dataset for Q&A by translating and adapting the English SQuAD v1.1 Dataset, a GloVe French word and character embed-ding vectors from Wikipedia French Dump. We trained and evaluated of three different Q&A neural network ar-chitectures in French and carried out a French Q&A models with F1 score around 70%.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Â
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
Â
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Â
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Model Attribute Check Company Auto PropertyCeline George
Â
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
The Roman Empire A Historical Colossus.pdfkaushalkr1407
Â
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesarâs dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empireâs birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empireâs society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
Â
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasnât one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Honest Reviews of Tim Han LMA Course Program.pptxtimhan337
Â
Personal development courses are widely available today, with each one promising life-changing outcomes. Tim Hanâs Life Mastery Achievers (LMA) Course has drawn a lot of interest. In addition to offering my frank assessment of Success Insiderâs LMA Course, this piece examines the courseâs effects via a variety of Tim Han LMA course reviews and Success Insider comments.
An Investigation of Keywords Extraction from Textual Documents using Word2Vec and Decision Tree
1. An Investigation of Keywords Extraction from
Textual Documents using Word2Vec and Decision
Tree
Hawa Benghuzzi
Department of Computer Science
Faculty of Information Technology, Misurata University
Misurata â Libya
H.Benghuzzi@it.misuratau.edu.ly
Mohammed M. Elsheh
Department of Computer Science
Faculty of Information Technology, Misurata University
Misurata - Libya
m.elsheh@it.misuratau.edu.ly
Abstractâ In recent years the growth of digital data is
increasing dramatically, knowledge discovery and data mining
have attracted immense attention with coming up need for
turning such data into useful information and knowledge.
Keyword extraction is considered an essential task in natural
language processing (NLP) that facilitates mapping of documents
to a concise set of representative single and multi-word phrases.
This paper investigates using of Word2Vec and Decision Tree for
keywords extraction from textual documents. The Sem-Eval
(2010) dataset is used as a main input for the proposed study. The
words are represented by vectors with Word2Vec technique
following applying pre-processing operations on the dataset. This
method is based on word similarity between candidate keywords
from both collecting keywords for each label and one sample
from the same label. An appropriate threshold has been
determined by which the percentages that exceed this threshold
are exported to the Decision Tree in order to consider an
appropriate classification to be taken on the text document.
Some similarity measurements were used for the classification
process. The efficiency and accuracy of the algorithm was
measured in the process of classification using precision, recall
and F-score rates. The obtained results indicated that using of
vector representation for each keyword is an effective way to
identify the most similar words, so that the opportunity to
recognize the correct classification of the document increases.
When using word2Vec CBOW the result of F-Score was 64%
with the Gini method and WordNet Lemmatizer. Meanwhile,
when using Word2Vec SG the result of F-Score was 82% with
Gini Index and English Porter Stemming which considered the
highest ratio for all our experiments.
Keywords- Text Classification; Keywords Extraction; Word2Vec;
Decision Tree; Text Mining.
I. INTRODUCTION
Nowadays, the electronic documents space is growing on a
daily basis at a massive rate. At the same time, we need to go
quickly throughout these large amounts of textual information
to find out documents related to our interests [1]. Unstructured
data has a diversity forms, and text data is an adequate
example of it, that is one of the simplest forms of data that can
be generated in most scenarios. Humans can easily process
and perceiving the unstructured text, but is harder for
machines to understand. As a result, there is a desperate need
to design methods and algorithms in order to effectively
process this collapse of text in a broad set of applications [2].
Moreover, this increasing of electronic textual documents led
to the need of text mining studies, that is the task of extracting
meaningful information from text, which has gained more
importance recently[1].
Text mining is unlike from what is familiar with in web
search. In web searching, the user is typically looking for
something that is previously known and has been written by
someone else. The problem raises from pushing aside all the
material that currently is not appropriate to the user needs in
order to find the relevant information. In text mining, the
objective is to realize unknown information, something that no
one yet knows and so could not have yet written down [3].
There is a set of approaches that involves in Text Mining such
as: Text Summarization, Unsupervised Learning Methods and
Supervised Learning Methods.
However, there are many approaches by which keyword
extraction can be carried out, such as supervised and
unsupervised machine learning, statistical methods and
linguistic ones.
Text Classification (TC) is the task of automatically sorting a
set of documents into categories from a predefined set, also an
important part of text mining is included under supervised
machine learning methods [4].
The keywords extraction phase comes before Text
classification, where the keywords are subcategory of words
that contain the most major information about the content of
the document. keyword extraction is the process of selecting
words from the text document that probably contains valuable
information from the document without any human
intervention depending on the model [5].
Basically, in TC there are two stages involved namely,
training stage and testing stage. In former stage, documents
are preprocessed and trained by a learning algorithm to
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 5, May 2020
13 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
2. generate the classifier. In latter stage, a prediction of classifier
is performed. Using supervised learning algorithms [3], the
objective is to learn classifiers from known examples (labelled
documents) and to perform the classification automatically on
unknown examples (unlabelled documents). There are many
traditional learning algorithms to train the data, such as
Decision Trees, Naive Bayes (NB), Support Vector Machines
(SVM), k-Nearest Neighbour (KNN), Neural Network (NNet).
The remainder of this paper is organized as follows: section 2
presents some related research works that deal with the
problem of keywords extraction. Section 3 presents our
proposed approach for extracting the keywords using
Word2Vec and Decision Tree. Experiments and results are
described and discussed in Section 4. Finally, section 5
presents the conclusion of the paper.
II. EXTRACION OF KEYWORDS FROM TEXTUAL DOCUMENTS: A
LITERATURE REVIEW
This section summaries a collection of previous
studies which were conducted in last few years regarding to
keywords extraction and text classification.
A. The Textual Datasets
There is a lot of textual datasets are available for
NLP, and in recent years interest increases in collecting data
for these studies. Where, the investigators in [6] described the
Task 5 of the Workshop on Semantic Evaluation 2010
(SemEval-2010), their work focusing on key-phrase
extraction. The researchers have compiled a set of 284
scientific articles with key-phrases carefully chosen by both
their authors and readers. The dataset consists of trial, training
and testing data of conference and workshop papers from the
ACM Digital Library. The papers ranged between six and
eight pages, and containing tables and pictures.
Also, in [7] the researchers collected 1,147,000 scientific
abstracts related to different areas from arxiv, then they added
the scientific documents present in the benchmark datasets
comprising of short abstracts (Inspec) and long scientific
papers (SemEval-2010) that later used for evaluation to rank
keyword extraction.
And, in [8] the authors evaluated their algorithm and other
baseline algorithms over 2500 patent documents extracted
from Google Patent .
B. Text Preprocessing operations
Text Preprocessing is an important task and a basic
step in many Text Mining and IR algorithms, and it is the
fundamental part of any NLP system. Since the characters,
words, and sentences are identified at this stage the major
units are passed to all further processing stages. In [10], the
authors present an efficient preprocessing techniques that
eliminate unuseful parts of a document such as prepositions,
articles, and pro-nouns. These pre-processing techniques
eliminate noise from text data, later identifying the root word
for actual words and reducing the size of the text data. Their
objective was to analyze the issues of preprocessing methods
such as Tokenization, stop words removal and stemming for
the text documents.
In addition, the authors in [11] do preprocessing on documents
before classifying them. In preprocessing, stop words are
removed and the words were stemmed. The researchers' point
of view was that the reason behind stop-words should be
removed from a text is that they make the text look heavier
and less important for analysts.
Moreover, the authors in [12] applied preprocessing
techniques on the input documents to present the text
documents in a clear word format. The most taken steps are:
⢠Tokenization: A document is treated as a string, and
then partitioned into a list of tokens.
⢠Removing stop words: Stop words such as âtheâ, âaâ,
âandâ, etc. are frequently occurring, so the
insignificant words need to be removed.
⢠Stemming word: Applying the stemming algorithm
that converts different word forms into a similar
canonical form. This step is the process of conflating
tokens to their root form, e.g. connection to connect
and computing to compute.
C. Keywords Extraction
International Encyclopedia of Information and
Library Science [1] defines âKeywordâ as âA word that
succinctly and accurately describes the subject, or an aspect of
the subject, discussed in a document.â
There are many techniques used to extract the keywords. In
this work Word2vec is used, which is a method utilizes a
vector to represent a word. The Word2Vec technique was
created by a research team led by Tomas Mikolov at Google
(2013) [13]. They proposed two new model architectures for
learning distributed representations of words that minimize
computational complexity namely Continue Bag of Words
(CBOW) and skip Gram (SG) models. Figure 1 illustrates the
architecture of CBOW and SG:
Figure1: The architecture of CBOW and SG
In addition, the authors in [14] offer and discuss experiments
on sentiment analysis of Twitter posts regarding to United
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 5, May 2020
14 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
3. State (U.S) airline companies. Their study aims to determine
whether using the word2vec algorithm to create word
embeddings could be used to classify sentiment. Their dataset
was acquired from Kaggle.com, which contains over 14,000
tweets about users' airline experience and 15 attributes
including the original tweet text, Twitter user-related data, and
the class sentiment label.
Furthermore, in the article by J. LindĂŠn, S. ForsstrĂśm, and T.
Zhang [9], they present combination of the paragraph vector
algorithms Distributed Memory and Distributed Bag of Words
with four classification algorithms namely Decision Tree,
Random forest, Multi-Layer perceptron (MLP) and Long-
Short-Term-Memory (LSTM) to evaluate critical parameter
modifications of mentioned classification algorithms, with an
aim to categorize news articles.
D. Text Classification Algorithms
The aim of text classification is to classify the text
documents into a definite number of pre-defined classes. In
classification, there are key issues such as handling big
number of features, unstructured text documents, and choosing
a machine learning technique suitable for the text
classification application.
The authors in [11] applied text mining algorithms to extract
keywords from journal papers using TF-IDF and WordNet
thesaurus. TF-IDF algorithm is used to select the candidate
words, While WordNet is a lexical database of English which
is used to find similarity among the candidate words. Then
documents are classified based on extracted keywords using
the machine learning algorithms - NB, Decision Tree and
KNN. Decision Tree algorithm gives better results based on
prediction accuracy when compared to NB and KNN
algorithms with accuracy of 98.47%.
Wongkot Sriurai in his research [15] has compared the feature
processing techniques of Bag-of- Words (BOW) with the topic
model. Text categorization algorithms such as NB, SVM and
Decision tree are used for experimentation. For the
experiment, the precision, recall and F1 measure were used for
evaluating the text classification. The results proved that the
topic-model approach for representing the documents yield the
best performance based on F1 measure of 79% an
improvement of 11.1% over the BOW model.
III. PROPOSED APPROACH FOR EXTRACTING KEYWORDS AND
TEXTUAL CLASSIFICATION
In this section, we present the proposed method of using
Word2Vec technique in combination of Decision Tree
classifier to extract keywords from textual documents. The
architecture of the proposed method consists of three phases:
(1) Preprocessing phase; (2) Keywords extraction phase with
Word2Vec; (3) Documents classification using Decision Tree.
We describe these three phases in the following subsections.
A. Pre-processing phase
Preprocessing operations applied on dataset before
feeding it to the second phase. Its importance comes from the
fact that it makes the data more focused and clearer, which
makes it easy to select keywords and place them into the
correct categories to which they belong. The following
parameters are performed:
⢠Tokenization: is the process of breaking a stream of
text into words, phrases, symbols, or other
meaningful elements called tokens. The aim of the
tokenization is the exploration of the words in a
sentence.
⢠Stop words elimination: Many words are repeated
frequently in documents but basically are
meaningless since they are used to link words
together in a sentence. Due to their high
occurrence, their presence in text extraction
process is an obstacle to understanding the content
of documents. Stalled words often use common
words like "and", "she", "this", etc. They are not
helpful in classifying documents. So, they must be
eliminated.
⢠Stemming: It is the process of conflating the variant
forms of a word into a common representation. In
this work, three different stemming algorithms are
used:
i. English Porter stemming: It is used due to its
accuracy and simplicity. It is designed for
English language and based on the idea
that suffixes of words are frequently made
up of a combination of smaller and
simpler suffixes. If a suffix rule matches a
word, then the conditions attached to that
rule are tested and the stem is obtained by
removing the suffix [16].
ii. Paice-Husk Stemmer (Lancaster Stemmer):
It is an iterative stemmer. It removes the
endings from a word in an indefinite
number of steps. It uses a separate rule
file, which is first read into an array or list.
Then this file is divided into a series of
sections, each section corresponding to a
letter of the alphabet [16] [17].
iii. WordNet Lemmatizer: Lemmatization is the
process of converting a word into its basic
form. The difference between stemming
and lemmatization is that, the latter takes
the context into account and converts the
word into its meaningful basic form,
while the former removes only the last
few letters, often leading to incorrect
meanings and misspellings.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 5, May 2020
15 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
4. Tokenizing, removing stop-words, stemming
Calculating most frequently keywords for each category
Classifying the documents
Collecting textual documents
Representing keywords using vectors
Preprocessing
phase
Keywords
Extraction
phase
Classification
phase
B. Keywords Extraction phase
In this work, following preprocessing stage, Word2Vec
with its two architectures SG and CBOW is used. The most
frequently keywords are extracted for each category and each
word in every document is presented by a vector.
As a first stage, we collecting the textual documents for
each category, all training and testing documents for each class
are merged, and grouped into a single text file, then passed to
Word2Vec model for performing the training. Then, 15 words
from every document and each category are selected to be
passed for the next similarity operations.
CBOW takes the context of each word as input and tries to
predict the word corresponding to the context. Training
complexity is shown in equation (1) [13]:
Where N is the size of the hidden layer, V is the vocabulary
size, and D is the word representations.
SG model is the opposite of the CBOW model. The training
complexity of this architecture is proportional in equation (2)
[13]:
where C is the maximum distance of the words, V is the
vocabulary size, and D is the word representations.
The second stage is calculating the most frequently
keywords in each document for each category, and the 15
candidate words that have highest frequency is then filtered for
similarity and affinity calculations. Each word is represented
by a vector.
At the final stage, the cosine similarity is calculated
between the candidate keywords from the first stage and the
candidate keywords from the second one. Word2Vec generates
two numerical vectors X and Y for two different words, the
cosine similarity between the two words is defined as the
normalized dot product of X and Y as shown in equation (3)
[18]:
C. Documents classification using Decision Tree
Once the keyword extraction stage has taken place
followed by conducting similarity scale between the
nominated words from each document and creation of a file
for each classification along with presenting the extracted data
in a form of five scale vector, Decision Tree be able to
determine the belonging of each document to the correct
classification.
The target dataset has been divided into 60% as a training set
and 40% as a testing set. In terms of choosing the optimal
property for dividing data with it, two measures were used
namely Information Gain and Gini index.
The Decision Tree of CBOW â WordNet Lemmatizer with
forth scale using Entropy is presented in Figure 2.
Figure2: The Decision Tree of CBOW â wordnet lemmatizer
The architecture of the proposed method is summarized by the
schema in Figure 3.
Figure3: The architecture of the proposed method
IV. EXPERIMENTS AND RESULTS
This section firstly describes the input corpus and
used tools for implementation the proposed approach of
keywords extraction and measuring its performance. Secondly,
it presents and discusses the results of the experiments.
A. Input corpus
(1)
(2)
(3)
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 5, May 2020
16 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
5. To test the proposed approach, the Sem â Eval (2010) dataset is
used. It has four different research areas to make sure a variety
of different topics that relate to the following 1998 ACM
classifications: C2.4 Distributed Systems, H3.3 Information
Search and Retrieval, I2.11 Distributed Artificial Intelligence
Multi-Agent Systems and J4 Social and Behavioural Sciences
Economics. The three datasets trial, training and testing had
four categories were provided with 40, 144, and 100 articles,
respectively.
B. Used tools
To apply the preprocessing operations on mentioned data,
Natural Language Tool Kit (NLTK_Tokenize) is used. Also, to
extract keywords with Word2Vec, genism library from Python
is utilized.
C. Result evaluation
This section presents the used measurement metrics for
evaluating the proposed approach. For this, the precision, recall
and F1-score are used. These three metrics are commonly used
to evaluate the performance of information retrieval tools and
natural language processing.
Precision (P) is the number of correct results divided by the
number of all returned results as shown in equation (4).
Recall (R) is the number of correct results divided by the
number of results that have been returned as shown in equation
(5).
The F-measure is defined as a harmonic mean of precision (P)
and recall R as shown in equation (6).
D. The Results of Word2Vec CBOW
Using the Gini method with WordNet Lemmatizer, the ratio
was 64% F-Score. For the Entropy method, with the same
measure it achieved 59% F-Score, and the highest percentage
using Entropy method was 62% F-Score which achieved by
English Porter Stemming with fourth scale.
As for the lowest percentages that got with a Paice â Husk
stemmer scale three with ratio of F-Score 22% using Gini
Index method.
With regard to combined Keywords from author and
readers, the fourth measure achieved the highest percentage F-
score 57% with Entropy method.
Figure 4 explains the confusion matrix of the highest obtained
ratio using Gini Index and CBOW for WordNet Lemmatizer.
The F-score result with ratio 0.49% for label I has contributed
to decrease the overall score.
Figure 4: The confusion matrix of highest f-score cbow gini index
(wordnet lemmatizer)
E. The Results of Word2Vec Skip Gram
The highest average score for the second English Porter
Stemming scale was 82% by Gini Index. Likewise, it achieved
the same standard and the third scale using the Entropy
method. The lowest average score for combined keywords
candidate by authors and readers was 52% using the Gini Index
method with scale five.
As for WordNet Lemmatizer, it has achieved a percentage of
78% using both Gini Index and Entropy methods with third
scale. And as for Paice â Husk stemmer, it achieved the highest
ratio on its level with the fourth and fifth measures using the
Gini Index and entropy methods with a value of 76%.
Using Skip Gram algorithm, it is noted a general improvement
in the shape and beginning of results with 82% using the F
score. There is a significant improvement in the H label rating
performance. Figures 5 and 6 illustrate the confusion matrixes
of highest average score.
Figure 5: The confusion matrix of highest f-score sg gini index (english
porter stemming)
(4)
(5)
(6)
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 5, May 2020
17 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
6. Figure 6: The confusion matrix of highest f-score sg entropy (English
porter stemming)
V. CONCLUSIONS
The paper discussed a method for extracting keywords from
text documents and classifying these documents using both
Word2Vec and Decision Tree. Word2Vec model is used to
obtain the keywords, which provides us with Word Similarity
technology that do the convergence process between words.
The Decision Tree has also been used to find the correct
classification for the target document. In order to evaluate the
proposed method performance, the precision, recall and F-
Score values were computed.
As the results varied based on used five measures and to
CBOW and SG techniques, the SG method proved its
effectiveness with Decision Tree in determining the correct
classifications of documents by a percentage exceeded 80% of
F-score.
By comparing the obtained results with previous studies, it is
obvious that the proposed method proved its effectiveness in
finding the correct classification of documents and it
outperformed its counterpart with the same used keywords.
REFERENCES
[1] S. Siddiqi and A. J. I. J. o. C. A. Sharan, "Keyword and keyphrase
extraction techniques: a literature review," International Journal of
Computer Applications, vol. 109, no. 2, 2015.
[2] M. Allahyari et al., "A brief survey of text mining: Classification,
clustering and extraction techniques," arXive preprint arXiv:
1707.02919v2, 2017.
[3] V. Gupta and G. S. J. J. o. e. t. i. w. i. Lehal, "A survey of text mining
techniques and applications," JOURNAL OF EMERGING
TECHNOLOGIES IN WEB INTELLIGENCE, vol. 1, no. 1, pp. 60-76,
2009.
[4] A. K. S. Tilve and S. N. J. I. J. E. S. R. T. Jain, "A survey on machine
learning techniques for text classification," INTERNATIONAL
JOURNAL OF ENGINEERING SCIENCES & RESEARCH
TECHNOLOGY, 2017.
[5] . K. Bharti and K. S. J. a. p. a. Babu, "Automatic keyword extraction
for text summarization: A survey," European Journal of Advances in
Engineering and Technology, 2017, 2017.
[6] S. N. Kim, O. Medelyan, M.-Y. Kan, and T. Baldwin, "Semeval-2010
task 5: Automatic keyphrase extraction from scientific articles," in
Proceedings of the 5th International Workshop on Semantic Evaluation,
2010, pp. 21-26.
[7] D. Mahata, R. R. Shah, J. Kuriakose, R. Zimmermann, and J. R.
Talburt, "Theme-weighted Ranking of Keywords from Text Documents
using Phrase Embeddings," in 2018 IEEE Conference on Multimedia
Information Processing and Retrieval (MIPR), 2018, pp. 184-189:
IEEE.
[8] J. Hu, S. Li, Y. Yao, L. Yu, G. Yang, and J. J. E. Hu, "Patent keyword
extraction algorithm based on distributed representation for patent
classification," entropy, vol. 20, no. 2, p. 104, 2018.
[9] J. LindĂŠn, S. ForsstrĂśm, and T. Zhang, "Evaluating Combinations of
Classification Algorithms and Paragraph Vectors for News Article
Classification," in 2018 Federated Conference on Computer Science
and Information Systems (FedCSIS), 2018, pp. 489-495: IEEE.
[10] S. Kannan, V. J. I. J. o. C. S. Gurusamy, and C. Networks,
"Preprocessing Techniques for Text Mining," IEEE conference ,2011,
vol. 5, no. 1, pp. 7-16, 2014.A. Karnik, âPerformance of TCP
congestion control with rate feedback: TCP/ABR and rate adaptive
TCP/IP,â M. Eng. thesis, Indian Institute of Science, Bangalore, India,
Jan. 1999.
[11] S. Menaka and N. Radha, "Text classification using keyword extraction
technique," International Journal of Advanced Research in Computer
Science and Software Engineering, vol. 3, no. 12, 2013.
[12] M. Mowafy, A. Rezk, and H. J. A. J. C. S. I. T. El-bakry, "An Efficient
Classification Model for Unstructured Text Document," American
Journal of Computer Science and Information Technology, vol. 6, no. 1,
p. 16, 2018.
[13] T. Mikolov, K. Chen, G. Corrado, and J. J. a. p. a. Dean, "Efficient
estimation of word representations in vector space," arXive preprint
arXiv:1301.3781v3, 2013.
[14] J. Acosta, N. Lamaute, M. Luo, E. Finkelstein, and Andreea,
"Sentiment Analysis of Twitter Messages Using Word2Vec,"
Proceedings of Student-Faculty Research Day, CSIS, Pace University,
p. 7, 2017.
[15] W. Sriurai, "Improving text categorization by using a topic model,"
Advanced Computing: An International Journal, vol. 2, no. 6, p. 21,
2011.
[16] M. S. Kumar and K. Murthy, "Corpus Based Statistical Approach for
Stemming Telugu," Creation of Lexical Resources for Indian Language
Computing Processing, C-DAC, Mumbai, India, 2007.
[17] N. Giridhar, K. Prema, N. S. Reddy, and P. Subba, "A Prospective
Study of Stemming Algorithms for Web Text Mining," Ganapt
University Journal of EngineeringTechnology, vol. 1, pp. 28-34, 2011.
[18] L. Ma, "A Multi-label Text Classification Framework: Using
Supervised and Unsupervised Feature Selection Strategy," Bonfring
International Journal of Data Mining, 2017.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 5, May 2020
18 https://sites.google.com/site/ijcsis/
ISSN 1947-5500