SlideShare a Scribd company logo
An Investigation of Keywords Extraction from
Textual Documents using Word2Vec and Decision
Tree
Hawa Benghuzzi
Department of Computer Science
Faculty of Information Technology, Misurata University
Misurata – Libya
H.Benghuzzi@it.misuratau.edu.ly
Mohammed M. Elsheh
Department of Computer Science
Faculty of Information Technology, Misurata University
Misurata - Libya
m.elsheh@it.misuratau.edu.ly
Abstract— In recent years the growth of digital data is
increasing dramatically, knowledge discovery and data mining
have attracted immense attention with coming up need for
turning such data into useful information and knowledge.
Keyword extraction is considered an essential task in natural
language processing (NLP) that facilitates mapping of documents
to a concise set of representative single and multi-word phrases.
This paper investigates using of Word2Vec and Decision Tree for
keywords extraction from textual documents. The Sem-Eval
(2010) dataset is used as a main input for the proposed study. The
words are represented by vectors with Word2Vec technique
following applying pre-processing operations on the dataset. This
method is based on word similarity between candidate keywords
from both collecting keywords for each label and one sample
from the same label. An appropriate threshold has been
determined by which the percentages that exceed this threshold
are exported to the Decision Tree in order to consider an
appropriate classification to be taken on the text document.
Some similarity measurements were used for the classification
process. The efficiency and accuracy of the algorithm was
measured in the process of classification using precision, recall
and F-score rates. The obtained results indicated that using of
vector representation for each keyword is an effective way to
identify the most similar words, so that the opportunity to
recognize the correct classification of the document increases.
When using word2Vec CBOW the result of F-Score was 64%
with the Gini method and WordNet Lemmatizer. Meanwhile,
when using Word2Vec SG the result of F-Score was 82% with
Gini Index and English Porter Stemming which considered the
highest ratio for all our experiments.
Keywords- Text Classification; Keywords Extraction; Word2Vec;
Decision Tree; Text Mining.
I. INTRODUCTION
Nowadays, the electronic documents space is growing on a
daily basis at a massive rate. At the same time, we need to go
quickly throughout these large amounts of textual information
to find out documents related to our interests [1]. Unstructured
data has a diversity forms, and text data is an adequate
example of it, that is one of the simplest forms of data that can
be generated in most scenarios. Humans can easily process
and perceiving the unstructured text, but is harder for
machines to understand. As a result, there is a desperate need
to design methods and algorithms in order to effectively
process this collapse of text in a broad set of applications [2].
Moreover, this increasing of electronic textual documents led
to the need of text mining studies, that is the task of extracting
meaningful information from text, which has gained more
importance recently[1].
Text mining is unlike from what is familiar with in web
search. In web searching, the user is typically looking for
something that is previously known and has been written by
someone else. The problem raises from pushing aside all the
material that currently is not appropriate to the user needs in
order to find the relevant information. In text mining, the
objective is to realize unknown information, something that no
one yet knows and so could not have yet written down [3].
There is a set of approaches that involves in Text Mining such
as: Text Summarization, Unsupervised Learning Methods and
Supervised Learning Methods.
However, there are many approaches by which keyword
extraction can be carried out, such as supervised and
unsupervised machine learning, statistical methods and
linguistic ones.
Text Classification (TC) is the task of automatically sorting a
set of documents into categories from a predefined set, also an
important part of text mining is included under supervised
machine learning methods [4].
The keywords extraction phase comes before Text
classification, where the keywords are subcategory of words
that contain the most major information about the content of
the document. keyword extraction is the process of selecting
words from the text document that probably contains valuable
information from the document without any human
intervention depending on the model [5].
Basically, in TC there are two stages involved namely,
training stage and testing stage. In former stage, documents
are preprocessed and trained by a learning algorithm to
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 5, May 2020
13 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
generate the classifier. In latter stage, a prediction of classifier
is performed. Using supervised learning algorithms [3], the
objective is to learn classifiers from known examples (labelled
documents) and to perform the classification automatically on
unknown examples (unlabelled documents). There are many
traditional learning algorithms to train the data, such as
Decision Trees, Naive Bayes (NB), Support Vector Machines
(SVM), k-Nearest Neighbour (KNN), Neural Network (NNet).
The remainder of this paper is organized as follows: section 2
presents some related research works that deal with the
problem of keywords extraction. Section 3 presents our
proposed approach for extracting the keywords using
Word2Vec and Decision Tree. Experiments and results are
described and discussed in Section 4. Finally, section 5
presents the conclusion of the paper.
II. EXTRACION OF KEYWORDS FROM TEXTUAL DOCUMENTS: A
LITERATURE REVIEW
This section summaries a collection of previous
studies which were conducted in last few years regarding to
keywords extraction and text classification.
A. The Textual Datasets
There is a lot of textual datasets are available for
NLP, and in recent years interest increases in collecting data
for these studies. Where, the investigators in [6] described the
Task 5 of the Workshop on Semantic Evaluation 2010
(SemEval-2010), their work focusing on key-phrase
extraction. The researchers have compiled a set of 284
scientific articles with key-phrases carefully chosen by both
their authors and readers. The dataset consists of trial, training
and testing data of conference and workshop papers from the
ACM Digital Library. The papers ranged between six and
eight pages, and containing tables and pictures.
Also, in [7] the researchers collected 1,147,000 scientific
abstracts related to different areas from arxiv, then they added
the scientific documents present in the benchmark datasets
comprising of short abstracts (Inspec) and long scientific
papers (SemEval-2010) that later used for evaluation to rank
keyword extraction.
And, in [8] the authors evaluated their algorithm and other
baseline algorithms over 2500 patent documents extracted
from Google Patent .
B. Text Preprocessing operations
Text Preprocessing is an important task and a basic
step in many Text Mining and IR algorithms, and it is the
fundamental part of any NLP system. Since the characters,
words, and sentences are identified at this stage the major
units are passed to all further processing stages. In [10], the
authors present an efficient preprocessing techniques that
eliminate unuseful parts of a document such as prepositions,
articles, and pro-nouns. These pre-processing techniques
eliminate noise from text data, later identifying the root word
for actual words and reducing the size of the text data. Their
objective was to analyze the issues of preprocessing methods
such as Tokenization, stop words removal and stemming for
the text documents.
In addition, the authors in [11] do preprocessing on documents
before classifying them. In preprocessing, stop words are
removed and the words were stemmed. The researchers' point
of view was that the reason behind stop-words should be
removed from a text is that they make the text look heavier
and less important for analysts.
Moreover, the authors in [12] applied preprocessing
techniques on the input documents to present the text
documents in a clear word format. The most taken steps are:
• Tokenization: A document is treated as a string, and
then partitioned into a list of tokens.
• Removing stop words: Stop words such as “the”, “a”,
“and”, etc. are frequently occurring, so the
insignificant words need to be removed.
• Stemming word: Applying the stemming algorithm
that converts different word forms into a similar
canonical form. This step is the process of conflating
tokens to their root form, e.g. connection to connect
and computing to compute.
C. Keywords Extraction
International Encyclopedia of Information and
Library Science [1] defines “Keyword” as “A word that
succinctly and accurately describes the subject, or an aspect of
the subject, discussed in a document.”
There are many techniques used to extract the keywords. In
this work Word2vec is used, which is a method utilizes a
vector to represent a word. The Word2Vec technique was
created by a research team led by Tomas Mikolov at Google
(2013) [13]. They proposed two new model architectures for
learning distributed representations of words that minimize
computational complexity namely Continue Bag of Words
(CBOW) and skip Gram (SG) models. Figure 1 illustrates the
architecture of CBOW and SG:
Figure1: The architecture of CBOW and SG
In addition, the authors in [14] offer and discuss experiments
on sentiment analysis of Twitter posts regarding to United
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 5, May 2020
14 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
State (U.S) airline companies. Their study aims to determine
whether using the word2vec algorithm to create word
embeddings could be used to classify sentiment. Their dataset
was acquired from Kaggle.com, which contains over 14,000
tweets about users' airline experience and 15 attributes
including the original tweet text, Twitter user-related data, and
the class sentiment label.
Furthermore, in the article by J. LindĂŠn, S. ForsstrĂśm, and T.
Zhang [9], they present combination of the paragraph vector
algorithms Distributed Memory and Distributed Bag of Words
with four classification algorithms namely Decision Tree,
Random forest, Multi-Layer perceptron (MLP) and Long-
Short-Term-Memory (LSTM) to evaluate critical parameter
modifications of mentioned classification algorithms, with an
aim to categorize news articles.
D. Text Classification Algorithms
The aim of text classification is to classify the text
documents into a definite number of pre-defined classes. In
classification, there are key issues such as handling big
number of features, unstructured text documents, and choosing
a machine learning technique suitable for the text
classification application.
The authors in [11] applied text mining algorithms to extract
keywords from journal papers using TF-IDF and WordNet
thesaurus. TF-IDF algorithm is used to select the candidate
words, While WordNet is a lexical database of English which
is used to find similarity among the candidate words. Then
documents are classified based on extracted keywords using
the machine learning algorithms - NB, Decision Tree and
KNN. Decision Tree algorithm gives better results based on
prediction accuracy when compared to NB and KNN
algorithms with accuracy of 98.47%.
Wongkot Sriurai in his research [15] has compared the feature
processing techniques of Bag-of- Words (BOW) with the topic
model. Text categorization algorithms such as NB, SVM and
Decision tree are used for experimentation. For the
experiment, the precision, recall and F1 measure were used for
evaluating the text classification. The results proved that the
topic-model approach for representing the documents yield the
best performance based on F1 measure of 79% an
improvement of 11.1% over the BOW model.
III. PROPOSED APPROACH FOR EXTRACTING KEYWORDS AND
TEXTUAL CLASSIFICATION
In this section, we present the proposed method of using
Word2Vec technique in combination of Decision Tree
classifier to extract keywords from textual documents. The
architecture of the proposed method consists of three phases:
(1) Preprocessing phase; (2) Keywords extraction phase with
Word2Vec; (3) Documents classification using Decision Tree.
We describe these three phases in the following subsections.
A. Pre-processing phase
Preprocessing operations applied on dataset before
feeding it to the second phase. Its importance comes from the
fact that it makes the data more focused and clearer, which
makes it easy to select keywords and place them into the
correct categories to which they belong. The following
parameters are performed:
• Tokenization: is the process of breaking a stream of
text into words, phrases, symbols, or other
meaningful elements called tokens. The aim of the
tokenization is the exploration of the words in a
sentence.
• Stop words elimination: Many words are repeated
frequently in documents but basically are
meaningless since they are used to link words
together in a sentence. Due to their high
occurrence, their presence in text extraction
process is an obstacle to understanding the content
of documents. Stalled words often use common
words like "and", "she", "this", etc. They are not
helpful in classifying documents. So, they must be
eliminated.
• Stemming: It is the process of conflating the variant
forms of a word into a common representation. In
this work, three different stemming algorithms are
used:
i. English Porter stemming: It is used due to its
accuracy and simplicity. It is designed for
English language and based on the idea
that suffixes of words are frequently made
up of a combination of smaller and
simpler suffixes. If a suffix rule matches a
word, then the conditions attached to that
rule are tested and the stem is obtained by
removing the suffix [16].
ii. Paice-Husk Stemmer (Lancaster Stemmer):
It is an iterative stemmer. It removes the
endings from a word in an indefinite
number of steps. It uses a separate rule
file, which is first read into an array or list.
Then this file is divided into a series of
sections, each section corresponding to a
letter of the alphabet [16] [17].
iii. WordNet Lemmatizer: Lemmatization is the
process of converting a word into its basic
form. The difference between stemming
and lemmatization is that, the latter takes
the context into account and converts the
word into its meaningful basic form,
while the former removes only the last
few letters, often leading to incorrect
meanings and misspellings.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 5, May 2020
15 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Tokenizing, removing stop-words, stemming
Calculating most frequently keywords for each category
Classifying the documents
Collecting textual documents
Representing keywords using vectors
Preprocessing
phase
Keywords
Extraction
phase
Classification
phase
B. Keywords Extraction phase
In this work, following preprocessing stage, Word2Vec
with its two architectures SG and CBOW is used. The most
frequently keywords are extracted for each category and each
word in every document is presented by a vector.
As a first stage, we collecting the textual documents for
each category, all training and testing documents for each class
are merged, and grouped into a single text file, then passed to
Word2Vec model for performing the training. Then, 15 words
from every document and each category are selected to be
passed for the next similarity operations.
CBOW takes the context of each word as input and tries to
predict the word corresponding to the context. Training
complexity is shown in equation (1) [13]:
Where N is the size of the hidden layer, V is the vocabulary
size, and D is the word representations.
SG model is the opposite of the CBOW model. The training
complexity of this architecture is proportional in equation (2)
[13]:
where C is the maximum distance of the words, V is the
vocabulary size, and D is the word representations.
The second stage is calculating the most frequently
keywords in each document for each category, and the 15
candidate words that have highest frequency is then filtered for
similarity and affinity calculations. Each word is represented
by a vector.
At the final stage, the cosine similarity is calculated
between the candidate keywords from the first stage and the
candidate keywords from the second one. Word2Vec generates
two numerical vectors X and Y for two different words, the
cosine similarity between the two words is defined as the
normalized dot product of X and Y as shown in equation (3)
[18]:
C. Documents classification using Decision Tree
Once the keyword extraction stage has taken place
followed by conducting similarity scale between the
nominated words from each document and creation of a file
for each classification along with presenting the extracted data
in a form of five scale vector, Decision Tree be able to
determine the belonging of each document to the correct
classification.
The target dataset has been divided into 60% as a training set
and 40% as a testing set. In terms of choosing the optimal
property for dividing data with it, two measures were used
namely Information Gain and Gini index.
The Decision Tree of CBOW – WordNet Lemmatizer with
forth scale using Entropy is presented in Figure 2.
Figure2: The Decision Tree of CBOW – wordnet lemmatizer
The architecture of the proposed method is summarized by the
schema in Figure 3.
Figure3: The architecture of the proposed method
IV. EXPERIMENTS AND RESULTS
This section firstly describes the input corpus and
used tools for implementation the proposed approach of
keywords extraction and measuring its performance. Secondly,
it presents and discusses the results of the experiments.
A. Input corpus
(1)
(2)
(3)
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 5, May 2020
16 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
To test the proposed approach, the Sem – Eval (2010) dataset is
used. It has four different research areas to make sure a variety
of different topics that relate to the following 1998 ACM
classifications: C2.4 Distributed Systems, H3.3 Information
Search and Retrieval, I2.11 Distributed Artificial Intelligence
Multi-Agent Systems and J4 Social and Behavioural Sciences
Economics. The three datasets trial, training and testing had
four categories were provided with 40, 144, and 100 articles,
respectively.
B. Used tools
To apply the preprocessing operations on mentioned data,
Natural Language Tool Kit (NLTK_Tokenize) is used. Also, to
extract keywords with Word2Vec, genism library from Python
is utilized.
C. Result evaluation
This section presents the used measurement metrics for
evaluating the proposed approach. For this, the precision, recall
and F1-score are used. These three metrics are commonly used
to evaluate the performance of information retrieval tools and
natural language processing.
Precision (P) is the number of correct results divided by the
number of all returned results as shown in equation (4).
Recall (R) is the number of correct results divided by the
number of results that have been returned as shown in equation
(5).
The F-measure is defined as a harmonic mean of precision (P)
and recall R as shown in equation (6).
D. The Results of Word2Vec CBOW
Using the Gini method with WordNet Lemmatizer, the ratio
was 64% F-Score. For the Entropy method, with the same
measure it achieved 59% F-Score, and the highest percentage
using Entropy method was 62% F-Score which achieved by
English Porter Stemming with fourth scale.
As for the lowest percentages that got with a Paice – Husk
stemmer scale three with ratio of F-Score 22% using Gini
Index method.
With regard to combined Keywords from author and
readers, the fourth measure achieved the highest percentage F-
score 57% with Entropy method.
Figure 4 explains the confusion matrix of the highest obtained
ratio using Gini Index and CBOW for WordNet Lemmatizer.
The F-score result with ratio 0.49% for label I has contributed
to decrease the overall score.
Figure 4: The confusion matrix of highest f-score cbow gini index
(wordnet lemmatizer)
E. The Results of Word2Vec Skip Gram
The highest average score for the second English Porter
Stemming scale was 82% by Gini Index. Likewise, it achieved
the same standard and the third scale using the Entropy
method. The lowest average score for combined keywords
candidate by authors and readers was 52% using the Gini Index
method with scale five.
As for WordNet Lemmatizer, it has achieved a percentage of
78% using both Gini Index and Entropy methods with third
scale. And as for Paice – Husk stemmer, it achieved the highest
ratio on its level with the fourth and fifth measures using the
Gini Index and entropy methods with a value of 76%.
Using Skip Gram algorithm, it is noted a general improvement
in the shape and beginning of results with 82% using the F
score. There is a significant improvement in the H label rating
performance. Figures 5 and 6 illustrate the confusion matrixes
of highest average score.
Figure 5: The confusion matrix of highest f-score sg gini index (english
porter stemming)
(4)
(5)
(6)
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 5, May 2020
17 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Figure 6: The confusion matrix of highest f-score sg entropy (English
porter stemming)
V. CONCLUSIONS
The paper discussed a method for extracting keywords from
text documents and classifying these documents using both
Word2Vec and Decision Tree. Word2Vec model is used to
obtain the keywords, which provides us with Word Similarity
technology that do the convergence process between words.
The Decision Tree has also been used to find the correct
classification for the target document. In order to evaluate the
proposed method performance, the precision, recall and F-
Score values were computed.
As the results varied based on used five measures and to
CBOW and SG techniques, the SG method proved its
effectiveness with Decision Tree in determining the correct
classifications of documents by a percentage exceeded 80% of
F-score.
By comparing the obtained results with previous studies, it is
obvious that the proposed method proved its effectiveness in
finding the correct classification of documents and it
outperformed its counterpart with the same used keywords.
REFERENCES
[1] S. Siddiqi and A. J. I. J. o. C. A. Sharan, "Keyword and keyphrase
extraction techniques: a literature review," International Journal of
Computer Applications, vol. 109, no. 2, 2015.
[2] M. Allahyari et al., "A brief survey of text mining: Classification,
clustering and extraction techniques," arXive preprint arXiv:
1707.02919v2, 2017.
[3] V. Gupta and G. S. J. J. o. e. t. i. w. i. Lehal, "A survey of text mining
techniques and applications," JOURNAL OF EMERGING
TECHNOLOGIES IN WEB INTELLIGENCE, vol. 1, no. 1, pp. 60-76,
2009.
[4] A. K. S. Tilve and S. N. J. I. J. E. S. R. T. Jain, "A survey on machine
learning techniques for text classification," INTERNATIONAL
JOURNAL OF ENGINEERING SCIENCES & RESEARCH
TECHNOLOGY, 2017.
[5] . K. Bharti and K. S. J. a. p. a. Babu, "Automatic keyword extraction
for text summarization: A survey," European Journal of Advances in
Engineering and Technology, 2017, 2017.
[6] S. N. Kim, O. Medelyan, M.-Y. Kan, and T. Baldwin, "Semeval-2010
task 5: Automatic keyphrase extraction from scientific articles," in
Proceedings of the 5th International Workshop on Semantic Evaluation,
2010, pp. 21-26.
[7] D. Mahata, R. R. Shah, J. Kuriakose, R. Zimmermann, and J. R.
Talburt, "Theme-weighted Ranking of Keywords from Text Documents
using Phrase Embeddings," in 2018 IEEE Conference on Multimedia
Information Processing and Retrieval (MIPR), 2018, pp. 184-189:
IEEE.
[8] J. Hu, S. Li, Y. Yao, L. Yu, G. Yang, and J. J. E. Hu, "Patent keyword
extraction algorithm based on distributed representation for patent
classification," entropy, vol. 20, no. 2, p. 104, 2018.
[9] J. LindĂŠn, S. ForsstrĂśm, and T. Zhang, "Evaluating Combinations of
Classification Algorithms and Paragraph Vectors for News Article
Classification," in 2018 Federated Conference on Computer Science
and Information Systems (FedCSIS), 2018, pp. 489-495: IEEE.
[10] S. Kannan, V. J. I. J. o. C. S. Gurusamy, and C. Networks,
"Preprocessing Techniques for Text Mining," IEEE conference ,2011,
vol. 5, no. 1, pp. 7-16, 2014.A. Karnik, “Performance of TCP
congestion control with rate feedback: TCP/ABR and rate adaptive
TCP/IP,” M. Eng. thesis, Indian Institute of Science, Bangalore, India,
Jan. 1999.
[11] S. Menaka and N. Radha, "Text classification using keyword extraction
technique," International Journal of Advanced Research in Computer
Science and Software Engineering, vol. 3, no. 12, 2013.
[12] M. Mowafy, A. Rezk, and H. J. A. J. C. S. I. T. El-bakry, "An Efficient
Classification Model for Unstructured Text Document," American
Journal of Computer Science and Information Technology, vol. 6, no. 1,
p. 16, 2018.
[13] T. Mikolov, K. Chen, G. Corrado, and J. J. a. p. a. Dean, "Efficient
estimation of word representations in vector space," arXive preprint
arXiv:1301.3781v3, 2013.
[14] J. Acosta, N. Lamaute, M. Luo, E. Finkelstein, and Andreea,
"Sentiment Analysis of Twitter Messages Using Word2Vec,"
Proceedings of Student-Faculty Research Day, CSIS, Pace University,
p. 7, 2017.
[15] W. Sriurai, "Improving text categorization by using a topic model,"
Advanced Computing: An International Journal, vol. 2, no. 6, p. 21,
2011.
[16] M. S. Kumar and K. Murthy, "Corpus Based Statistical Approach for
Stemming Telugu," Creation of Lexical Resources for Indian Language
Computing Processing, C-DAC, Mumbai, India, 2007.
[17] N. Giridhar, K. Prema, N. S. Reddy, and P. Subba, "A Prospective
Study of Stemming Algorithms for Web Text Mining," Ganapt
University Journal of EngineeringTechnology, vol. 1, pp. 28-34, 2011.
[18] L. Ma, "A Multi-label Text Classification Framework: Using
Supervised and Unsupervised Feature Selection Strategy," Bonfring
International Journal of Data Mining, 2017.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 5, May 2020
18 https://sites.google.com/site/ijcsis/
ISSN 1947-5500

More Related Content

What's hot

76 s201906
76 s20190676 s201906
76 s201906
IJRAT
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
ijma
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
csandit
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A Review
INFOGAIN PUBLICATION
 
Single document keywords extraction in Bahasa Indonesia using phrase chunking
Single document keywords extraction in Bahasa Indonesia using phrase chunkingSingle document keywords extraction in Bahasa Indonesia using phrase chunking
Single document keywords extraction in Bahasa Indonesia using phrase chunking
TELKOMNIKA JOURNAL
 
G04124041046
G04124041046G04124041046
G04124041046
IOSR-JEN
 
A template based algorithm for automatic summarization and dialogue managemen...
A template based algorithm for automatic summarization and dialogue managemen...A template based algorithm for automatic summarization and dialogue managemen...
A template based algorithm for automatic summarization and dialogue managemen...
eSAT Journals
 
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
cseij
 
A prior case study of natural language processing on different domain
A prior case study of natural language processing  on different domain A prior case study of natural language processing  on different domain
A prior case study of natural language processing on different domain
IJECEIAES
 
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESA NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
ijnlc
 
Answer extraction and passage retrieval for
Answer extraction and passage retrieval forAnswer extraction and passage retrieval for
Answer extraction and passage retrieval for
Waheeb Ahmed
 
Convolutional Neural Networks
Convolutional Neural Networks Convolutional Neural Networks
Convolutional Neural Networks
MichaelRodriguesdosS1
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptbutest
 
What to read next? Challenges and Preliminary Results in Selecting Represen...
What to read next? Challenges and  Preliminary Results in Selecting  Represen...What to read next? Challenges and  Preliminary Results in Selecting  Represen...
What to read next? Challenges and Preliminary Results in Selecting Represen...
MOVING Project
 
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUESMULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
ijcseit
 
A Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningA Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text mining
IJSRD
 
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
IJDKP
 
BENGALI INFORMATION RETRIEVAL SYSTEM (BIRS)
BENGALI INFORMATION RETRIEVAL SYSTEM (BIRS)BENGALI INFORMATION RETRIEVAL SYSTEM (BIRS)
BENGALI INFORMATION RETRIEVAL SYSTEM (BIRS)
ijnlc
 
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
IRJET Journal
 

What's hot (19)

76 s201906
76 s20190676 s201906
76 s201906
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A Review
 
Single document keywords extraction in Bahasa Indonesia using phrase chunking
Single document keywords extraction in Bahasa Indonesia using phrase chunkingSingle document keywords extraction in Bahasa Indonesia using phrase chunking
Single document keywords extraction in Bahasa Indonesia using phrase chunking
 
G04124041046
G04124041046G04124041046
G04124041046
 
A template based algorithm for automatic summarization and dialogue managemen...
A template based algorithm for automatic summarization and dialogue managemen...A template based algorithm for automatic summarization and dialogue managemen...
A template based algorithm for automatic summarization and dialogue managemen...
 
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
 
A prior case study of natural language processing on different domain
A prior case study of natural language processing  on different domain A prior case study of natural language processing  on different domain
A prior case study of natural language processing on different domain
 
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESA NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
 
Answer extraction and passage retrieval for
Answer extraction and passage retrieval forAnswer extraction and passage retrieval for
Answer extraction and passage retrieval for
 
Convolutional Neural Networks
Convolutional Neural Networks Convolutional Neural Networks
Convolutional Neural Networks
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
What to read next? Challenges and Preliminary Results in Selecting Represen...
What to read next? Challenges and  Preliminary Results in Selecting  Represen...What to read next? Challenges and  Preliminary Results in Selecting  Represen...
What to read next? Challenges and Preliminary Results in Selecting Represen...
 
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUESMULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
 
A Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningA Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text mining
 
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
 
BENGALI INFORMATION RETRIEVAL SYSTEM (BIRS)
BENGALI INFORMATION RETRIEVAL SYSTEM (BIRS)BENGALI INFORMATION RETRIEVAL SYSTEM (BIRS)
BENGALI INFORMATION RETRIEVAL SYSTEM (BIRS)
 
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
 

Similar to An Investigation of Keywords Extraction from Textual Documents using Word2Vec and Decision Tree

IRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET- Concept Extraction from Ambiguous Text Document using K-MeansIRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET Journal
 
A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...
A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...
A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...
idescitation
 
6.domain extraction from research papers
6.domain extraction from research papers6.domain extraction from research papers
6.domain extraction from research papers
EditorJST
 
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...
ijaia
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
El Habib NFAOUI
 
IRJET- Automated Document Summarization and Classification using Deep Lear...
IRJET- 	  Automated Document Summarization and Classification using Deep Lear...IRJET- 	  Automated Document Summarization and Classification using Deep Lear...
IRJET- Automated Document Summarization and Classification using Deep Lear...
IRJET Journal
 
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
csandit
 
An Evaluation of Preprocessing Techniques for Text Classification
An Evaluation of Preprocessing Techniques for Text ClassificationAn Evaluation of Preprocessing Techniques for Text Classification
An Evaluation of Preprocessing Techniques for Text Classification
IJCSIS Research Publications
 
Text classification supervised algorithms with term frequency inverse documen...
Text classification supervised algorithms with term frequency inverse documen...Text classification supervised algorithms with term frequency inverse documen...
Text classification supervised algorithms with term frequency inverse documen...
IJECEIAES
 
Exploiting Wikipedia and Twitter for Text Mining Applications
Exploiting Wikipedia and Twitter for Text Mining ApplicationsExploiting Wikipedia and Twitter for Text Mining Applications
Exploiting Wikipedia and Twitter for Text Mining Applications
IRJET Journal
 
Feature selection, optimization and clustering strategies of text documents
Feature selection, optimization and clustering strategies of text documentsFeature selection, optimization and clustering strategies of text documents
Feature selection, optimization and clustering strategies of text documents
IJECEIAES
 
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
cscpconf
 
Keyword Extraction Based Summarization of Categorized Kannada Text Documents
Keyword Extraction Based Summarization of Categorized Kannada Text Documents Keyword Extraction Based Summarization of Categorized Kannada Text Documents
Keyword Extraction Based Summarization of Categorized Kannada Text Documents
ijsc
 
Extraction and Retrieval of Web based Content in Web Engineering
Extraction and Retrieval of Web based Content in Web EngineeringExtraction and Retrieval of Web based Content in Web Engineering
Extraction and Retrieval of Web based Content in Web Engineering
IRJET Journal
 
Survey on Text Classification
Survey on Text ClassificationSurvey on Text Classification
Survey on Text Classification
AM Publications
 
E43022023
E43022023E43022023
E43022023
IJERA Editor
 
Survey on Key Phrase Extraction using Machine Learning Approaches
Survey on Key Phrase Extraction using Machine Learning ApproachesSurvey on Key Phrase Extraction using Machine Learning Approaches
Survey on Key Phrase Extraction using Machine Learning Approaches
YogeshIJTSRD
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answering
Ali Kabbadj
 
A Review Of Text Mining Techniques And Applications
A Review Of Text Mining Techniques And ApplicationsA Review Of Text Mining Techniques And Applications
A Review Of Text Mining Techniques And Applications
Lisa Graves
 
Knowledge Graph and Similarity Based Retrieval Method for Query Answering System
Knowledge Graph and Similarity Based Retrieval Method for Query Answering SystemKnowledge Graph and Similarity Based Retrieval Method for Query Answering System
Knowledge Graph and Similarity Based Retrieval Method for Query Answering System
IRJET Journal
 

Similar to An Investigation of Keywords Extraction from Textual Documents using Word2Vec and Decision Tree (20)

IRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET- Concept Extraction from Ambiguous Text Document using K-MeansIRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET- Concept Extraction from Ambiguous Text Document using K-Means
 
A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...
A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...
A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...
 
6.domain extraction from research papers
6.domain extraction from research papers6.domain extraction from research papers
6.domain extraction from research papers
 
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
 
IRJET- Automated Document Summarization and Classification using Deep Lear...
IRJET- 	  Automated Document Summarization and Classification using Deep Lear...IRJET- 	  Automated Document Summarization and Classification using Deep Lear...
IRJET- Automated Document Summarization and Classification using Deep Lear...
 
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
 
An Evaluation of Preprocessing Techniques for Text Classification
An Evaluation of Preprocessing Techniques for Text ClassificationAn Evaluation of Preprocessing Techniques for Text Classification
An Evaluation of Preprocessing Techniques for Text Classification
 
Text classification supervised algorithms with term frequency inverse documen...
Text classification supervised algorithms with term frequency inverse documen...Text classification supervised algorithms with term frequency inverse documen...
Text classification supervised algorithms with term frequency inverse documen...
 
Exploiting Wikipedia and Twitter for Text Mining Applications
Exploiting Wikipedia and Twitter for Text Mining ApplicationsExploiting Wikipedia and Twitter for Text Mining Applications
Exploiting Wikipedia and Twitter for Text Mining Applications
 
Feature selection, optimization and clustering strategies of text documents
Feature selection, optimization and clustering strategies of text documentsFeature selection, optimization and clustering strategies of text documents
Feature selection, optimization and clustering strategies of text documents
 
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
 
Keyword Extraction Based Summarization of Categorized Kannada Text Documents
Keyword Extraction Based Summarization of Categorized Kannada Text Documents Keyword Extraction Based Summarization of Categorized Kannada Text Documents
Keyword Extraction Based Summarization of Categorized Kannada Text Documents
 
Extraction and Retrieval of Web based Content in Web Engineering
Extraction and Retrieval of Web based Content in Web EngineeringExtraction and Retrieval of Web based Content in Web Engineering
Extraction and Retrieval of Web based Content in Web Engineering
 
Survey on Text Classification
Survey on Text ClassificationSurvey on Text Classification
Survey on Text Classification
 
E43022023
E43022023E43022023
E43022023
 
Survey on Key Phrase Extraction using Machine Learning Approaches
Survey on Key Phrase Extraction using Machine Learning ApproachesSurvey on Key Phrase Extraction using Machine Learning Approaches
Survey on Key Phrase Extraction using Machine Learning Approaches
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answering
 
A Review Of Text Mining Techniques And Applications
A Review Of Text Mining Techniques And ApplicationsA Review Of Text Mining Techniques And Applications
A Review Of Text Mining Techniques And Applications
 
Knowledge Graph and Similarity Based Retrieval Method for Query Answering System
Knowledge Graph and Similarity Based Retrieval Method for Query Answering SystemKnowledge Graph and Similarity Based Retrieval Method for Query Answering System
Knowledge Graph and Similarity Based Retrieval Method for Query Answering System
 

Recently uploaded

Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
Anna Sz.
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
TechSoup
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
timhan337
 

Recently uploaded (20)

Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
 

An Investigation of Keywords Extraction from Textual Documents using Word2Vec and Decision Tree

  • 1. An Investigation of Keywords Extraction from Textual Documents using Word2Vec and Decision Tree Hawa Benghuzzi Department of Computer Science Faculty of Information Technology, Misurata University Misurata – Libya H.Benghuzzi@it.misuratau.edu.ly Mohammed M. Elsheh Department of Computer Science Faculty of Information Technology, Misurata University Misurata - Libya m.elsheh@it.misuratau.edu.ly Abstract— In recent years the growth of digital data is increasing dramatically, knowledge discovery and data mining have attracted immense attention with coming up need for turning such data into useful information and knowledge. Keyword extraction is considered an essential task in natural language processing (NLP) that facilitates mapping of documents to a concise set of representative single and multi-word phrases. This paper investigates using of Word2Vec and Decision Tree for keywords extraction from textual documents. The Sem-Eval (2010) dataset is used as a main input for the proposed study. The words are represented by vectors with Word2Vec technique following applying pre-processing operations on the dataset. This method is based on word similarity between candidate keywords from both collecting keywords for each label and one sample from the same label. An appropriate threshold has been determined by which the percentages that exceed this threshold are exported to the Decision Tree in order to consider an appropriate classification to be taken on the text document. Some similarity measurements were used for the classification process. The efficiency and accuracy of the algorithm was measured in the process of classification using precision, recall and F-score rates. The obtained results indicated that using of vector representation for each keyword is an effective way to identify the most similar words, so that the opportunity to recognize the correct classification of the document increases. When using word2Vec CBOW the result of F-Score was 64% with the Gini method and WordNet Lemmatizer. Meanwhile, when using Word2Vec SG the result of F-Score was 82% with Gini Index and English Porter Stemming which considered the highest ratio for all our experiments. Keywords- Text Classification; Keywords Extraction; Word2Vec; Decision Tree; Text Mining. I. INTRODUCTION Nowadays, the electronic documents space is growing on a daily basis at a massive rate. At the same time, we need to go quickly throughout these large amounts of textual information to find out documents related to our interests [1]. Unstructured data has a diversity forms, and text data is an adequate example of it, that is one of the simplest forms of data that can be generated in most scenarios. Humans can easily process and perceiving the unstructured text, but is harder for machines to understand. As a result, there is a desperate need to design methods and algorithms in order to effectively process this collapse of text in a broad set of applications [2]. Moreover, this increasing of electronic textual documents led to the need of text mining studies, that is the task of extracting meaningful information from text, which has gained more importance recently[1]. Text mining is unlike from what is familiar with in web search. In web searching, the user is typically looking for something that is previously known and has been written by someone else. The problem raises from pushing aside all the material that currently is not appropriate to the user needs in order to find the relevant information. In text mining, the objective is to realize unknown information, something that no one yet knows and so could not have yet written down [3]. There is a set of approaches that involves in Text Mining such as: Text Summarization, Unsupervised Learning Methods and Supervised Learning Methods. However, there are many approaches by which keyword extraction can be carried out, such as supervised and unsupervised machine learning, statistical methods and linguistic ones. Text Classification (TC) is the task of automatically sorting a set of documents into categories from a predefined set, also an important part of text mining is included under supervised machine learning methods [4]. The keywords extraction phase comes before Text classification, where the keywords are subcategory of words that contain the most major information about the content of the document. keyword extraction is the process of selecting words from the text document that probably contains valuable information from the document without any human intervention depending on the model [5]. Basically, in TC there are two stages involved namely, training stage and testing stage. In former stage, documents are preprocessed and trained by a learning algorithm to International Journal of Computer Science and Information Security (IJCSIS), Vol. 18, No. 5, May 2020 13 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 2. generate the classifier. In latter stage, a prediction of classifier is performed. Using supervised learning algorithms [3], the objective is to learn classifiers from known examples (labelled documents) and to perform the classification automatically on unknown examples (unlabelled documents). There are many traditional learning algorithms to train the data, such as Decision Trees, Naive Bayes (NB), Support Vector Machines (SVM), k-Nearest Neighbour (KNN), Neural Network (NNet). The remainder of this paper is organized as follows: section 2 presents some related research works that deal with the problem of keywords extraction. Section 3 presents our proposed approach for extracting the keywords using Word2Vec and Decision Tree. Experiments and results are described and discussed in Section 4. Finally, section 5 presents the conclusion of the paper. II. EXTRACION OF KEYWORDS FROM TEXTUAL DOCUMENTS: A LITERATURE REVIEW This section summaries a collection of previous studies which were conducted in last few years regarding to keywords extraction and text classification. A. The Textual Datasets There is a lot of textual datasets are available for NLP, and in recent years interest increases in collecting data for these studies. Where, the investigators in [6] described the Task 5 of the Workshop on Semantic Evaluation 2010 (SemEval-2010), their work focusing on key-phrase extraction. The researchers have compiled a set of 284 scientific articles with key-phrases carefully chosen by both their authors and readers. The dataset consists of trial, training and testing data of conference and workshop papers from the ACM Digital Library. The papers ranged between six and eight pages, and containing tables and pictures. Also, in [7] the researchers collected 1,147,000 scientific abstracts related to different areas from arxiv, then they added the scientific documents present in the benchmark datasets comprising of short abstracts (Inspec) and long scientific papers (SemEval-2010) that later used for evaluation to rank keyword extraction. And, in [8] the authors evaluated their algorithm and other baseline algorithms over 2500 patent documents extracted from Google Patent . B. Text Preprocessing operations Text Preprocessing is an important task and a basic step in many Text Mining and IR algorithms, and it is the fundamental part of any NLP system. Since the characters, words, and sentences are identified at this stage the major units are passed to all further processing stages. In [10], the authors present an efficient preprocessing techniques that eliminate unuseful parts of a document such as prepositions, articles, and pro-nouns. These pre-processing techniques eliminate noise from text data, later identifying the root word for actual words and reducing the size of the text data. Their objective was to analyze the issues of preprocessing methods such as Tokenization, stop words removal and stemming for the text documents. In addition, the authors in [11] do preprocessing on documents before classifying them. In preprocessing, stop words are removed and the words were stemmed. The researchers' point of view was that the reason behind stop-words should be removed from a text is that they make the text look heavier and less important for analysts. Moreover, the authors in [12] applied preprocessing techniques on the input documents to present the text documents in a clear word format. The most taken steps are: • Tokenization: A document is treated as a string, and then partitioned into a list of tokens. • Removing stop words: Stop words such as “the”, “a”, “and”, etc. are frequently occurring, so the insignificant words need to be removed. • Stemming word: Applying the stemming algorithm that converts different word forms into a similar canonical form. This step is the process of conflating tokens to their root form, e.g. connection to connect and computing to compute. C. Keywords Extraction International Encyclopedia of Information and Library Science [1] defines “Keyword” as “A word that succinctly and accurately describes the subject, or an aspect of the subject, discussed in a document.” There are many techniques used to extract the keywords. In this work Word2vec is used, which is a method utilizes a vector to represent a word. The Word2Vec technique was created by a research team led by Tomas Mikolov at Google (2013) [13]. They proposed two new model architectures for learning distributed representations of words that minimize computational complexity namely Continue Bag of Words (CBOW) and skip Gram (SG) models. Figure 1 illustrates the architecture of CBOW and SG: Figure1: The architecture of CBOW and SG In addition, the authors in [14] offer and discuss experiments on sentiment analysis of Twitter posts regarding to United International Journal of Computer Science and Information Security (IJCSIS), Vol. 18, No. 5, May 2020 14 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 3. State (U.S) airline companies. Their study aims to determine whether using the word2vec algorithm to create word embeddings could be used to classify sentiment. Their dataset was acquired from Kaggle.com, which contains over 14,000 tweets about users' airline experience and 15 attributes including the original tweet text, Twitter user-related data, and the class sentiment label. Furthermore, in the article by J. LindĂŠn, S. ForsstrĂśm, and T. Zhang [9], they present combination of the paragraph vector algorithms Distributed Memory and Distributed Bag of Words with four classification algorithms namely Decision Tree, Random forest, Multi-Layer perceptron (MLP) and Long- Short-Term-Memory (LSTM) to evaluate critical parameter modifications of mentioned classification algorithms, with an aim to categorize news articles. D. Text Classification Algorithms The aim of text classification is to classify the text documents into a definite number of pre-defined classes. In classification, there are key issues such as handling big number of features, unstructured text documents, and choosing a machine learning technique suitable for the text classification application. The authors in [11] applied text mining algorithms to extract keywords from journal papers using TF-IDF and WordNet thesaurus. TF-IDF algorithm is used to select the candidate words, While WordNet is a lexical database of English which is used to find similarity among the candidate words. Then documents are classified based on extracted keywords using the machine learning algorithms - NB, Decision Tree and KNN. Decision Tree algorithm gives better results based on prediction accuracy when compared to NB and KNN algorithms with accuracy of 98.47%. Wongkot Sriurai in his research [15] has compared the feature processing techniques of Bag-of- Words (BOW) with the topic model. Text categorization algorithms such as NB, SVM and Decision tree are used for experimentation. For the experiment, the precision, recall and F1 measure were used for evaluating the text classification. The results proved that the topic-model approach for representing the documents yield the best performance based on F1 measure of 79% an improvement of 11.1% over the BOW model. III. PROPOSED APPROACH FOR EXTRACTING KEYWORDS AND TEXTUAL CLASSIFICATION In this section, we present the proposed method of using Word2Vec technique in combination of Decision Tree classifier to extract keywords from textual documents. The architecture of the proposed method consists of three phases: (1) Preprocessing phase; (2) Keywords extraction phase with Word2Vec; (3) Documents classification using Decision Tree. We describe these three phases in the following subsections. A. Pre-processing phase Preprocessing operations applied on dataset before feeding it to the second phase. Its importance comes from the fact that it makes the data more focused and clearer, which makes it easy to select keywords and place them into the correct categories to which they belong. The following parameters are performed: • Tokenization: is the process of breaking a stream of text into words, phrases, symbols, or other meaningful elements called tokens. The aim of the tokenization is the exploration of the words in a sentence. • Stop words elimination: Many words are repeated frequently in documents but basically are meaningless since they are used to link words together in a sentence. Due to their high occurrence, their presence in text extraction process is an obstacle to understanding the content of documents. Stalled words often use common words like "and", "she", "this", etc. They are not helpful in classifying documents. So, they must be eliminated. • Stemming: It is the process of conflating the variant forms of a word into a common representation. In this work, three different stemming algorithms are used: i. English Porter stemming: It is used due to its accuracy and simplicity. It is designed for English language and based on the idea that suffixes of words are frequently made up of a combination of smaller and simpler suffixes. If a suffix rule matches a word, then the conditions attached to that rule are tested and the stem is obtained by removing the suffix [16]. ii. Paice-Husk Stemmer (Lancaster Stemmer): It is an iterative stemmer. It removes the endings from a word in an indefinite number of steps. It uses a separate rule file, which is first read into an array or list. Then this file is divided into a series of sections, each section corresponding to a letter of the alphabet [16] [17]. iii. WordNet Lemmatizer: Lemmatization is the process of converting a word into its basic form. The difference between stemming and lemmatization is that, the latter takes the context into account and converts the word into its meaningful basic form, while the former removes only the last few letters, often leading to incorrect meanings and misspellings. International Journal of Computer Science and Information Security (IJCSIS), Vol. 18, No. 5, May 2020 15 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 4. Tokenizing, removing stop-words, stemming Calculating most frequently keywords for each category Classifying the documents Collecting textual documents Representing keywords using vectors Preprocessing phase Keywords Extraction phase Classification phase B. Keywords Extraction phase In this work, following preprocessing stage, Word2Vec with its two architectures SG and CBOW is used. The most frequently keywords are extracted for each category and each word in every document is presented by a vector. As a first stage, we collecting the textual documents for each category, all training and testing documents for each class are merged, and grouped into a single text file, then passed to Word2Vec model for performing the training. Then, 15 words from every document and each category are selected to be passed for the next similarity operations. CBOW takes the context of each word as input and tries to predict the word corresponding to the context. Training complexity is shown in equation (1) [13]: Where N is the size of the hidden layer, V is the vocabulary size, and D is the word representations. SG model is the opposite of the CBOW model. The training complexity of this architecture is proportional in equation (2) [13]: where C is the maximum distance of the words, V is the vocabulary size, and D is the word representations. The second stage is calculating the most frequently keywords in each document for each category, and the 15 candidate words that have highest frequency is then filtered for similarity and affinity calculations. Each word is represented by a vector. At the final stage, the cosine similarity is calculated between the candidate keywords from the first stage and the candidate keywords from the second one. Word2Vec generates two numerical vectors X and Y for two different words, the cosine similarity between the two words is defined as the normalized dot product of X and Y as shown in equation (3) [18]: C. Documents classification using Decision Tree Once the keyword extraction stage has taken place followed by conducting similarity scale between the nominated words from each document and creation of a file for each classification along with presenting the extracted data in a form of five scale vector, Decision Tree be able to determine the belonging of each document to the correct classification. The target dataset has been divided into 60% as a training set and 40% as a testing set. In terms of choosing the optimal property for dividing data with it, two measures were used namely Information Gain and Gini index. The Decision Tree of CBOW – WordNet Lemmatizer with forth scale using Entropy is presented in Figure 2. Figure2: The Decision Tree of CBOW – wordnet lemmatizer The architecture of the proposed method is summarized by the schema in Figure 3. Figure3: The architecture of the proposed method IV. EXPERIMENTS AND RESULTS This section firstly describes the input corpus and used tools for implementation the proposed approach of keywords extraction and measuring its performance. Secondly, it presents and discusses the results of the experiments. A. Input corpus (1) (2) (3) International Journal of Computer Science and Information Security (IJCSIS), Vol. 18, No. 5, May 2020 16 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 5. To test the proposed approach, the Sem – Eval (2010) dataset is used. It has four different research areas to make sure a variety of different topics that relate to the following 1998 ACM classifications: C2.4 Distributed Systems, H3.3 Information Search and Retrieval, I2.11 Distributed Artificial Intelligence Multi-Agent Systems and J4 Social and Behavioural Sciences Economics. The three datasets trial, training and testing had four categories were provided with 40, 144, and 100 articles, respectively. B. Used tools To apply the preprocessing operations on mentioned data, Natural Language Tool Kit (NLTK_Tokenize) is used. Also, to extract keywords with Word2Vec, genism library from Python is utilized. C. Result evaluation This section presents the used measurement metrics for evaluating the proposed approach. For this, the precision, recall and F1-score are used. These three metrics are commonly used to evaluate the performance of information retrieval tools and natural language processing. Precision (P) is the number of correct results divided by the number of all returned results as shown in equation (4). Recall (R) is the number of correct results divided by the number of results that have been returned as shown in equation (5). The F-measure is defined as a harmonic mean of precision (P) and recall R as shown in equation (6). D. The Results of Word2Vec CBOW Using the Gini method with WordNet Lemmatizer, the ratio was 64% F-Score. For the Entropy method, with the same measure it achieved 59% F-Score, and the highest percentage using Entropy method was 62% F-Score which achieved by English Porter Stemming with fourth scale. As for the lowest percentages that got with a Paice – Husk stemmer scale three with ratio of F-Score 22% using Gini Index method. With regard to combined Keywords from author and readers, the fourth measure achieved the highest percentage F- score 57% with Entropy method. Figure 4 explains the confusion matrix of the highest obtained ratio using Gini Index and CBOW for WordNet Lemmatizer. The F-score result with ratio 0.49% for label I has contributed to decrease the overall score. Figure 4: The confusion matrix of highest f-score cbow gini index (wordnet lemmatizer) E. The Results of Word2Vec Skip Gram The highest average score for the second English Porter Stemming scale was 82% by Gini Index. Likewise, it achieved the same standard and the third scale using the Entropy method. The lowest average score for combined keywords candidate by authors and readers was 52% using the Gini Index method with scale five. As for WordNet Lemmatizer, it has achieved a percentage of 78% using both Gini Index and Entropy methods with third scale. And as for Paice – Husk stemmer, it achieved the highest ratio on its level with the fourth and fifth measures using the Gini Index and entropy methods with a value of 76%. Using Skip Gram algorithm, it is noted a general improvement in the shape and beginning of results with 82% using the F score. There is a significant improvement in the H label rating performance. Figures 5 and 6 illustrate the confusion matrixes of highest average score. Figure 5: The confusion matrix of highest f-score sg gini index (english porter stemming) (4) (5) (6) International Journal of Computer Science and Information Security (IJCSIS), Vol. 18, No. 5, May 2020 17 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 6. Figure 6: The confusion matrix of highest f-score sg entropy (English porter stemming) V. CONCLUSIONS The paper discussed a method for extracting keywords from text documents and classifying these documents using both Word2Vec and Decision Tree. Word2Vec model is used to obtain the keywords, which provides us with Word Similarity technology that do the convergence process between words. The Decision Tree has also been used to find the correct classification for the target document. In order to evaluate the proposed method performance, the precision, recall and F- Score values were computed. As the results varied based on used five measures and to CBOW and SG techniques, the SG method proved its effectiveness with Decision Tree in determining the correct classifications of documents by a percentage exceeded 80% of F-score. By comparing the obtained results with previous studies, it is obvious that the proposed method proved its effectiveness in finding the correct classification of documents and it outperformed its counterpart with the same used keywords. REFERENCES [1] S. Siddiqi and A. J. I. J. o. C. A. Sharan, "Keyword and keyphrase extraction techniques: a literature review," International Journal of Computer Applications, vol. 109, no. 2, 2015. [2] M. Allahyari et al., "A brief survey of text mining: Classification, clustering and extraction techniques," arXive preprint arXiv: 1707.02919v2, 2017. [3] V. Gupta and G. S. J. J. o. e. t. i. w. i. Lehal, "A survey of text mining techniques and applications," JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, vol. 1, no. 1, pp. 60-76, 2009. [4] A. K. S. Tilve and S. N. J. I. J. E. S. R. T. Jain, "A survey on machine learning techniques for text classification," INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY, 2017. [5] . K. Bharti and K. S. J. a. p. a. Babu, "Automatic keyword extraction for text summarization: A survey," European Journal of Advances in Engineering and Technology, 2017, 2017. [6] S. N. Kim, O. Medelyan, M.-Y. Kan, and T. Baldwin, "Semeval-2010 task 5: Automatic keyphrase extraction from scientific articles," in Proceedings of the 5th International Workshop on Semantic Evaluation, 2010, pp. 21-26. [7] D. Mahata, R. R. Shah, J. Kuriakose, R. Zimmermann, and J. R. Talburt, "Theme-weighted Ranking of Keywords from Text Documents using Phrase Embeddings," in 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), 2018, pp. 184-189: IEEE. [8] J. Hu, S. Li, Y. Yao, L. Yu, G. Yang, and J. J. E. Hu, "Patent keyword extraction algorithm based on distributed representation for patent classification," entropy, vol. 20, no. 2, p. 104, 2018. [9] J. LindĂŠn, S. ForsstrĂśm, and T. Zhang, "Evaluating Combinations of Classification Algorithms and Paragraph Vectors for News Article Classification," in 2018 Federated Conference on Computer Science and Information Systems (FedCSIS), 2018, pp. 489-495: IEEE. [10] S. Kannan, V. J. I. J. o. C. S. Gurusamy, and C. Networks, "Preprocessing Techniques for Text Mining," IEEE conference ,2011, vol. 5, no. 1, pp. 7-16, 2014.A. Karnik, “Performance of TCP congestion control with rate feedback: TCP/ABR and rate adaptive TCP/IP,” M. Eng. thesis, Indian Institute of Science, Bangalore, India, Jan. 1999. [11] S. Menaka and N. Radha, "Text classification using keyword extraction technique," International Journal of Advanced Research in Computer Science and Software Engineering, vol. 3, no. 12, 2013. [12] M. Mowafy, A. Rezk, and H. J. A. J. C. S. I. T. El-bakry, "An Efficient Classification Model for Unstructured Text Document," American Journal of Computer Science and Information Technology, vol. 6, no. 1, p. 16, 2018. [13] T. Mikolov, K. Chen, G. Corrado, and J. J. a. p. a. Dean, "Efficient estimation of word representations in vector space," arXive preprint arXiv:1301.3781v3, 2013. [14] J. Acosta, N. Lamaute, M. Luo, E. Finkelstein, and Andreea, "Sentiment Analysis of Twitter Messages Using Word2Vec," Proceedings of Student-Faculty Research Day, CSIS, Pace University, p. 7, 2017. [15] W. Sriurai, "Improving text categorization by using a topic model," Advanced Computing: An International Journal, vol. 2, no. 6, p. 21, 2011. [16] M. S. Kumar and K. Murthy, "Corpus Based Statistical Approach for Stemming Telugu," Creation of Lexical Resources for Indian Language Computing Processing, C-DAC, Mumbai, India, 2007. [17] N. Giridhar, K. Prema, N. S. Reddy, and P. Subba, "A Prospective Study of Stemming Algorithms for Web Text Mining," Ganapt University Journal of EngineeringTechnology, vol. 1, pp. 28-34, 2011. [18] L. Ma, "A Multi-label Text Classification Framework: Using Supervised and Unsupervised Feature Selection Strategy," Bonfring International Journal of Data Mining, 2017. International Journal of Computer Science and Information Security (IJCSIS), Vol. 18, No. 5, May 2020 18 https://sites.google.com/site/ijcsis/ ISSN 1947-5500