An Investigation of Keywords Extraction from Textual Documents using Word2Vec and Decision Tree

An Investigation of Keywords Extraction from
Textual Documents using Word2Vec and Decision
Tree
Hawa Benghuzzi
Department of Computer Science
Faculty of Information Technology, Misurata University
Misurata – Libya
H.Benghuzzi@it.misuratau.edu.ly
Mohammed M. Elsheh
Department of Computer Science
Faculty of Information Technology, Misurata University
Misurata - Libya
m.elsheh@it.misuratau.edu.ly
Abstract— In recent years the growth of digital data is
increasing dramatically, knowledge discovery and data mining
have attracted immense attention with coming up need for
turning such data into useful information and knowledge.
Keyword extraction is considered an essential task in natural
language processing (NLP) that facilitates mapping of documents
to a concise set of representative single and multi-word phrases.
This paper investigates using of Word2Vec and Decision Tree for
keywords extraction from textual documents. The Sem-Eval
(2010) dataset is used as a main input for the proposed study. The
words are represented by vectors with Word2Vec technique
following applying pre-processing operations on the dataset. This
method is based on word similarity between candidate keywords
from both collecting keywords for each label and one sample
from the same label. An appropriate threshold has been
determined by which the percentages that exceed this threshold
are exported to the Decision Tree in order to consider an
appropriate classification to be taken on the text document.
Some similarity measurements were used for the classification
process. The efficiency and accuracy of the algorithm was
measured in the process of classification using precision, recall
and F-score rates. The obtained results indicated that using of
vector representation for each keyword is an effective way to
identify the most similar words, so that the opportunity to
recognize the correct classification of the document increases.
When using word2Vec CBOW the result of F-Score was 64%
with the Gini method and WordNet Lemmatizer. Meanwhile,
when using Word2Vec SG the result of F-Score was 82% with
Gini Index and English Porter Stemming which considered the
highest ratio for all our experiments.
Keywords- Text Classification; Keywords Extraction; Word2Vec;
Decision Tree; Text Mining.
I. INTRODUCTION
Nowadays, the electronic documents space is growing on a
daily basis at a massive rate. At the same time, we need to go
quickly throughout these large amounts of textual information
to find out documents related to our interests [1]. Unstructured
data has a diversity forms, and text data is an adequate
example of it, that is one of the simplest forms of data that can
be generated in most scenarios. Humans can easily process
and perceiving the unstructured text, but is harder for
machines to understand. As a result, there is a desperate need
to design methods and algorithms in order to effectively
process this collapse of text in a broad set of applications [2].
Moreover, this increasing of electronic textual documents led
to the need of text mining studies, that is the task of extracting
meaningful information from text, which has gained more
importance recently[1].
Text mining is unlike from what is familiar with in web
search. In web searching, the user is typically looking for
something that is previously known and has been written by
someone else. The problem raises from pushing aside all the
material that currently is not appropriate to the user needs in
order to find the relevant information. In text mining, the
objective is to realize unknown information, something that no
one yet knows and so could not have yet written down [3].
There is a set of approaches that involves in Text Mining such
as: Text Summarization, Unsupervised Learning Methods and
Supervised Learning Methods.
However, there are many approaches by which keyword
extraction can be carried out, such as supervised and
unsupervised machine learning, statistical methods and
linguistic ones.
Text Classification (TC) is the task of automatically sorting a
set of documents into categories from a predefined set, also an
important part of text mining is included under supervised
machine learning methods [4].
The keywords extraction phase comes before Text
classification, where the keywords are subcategory of words
that contain the most major information about the content of
the document. keyword extraction is the process of selecting
words from the text document that probably contains valuable
information from the document without any human
intervention depending on the model [5].
Basically, in TC there are two stages involved namely,
training stage and testing stage. In former stage, documents
are preprocessed and trained by a learning algorithm to
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 5, May 2020
13 https://sites.google.com/site/ijcsis/
ISSN 1947-5500

generate the classifier. In latter stage, a prediction of classifier
is performed. Using supervised learning algorithms [3], the
objective is to learn classifiers from known examples (labelled
documents) and to perform the classification automatically on
unknown examples (unlabelled documents). There are many
traditional learning algorithms to train the data, such as
Decision Trees, Naive Bayes (NB), Support Vector Machines
(SVM), k-Nearest Neighbour (KNN), Neural Network (NNet).
The remainder of this paper is organized as follows: section 2
presents some related research works that deal with the
problem of keywords extraction. Section 3 presents our
proposed approach for extracting the keywords using
Word2Vec and Decision Tree. Experiments and results are
described and discussed in Section 4. Finally, section 5
presents the conclusion of the paper.
II. EXTRACION OF KEYWORDS FROM TEXTUAL DOCUMENTS: A
LITERATURE REVIEW
This section summaries a collection of previous
studies which were conducted in last few years regarding to
keywords extraction and text classification.
A. The Textual Datasets
There is a lot of textual datasets are available for
NLP, and in recent years interest increases in collecting data
for these studies. Where, the investigators in [6] described the
Task 5 of the Workshop on Semantic Evaluation 2010
(SemEval-2010), their work focusing on key-phrase
extraction. The researchers have compiled a set of 284
scientific articles with key-phrases carefully chosen by both
their authors and readers. The dataset consists of trial, training
and testing data of conference and workshop papers from the
ACM Digital Library. The papers ranged between six and
eight pages, and containing tables and pictures.
Also, in [7] the researchers collected 1,147,000 scientific
abstracts related to different areas from arxiv, then they added
the scientific documents present in the benchmark datasets
comprising of short abstracts (Inspec) and long scientific
papers (SemEval-2010) that later used for evaluation to rank
keyword extraction.
And, in [8] the authors evaluated their algorithm and other
baseline algorithms over 2500 patent documents extracted
from Google Patent .
B. Text Preprocessing operations
Text Preprocessing is an important task and a basic
step in many Text Mining and IR algorithms, and it is the
fundamental part of any NLP system. Since the characters,
words, and sentences are identified at this stage the major
units are passed to all further processing stages. In [10], the
authors present an efficient preprocessing techniques that
eliminate unuseful parts of a document such as prepositions,
articles, and pro-nouns. These pre-processing techniques
eliminate noise from text data, later identifying the root word
for actual words and reducing the size of the text data. Their
objective was to analyze the issues of preprocessing methods
such as Tokenization, stop words removal and stemming for
the text documents.
In addition, the authors in [11] do preprocessing on documents
before classifying them. In preprocessing, stop words are
removed and the words were stemmed. The researchers' point
of view was that the reason behind stop-words should be
removed from a text is that they make the text look heavier
and less important for analysts.
Moreover, the authors in [12] applied preprocessing
techniques on the input documents to present the text
documents in a clear word format. The most taken steps are:
• Tokenization: A document is treated as a string, and
then partitioned into a list of tokens.
• Removing stop words: Stop words such as “the”, “a”,
“and”, etc. are frequently occurring, so the
insignificant words need to be removed.
• Stemming word: Applying the stemming algorithm
that converts different word forms into a similar
canonical form. This step is the process of conflating
tokens to their root form, e.g. connection to connect
and computing to compute.
C. Keywords Extraction
International Encyclopedia of Information and
Library Science [1] defines “Keyword” as “A word that
succinctly and accurately describes the subject, or an aspect of
the subject, discussed in a document.”
There are many techniques used to extract the keywords. In
this work Word2vec is used, which is a method utilizes a
vector to represent a word. The Word2Vec technique was
created by a research team led by Tomas Mikolov at Google
(2013) [13]. They proposed two new model architectures for
learning distributed representations of words that minimize
computational complexity namely Continue Bag of Words
(CBOW) and skip Gram (SG) models. Figure 1 illustrates the
architecture of CBOW and SG:
Figure1: The architecture of CBOW and SG
In addition, the authors in [14] offer and discuss experiments
on sentiment analysis of Twitter posts regarding to United
Vol. 18, No. 5, May 2020
ISSN 1947-5500

State (U.S) airline companies. Their study aims to determine
whether using the word2vec algorithm to create word
embeddings could be used to classify sentiment. Their dataset
was acquired from Kaggle.com, which contains over 14,000
tweets about users' airline experience and 15 attributes
including the original tweet text, Twitter user-related data, and
the class sentiment label.
Furthermore, in the article by J. Lindén, S. Forsström, and T.
Zhang [9], they present combination of the paragraph vector
algorithms Distributed Memory and Distributed Bag of Words
with four classification algorithms namely Decision Tree,
Random forest, Multi-Layer perceptron (MLP) and Long-
Short-Term-Memory (LSTM) to evaluate critical parameter
modifications of mentioned classification algorithms, with an
aim to categorize news articles.
D. Text Classification Algorithms
The aim of text classification is to classify the text
documents into a definite number of pre-defined classes. In
classification, there are key issues such as handling big
number of features, unstructured text documents, and choosing
a machine learning technique suitable for the text
classification application.
The authors in [11] applied text mining algorithms to extract
keywords from journal papers using TF-IDF and WordNet
thesaurus. TF-IDF algorithm is used to select the candidate
words, While WordNet is a lexical database of English which
is used to find similarity among the candidate words. Then
documents are classified based on extracted keywords using
the machine learning algorithms - NB, Decision Tree and
KNN. Decision Tree algorithm gives better results based on
prediction accuracy when compared to NB and KNN
algorithms with accuracy of 98.47%.
Wongkot Sriurai in his research [15] has compared the feature
processing techniques of Bag-of- Words (BOW) with the topic
model. Text categorization algorithms such as NB, SVM and
Decision tree are used for experimentation. For the
experiment, the precision, recall and F1 measure were used for
evaluating the text classification. The results proved that the
topic-model approach for representing the documents yield the
best performance based on F1 measure of 79% an
improvement of 11.1% over the BOW model.
III. PROPOSED APPROACH FOR EXTRACTING KEYWORDS AND
TEXTUAL CLASSIFICATION
In this section, we present the proposed method of using
Word2Vec technique in combination of Decision Tree
classifier to extract keywords from textual documents. The
architecture of the proposed method consists of three phases:
(1) Preprocessing phase; (2) Keywords extraction phase with
Word2Vec; (3) Documents classification using Decision Tree.
We describe these three phases in the following subsections.
A. Pre-processing phase
Preprocessing operations applied on dataset before
feeding it to the second phase. Its importance comes from the
fact that it makes the data more focused and clearer, which
makes it easy to select keywords and place them into the
correct categories to which they belong. The following
parameters are performed:
• Tokenization: is the process of breaking a stream of
text into words, phrases, symbols, or other
meaningful elements called tokens. The aim of the
tokenization is the exploration of the words in a
sentence.
• Stop words elimination: Many words are repeated
frequently in documents but basically are
meaningless since they are used to link words
together in a sentence. Due to their high
occurrence, their presence in text extraction
process is an obstacle to understanding the content
of documents. Stalled words often use common
words like "and", "she", "this", etc. They are not
helpful in classifying documents. So, they must be
eliminated.
• Stemming: It is the process of conflating the variant
forms of a word into a common representation. In
this work, three different stemming algorithms are
used:
i. English Porter stemming: It is used due to its
accuracy and simplicity. It is designed for
English language and based on the idea
that suffixes of words are frequently made
up of a combination of smaller and
simpler suffixes. If a suffix rule matches a
word, then the conditions attached to that
rule are tested and the stem is obtained by
removing the suffix [16].
ii. Paice-Husk Stemmer (Lancaster Stemmer):
It is an iterative stemmer. It removes the
endings from a word in an indefinite
number of steps. It uses a separate rule
file, which is first read into an array or list.
Then this file is divided into a series of
sections, each section corresponding to a
letter of the alphabet [16] [17].
iii. WordNet Lemmatizer: Lemmatization is the
process of converting a word into its basic
form. The difference between stemming
and lemmatization is that, the latter takes
the context into account and converts the
word into its meaningful basic form,
while the former removes only the last
few letters, often leading to incorrect
meanings and misspellings.
Vol. 18, No. 5, May 2020
ISSN 1947-5500

Tokenizing, removing stop-words, stemming
Calculating most frequently keywords for each category
Classifying the documents
Collecting textual documents
Representing keywords using vectors
Preprocessing
phase
Keywords
Extraction
phase
Classification
phase
B. Keywords Extraction phase
In this work, following preprocessing stage, Word2Vec
with its two architectures SG and CBOW is used. The most
frequently keywords are extracted for each category and each
word in every document is presented by a vector.
As a first stage, we collecting the textual documents for
each category, all training and testing documents for each class
are merged, and grouped into a single text file, then passed to
Word2Vec model for performing the training. Then, 15 words
from every document and each category are selected to be
passed for the next similarity operations.
CBOW takes the context of each word as input and tries to
predict the word corresponding to the context. Training
complexity is shown in equation (1) [13]:
Where N is the size of the hidden layer, V is the vocabulary
size, and D is the word representations.
SG model is the opposite of the CBOW model. The training
complexity of this architecture is proportional in equation (2)
[13]:
where C is the maximum distance of the words, V is the
vocabulary size, and D is the word representations.
The second stage is calculating the most frequently
keywords in each document for each category, and the 15
candidate words that have highest frequency is then filtered for
similarity and affinity calculations. Each word is represented
by a vector.
At the final stage, the cosine similarity is calculated
between the candidate keywords from the first stage and the
candidate keywords from the second one. Word2Vec generates
two numerical vectors X and Y for two different words, the
cosine similarity between the two words is defined as the
normalized dot product of X and Y as shown in equation (3)
[18]:
C. Documents classification using Decision Tree
Once the keyword extraction stage has taken place
followed by conducting similarity scale between the
nominated words from each document and creation of a file
for each classification along with presenting the extracted data
in a form of five scale vector, Decision Tree be able to
determine the belonging of each document to the correct
classification.
The target dataset has been divided into 60% as a training set
and 40% as a testing set. In terms of choosing the optimal
property for dividing data with it, two measures were used
namely Information Gain and Gini index.
The Decision Tree of CBOW – WordNet Lemmatizer with
forth scale using Entropy is presented in Figure 2.
Figure2: The Decision Tree of CBOW – wordnet lemmatizer
The architecture of the proposed method is summarized by the
schema in Figure 3.
Figure3: The architecture of the proposed method
IV. EXPERIMENTS AND RESULTS
This section firstly describes the input corpus and
used tools for implementation the proposed approach of
keywords extraction and measuring its performance. Secondly,
it presents and discusses the results of the experiments.
A. Input corpus
(1)
(2)
(3)
Vol. 18, No. 5, May 2020
ISSN 1947-5500

To test the proposed approach, the Sem – Eval (2010) dataset is
used. It has four different research areas to make sure a variety
of different topics that relate to the following 1998 ACM
classifications: C2.4 Distributed Systems, H3.3 Information
Search and Retrieval, I2.11 Distributed Artificial Intelligence
Multi-Agent Systems and J4 Social and Behavioural Sciences
Economics. The three datasets trial, training and testing had
four categories were provided with 40, 144, and 100 articles,
respectively.
B. Used tools
To apply the preprocessing operations on mentioned data,
Natural Language Tool Kit (NLTK_Tokenize) is used. Also, to
extract keywords with Word2Vec, genism library from Python
is utilized.
C. Result evaluation
This section presents the used measurement metrics for
evaluating the proposed approach. For this, the precision, recall
and F1-score are used. These three metrics are commonly used
to evaluate the performance of information retrieval tools and
natural language processing.
Precision (P) is the number of correct results divided by the
number of all returned results as shown in equation (4).
Recall (R) is the number of correct results divided by the
number of results that have been returned as shown in equation
(5).
The F-measure is defined as a harmonic mean of precision (P)
and recall R as shown in equation (6).
D. The Results of Word2Vec CBOW
Using the Gini method with WordNet Lemmatizer, the ratio
was 64% F-Score. For the Entropy method, with the same
measure it achieved 59% F-Score, and the highest percentage
using Entropy method was 62% F-Score which achieved by
English Porter Stemming with fourth scale.
As for the lowest percentages that got with a Paice – Husk
stemmer scale three with ratio of F-Score 22% using Gini
Index method.
With regard to combined Keywords from author and
readers, the fourth measure achieved the highest percentage F-
score 57% with Entropy method.
Figure 4 explains the confusion matrix of the highest obtained
ratio using Gini Index and CBOW for WordNet Lemmatizer.
The F-score result with ratio 0.49% for label I has contributed
to decrease the overall score.
Figure 4: The confusion matrix of highest f-score cbow gini index
(wordnet lemmatizer)
E. The Results of Word2Vec Skip Gram
The highest average score for the second English Porter
Stemming scale was 82% by Gini Index. Likewise, it achieved
the same standard and the third scale using the Entropy
method. The lowest average score for combined keywords
candidate by authors and readers was 52% using the Gini Index
method with scale five.
As for WordNet Lemmatizer, it has achieved a percentage of
78% using both Gini Index and Entropy methods with third
scale. And as for Paice – Husk stemmer, it achieved the highest
ratio on its level with the fourth and fifth measures using the
Gini Index and entropy methods with a value of 76%.
Using Skip Gram algorithm, it is noted a general improvement
in the shape and beginning of results with 82% using the F
score. There is a significant improvement in the H label rating
performance. Figures 5 and 6 illustrate the confusion matrixes
of highest average score.
Figure 5: The confusion matrix of highest f-score sg gini index (english
porter stemming)
(4)
(5)
(6)
Vol. 18, No. 5, May 2020
ISSN 1947-5500

Figure 6: The confusion matrix of highest f-score sg entropy (English
porter stemming)
V. CONCLUSIONS
The paper discussed a method for extracting keywords from
text documents and classifying these documents using both
Word2Vec and Decision Tree. Word2Vec model is used to
obtain the keywords, which provides us with Word Similarity
technology that do the convergence process between words.
The Decision Tree has also been used to find the correct
classification for the target document. In order to evaluate the
proposed method performance, the precision, recall and F-
Score values were computed.
As the results varied based on used five measures and to
CBOW and SG techniques, the SG method proved its
effectiveness with Decision Tree in determining the correct
classifications of documents by a percentage exceeded 80% of
F-score.
By comparing the obtained results with previous studies, it is
obvious that the proposed method proved its effectiveness in
finding the correct classification of documents and it
outperformed its counterpart with the same used keywords.
REFERENCES
[1] S. Siddiqi and A. J. I. J. o. C. A. Sharan, "Keyword and keyphrase
extraction techniques: a literature review," International Journal of
Computer Applications, vol. 109, no. 2, 2015.
[2] M. Allahyari et al., "A brief survey of text mining: Classification,
clustering and extraction techniques," arXive preprint arXiv:
1707.02919v2, 2017.
[3] V. Gupta and G. S. J. J. o. e. t. i. w. i. Lehal, "A survey of text mining
techniques and applications," JOURNAL OF EMERGING
TECHNOLOGIES IN WEB INTELLIGENCE, vol. 1, no. 1, pp. 60-76,
2009.
[4] A. K. S. Tilve and S. N. J. I. J. E. S. R. T. Jain, "A survey on machine
learning techniques for text classification," INTERNATIONAL
JOURNAL OF ENGINEERING SCIENCES & RESEARCH
TECHNOLOGY, 2017.
[5] . K. Bharti and K. S. J. a. p. a. Babu, "Automatic keyword extraction
for text summarization: A survey," European Journal of Advances in
Engineering and Technology, 2017, 2017.
[6] S. N. Kim, O. Medelyan, M.-Y. Kan, and T. Baldwin, "Semeval-2010
task 5: Automatic keyphrase extraction from scientific articles," in
Proceedings of the 5th International Workshop on Semantic Evaluation,
2010, pp. 21-26.
[7] D. Mahata, R. R. Shah, J. Kuriakose, R. Zimmermann, and J. R.
Talburt, "Theme-weighted Ranking of Keywords from Text Documents
using Phrase Embeddings," in 2018 IEEE Conference on Multimedia
Information Processing and Retrieval (MIPR), 2018, pp. 184-189:
IEEE.
[8] J. Hu, S. Li, Y. Yao, L. Yu, G. Yang, and J. J. E. Hu, "Patent keyword
extraction algorithm based on distributed representation for patent
classification," entropy, vol. 20, no. 2, p. 104, 2018.
[9] J. Lindén, S. Forsström, and T. Zhang, "Evaluating Combinations of
Classification Algorithms and Paragraph Vectors for News Article
Classification," in 2018 Federated Conference on Computer Science
and Information Systems (FedCSIS), 2018, pp. 489-495: IEEE.
[10] S. Kannan, V. J. I. J. o. C. S. Gurusamy, and C. Networks,
"Preprocessing Techniques for Text Mining," IEEE conference ,2011,
vol. 5, no. 1, pp. 7-16, 2014.A. Karnik, “Performance of TCP
congestion control with rate feedback: TCP/ABR and rate adaptive
TCP/IP,” M. Eng. thesis, Indian Institute of Science, Bangalore, India,
Jan. 1999.
[11] S. Menaka and N. Radha, "Text classification using keyword extraction
technique," International Journal of Advanced Research in Computer
Science and Software Engineering, vol. 3, no. 12, 2013.
[12] M. Mowafy, A. Rezk, and H. J. A. J. C. S. I. T. El-bakry, "An Efficient
Classification Model for Unstructured Text Document," American
Journal of Computer Science and Information Technology, vol. 6, no. 1,
p. 16, 2018.
[13] T. Mikolov, K. Chen, G. Corrado, and J. J. a. p. a. Dean, "Efficient
estimation of word representations in vector space," arXive preprint
arXiv:1301.3781v3, 2013.
[14] J. Acosta, N. Lamaute, M. Luo, E. Finkelstein, and Andreea,
"Sentiment Analysis of Twitter Messages Using Word2Vec,"
Proceedings of Student-Faculty Research Day, CSIS, Pace University,
p. 7, 2017.
[15] W. Sriurai, "Improving text categorization by using a topic model,"
Advanced Computing: An International Journal, vol. 2, no. 6, p. 21,
2011.
[16] M. S. Kumar and K. Murthy, "Corpus Based Statistical Approach for
Stemming Telugu," Creation of Lexical Resources for Indian Language
Computing Processing, C-DAC, Mumbai, India, 2007.
[17] N. Giridhar, K. Prema, N. S. Reddy, and P. Subba, "A Prospective
Study of Stemming Algorithms for Web Text Mining," Ganapt
University Journal of EngineeringTechnology, vol. 1, pp. 28-34, 2011.
[18] L. Ma, "A Multi-label Text Classification Framework: Using
Supervised and Unsupervised Feature Selection Strategy," Bonfring
International Journal of Data Mining, 2017.
Vol. 18, No. 5, May 2020
ISSN 1947-5500

An Investigation of Keywords Extraction from Textual Documents using Word2Vec and Decision Tree

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to An Investigation of Keywords Extraction from Textual Documents using Word2Vec and Decision Tree

Similar to An Investigation of Keywords Extraction from Textual Documents using Word2Vec and Decision Tree (20)

Recently uploaded

Recently uploaded (20)

An Investigation of Keywords Extraction from Textual Documents using Word2Vec and Decision Tree