2. Related Work
Text document classification studies have become an
emerging field in the text mining research area. Consequently,
an abundance of approaches has been developed for such
purpose, including support vector machines (SVM) [1], K-nearest
neighbor (KNN) classification [2], Naive Bayes classification [3],
Decision tree (DT) [4], Neural Network (NN) [5], and maximum
entropy [6]. In regard to these approaches, Multinomial Naive
Bayes test classifier has been vastly used due to its simplicity in
training and classifying phases [3]. Many researchers proved its
effectiveness in classifying the unstructured text documents in
various domains.
Dalal and Zaveri [7] have presented a generic strategy for
automatic text classification, which includes phases such as
preprocessing, feature selection, using semantic or statistical
techniques, and selecting the appropriate machine learning
techniques (Naive bayes, Decision tree, hybrid techniques,
Support vector machines). They have also discussed some of the
key issues involved in text classification such as handling a large
number of features, working with unstructured text, dealing
with missing metadata, and selecting a suitable machine
learning technique for training a text classifier.
Bolaj and Govilkar [8] have presented a survey of text
categorization techniques for Indian regional languages and
have proved that the naive bayes, k-nearest neighbor, and
support vector machines are the most suitable techniques for
achieving better document classification results for Indian
regional languages. Jain and Saini [9] used a statistical approach
to classify Punjabi text. In this paper Naive bayes classifier has
been successfully implemented and tested. The classifier has
achieved satisfactory results in classifying the documents.
Tilve and Jain [10] used three text classification algorithms
(Naive Bayes, VSM for text classification, and the new
implemented Use of Stanford Tagger for text classification) for
text classification on two different datasets (20 Newsgroups and
New news dataset for five categories). Comparing with the
above classification strategies, Naïve Bayes is potentially good at
serving as a text classification model due to its simplicity. Gogoi
and Sarma [11] has highlighted the performance of employing
Naive bayes technique in document classification.
In this paper, a classification model has been built on and
evaluated according to a small dataset of four categories and
200 documents for training and testing. The result has been
validated using statistical measures of precision, recall and their
combination F-measure. Results showed that Naïve Bayes is a
good classifier. This study can be extended by applying Naive
Bayes classification on larger datasets. Rajeswari R, et al. [12]
focused on text classification using Naive bayes and K-nearest
neighbor classifiers and to confirm on performance and accuracy
of these approaches. The result showed that Naive bayes
classifier is a good classifier with an accuracy of 66.67 as
opposed to KNN classifier with 38.89.
One closely related research paper to our research was of
Trstenjak B, et al. [13] that has proposed a text categorization
framework based on using KNN technique with TF-IDF method.
Both KNN and TF-IDF embedded together gave good results and
confirmed the initial expectations. The framework has been
tested on several categories of text Documents. During testing,
classification gave accurate results due to KNN algorithm. This
combination gives better results and needs to upgrade and
improve the framework for better and high accuracy results.
The Proposed Model
The proposed model is expected to achieve more efficiency in
the text classification performance. Hence, the proposed model
has been presented in [13] depends on using KNN with TF-IDF.
Our work used the multinomial naive bayes technique according
to its popularity and efficiency in text classification nowadays as
discussed earlier. Especially, its superior performance than K-
nearest neighbor as [12] has been illustrated. The proposed
model presents combining multinomial naive bayes as a selected
machine learning technique for classification, and TF-IDF as a
vector space model for text extraction, and chi2 technique for
feature selection for more accurate text classification with better
performance in comparison to the testing result to of [13]
framework.
The general steps for building Classification model as
presented in Figure 1 are: Preprocessing for all labeled and
unlabeled documents. Training, in which the classifier is
constructed from the labeled training prepared instances. Then,
the testing that is responsible for testing the model by testing
prepared samples whose class labels are known but not used for
training model. Finally, usage phase, since the model is prepared
to be used for classification of new prepared data whose class
label is unknown.
Figure 1: Model Architecture
Pre-processing
This phase is applied on the input documents, either on the
labeled data for training and testing phase or on the unlabeled
for the usage phase. This phase is used to present the text
documents in a clear word format. The output documents are
prepared for next phases in text classification. Commonly the
steps taken are:
American Journal of Computer Science and Information Technology
ISSN 2349-3917 Vol.6 No.1:16
2018
2 This article is available from: https://www.imedpub.com/computer-science-and-information-technology/
4. Constructing vector space model: Once the feature selection
phase has been done and the best M features have been
selected, the vector space model will be constructed. That
vector space represents each document as a vector with M
dimensions, which is the number of selected features set that
has been produced in the previous phase, each vector can be
written as:
Vd=[W1d,W2d,……..,Wmd]
where Wid is a weight for measuring the importance of the
term i in document j. There are various methods that can be
used for weighting the terms as we mentioned in text
presentation phase. Our model uses TF-IDF, which stands for
Term Frequency-Inverse Document Frequency. TF-IDF can be
calculated as [17]:
��� = ���� * ���� = ���� * log
�
��
�
4
Where Wij is the weight of term i in document j, N is the
number of all documents in the training set, tfij is the term
frequency for term i in document j, dfi is the document
frequency of term i in the documents of the training set. The
following pseudo code present TF-IDF vector space model:
Training classifier: Training classifier is the key component of
the text classification process. The role of this phase is to build a
classifier or generate model by training it using predefined
documents that will be used to classify unlabelled documents.
The Multinomial naive bayes classifier is selected to build the
classifier in our model.
The following probability calculations should be done during
this phase:
A. For each term or feature in the selected feature set,
calculate the probability of that feature (term) to each class.
Assume that a set of classes is denoted by C, where
C={c1,c2.......,ck} is the set of class labels, and N is the length of
the selected features set {t1,t2,…..tN} according to the use of
term frequencies to estimate the class-conditional probabilities
in the multinomial model [18]:
� ��/�� =
∑�� �
�
, � ∈ �
�
+ �
∑�
� ∈ ��
+ � . �
5
it can be adapted for using with tf-idf for our model, as the
following equation:
� ��/�� =
∑����� �
�
, � ∈ �� + �
∑�
� ∈ ��
+ � . �
6
Where:
• ti: A word from the feature vector t of a particular sample.
•
∑����� ��, � ∈ �� : The sum of raw tf-idf of word ti from
all documents in the training sample that belong to a class
ck.∑�� ∈ ��: The sum of all tf-idf in the training dataset
for class ck
• α: An additive smoothing parameter (α:=1 for Laplace
smoothing).
• V: The size of the vocabulary (number of different words in
the training set).
The pseudo code for this calculation step is:
American Journal of Computer Science and Information Technology
ISSN 2349-3917 Vol.6 No.1:16
2018
4 This article is available from: https://www.imedpub.com/computer-science-and-information-technology/
6. Figure 4: Usage Phase
Experiment Results
Dataset and setup considerations
The experiment is carried out using (20-Newgroups) collected
by Ken Lang. It has become one of the standard corpora for text
classification. 20 newsgroups dataset is a collection of 19,997
newsgroup document which are taken from Usenet news,
partitioned across 20 different categories. The training dataset
consists of 11998 samples, 60% of which is training, the rest 40%
is used for testing the classifier. Table 1 presents the used
environment properties. Experimental Environment
characteristics are shown in Table 1.
Table 1: Experimental environment
OS Windows 7
CPU Core(TM) I7-3630
RAM 8.00 GB
Language Python
Results
The result of the experiment is measuring a set of items as the
following:
Superiority of using the multinomial naive bayes MNB with
TFIDF than KNN: In this section, we have investigated the
compatibility's effectiveness of our choice of combining the TF-
IDF weight with multinomial naive bayes (MNB) and compare it
with combining the same weight algorithm with KNN [13] for
classifying the unstructured text documents. Performance is
evaluated in terms of Accuracy, Precision, Recall and F-score for
all the documents as shown in Table 2 and Figure 5. Run time for
both MNB and KNN with TFIDF is presented in Figure 6.
Accuracy, Precision, Recall and F-score for each category in both
techniques are illustrated in Table 3.
Table 2: Accuracy, precision, recall and F-measure of both
approaches
Results MNB-TFIDF KNN-TFIDF
Accuracy 0.87 0.71
Precision 0.88 0.72
Recall 0.87 0.71
F1-Score 0.87 0.71
Time 0.44 (ms) 18.99 (ms)
Figure 5: Comparison of performance results for using both
KNN and Multinomial Naive bayes(MNB)with TF-IDF
American Journal of Computer Science and Information Technology
ISSN 2349-3917 Vol.6 No.1:16
2018
6 This article is available from: https://www.imedpub.com/computer-science-and-information-technology/
8. Figure 8: Runtime comparison for MNB-TFIDF with/without
Chi
The evaluation of the proposed model: Evaluating the whole
model through the precision, recall, f1-score performance
measures for each category in the testing data set as shown in
Table 5.
Table 5: performance measures for each category in the testing
data set
No. of
docs
Precision Recall F1-score
Cat. 0 278 0.69 0.79 0.73
Cat. 1 287 0.92 0.91 0.91
Cat. 2 305 0.94 0.96 0.95
Cat. 3 320 0.86 0.95 0.90
Cat. 4 288 0.96 0.95 0.96
Cat. 5 315 0.94 0.90 0.92
Cat. 6 297 0.98 0.97 0.97
Cat. 7 322 0.96 0.97 0.96
Cat. 8 298 1 0.96 0.98
Cat. 9 278 1 0.97 0.99
Cat. 10 287 0.98 0.99 0.99
Cat. 11 308 0.98 0.96 0.97
Cat. 12 288 0.94 0.90 0.92
Cat. 13 318 0.93 0.96 0.95
Cat. 14 286 0.98 0.99 0.99
Cat. 15 282 0.97 1 0.98
Cat. 16 308 0.81 0.97 0.88
Cat. 17 312 0.92 0.92 0.92
Cat. 18 304 0.84 0.70 0.77
Cat. 19 319 0.72 0.59 0.65
Capability of the model to work with other techniques: This
part of the evaluation proves the capability of the model for
improving the quality of other classification techniques as in
Tables 6 and 7 and Figures 9 and 10.
Table 6: Accuracy, Precision, Recall and F-measure of both
approaches.
Results KNN-TFIDF KNN-TFIDF using the model
Accuracy 0.71 0.89
Precision 0.72 0.89
Recall 0.71 0.89
F1-Score 0.71 0.89
Time 0:00:33.681047 0:00:10.93001
Figure 9: Comparison of performance results for using
another technique with and without the proposed model
Table 7: Performance measures for each category in the testing data set for both KNN-TFIDF with and without the proposed model
Cat No. of docs
Precision Recall F1-score
with without with without with without
Cat.0 278 0.64 0.59 0.80 0.81 0.71 0.68
Cat.1 287 0.89 0.53 0.89 0.57 0.89 0.55
Cat.2 305 0.92 0.58 0.95 0.71 0.93 0.64
American Journal of Computer Science and Information Technology
ISSN 2349-3917 Vol.6 No.1:16
2018
8 This article is available from: https://www.imedpub.com/computer-science-and-information-technology/
10. 6. Nigam K, Lafferty J, McCallum A (1999) Using maximum entropy
for text classification. In Proceedings: IJCAI-99 Workshop on
Machine Learning for Information Filtering 61-67.
7. Dalal MK, Zaveri MA (2011) Automatic Text Classification: A
Technical Review. Int J Comp App 28: 0975-8887.
8. Bolaj P, Govilkar S (2016) A Survey on Text Categorization
Techniques for Indian Regional Languages. Int J Comp Sci Inform
Technol 7: 480-483.
9. Jain U, Saini K (2015) Punjabi Text Classification using Naive Bayes
Algorithm. Int J Curr Engineering Technol 5.
10. Tilve AKS, Jain SN (2017) Text Classification using Naive Bayes,
VSM and POS Tagger. Int J Ethics in Engineering & Management
Education 4: 1.
11. Gogoi M, Sharma SK (2015) Document Classification of Assamese
Text Using Naïve Bayes Approach. Int J Comp Trends Technol 30: 4.
12. Rajeswari RP, Juliet K, Aradhana (2017) Text Classification for
Student Data Set using Naive Bayes Classifier and KNN Classifier.
Int J Comp Trends Technol 43: 1.
13. Trstenjak B, Mikac S, Donko D (2014) KNN with TF-IDF based
Framework for Text Categorization. Procedia Engineering 69:
1356-1364.
14. Grobelnik M, Mladenic D (2004) Text-Mining Tutorial in the
Proceeding of Learning methods for Text Understanding and
Mining. pp: 26-29.
15. Vector space method (2017) Available at: en.wikipedia.org/wiki/
Vector_space.
16. Thaoroijam K (2014) A Study on Document Classification using
Machine Learning Techniques. Int J Comp Sci Issues 11: 1.
17. Wang D, Zhang H, Liu R, Lv W, Wang D (2014) t-Test feature
selection approach based on term frequency for text
categorization. Pattern Recognition Letters 45: 1-10.
18. Raschka S (2014) Naive Bayes and Text Classification I-Introduction
and Theory.
19. Pop L (2006) An approach of the Naïve Bayes classifier for the
document classification. General Mathematics 14: 135-138.
American Journal of Computer Science and Information Technology
ISSN 2349-3917 Vol.6 No.1:16
2018
10 This article is available from: https://www.imedpub.com/computer-science-and-information-technology/