Text categorization

Contents
• Introduction
• Problem Definition
• Literature Review
• Objective
• Methodology & Algorithmic Strategy
• Proposed System
• System Architecture
• Modules
• Technology
• Application
• References

Introduction
• Data Mining Domain.
• What is Text categorization?
• What is Text Reduction?
• Advantages of Text categorization
• Automated Text Categorization

Introduction (Contd…)
Fig: Working of Text Classifier
Unknown Input
Text
Classifier
Output Class 1
Output Class 2
Output Class 3
Output Class 4
Output Class 5
Output Class N
Dataset

Problem definition
• Text categorization is a process of performing classification of unknown
text in classes.
• Until now manual classification is done in system and automated
classification doesn’t give better efficiency.
• To improve the efficiency of automated text categorization, we can propose
a modified approach of Naïve Bayes algorithm which outcomes the
disadvantages of existing system.

Objectives
• To preprocess the 20 newsgroup dataset
• Perform text reduction.
• Generate Frequencies based on Modified algorithm
• Classification of unknown input using proposed algorithm
• Comparison of existing and proposed system.

Literature Review
Sr No. Title Authors Description
1. Toward
Optimal
Feature
Selection
in Naive
Bayes for
Text
Categoriz
ation
Bo Tang,
Student
Member,
IEEE, Steven
Kay, Fellow,
IEEE, and
Haibo He,
Senior
Member, IEEE
IEEE : Feb
2016
In this paper, Automated feature
selection is important for text
categorization to reduce the feature
size and to speed up the learning
process of classifiers. In this paper,
we present a novel and efficient
feature selection framework based on
the Information Theory, which aims to
rank the features with their
discriminative capacity for
classification.
Disadvantages:
Accuracy is very low based on
wordcount. It can be improved by
using any other classification
algorithms like Naïve Bayes

Literature Review
2. Comparative Study
Of Classification
Algorithm For Text
Based
Categorization
Omkar
Ardhapure,
Gayatri
Patil, Disha
Udani,
Kamlesh
Jetha
IJRET :
Feb 2016
In this paper, Text categorization is a
process in data mining which
assigns predefined categories to
free-text documents using machine
learning techniques. Any document
in the form of text, image, music,
etc. can be classified using some
categorization techniques. .
Disadvantages:
Accuracy is not compared with much
data. In data mining the more the
data, proper results can be found.

Literature Review
3. Text
Classification by
Combining Text
Classifiers to
Improve the
Efficiency of
Classification
Aaditya Jain,
Jyoti
Mandowara
IJCA : April
2016
Basic working of web crawler is
presented in this paper.
Disadvantages:
Nothing is given about how pages
can be ranked using some
algorithms. Only working of Text
Classification is given.

Literature Review
Sr
No
.
Title Authors Description
4. Arabic Text
Categorization
using k-nearest
neighbour, Decision
Trees (C4.5) and
Rocchio Classifier:
A Comparative
Study
Adel Hamdan
Mohammad,
Omar Al-
Momani and
Tariq
Alwada’n
IJCET : April
2016
In this paper authors proposed that
many researches about text
classification in English language. A
few researchers in general talk about
text classification using Arabic data
set. This research applies three well
known classification algorithm.
Algorithm applied are KNearest
neighbour (K-NN), C4.5 and Rocchio
algorithm.
Disadvantages:
Accuracy is not compared with much
data. In data mining the more the
data, proper results can be found.

Literature Review
5. Text
Categorization
on Multiple
Languages
Based On
Classification
Technique
Kapila Rani,
Satvika
IJCSIT-
March 2016
In this paper authors proposed that
The availability of constantly
increasing amount of textual data of
various Indian regional languages in
electronic form has accelerated.
Hence, the Classification of text
documents based on languages is
essential .
Disadvantages:
Not very much informative regarding
search engine optimizations.

Methodology
&
Algorithmic Strategy

Methodology
Pre-processing Dataset (Apply Reduction)
The first step is to perform text reduction of 20 newsgroup dataset using text
reduction technique
Dataset
Preprocessor
Method
Reduction
Procedure
Reduced
Dataset

Methodology
Generating Frequencies
We have to generate dataset frequencies provided below.
• Wordcount = number of times a word occur in a file.
• Term Frequency (TF) = occurrence / total (word freq. for each file)
• Inverse Document Frequency (IDF) = total doc. / no of doc in which term
occurs
• Normalized term Frequency = ∑ TF/ no. of occurrence

Methodology
Classification
Unknown Input
Text
Classifier
Output Class 1
Output Class 2
Output Class 3
Output Class 4
Output Class 5
Output Class N
Dataset

Proposed Work
• System will be provided automatic text categorization using
Modified Naïve Bayes algorithm.
• Text Reduction and Feature selection is done for Dataset
preprocessing.
• Dataset for simulation : 20 newsgroup
• Comparison of existing and proposed system will be provided

Proposed Architecture
Unknown
News
Modified NB
Algorithm
Output Class 1
Output Class 2
Output Class 3
Output Class 4
Output Class 5
Output Class N
News Dataset

Modules
1. Dataset preprocessing
2. Text Categorization
3. Modified NB Implementation
4. Comparative Study

Module - Dataset Preprocessing
1. Read each and every file one by one
2. Perform Preprocessing
a. Remove stop words.
b. Remove Special Symbols
c. Remove Unwanted Spaces.
3. Calculate word count, Term Frequency, Normalized Term Frequency, Inverse
Document Frequency
4. Insert data in DB as word , docid, classid, wordcount , TF, IDF, NTF

Module – Text Categorization
1. Performing Text Categorization on an unknown file.
a. Perform preprocessing on Input file
Remove stop words, special symbols and unwanted spaces.
2. Generate Decision Matrix.
News class 1
News class 2
News class N
Word 1 Word 2 Word N

1. For k-NN algorithm use wordcount
Wordcount
word1 – class 1
Wordcount
word2 – class 1
Wordcount
wordN – class 1
News class 1
News class 2
News class N

1. For Naïve bayes algorithm use TF * IDF
TF*IDF word1 –
class 1
TF*IDF
word2 – class 1
TF*IDF
wordN – class 1
News class 1
News class 2
News class N

1. For Modified Naïve bayes algorithm use NTF * IDF
NTF*IDF word1
– class 1
NTF*IDF
word2 – class 1
NTF*IDF
wordN – class 1
News class 1
News class 2
News class N

2. Perform Row wise Addition
Frequency
Class 1
Class N
Word count for k-NN
TF * IDF for Naïve Bayes
NTF * IDF for Modified
Naïve Bayes

3. Perform Maximum of Array and select index as predicted class.
Frequency
Class 1
Class N
Word count for k-NN
TF * IDF for Naïve Bayes
NTF * IDF for Modified
Naïve Bayes

Technology
1. Front End : Java 8
2. BackEnd : MySQL

Applications
1. Antivirus System
2. Disease Detection Systems.

Software Requirements
1. Eclipse or Netbeans for Java Development
2. Apache Tomcat 8

Expected Hardware Requirements
1. I3 Processor
2. 4GB RAM
3. 500 GB HDD

References
1. Bo Tang, Student Member, IEEE, Steven Kay, Fellow, IEEE, and Haibo He, Senior Member,
IEEE,” Toward Optimal Feature Selection in Naive Bayes for Text Categorization”
Dependable and Secure Computing, Submitted To Ieee Transactions On Knowledge And Data
Engineering, 09February 2016.
2. Omkar Ardhapure, Gayatri Patil, Disha Udani, Kamlesh Jetha, “Comparative Study Of
Classification Algorithm For Text Based Categorization”, in IJRET: International Journal of
Research in Engineering and Technology, Volume: 05 Issue: 02 | Feb-2016.
3. Aaditya Jain, Jyoti Mandowara, “Text Classification by Combining Text Classifiers to
Improve the Efficiency of Classification”, International Journal of Computer Application
(2250-1797) Volume 6– No.2, March- April 2016.
4. Adel Hamdan Mohammad, Omar Al-Momani and Tariq Alwada’n, “Arabic Text
Categorization using k-nearest neighbour, Decision Trees (C4.5) and Rocchio Classifier: A
Comparative Study” in International Journal of Current Engineering and Technology, E-ISSN
2277 – 4106, P-ISSN 2347 – 5161 , Vol.6, No.2 (April 2016).
5. Kapila Rani, Satvika, “Text Categorization on Multiple Languages Based On Classification
Technique”, (IJCSIT) International Journal of Computer Science and Information
Technologies, Vol. 7, March 2016, 1578-1581.

References
6. T. Joachims, “Text categorization with support vector machines: Learning with many relevant
features,” in ECML, 1998.
7. W. Lam, M. Ruiz, and P. Srinivasan, “Automatic text categorization and its application to text
retrieval,” IEEE Transactions on Knowledge and Data Engineering, vol. 11, no. 6, pp. 865–
879, 1999.
8. F. Sebastiani, “Machine learning in automated text categorization,” ACM computing surveys
(CSUR), vol. 34, no. 1, pp. 1–47, 2002.
9. H. Al-Mubaid, S. Umair et al., “A new text categorization technique using distributional
clustering and learning logic,” IEEE Transactions on Knowledge and Data Engineering, vol.
18, no. 9, pp. 1156–1165, 2006.
10. Y. Aphinyanaphongs, L. D. Fu, Z. Li, E. R. Peskin, E. Efstathiadis, C. F. Aliferis, and A.
Statnikov, “A comprehensive empirical comparison of modern supervised classification and
feature selection methods for text categorization,” Journal of the Association for Information
Science and Technology, vol. 65, no. 10, pp. 1964–1987, 2014.

Text categorization

More Related Content

What's hot

Viewers also liked

Similar to Text categorization

Recently uploaded

Text categorization