Contents
• Introduction
• Problem Definition
• Literature Review
• Objective
• Methodology & Algorithmic Strategy
• Proposed System
• System Architecture
• Modules
• Technology
• Application
• References
Introduction
Introduction
• Data Mining Domain.
• What is Text categorization?
• What is Text Reduction?
• Advantages of Text categorization
• Automated Text Categorization
Introduction (Contd…)
Fig: Working of Text Classifier
Unknown Input
Text
Classifier
Output Class 1
Output Class 2
Output Class 3
Output Class 4
Output Class 5
Output Class N
Dataset
Problem Definition
Problem definition
• Text categorization is a process of performing classification of unknown
text in classes.
• Until now manual classification is done in system and automated
classification doesn’t give better efficiency.
• To improve the efficiency of automated text categorization, we can propose
a modified approach of Naïve Bayes algorithm which outcomes the
disadvantages of existing system.
Objectives
Objectives
• To preprocess the 20 newsgroup dataset
• Perform text reduction.
• Generate Frequencies based on Modified algorithm
• Classification of unknown input using proposed algorithm
• Comparison of existing and proposed system.
Literature Review
Literature Review
Sr No. Title Authors Description
1. Toward
Optimal
Feature
Selection
in Naive
Bayes for
Text
Categoriz
ation
Bo Tang,
Student
Member,
IEEE, Steven
Kay, Fellow,
IEEE, and
Haibo He,
Senior
Member, IEEE
IEEE : Feb
2016
In this paper, Automated feature
selection is important for text
categorization to reduce the feature
size and to speed up the learning
process of classifiers. In this paper,
we present a novel and efficient
feature selection framework based on
the Information Theory, which aims to
rank the features with their
discriminative capacity for
classification.
Disadvantages:
Accuracy is very low based on
wordcount. It can be improved by
using any other classification
algorithms like Naïve Bayes
Literature Review
Sr No. Title Authors Description
2. Comparative Study
Of Classification
Algorithm For Text
Based
Categorization
Omkar
Ardhapure,
Gayatri
Patil, Disha
Udani,
Kamlesh
Jetha
IJRET :
Feb 2016
In this paper, Text categorization is a
process in data mining which
assigns predefined categories to
free-text documents using machine
learning techniques. Any document
in the form of text, image, music,
etc. can be classified using some
categorization techniques. .
Disadvantages:
Accuracy is not compared with much
data. In data mining the more the
data, proper results can be found.
Literature Review
Sr No. Title Authors Description
3. Text
Classification by
Combining Text
Classifiers to
Improve the
Efficiency of
Classification
Aaditya Jain,
Jyoti
Mandowara
IJCA : April
2016
Basic working of web crawler is
presented in this paper.
Disadvantages:
Nothing is given about how pages
can be ranked using some
algorithms. Only working of Text
Classification is given.
Literature Review
Sr
No
.
Title Authors Description
4. Arabic Text
Categorization
using k-nearest
neighbour, Decision
Trees (C4.5) and
Rocchio Classifier:
A Comparative
Study
Adel Hamdan
Mohammad,
Omar Al-
Momani and
Tariq
Alwada’n
IJCET : April
2016
In this paper authors proposed that
many researches about text
classification in English language. A
few researchers in general talk about
text classification using Arabic data
set. This research applies three well
known classification algorithm.
Algorithm applied are KNearest
neighbour (K-NN), C4.5 and Rocchio
algorithm.
Disadvantages:
Accuracy is not compared with much
data. In data mining the more the
data, proper results can be found.
Literature Review
Sr No. Title Authors Description
5. Text
Categorization
on Multiple
Languages
Based On
Classification
Technique
Kapila Rani,
Satvika
IJCSIT-
March 2016
In this paper authors proposed that
The availability of constantly
increasing amount of textual data of
various Indian regional languages in
electronic form has accelerated.
Hence, the Classification of text
documents based on languages is
essential .
Disadvantages:
Not very much informative regarding
search engine optimizations.
Methodology
&
Algorithmic Strategy
Methodology
Pre-processing Dataset (Apply Reduction)
The first step is to perform text reduction of 20 newsgroup dataset using text
reduction technique
Dataset
Preprocessor
Method
Reduction
Procedure
Reduced
Dataset
Methodology
Generating Frequencies
We have to generate dataset frequencies provided below.
• Wordcount = number of times a word occur in a file.
• Term Frequency (TF) = occurrence / total (word freq. for each file)
• Inverse Document Frequency (IDF) = total doc. / no of doc in which term
occurs
• Normalized term Frequency = ∑ TF/ no. of occurrence
Methodology
Classification
Unknown Input
Text
Classifier
Output Class 1
Output Class 2
Output Class 3
Output Class 4
Output Class 5
Output Class N
Dataset
Proposed Work
Proposed Work
• System will be provided automatic text categorization using
Modified Naïve Bayes algorithm.
• Text Reduction and Feature selection is done for Dataset
preprocessing.
• Dataset for simulation : 20 newsgroup
• Comparison of existing and proposed system will be provided
Proposed Architecture
Proposed Architecture
Unknown
News
Modified NB
Algorithm
Output Class 1
Output Class 2
Output Class 3
Output Class 4
Output Class 5
Output Class N
News Dataset
Modules
Modules
1. Dataset preprocessing
2. Text Categorization
3. Modified NB Implementation
4. Comparative Study
Module - Dataset Preprocessing
1. Read each and every file one by one
2. Perform Preprocessing
a. Remove stop words.
b. Remove Special Symbols
c. Remove Unwanted Spaces.
3. Calculate word count, Term Frequency, Normalized Term Frequency, Inverse
Document Frequency
4. Insert data in DB as word , docid, classid, wordcount , TF, IDF, NTF
Module – Text Categorization
1. Performing Text Categorization on an unknown file.
a. Perform preprocessing on Input file
Remove stop words, special symbols and unwanted spaces.
2. Generate Decision Matrix.
News class 1
News class 2
News class N
Word 1 Word 2 Word N
Module – Text Categorization
1. For k-NN algorithm use wordcount
Wordcount
word1 – class 1
Wordcount
word2 – class 1
Wordcount
wordN – class 1
News class 1
News class 2
News class N
Word 1 Word 2 Word N
Module – Text Categorization
1. For Naïve bayes algorithm use TF * IDF
TF*IDF word1 –
class 1
TF*IDF
word2 – class 1
TF*IDF
wordN – class 1
News class 1
News class 2
News class N
Word 1 Word 2 Word N
Module – Text Categorization
1. For Modified Naïve bayes algorithm use NTF * IDF
NTF*IDF word1
– class 1
NTF*IDF
word2 – class 1
NTF*IDF
wordN – class 1
News class 1
News class 2
News class N
Word 1 Word 2 Word N
Module – Text Categorization
2. Perform Row wise Addition
Frequency
Class 1
Class N
Word count for k-NN
TF * IDF for Naïve Bayes
NTF * IDF for Modified
Naïve Bayes
Module – Text Categorization
3. Perform Maximum of Array and select index as predicted class.
Frequency
Class 1
Class N
Word count for k-NN
TF * IDF for Naïve Bayes
NTF * IDF for Modified
Naïve Bayes
Technology
Technology
1. Front End : Java 8
2. BackEnd : MySQL
Applications
Applications
1. Antivirus System
2. Disease Detection Systems.
Requirements
Software Requirements
1. Eclipse or Netbeans for Java Development
2. Apache Tomcat 8
Expected Hardware Requirements
1. I3 Processor
2. 4GB RAM
3. 500 GB HDD
References
References
1. Bo Tang, Student Member, IEEE, Steven Kay, Fellow, IEEE, and Haibo He, Senior Member,
IEEE,” Toward Optimal Feature Selection in Naive Bayes for Text Categorization”
Dependable and Secure Computing, Submitted To Ieee Transactions On Knowledge And Data
Engineering, 09February 2016.
2. Omkar Ardhapure, Gayatri Patil, Disha Udani, Kamlesh Jetha, “Comparative Study Of
Classification Algorithm For Text Based Categorization”, in IJRET: International Journal of
Research in Engineering and Technology, Volume: 05 Issue: 02 | Feb-2016.
3. Aaditya Jain, Jyoti Mandowara, “Text Classification by Combining Text Classifiers to
Improve the Efficiency of Classification”, International Journal of Computer Application
(2250-1797) Volume 6– No.2, March- April 2016.
4. Adel Hamdan Mohammad, Omar Al-Momani and Tariq Alwada’n, “Arabic Text
Categorization using k-nearest neighbour, Decision Trees (C4.5) and Rocchio Classifier: A
Comparative Study” in International Journal of Current Engineering and Technology, E-ISSN
2277 – 4106, P-ISSN 2347 – 5161 , Vol.6, No.2 (April 2016).
5. Kapila Rani, Satvika, “Text Categorization on Multiple Languages Based On Classification
Technique”, (IJCSIT) International Journal of Computer Science and Information
Technologies, Vol. 7, March 2016, 1578-1581.
References
6. T. Joachims, “Text categorization with support vector machines: Learning with many relevant
features,” in ECML, 1998.
7. W. Lam, M. Ruiz, and P. Srinivasan, “Automatic text categorization and its application to text
retrieval,” IEEE Transactions on Knowledge and Data Engineering, vol. 11, no. 6, pp. 865–
879, 1999.
8. F. Sebastiani, “Machine learning in automated text categorization,” ACM computing surveys
(CSUR), vol. 34, no. 1, pp. 1–47, 2002.
9. H. Al-Mubaid, S. Umair et al., “A new text categorization technique using distributional
clustering and learning logic,” IEEE Transactions on Knowledge and Data Engineering, vol.
18, no. 9, pp. 1156–1165, 2006.
10. Y. Aphinyanaphongs, L. D. Fu, Z. Li, E. R. Peskin, E. Efstathiadis, C. F. Aliferis, and A.
Statnikov, “A comprehensive empirical comparison of modern supervised classification and
feature selection methods for text categorization,” Journal of the Association for Information
Science and Technology, vol. 65, no. 10, pp. 1964–1987, 2014.
Thank You

Text categorization

  • 1.
    Contents • Introduction • ProblemDefinition • Literature Review • Objective • Methodology & Algorithmic Strategy • Proposed System • System Architecture • Modules • Technology • Application • References
  • 2.
  • 3.
    Introduction • Data MiningDomain. • What is Text categorization? • What is Text Reduction? • Advantages of Text categorization • Automated Text Categorization
  • 4.
    Introduction (Contd…) Fig: Workingof Text Classifier Unknown Input Text Classifier Output Class 1 Output Class 2 Output Class 3 Output Class 4 Output Class 5 Output Class N Dataset
  • 5.
  • 6.
    Problem definition • Textcategorization is a process of performing classification of unknown text in classes. • Until now manual classification is done in system and automated classification doesn’t give better efficiency. • To improve the efficiency of automated text categorization, we can propose a modified approach of Naïve Bayes algorithm which outcomes the disadvantages of existing system.
  • 7.
  • 8.
    Objectives • To preprocessthe 20 newsgroup dataset • Perform text reduction. • Generate Frequencies based on Modified algorithm • Classification of unknown input using proposed algorithm • Comparison of existing and proposed system.
  • 9.
  • 10.
    Literature Review Sr No.Title Authors Description 1. Toward Optimal Feature Selection in Naive Bayes for Text Categoriz ation Bo Tang, Student Member, IEEE, Steven Kay, Fellow, IEEE, and Haibo He, Senior Member, IEEE IEEE : Feb 2016 In this paper, Automated feature selection is important for text categorization to reduce the feature size and to speed up the learning process of classifiers. In this paper, we present a novel and efficient feature selection framework based on the Information Theory, which aims to rank the features with their discriminative capacity for classification. Disadvantages: Accuracy is very low based on wordcount. It can be improved by using any other classification algorithms like Naïve Bayes
  • 11.
    Literature Review Sr No.Title Authors Description 2. Comparative Study Of Classification Algorithm For Text Based Categorization Omkar Ardhapure, Gayatri Patil, Disha Udani, Kamlesh Jetha IJRET : Feb 2016 In this paper, Text categorization is a process in data mining which assigns predefined categories to free-text documents using machine learning techniques. Any document in the form of text, image, music, etc. can be classified using some categorization techniques. . Disadvantages: Accuracy is not compared with much data. In data mining the more the data, proper results can be found.
  • 12.
    Literature Review Sr No.Title Authors Description 3. Text Classification by Combining Text Classifiers to Improve the Efficiency of Classification Aaditya Jain, Jyoti Mandowara IJCA : April 2016 Basic working of web crawler is presented in this paper. Disadvantages: Nothing is given about how pages can be ranked using some algorithms. Only working of Text Classification is given.
  • 13.
    Literature Review Sr No . Title AuthorsDescription 4. Arabic Text Categorization using k-nearest neighbour, Decision Trees (C4.5) and Rocchio Classifier: A Comparative Study Adel Hamdan Mohammad, Omar Al- Momani and Tariq Alwada’n IJCET : April 2016 In this paper authors proposed that many researches about text classification in English language. A few researchers in general talk about text classification using Arabic data set. This research applies three well known classification algorithm. Algorithm applied are KNearest neighbour (K-NN), C4.5 and Rocchio algorithm. Disadvantages: Accuracy is not compared with much data. In data mining the more the data, proper results can be found.
  • 14.
    Literature Review Sr No.Title Authors Description 5. Text Categorization on Multiple Languages Based On Classification Technique Kapila Rani, Satvika IJCSIT- March 2016 In this paper authors proposed that The availability of constantly increasing amount of textual data of various Indian regional languages in electronic form has accelerated. Hence, the Classification of text documents based on languages is essential . Disadvantages: Not very much informative regarding search engine optimizations.
  • 15.
  • 16.
    Methodology Pre-processing Dataset (ApplyReduction) The first step is to perform text reduction of 20 newsgroup dataset using text reduction technique Dataset Preprocessor Method Reduction Procedure Reduced Dataset
  • 17.
    Methodology Generating Frequencies We haveto generate dataset frequencies provided below. • Wordcount = number of times a word occur in a file. • Term Frequency (TF) = occurrence / total (word freq. for each file) • Inverse Document Frequency (IDF) = total doc. / no of doc in which term occurs • Normalized term Frequency = ∑ TF/ no. of occurrence
  • 18.
    Methodology Classification Unknown Input Text Classifier Output Class1 Output Class 2 Output Class 3 Output Class 4 Output Class 5 Output Class N Dataset
  • 19.
  • 20.
    Proposed Work • Systemwill be provided automatic text categorization using Modified Naïve Bayes algorithm. • Text Reduction and Feature selection is done for Dataset preprocessing. • Dataset for simulation : 20 newsgroup • Comparison of existing and proposed system will be provided
  • 21.
  • 22.
    Proposed Architecture Unknown News Modified NB Algorithm OutputClass 1 Output Class 2 Output Class 3 Output Class 4 Output Class 5 Output Class N News Dataset
  • 23.
  • 24.
    Modules 1. Dataset preprocessing 2.Text Categorization 3. Modified NB Implementation 4. Comparative Study
  • 25.
    Module - DatasetPreprocessing 1. Read each and every file one by one 2. Perform Preprocessing a. Remove stop words. b. Remove Special Symbols c. Remove Unwanted Spaces. 3. Calculate word count, Term Frequency, Normalized Term Frequency, Inverse Document Frequency 4. Insert data in DB as word , docid, classid, wordcount , TF, IDF, NTF
  • 26.
    Module – TextCategorization 1. Performing Text Categorization on an unknown file. a. Perform preprocessing on Input file Remove stop words, special symbols and unwanted spaces. 2. Generate Decision Matrix. News class 1 News class 2 News class N Word 1 Word 2 Word N
  • 27.
    Module – TextCategorization 1. For k-NN algorithm use wordcount Wordcount word1 – class 1 Wordcount word2 – class 1 Wordcount wordN – class 1 News class 1 News class 2 News class N Word 1 Word 2 Word N
  • 28.
    Module – TextCategorization 1. For Naïve bayes algorithm use TF * IDF TF*IDF word1 – class 1 TF*IDF word2 – class 1 TF*IDF wordN – class 1 News class 1 News class 2 News class N Word 1 Word 2 Word N
  • 29.
    Module – TextCategorization 1. For Modified Naïve bayes algorithm use NTF * IDF NTF*IDF word1 – class 1 NTF*IDF word2 – class 1 NTF*IDF wordN – class 1 News class 1 News class 2 News class N Word 1 Word 2 Word N
  • 30.
    Module – TextCategorization 2. Perform Row wise Addition Frequency Class 1 Class N Word count for k-NN TF * IDF for Naïve Bayes NTF * IDF for Modified Naïve Bayes
  • 31.
    Module – TextCategorization 3. Perform Maximum of Array and select index as predicted class. Frequency Class 1 Class N Word count for k-NN TF * IDF for Naïve Bayes NTF * IDF for Modified Naïve Bayes
  • 32.
  • 33.
    Technology 1. Front End: Java 8 2. BackEnd : MySQL
  • 34.
  • 35.
    Applications 1. Antivirus System 2.Disease Detection Systems.
  • 36.
  • 37.
    Software Requirements 1. Eclipseor Netbeans for Java Development 2. Apache Tomcat 8
  • 38.
    Expected Hardware Requirements 1.I3 Processor 2. 4GB RAM 3. 500 GB HDD
  • 39.
  • 40.
    References 1. Bo Tang,Student Member, IEEE, Steven Kay, Fellow, IEEE, and Haibo He, Senior Member, IEEE,” Toward Optimal Feature Selection in Naive Bayes for Text Categorization” Dependable and Secure Computing, Submitted To Ieee Transactions On Knowledge And Data Engineering, 09February 2016. 2. Omkar Ardhapure, Gayatri Patil, Disha Udani, Kamlesh Jetha, “Comparative Study Of Classification Algorithm For Text Based Categorization”, in IJRET: International Journal of Research in Engineering and Technology, Volume: 05 Issue: 02 | Feb-2016. 3. Aaditya Jain, Jyoti Mandowara, “Text Classification by Combining Text Classifiers to Improve the Efficiency of Classification”, International Journal of Computer Application (2250-1797) Volume 6– No.2, March- April 2016. 4. Adel Hamdan Mohammad, Omar Al-Momani and Tariq Alwada’n, “Arabic Text Categorization using k-nearest neighbour, Decision Trees (C4.5) and Rocchio Classifier: A Comparative Study” in International Journal of Current Engineering and Technology, E-ISSN 2277 – 4106, P-ISSN 2347 – 5161 , Vol.6, No.2 (April 2016). 5. Kapila Rani, Satvika, “Text Categorization on Multiple Languages Based On Classification Technique”, (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 7, March 2016, 1578-1581.
  • 41.
    References 6. T. Joachims,“Text categorization with support vector machines: Learning with many relevant features,” in ECML, 1998. 7. W. Lam, M. Ruiz, and P. Srinivasan, “Automatic text categorization and its application to text retrieval,” IEEE Transactions on Knowledge and Data Engineering, vol. 11, no. 6, pp. 865– 879, 1999. 8. F. Sebastiani, “Machine learning in automated text categorization,” ACM computing surveys (CSUR), vol. 34, no. 1, pp. 1–47, 2002. 9. H. Al-Mubaid, S. Umair et al., “A new text categorization technique using distributional clustering and learning logic,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 9, pp. 1156–1165, 2006. 10. Y. Aphinyanaphongs, L. D. Fu, Z. Li, E. R. Peskin, E. Efstathiadis, C. F. Aliferis, and A. Statnikov, “A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization,” Journal of the Association for Information Science and Technology, vol. 65, no. 10, pp. 1964–1987, 2014.
  • 42.