Document Classification Using Hierarchies Clusters
Technique
Master of Technology
(Software system)
Submitted By
Ekta Jadon
Enrollment No. (0828CS13MT22)
Under the Supervision of
Prof. Roopesh Sharma
Patel College of Science and Technology, Indore 2017-18
TABLE OF CONTENTS
1. ABSTRACT .
2. INTRODUCTION.
3. LITERATURE SURVEY.
4. PROBLEM OF Defination .
5. PROPOSED APPROACH ALGORITHM.
6. EXPERIMENTAL RESULTS.
7. CONCLUSION AND FUTURE SCOPE.
8. REFERENCES.
9. PUBLICATION.
ABSTRACT
 Data Mining, classification is the way to splits the data
into several dependent and independent regions and
each region refer as a class.
 Flat (linear) and Hierarchical manner for improving the
efficiency of classification model.
 It has been found that Hierarchical Classification
technique is more effective than Flat classification.
 It also performs better in case of multi-label document
classification.
INTRODUCTION
 Data mining is the process of sorting through large data
sets to identify patterns and establish relationships to
solve problems through data analysis.
 Standard data mining methods may be integrated with
information retrieval techniques.
AUTOMATED DOCUMENT CLASSIFICATION
Rule Based Classification:
This approach is very accurate for small document sets.
 Follow IF-THEN Rules.
IF condition THEN conclusion
IF age=youth AND student=yes THEN
buy_computer=yes
rule-based classifier by extracting IF-THEN rules from a
decision tree.
Machine Learning Based Approach:
 it learns it hence automatically create classifiers based on this
data.
 On one hand it shows a high predictive performance.
LITERATURE SURVEY
Author
name
Method Conclusion
Armand
Joulin et al.
(2016)
Propose a Simple
Baseline Method for Text
Classification.
We can train fastText on more than one
billion words in less than ten minutes
using a standard multicore CPU, and
classify half a million sentences among
312K classes in less than a minute.
PROBLEM
Categorize Those Problems According To Relevant
Criteria.
 Achieving Accuracy in document classification is big
problem.
 Classifying document in multiclass is difficult.
 Retrieving relevant document is also one of the issue.
 Flat Classification didn’t retrieve relevant document.
 Performance of classification technique is also one
problem when we talk about text mining.
PROPOSED APPROACH PROCESS
STANDARD DOCUMENT CLASSIFICATION SETUP
TRAINING AND TESTING DATASET
 The documents are collected from the classic Reuters -21578
collection for the purpose of evaluation.
 It is a collection of 21578 newswire articles originally collected
and labeled by Carnegie Group,Inc. and Reuters,Ltd.
 the task of evaluation ten largest classes in the Reuters-21578
collection was taken and some classes are added in this
collection for testing the accuracy of classifier.
 The whole collection of documents is divided into two parts,
 one is considered as training set for developing model which can be
used for classifying new documents of unknown class.
 Second is used as test set which is the collection of new documents
of unknown class which can be used for testing the classification
model.
COMPARISON OF FLAT CLASSIFICATION WITH
HIERARCHICAL CLASSIFICATION
Flat Classification Hierarchical Classification
A Flat Classification work in a single
level.
A Hierarchical Classification follows
the layout of a pyramid.
ALGORITHM FOR HIERARCHY BUILDING
1: function HIERARCHY(TrainSet; Labels;RootNode;Kmin;Kmax)
2: Pmin Performance Measure(TrainSet)
3: for I Kmin;Kmax do
4: C[i] do Clustering(TrainSet; Labels; I)
5: Dataset dataMetaLabeling(TrainSet; C)
6: Results[I] Performance Measure (Dataset)
7: end for
8:PerfEstimation;Kbest PerfEstimate(Results; C)
9: if PerfEstimation > Pmin then
10: addChildNodes(RootNode;C[Kbest])
11: for I 0;BestNumber do
12: Hierarchy(TrainSet;
13: C[Kbest][I];RootNode:Child(I))
14: end for
15: end if
16: end function
17:functionPERFORMANCEMEASURE(Dataset)
18: Train P art; Test P art split(Dataset)
19: return Performance(Train P art; Test P art)
20: end function
21: function PERFESTIMATE(Results; Clusters)
22: end function
ESTIMATING PERFORMANCE FOR CLUSTER
F1-measure (F1*), Precision (P*) or Recall (R*)
EXPERIMENTAL RESULTS
CONCLUSION
 The main Aim of all this work is to improve the
efficiency and accuracy of classifier. The Naive Bayes
we have used performs well with even large datasets.
Generating hierarchy of the available training classes and
then applying classifier model can improve classification
performance in most cases. It increases the performance
of the classifier even for multi label classification in the
field of multi class text classification.
FUTURE SCOPE
 The further research is needed to build statistically significant
and meaningful hierarchy. Even for efficient text classification
it is required to get strong hierarchy information which needs
further investigation. Combining different classification
approaches instead of single one along with hierarchic
structure of classes also provide avenue for future search.
REFERENCES
 P.-N. Tan, M. Steinbach, and V. Kumar, “Introduction to Data Mining. Addison Wesley
”, ijcsi vol IISC 2006 Pp 1-6.
 Xindong Wu1, Senior Member, IEEE “ Data Mining: An AI Perspective ” , IEEE
Conference pp 232-238.
 Jiawei Han and Micheline Kamber, “Data Mining: Concepts and Techniques”, Morgan
Kaufmann Publisher: CA Vol ICSI 2001. Pp 1-7.
 Thair Nu Phyu, “ Survey of Classification Techniques in Data Mining” , Proceedings of
the International Multi Conference of Engineers and Computer Scientists 2009 Vol
IIMECS 2009, March 18 - 20, 2009, Hong Kong pp 1-7.
 Tom M.Mitchell, “Machine Learning,” Carnegie Mellon University, McGraw-Hill Book
Co, 1997 pp 165-158
 Shantanu Godbole, Abhay Harpale, Sunita Sarawagi, and Soumen Chakrabarti.. “
Document classification through interactive supervision of document and term labels ”.
In Proc. of ECML/PKDD, 2004 pp 1-8.
 Alexandrin Popescul, Lyle H. Ungar, Steve Lawrence, and David M. Pennock. “
Statistical relational learning for document mining ”. In Proceedings of IEEE
International Conference on Data Mining (ICDM-2003), pp 275–282
PUBLICATIONS
 Ekta Jadon, Roopesh Sharma,“ Data Mining: Document
Classification using Naive Bayes Classifier”, International
Journal of Computer Applications (0975 – 8887) Volume 167
– No.6, June 2017 , page(s) 1-4.
Document Classification Using Hierarchies Clusters Technique

Document Classification Using Hierarchies Clusters Technique

  • 1.
    Document Classification UsingHierarchies Clusters Technique Master of Technology (Software system) Submitted By Ekta Jadon Enrollment No. (0828CS13MT22) Under the Supervision of Prof. Roopesh Sharma Patel College of Science and Technology, Indore 2017-18
  • 2.
    TABLE OF CONTENTS 1.ABSTRACT . 2. INTRODUCTION. 3. LITERATURE SURVEY. 4. PROBLEM OF Defination . 5. PROPOSED APPROACH ALGORITHM. 6. EXPERIMENTAL RESULTS. 7. CONCLUSION AND FUTURE SCOPE. 8. REFERENCES. 9. PUBLICATION.
  • 3.
    ABSTRACT  Data Mining,classification is the way to splits the data into several dependent and independent regions and each region refer as a class.  Flat (linear) and Hierarchical manner for improving the efficiency of classification model.  It has been found that Hierarchical Classification technique is more effective than Flat classification.  It also performs better in case of multi-label document classification.
  • 4.
    INTRODUCTION  Data miningis the process of sorting through large data sets to identify patterns and establish relationships to solve problems through data analysis.  Standard data mining methods may be integrated with information retrieval techniques.
  • 5.
    AUTOMATED DOCUMENT CLASSIFICATION RuleBased Classification: This approach is very accurate for small document sets.  Follow IF-THEN Rules. IF condition THEN conclusion IF age=youth AND student=yes THEN buy_computer=yes rule-based classifier by extracting IF-THEN rules from a decision tree. Machine Learning Based Approach:  it learns it hence automatically create classifiers based on this data.  On one hand it shows a high predictive performance.
  • 6.
    LITERATURE SURVEY Author name Method Conclusion Armand Joulinet al. (2016) Propose a Simple Baseline Method for Text Classification. We can train fastText on more than one billion words in less than ten minutes using a standard multicore CPU, and classify half a million sentences among 312K classes in less than a minute.
  • 7.
    PROBLEM Categorize Those ProblemsAccording To Relevant Criteria.  Achieving Accuracy in document classification is big problem.  Classifying document in multiclass is difficult.  Retrieving relevant document is also one of the issue.  Flat Classification didn’t retrieve relevant document.  Performance of classification technique is also one problem when we talk about text mining.
  • 8.
    PROPOSED APPROACH PROCESS STANDARDDOCUMENT CLASSIFICATION SETUP
  • 9.
    TRAINING AND TESTINGDATASET  The documents are collected from the classic Reuters -21578 collection for the purpose of evaluation.  It is a collection of 21578 newswire articles originally collected and labeled by Carnegie Group,Inc. and Reuters,Ltd.  the task of evaluation ten largest classes in the Reuters-21578 collection was taken and some classes are added in this collection for testing the accuracy of classifier.  The whole collection of documents is divided into two parts,  one is considered as training set for developing model which can be used for classifying new documents of unknown class.  Second is used as test set which is the collection of new documents of unknown class which can be used for testing the classification model.
  • 10.
    COMPARISON OF FLATCLASSIFICATION WITH HIERARCHICAL CLASSIFICATION Flat Classification Hierarchical Classification A Flat Classification work in a single level. A Hierarchical Classification follows the layout of a pyramid.
  • 11.
    ALGORITHM FOR HIERARCHYBUILDING 1: function HIERARCHY(TrainSet; Labels;RootNode;Kmin;Kmax) 2: Pmin Performance Measure(TrainSet) 3: for I Kmin;Kmax do 4: C[i] do Clustering(TrainSet; Labels; I) 5: Dataset dataMetaLabeling(TrainSet; C) 6: Results[I] Performance Measure (Dataset) 7: end for 8:PerfEstimation;Kbest PerfEstimate(Results; C) 9: if PerfEstimation > Pmin then 10: addChildNodes(RootNode;C[Kbest]) 11: for I 0;BestNumber do 12: Hierarchy(TrainSet; 13: C[Kbest][I];RootNode:Child(I)) 14: end for 15: end if 16: end function 17:functionPERFORMANCEMEASURE(Dataset) 18: Train P art; Test P art split(Dataset) 19: return Performance(Train P art; Test P art) 20: end function 21: function PERFESTIMATE(Results; Clusters) 22: end function
  • 12.
    ESTIMATING PERFORMANCE FORCLUSTER F1-measure (F1*), Precision (P*) or Recall (R*)
  • 13.
  • 14.
    CONCLUSION  The mainAim of all this work is to improve the efficiency and accuracy of classifier. The Naive Bayes we have used performs well with even large datasets. Generating hierarchy of the available training classes and then applying classifier model can improve classification performance in most cases. It increases the performance of the classifier even for multi label classification in the field of multi class text classification.
  • 15.
    FUTURE SCOPE  Thefurther research is needed to build statistically significant and meaningful hierarchy. Even for efficient text classification it is required to get strong hierarchy information which needs further investigation. Combining different classification approaches instead of single one along with hierarchic structure of classes also provide avenue for future search.
  • 16.
    REFERENCES  P.-N. Tan,M. Steinbach, and V. Kumar, “Introduction to Data Mining. Addison Wesley ”, ijcsi vol IISC 2006 Pp 1-6.  Xindong Wu1, Senior Member, IEEE “ Data Mining: An AI Perspective ” , IEEE Conference pp 232-238.  Jiawei Han and Micheline Kamber, “Data Mining: Concepts and Techniques”, Morgan Kaufmann Publisher: CA Vol ICSI 2001. Pp 1-7.  Thair Nu Phyu, “ Survey of Classification Techniques in Data Mining” , Proceedings of the International Multi Conference of Engineers and Computer Scientists 2009 Vol IIMECS 2009, March 18 - 20, 2009, Hong Kong pp 1-7.  Tom M.Mitchell, “Machine Learning,” Carnegie Mellon University, McGraw-Hill Book Co, 1997 pp 165-158  Shantanu Godbole, Abhay Harpale, Sunita Sarawagi, and Soumen Chakrabarti.. “ Document classification through interactive supervision of document and term labels ”. In Proc. of ECML/PKDD, 2004 pp 1-8.  Alexandrin Popescul, Lyle H. Ungar, Steve Lawrence, and David M. Pennock. “ Statistical relational learning for document mining ”. In Proceedings of IEEE International Conference on Data Mining (ICDM-2003), pp 275–282
  • 17.
    PUBLICATIONS  Ekta Jadon,Roopesh Sharma,“ Data Mining: Document Classification using Naive Bayes Classifier”, International Journal of Computer Applications (0975 – 8887) Volume 167 – No.6, June 2017 , page(s) 1-4.