Document Classification Using Hierarchies Clusters Technique
1. Document Classification Using Hierarchies Clusters
Technique
Master of Technology
(Software system)
Submitted By
Ekta Jadon
Enrollment No. (0828CS13MT22)
Under the Supervision of
Prof. Roopesh Sharma
Patel College of Science and Technology, Indore 2017-18
2. TABLE OF CONTENTS
1. ABSTRACT .
2. INTRODUCTION.
3. LITERATURE SURVEY.
4. PROBLEM OF Defination .
5. PROPOSED APPROACH ALGORITHM.
6. EXPERIMENTAL RESULTS.
7. CONCLUSION AND FUTURE SCOPE.
8. REFERENCES.
9. PUBLICATION.
3. ABSTRACT
Data Mining, classification is the way to splits the data
into several dependent and independent regions and
each region refer as a class.
Flat (linear) and Hierarchical manner for improving the
efficiency of classification model.
It has been found that Hierarchical Classification
technique is more effective than Flat classification.
It also performs better in case of multi-label document
classification.
4. INTRODUCTION
Data mining is the process of sorting through large data
sets to identify patterns and establish relationships to
solve problems through data analysis.
Standard data mining methods may be integrated with
information retrieval techniques.
5. AUTOMATED DOCUMENT CLASSIFICATION
Rule Based Classification:
This approach is very accurate for small document sets.
Follow IF-THEN Rules.
IF condition THEN conclusion
IF age=youth AND student=yes THEN
buy_computer=yes
rule-based classifier by extracting IF-THEN rules from a
decision tree.
Machine Learning Based Approach:
it learns it hence automatically create classifiers based on this
data.
On one hand it shows a high predictive performance.
6. LITERATURE SURVEY
Author
name
Method Conclusion
Armand
Joulin et al.
(2016)
Propose a Simple
Baseline Method for Text
Classification.
We can train fastText on more than one
billion words in less than ten minutes
using a standard multicore CPU, and
classify half a million sentences among
312K classes in less than a minute.
7. PROBLEM
Categorize Those Problems According To Relevant
Criteria.
Achieving Accuracy in document classification is big
problem.
Classifying document in multiclass is difficult.
Retrieving relevant document is also one of the issue.
Flat Classification didn’t retrieve relevant document.
Performance of classification technique is also one
problem when we talk about text mining.
9. TRAINING AND TESTING DATASET
The documents are collected from the classic Reuters -21578
collection for the purpose of evaluation.
It is a collection of 21578 newswire articles originally collected
and labeled by Carnegie Group,Inc. and Reuters,Ltd.
the task of evaluation ten largest classes in the Reuters-21578
collection was taken and some classes are added in this
collection for testing the accuracy of classifier.
The whole collection of documents is divided into two parts,
one is considered as training set for developing model which can be
used for classifying new documents of unknown class.
Second is used as test set which is the collection of new documents
of unknown class which can be used for testing the classification
model.
10. COMPARISON OF FLAT CLASSIFICATION WITH
HIERARCHICAL CLASSIFICATION
Flat Classification Hierarchical Classification
A Flat Classification work in a single
level.
A Hierarchical Classification follows
the layout of a pyramid.
11. ALGORITHM FOR HIERARCHY BUILDING
1: function HIERARCHY(TrainSet; Labels;RootNode;Kmin;Kmax)
2: Pmin Performance Measure(TrainSet)
3: for I Kmin;Kmax do
4: C[i] do Clustering(TrainSet; Labels; I)
5: Dataset dataMetaLabeling(TrainSet; C)
6: Results[I] Performance Measure (Dataset)
7: end for
8:PerfEstimation;Kbest PerfEstimate(Results; C)
9: if PerfEstimation > Pmin then
10: addChildNodes(RootNode;C[Kbest])
11: for I 0;BestNumber do
12: Hierarchy(TrainSet;
13: C[Kbest][I];RootNode:Child(I))
14: end for
15: end if
16: end function
17:functionPERFORMANCEMEASURE(Dataset)
18: Train P art; Test P art split(Dataset)
19: return Performance(Train P art; Test P art)
20: end function
21: function PERFESTIMATE(Results; Clusters)
22: end function
14. CONCLUSION
The main Aim of all this work is to improve the
efficiency and accuracy of classifier. The Naive Bayes
we have used performs well with even large datasets.
Generating hierarchy of the available training classes and
then applying classifier model can improve classification
performance in most cases. It increases the performance
of the classifier even for multi label classification in the
field of multi class text classification.
15. FUTURE SCOPE
The further research is needed to build statistically significant
and meaningful hierarchy. Even for efficient text classification
it is required to get strong hierarchy information which needs
further investigation. Combining different classification
approaches instead of single one along with hierarchic
structure of classes also provide avenue for future search.
16. REFERENCES
P.-N. Tan, M. Steinbach, and V. Kumar, “Introduction to Data Mining. Addison Wesley
”, ijcsi vol IISC 2006 Pp 1-6.
Xindong Wu1, Senior Member, IEEE “ Data Mining: An AI Perspective ” , IEEE
Conference pp 232-238.
Jiawei Han and Micheline Kamber, “Data Mining: Concepts and Techniques”, Morgan
Kaufmann Publisher: CA Vol ICSI 2001. Pp 1-7.
Thair Nu Phyu, “ Survey of Classification Techniques in Data Mining” , Proceedings of
the International Multi Conference of Engineers and Computer Scientists 2009 Vol
IIMECS 2009, March 18 - 20, 2009, Hong Kong pp 1-7.
Tom M.Mitchell, “Machine Learning,” Carnegie Mellon University, McGraw-Hill Book
Co, 1997 pp 165-158
Shantanu Godbole, Abhay Harpale, Sunita Sarawagi, and Soumen Chakrabarti.. “
Document classification through interactive supervision of document and term labels ”.
In Proc. of ECML/PKDD, 2004 pp 1-8.
Alexandrin Popescul, Lyle H. Ungar, Steve Lawrence, and David M. Pennock. “
Statistical relational learning for document mining ”. In Proceedings of IEEE
International Conference on Data Mining (ICDM-2003), pp 275–282
17. PUBLICATIONS
Ekta Jadon, Roopesh Sharma,“ Data Mining: Document
Classification using Naive Bayes Classifier”, International
Journal of Computer Applications (0975 – 8887) Volume 167
– No.6, June 2017 , page(s) 1-4.