SlideShare a Scribd company logo
1 of 18
Document Classification Using Hierarchies Clusters
Technique
Master of Technology
(Software system)
Submitted By
Ekta Jadon
Enrollment No. (0828CS13MT22)
Under the Supervision of
Prof. Roopesh Sharma
Patel College of Science and Technology, Indore 2017-18
TABLE OF CONTENTS
1. ABSTRACT .
2. INTRODUCTION.
3. LITERATURE SURVEY.
4. PROBLEM OF Defination .
5. PROPOSED APPROACH ALGORITHM.
6. EXPERIMENTAL RESULTS.
7. CONCLUSION AND FUTURE SCOPE.
8. REFERENCES.
9. PUBLICATION.
ABSTRACT
 Data Mining, classification is the way to splits the data
into several dependent and independent regions and
each region refer as a class.
 Flat (linear) and Hierarchical manner for improving the
efficiency of classification model.
 It has been found that Hierarchical Classification
technique is more effective than Flat classification.
 It also performs better in case of multi-label document
classification.
INTRODUCTION
 Data mining is the process of sorting through large data
sets to identify patterns and establish relationships to
solve problems through data analysis.
 Standard data mining methods may be integrated with
information retrieval techniques.
AUTOMATED DOCUMENT CLASSIFICATION
Rule Based Classification:
This approach is very accurate for small document sets.
 Follow IF-THEN Rules.
IF condition THEN conclusion
IF age=youth AND student=yes THEN
buy_computer=yes
rule-based classifier by extracting IF-THEN rules from a
decision tree.
Machine Learning Based Approach:
 it learns it hence automatically create classifiers based on this
data.
 On one hand it shows a high predictive performance.
LITERATURE SURVEY
Author
name
Method Conclusion
Armand
Joulin et al.
(2016)
Propose a Simple
Baseline Method for Text
Classification.
We can train fastText on more than one
billion words in less than ten minutes
using a standard multicore CPU, and
classify half a million sentences among
312K classes in less than a minute.
PROBLEM
Categorize Those Problems According To Relevant
Criteria.
 Achieving Accuracy in document classification is big
problem.
 Classifying document in multiclass is difficult.
 Retrieving relevant document is also one of the issue.
 Flat Classification didn’t retrieve relevant document.
 Performance of classification technique is also one
problem when we talk about text mining.
PROPOSED APPROACH PROCESS
STANDARD DOCUMENT CLASSIFICATION SETUP
TRAINING AND TESTING DATASET
 The documents are collected from the classic Reuters -21578
collection for the purpose of evaluation.
 It is a collection of 21578 newswire articles originally collected
and labeled by Carnegie Group,Inc. and Reuters,Ltd.
 the task of evaluation ten largest classes in the Reuters-21578
collection was taken and some classes are added in this
collection for testing the accuracy of classifier.
 The whole collection of documents is divided into two parts,
 one is considered as training set for developing model which can be
used for classifying new documents of unknown class.
 Second is used as test set which is the collection of new documents
of unknown class which can be used for testing the classification
model.
COMPARISON OF FLAT CLASSIFICATION WITH
HIERARCHICAL CLASSIFICATION
Flat Classification Hierarchical Classification
A Flat Classification work in a single
level.
A Hierarchical Classification follows
the layout of a pyramid.
ALGORITHM FOR HIERARCHY BUILDING
1: function HIERARCHY(TrainSet; Labels;RootNode;Kmin;Kmax)
2: Pmin Performance Measure(TrainSet)
3: for I Kmin;Kmax do
4: C[i] do Clustering(TrainSet; Labels; I)
5: Dataset dataMetaLabeling(TrainSet; C)
6: Results[I] Performance Measure (Dataset)
7: end for
8:PerfEstimation;Kbest PerfEstimate(Results; C)
9: if PerfEstimation > Pmin then
10: addChildNodes(RootNode;C[Kbest])
11: for I 0;BestNumber do
12: Hierarchy(TrainSet;
13: C[Kbest][I];RootNode:Child(I))
14: end for
15: end if
16: end function
17:functionPERFORMANCEMEASURE(Dataset)
18: Train P art; Test P art split(Dataset)
19: return Performance(Train P art; Test P art)
20: end function
21: function PERFESTIMATE(Results; Clusters)
22: end function
ESTIMATING PERFORMANCE FOR CLUSTER
F1-measure (F1*), Precision (P*) or Recall (R*)
EXPERIMENTAL RESULTS
CONCLUSION
 The main Aim of all this work is to improve the
efficiency and accuracy of classifier. The Naive Bayes
we have used performs well with even large datasets.
Generating hierarchy of the available training classes and
then applying classifier model can improve classification
performance in most cases. It increases the performance
of the classifier even for multi label classification in the
field of multi class text classification.
FUTURE SCOPE
 The further research is needed to build statistically significant
and meaningful hierarchy. Even for efficient text classification
it is required to get strong hierarchy information which needs
further investigation. Combining different classification
approaches instead of single one along with hierarchic
structure of classes also provide avenue for future search.
REFERENCES
 P.-N. Tan, M. Steinbach, and V. Kumar, “Introduction to Data Mining. Addison Wesley
”, ijcsi vol IISC 2006 Pp 1-6.
 Xindong Wu1, Senior Member, IEEE “ Data Mining: An AI Perspective ” , IEEE
Conference pp 232-238.
 Jiawei Han and Micheline Kamber, “Data Mining: Concepts and Techniques”, Morgan
Kaufmann Publisher: CA Vol ICSI 2001. Pp 1-7.
 Thair Nu Phyu, “ Survey of Classification Techniques in Data Mining” , Proceedings of
the International Multi Conference of Engineers and Computer Scientists 2009 Vol
IIMECS 2009, March 18 - 20, 2009, Hong Kong pp 1-7.
 Tom M.Mitchell, “Machine Learning,” Carnegie Mellon University, McGraw-Hill Book
Co, 1997 pp 165-158
 Shantanu Godbole, Abhay Harpale, Sunita Sarawagi, and Soumen Chakrabarti.. “
Document classification through interactive supervision of document and term labels ”.
In Proc. of ECML/PKDD, 2004 pp 1-8.
 Alexandrin Popescul, Lyle H. Ungar, Steve Lawrence, and David M. Pennock. “
Statistical relational learning for document mining ”. In Proceedings of IEEE
International Conference on Data Mining (ICDM-2003), pp 275–282
PUBLICATIONS
 Ekta Jadon, Roopesh Sharma,“ Data Mining: Document
Classification using Naive Bayes Classifier”, International
Journal of Computer Applications (0975 – 8887) Volume 167
– No.6, June 2017 , page(s) 1-4.
Document Classification Using Hierarchies Clusters Technique

More Related Content

What's hot

Subgraph relative frequency approach for extracting interesting substructur
Subgraph relative frequency approach for extracting interesting substructurSubgraph relative frequency approach for extracting interesting substructur
Subgraph relative frequency approach for extracting interesting substructur
IAEME Publication
 
Bs31267274
Bs31267274Bs31267274
Bs31267274
IJMER
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
Editor IJARCET
 

What's hot (18)

IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
 
Kmeans
KmeansKmeans
Kmeans
 
Correlation Coefficient Based Average Textual Similarity Model for Informatio...
Correlation Coefficient Based Average Textual Similarity Model for Informatio...Correlation Coefficient Based Average Textual Similarity Model for Informatio...
Correlation Coefficient Based Average Textual Similarity Model for Informatio...
 
Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm
 
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MININGPATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
 
DM
DMDM
DM
 
Iaetsd a survey on one class clustering
Iaetsd a survey on one class clusteringIaetsd a survey on one class clustering
Iaetsd a survey on one class clustering
 
Subgraph relative frequency approach for extracting interesting substructur
Subgraph relative frequency approach for extracting interesting substructurSubgraph relative frequency approach for extracting interesting substructur
Subgraph relative frequency approach for extracting interesting substructur
 
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...
IRJET-  	  Optimal Number of Cluster Identification using Robust K-Means for ...IRJET-  	  Optimal Number of Cluster Identification using Robust K-Means for ...
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...
 
TOWARD OPTIMAL FEATURE SELECTION IN NAÏVE BAYES FOR TEXT CATEGORIZATION
TOWARD OPTIMAL FEATURE SELECTION IN NAÏVE BAYES FOR TEXT CATEGORIZATIONTOWARD OPTIMAL FEATURE SELECTION IN NAÏVE BAYES FOR TEXT CATEGORIZATION
TOWARD OPTIMAL FEATURE SELECTION IN NAÏVE BAYES FOR TEXT CATEGORIZATION
 
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
 
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHTEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
 
Bs31267274
Bs31267274Bs31267274
Bs31267274
 
Bl24409420
Bl24409420Bl24409420
Bl24409420
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environment
 
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
 

Similar to Document Classification Using Hierarchies Clusters Technique

Associative Classification: Synopsis
Associative Classification: SynopsisAssociative Classification: Synopsis
Associative Classification: Synopsis
Jagdeep Singh Malhi
 
Chapter1_C.doc
Chapter1_C.docChapter1_C.doc
Chapter1_C.doc
butest
 
An efficient-classification-model-for-unstructured-text-document
An efficient-classification-model-for-unstructured-text-documentAn efficient-classification-model-for-unstructured-text-document
An efficient-classification-model-for-unstructured-text-document
SaleihGero
 
Dq2644974501
Dq2644974501Dq2644974501
Dq2644974501
IJMER
 
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERINGAN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
International Journal of Technical Research & Application
 
03 cs3024 pankaj_jajoo
03 cs3024 pankaj_jajoo03 cs3024 pankaj_jajoo
03 cs3024 pankaj_jajoo
Meetika Gupta
 

Similar to Document Classification Using Hierarchies Clusters Technique (20)

Associative Classification: Synopsis
Associative Classification: SynopsisAssociative Classification: Synopsis
Associative Classification: Synopsis
 
Chapter1_C.doc
Chapter1_C.docChapter1_C.doc
Chapter1_C.doc
 
J48 and JRIP Rules for E-Governance Data
J48 and JRIP Rules for E-Governance DataJ48 and JRIP Rules for E-Governance Data
J48 and JRIP Rules for E-Governance Data
 
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data SetsHortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
 
G045033841
G045033841G045033841
G045033841
 
An efficient-classification-model-for-unstructured-text-document
An efficient-classification-model-for-unstructured-text-documentAn efficient-classification-model-for-unstructured-text-document
An efficient-classification-model-for-unstructured-text-document
 
Indexing techniques for advanced database systems
Indexing techniques for advanced database systemsIndexing techniques for advanced database systems
Indexing techniques for advanced database systems
 
Novel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data StreamsNovel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data Streams
 
Dq2644974501
Dq2644974501Dq2644974501
Dq2644974501
 
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERINGAN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
 
EXTRACTING USEFUL RULES THROUGH IMPROVED DECISION TREE INDUCTION USING INFORM...
EXTRACTING USEFUL RULES THROUGH IMPROVED DECISION TREE INDUCTION USING INFORM...EXTRACTING USEFUL RULES THROUGH IMPROVED DECISION TREE INDUCTION USING INFORM...
EXTRACTING USEFUL RULES THROUGH IMPROVED DECISION TREE INDUCTION USING INFORM...
 
Hypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsHypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining Algorithms
 
N045038690
N045038690N045038690
N045038690
 
EXTRACTING USEFUL RULES THROUGH IMPROVED DECISION TREE INDUCTION USING INFORM...
EXTRACTING USEFUL RULES THROUGH IMPROVED DECISION TREE INDUCTION USING INFORM...EXTRACTING USEFUL RULES THROUGH IMPROVED DECISION TREE INDUCTION USING INFORM...
EXTRACTING USEFUL RULES THROUGH IMPROVED DECISION TREE INDUCTION USING INFORM...
 
03 cs3024 pankaj_jajoo
03 cs3024 pankaj_jajoo03 cs3024 pankaj_jajoo
03 cs3024 pankaj_jajoo
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
 
A Formal Machine Learning or Multi Objective Decision Making System for Deter...
A Formal Machine Learning or Multi Objective Decision Making System for Deter...A Formal Machine Learning or Multi Objective Decision Making System for Deter...
A Formal Machine Learning or Multi Objective Decision Making System for Deter...
 
Text Classification using Support Vector Machine
Text Classification using Support Vector MachineText Classification using Support Vector Machine
Text Classification using Support Vector Machine
 
Z36149154
Z36149154Z36149154
Z36149154
 
Analysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemAnalysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry System
 

Recently uploaded

AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
Tonystark477637
 

Recently uploaded (20)

Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 

Document Classification Using Hierarchies Clusters Technique

  • 1. Document Classification Using Hierarchies Clusters Technique Master of Technology (Software system) Submitted By Ekta Jadon Enrollment No. (0828CS13MT22) Under the Supervision of Prof. Roopesh Sharma Patel College of Science and Technology, Indore 2017-18
  • 2. TABLE OF CONTENTS 1. ABSTRACT . 2. INTRODUCTION. 3. LITERATURE SURVEY. 4. PROBLEM OF Defination . 5. PROPOSED APPROACH ALGORITHM. 6. EXPERIMENTAL RESULTS. 7. CONCLUSION AND FUTURE SCOPE. 8. REFERENCES. 9. PUBLICATION.
  • 3. ABSTRACT  Data Mining, classification is the way to splits the data into several dependent and independent regions and each region refer as a class.  Flat (linear) and Hierarchical manner for improving the efficiency of classification model.  It has been found that Hierarchical Classification technique is more effective than Flat classification.  It also performs better in case of multi-label document classification.
  • 4. INTRODUCTION  Data mining is the process of sorting through large data sets to identify patterns and establish relationships to solve problems through data analysis.  Standard data mining methods may be integrated with information retrieval techniques.
  • 5. AUTOMATED DOCUMENT CLASSIFICATION Rule Based Classification: This approach is very accurate for small document sets.  Follow IF-THEN Rules. IF condition THEN conclusion IF age=youth AND student=yes THEN buy_computer=yes rule-based classifier by extracting IF-THEN rules from a decision tree. Machine Learning Based Approach:  it learns it hence automatically create classifiers based on this data.  On one hand it shows a high predictive performance.
  • 6. LITERATURE SURVEY Author name Method Conclusion Armand Joulin et al. (2016) Propose a Simple Baseline Method for Text Classification. We can train fastText on more than one billion words in less than ten minutes using a standard multicore CPU, and classify half a million sentences among 312K classes in less than a minute.
  • 7. PROBLEM Categorize Those Problems According To Relevant Criteria.  Achieving Accuracy in document classification is big problem.  Classifying document in multiclass is difficult.  Retrieving relevant document is also one of the issue.  Flat Classification didn’t retrieve relevant document.  Performance of classification technique is also one problem when we talk about text mining.
  • 8. PROPOSED APPROACH PROCESS STANDARD DOCUMENT CLASSIFICATION SETUP
  • 9. TRAINING AND TESTING DATASET  The documents are collected from the classic Reuters -21578 collection for the purpose of evaluation.  It is a collection of 21578 newswire articles originally collected and labeled by Carnegie Group,Inc. and Reuters,Ltd.  the task of evaluation ten largest classes in the Reuters-21578 collection was taken and some classes are added in this collection for testing the accuracy of classifier.  The whole collection of documents is divided into two parts,  one is considered as training set for developing model which can be used for classifying new documents of unknown class.  Second is used as test set which is the collection of new documents of unknown class which can be used for testing the classification model.
  • 10. COMPARISON OF FLAT CLASSIFICATION WITH HIERARCHICAL CLASSIFICATION Flat Classification Hierarchical Classification A Flat Classification work in a single level. A Hierarchical Classification follows the layout of a pyramid.
  • 11. ALGORITHM FOR HIERARCHY BUILDING 1: function HIERARCHY(TrainSet; Labels;RootNode;Kmin;Kmax) 2: Pmin Performance Measure(TrainSet) 3: for I Kmin;Kmax do 4: C[i] do Clustering(TrainSet; Labels; I) 5: Dataset dataMetaLabeling(TrainSet; C) 6: Results[I] Performance Measure (Dataset) 7: end for 8:PerfEstimation;Kbest PerfEstimate(Results; C) 9: if PerfEstimation > Pmin then 10: addChildNodes(RootNode;C[Kbest]) 11: for I 0;BestNumber do 12: Hierarchy(TrainSet; 13: C[Kbest][I];RootNode:Child(I)) 14: end for 15: end if 16: end function 17:functionPERFORMANCEMEASURE(Dataset) 18: Train P art; Test P art split(Dataset) 19: return Performance(Train P art; Test P art) 20: end function 21: function PERFESTIMATE(Results; Clusters) 22: end function
  • 12. ESTIMATING PERFORMANCE FOR CLUSTER F1-measure (F1*), Precision (P*) or Recall (R*)
  • 14. CONCLUSION  The main Aim of all this work is to improve the efficiency and accuracy of classifier. The Naive Bayes we have used performs well with even large datasets. Generating hierarchy of the available training classes and then applying classifier model can improve classification performance in most cases. It increases the performance of the classifier even for multi label classification in the field of multi class text classification.
  • 15. FUTURE SCOPE  The further research is needed to build statistically significant and meaningful hierarchy. Even for efficient text classification it is required to get strong hierarchy information which needs further investigation. Combining different classification approaches instead of single one along with hierarchic structure of classes also provide avenue for future search.
  • 16. REFERENCES  P.-N. Tan, M. Steinbach, and V. Kumar, “Introduction to Data Mining. Addison Wesley ”, ijcsi vol IISC 2006 Pp 1-6.  Xindong Wu1, Senior Member, IEEE “ Data Mining: An AI Perspective ” , IEEE Conference pp 232-238.  Jiawei Han and Micheline Kamber, “Data Mining: Concepts and Techniques”, Morgan Kaufmann Publisher: CA Vol ICSI 2001. Pp 1-7.  Thair Nu Phyu, “ Survey of Classification Techniques in Data Mining” , Proceedings of the International Multi Conference of Engineers and Computer Scientists 2009 Vol IIMECS 2009, March 18 - 20, 2009, Hong Kong pp 1-7.  Tom M.Mitchell, “Machine Learning,” Carnegie Mellon University, McGraw-Hill Book Co, 1997 pp 165-158  Shantanu Godbole, Abhay Harpale, Sunita Sarawagi, and Soumen Chakrabarti.. “ Document classification through interactive supervision of document and term labels ”. In Proc. of ECML/PKDD, 2004 pp 1-8.  Alexandrin Popescul, Lyle H. Ungar, Steve Lawrence, and David M. Pennock. “ Statistical relational learning for document mining ”. In Proceedings of IEEE International Conference on Data Mining (ICDM-2003), pp 275–282
  • 17. PUBLICATIONS  Ekta Jadon, Roopesh Sharma,“ Data Mining: Document Classification using Naive Bayes Classifier”, International Journal of Computer Applications (0975 – 8887) Volume 167 – No.6, June 2017 , page(s) 1-4.