Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Document clustering and classification

2,470 views

Published on

محاضرة ألقيتها ضمن برنامج السيمينار الذي نفذه قسم علوم الحاسوب وتكنولوجيا المعلومات في الكلية الجامعية للعلوم والتكنولوجيا عام 2012

Published in: Education
  • Be the first to comment

Document clustering and classification

  1. 1. Prepared by: Mahmoud Rafeek Alfarra Seminar Program Document Clustering and Classification
  2. 2. Out Line  Classification and its techniques  Clustering its techniques  Document clustering !!  Comparison
  3. 3. Classification: Definition  Given a collection of records (training set ) – Each record contains a set of attributes, one of the attributes is the class.  Find a model for class attribute as a function of the values of other attributes.  Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
  4. 4. Classification: Definition Apply Model Induction Deduction Learn Model Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10 Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Learning algorithm Training Set
  5. 5. Classification Techniques  Decision Tree based Methods  Rule-based Methods  Memory based reasoning  Neural Networks  Naïve Bayes and Bayesian Belief Networks  Support Vector Machines
  6. 6. Artificial Neural Networks (ANN) X1 X2 X3 Y 1 0 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 0 1 0 0 1 0 0 0 1 1 1 0 0 0 0 X1 X2 X3 Y Black box Output Input Output Y is 1 if at least two of the three inputs are equal to 1.
  7. 7. Artificial Neural Networks (ANN) X1 X2 X3 Y 1 0 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 0 1 0 0 1 0 0 0 1 1 1 0 0 0 0  X1 X2 X3 Y Black box 0.3 0.3 0.3 t=0.4 Output node Input nodes      otherwise0 trueisif1 )(where )04.03.03.03.0( 321 z zI XXXIY
  8. 8. Artificial Neural Networks (ANN)  Model is an assembly of inter-connected nodes and weighted links  Output node sums up each of its input value according to the weights of its links  Compare output node against some threshold t  X1 X2 X3 Y Black box w1 t Output node Input nodes w2 w3 )( tXwIY i ii   Perceptron Model )( tXwsignY i ii   or
  9. 9. Clustering Definition  Clustering is a division of data into groups of similar objects.  Each group is called cluster and consists of objects that are similar between themselves and dissimilar to objects of other groups .
  10. 10. ClusteringDefinition C3 C2 C1
  11. 11. Document clustering  Document clustering is an automatic grouping of text documents into clusters so that documents within a cluster have high similarity in comparison to one another, but are dissimilar to documents in other clusters.
  12. 12. The challenge  The problem of Document clustering is how to organize a large set of documents of various topics and reach satisfy organization. It can display as follow:  Given: A huge set of documents of various topics (shared, related, totally different).  Required: Group the documents into a number of clusters such that the intra-cluster similarity is maximized, and the inter-cluster similarity is minimized.
  13. 13. The challenge Document cluster Document cluster Document cluster Inter-Cluster Sim. Intra-Cluster Sim. Inter-Cluster Sim. < Intra-Cluster Sim.
  14. 14. Clustering’s Process Knowledge Document Data Model Representation •Document Cleaning •Feature Selection or Extraction. Documents samples Clustering Algorithm • Similarity Measure • Criterion of Clustering Cluster Validation • External Indices • Internal Indices Results Interpretation Clusters 1 2 3 4
  15. 15. Clustering Techniques  Clustering methods in general can be viewed from different perspectives, the most widely applied to text domain are:  Hierarchical Clustering  Partitioning Clustering  Neural Network based Clustering
  16. 16. Clustering Techniques  Suffix Tree Clustering algorithm 2015-09-26 16 D1: cat ate cheese D2: mouse ate cheese too D3: cat ate mouse too and
  17. 17. Clustering Techniques  Document Index Graph for clustering (DIG)
  18. 18. Clustering Techniques  Graph based growing hierarchal SOM
  19. 19. Comparison
  20. 20. Thanks

×