Prepared by: Mahmoud Rafeek Alfarra
Seminar Program
Document
Clustering and Classification
Out Line
 Classification and its techniques
 Clustering its techniques
 Document clustering !!
 Comparison
Classification: Definition
 Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class.
 Find a model for class attribute as a function of
the values of other attributes.
 Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
Classification: Definition
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Learning
algorithm
Training Set
Classification Techniques
 Decision Tree based Methods
 Rule-based Methods
 Memory based reasoning
 Neural Networks
 Naïve Bayes and Bayesian Belief Networks
 Support Vector Machines
Artificial Neural Networks (ANN)
X1 X2 X3 Y
1 0 0 0
1 0 1 1
1 1 0 1
1 1 1 1
0 0 1 0
0 1 0 0
0 1 1 1
0 0 0 0
X1
X2
X3
Y
Black box
Output
Input
Output Y is 1 if at least two of the three inputs are equal
to 1.
Artificial Neural Networks (ANN)
X1 X2 X3 Y
1 0 0 0
1 0 1 1
1 1 0 1
1 1 1 1
0 0 1 0
0 1 0 0
0 1 1 1
0 0 0 0

X1
X2
X3
Y
Black box
0.3
0.3
0.3 t=0.4
Output
node
Input
nodes





otherwise0
trueisif1
)(where
)04.03.03.03.0( 321
z
zI
XXXIY
Artificial Neural Networks (ANN)
 Model is an assembly of
inter-connected nodes
and weighted links
 Output node sums up
each of its input value
according to the weights
of its links
 Compare output node
against some threshold t

X1
X2
X3
Y
Black box
w1
t
Output
node
Input
nodes
w2
w3
)( tXwIY
i
ii  
Perceptron Model
)( tXwsignY
i
ii  
or
Clustering Definition
 Clustering is a division of data into groups of
similar objects.
 Each group is called cluster and consists of
objects that are similar between themselves and
dissimilar to objects of other groups .
ClusteringDefinition
C3
C2 C1
Document clustering
 Document clustering is an automatic grouping
of text documents into clusters so that
documents within a cluster have high similarity
in comparison to one another, but are dissimilar
to documents in other clusters.
The challenge
 The problem of Document clustering is how to organize
a large set of documents of various topics and reach
satisfy organization. It can display as follow:
 Given: A huge set of documents of various topics (shared,
related, totally different).
 Required: Group the documents into a number of clusters
such that the intra-cluster similarity is maximized, and the
inter-cluster similarity is minimized.
The challenge
Document cluster Document cluster
Document cluster
Inter-Cluster
Sim.
Intra-Cluster
Sim.
Inter-Cluster Sim. < Intra-Cluster Sim.
Clustering’s Process
Knowledge
Document Data Model
Representation
•Document Cleaning
•Feature Selection or Extraction.
Documents samples
Clustering Algorithm
• Similarity Measure
• Criterion of Clustering
Cluster Validation
• External Indices
• Internal Indices
Results Interpretation
Clusters
1 2
3
4
Clustering Techniques
 Clustering methods in general can be viewed
from different perspectives, the most widely
applied to text domain are:
 Hierarchical Clustering
 Partitioning Clustering
 Neural Network based Clustering
Clustering Techniques
 Suffix Tree Clustering algorithm
2015-09-26 16
D1: cat ate cheese
D2: mouse ate cheese too
D3: cat ate mouse too
and
Clustering Techniques
 Document Index Graph for clustering (DIG)
Clustering Techniques
 Graph based growing hierarchal SOM
Comparison
Thanks

Document clustering and classification

  • 1.
    Prepared by: MahmoudRafeek Alfarra Seminar Program Document Clustering and Classification
  • 2.
    Out Line  Classificationand its techniques  Clustering its techniques  Document clustering !!  Comparison
  • 3.
    Classification: Definition  Givena collection of records (training set ) – Each record contains a set of attributes, one of the attributes is the class.  Find a model for class attribute as a function of the values of other attributes.  Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
  • 4.
    Classification: Definition Apply Model Induction Deduction Learn Model Model Tid Attrib1Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10 Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Learning algorithm Training Set
  • 5.
    Classification Techniques  DecisionTree based Methods  Rule-based Methods  Memory based reasoning  Neural Networks  Naïve Bayes and Bayesian Belief Networks  Support Vector Machines
  • 6.
    Artificial Neural Networks(ANN) X1 X2 X3 Y 1 0 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 0 1 0 0 1 0 0 0 1 1 1 0 0 0 0 X1 X2 X3 Y Black box Output Input Output Y is 1 if at least two of the three inputs are equal to 1.
  • 7.
    Artificial Neural Networks(ANN) X1 X2 X3 Y 1 0 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 0 1 0 0 1 0 0 0 1 1 1 0 0 0 0  X1 X2 X3 Y Black box 0.3 0.3 0.3 t=0.4 Output node Input nodes      otherwise0 trueisif1 )(where )04.03.03.03.0( 321 z zI XXXIY
  • 8.
    Artificial Neural Networks(ANN)  Model is an assembly of inter-connected nodes and weighted links  Output node sums up each of its input value according to the weights of its links  Compare output node against some threshold t  X1 X2 X3 Y Black box w1 t Output node Input nodes w2 w3 )( tXwIY i ii   Perceptron Model )( tXwsignY i ii   or
  • 9.
    Clustering Definition  Clusteringis a division of data into groups of similar objects.  Each group is called cluster and consists of objects that are similar between themselves and dissimilar to objects of other groups .
  • 10.
  • 11.
    Document clustering  Documentclustering is an automatic grouping of text documents into clusters so that documents within a cluster have high similarity in comparison to one another, but are dissimilar to documents in other clusters.
  • 12.
    The challenge  Theproblem of Document clustering is how to organize a large set of documents of various topics and reach satisfy organization. It can display as follow:  Given: A huge set of documents of various topics (shared, related, totally different).  Required: Group the documents into a number of clusters such that the intra-cluster similarity is maximized, and the inter-cluster similarity is minimized.
  • 13.
    The challenge Document clusterDocument cluster Document cluster Inter-Cluster Sim. Intra-Cluster Sim. Inter-Cluster Sim. < Intra-Cluster Sim.
  • 14.
    Clustering’s Process Knowledge Document DataModel Representation •Document Cleaning •Feature Selection or Extraction. Documents samples Clustering Algorithm • Similarity Measure • Criterion of Clustering Cluster Validation • External Indices • Internal Indices Results Interpretation Clusters 1 2 3 4
  • 15.
    Clustering Techniques  Clusteringmethods in general can be viewed from different perspectives, the most widely applied to text domain are:  Hierarchical Clustering  Partitioning Clustering  Neural Network based Clustering
  • 16.
    Clustering Techniques  SuffixTree Clustering algorithm 2015-09-26 16 D1: cat ate cheese D2: mouse ate cheese too D3: cat ate mouse too and
  • 17.
    Clustering Techniques  DocumentIndex Graph for clustering (DIG)
  • 18.
    Clustering Techniques  Graphbased growing hierarchal SOM
  • 19.
  • 20.