Document clustering and classification

Prepared by: Mahmoud Rafeek Alfarra
Seminar Program
Document
Clustering and Classification

Out Line
 Classification and its techniques
 Clustering its techniques
 Document clustering !!
 Comparison

Classification: Definition
 Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class.
 Find a model for class attribute as a function of
the values of other attributes.
 Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.

Classification: Definition
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Learning
algorithm
Training Set

Classification Techniques
 Decision Tree based Methods
 Rule-based Methods
 Memory based reasoning
 Neural Networks
 Naïve Bayes and Bayesian Belief Networks
 Support Vector Machines

Artificial Neural Networks (ANN)
X1 X2 X3 Y
1 0 0 0
1 0 1 1
1 1 0 1
1 1 1 1
0 0 1 0
0 1 0 0
0 1 1 1
0 0 0 0
X1
X2
X3
Y
Black box
Output
Input
Output Y is 1 if at least two of the three inputs are equal
to 1.

X1 X2 X3 Y
1 0 0 0
1 0 1 1
1 1 0 1
1 1 1 1
0 0 1 0
0 1 0 0
0 1 1 1
0 0 0 0

X1
X2
X3
Y
Black box
0.3
0.3
0.3 t=0.4
Output
node
Input
nodes





otherwise0
trueisif1
)(where
)04.03.03.03.0( 321
z
zI
XXXIY

 Model is an assembly of
inter-connected nodes
and weighted links
 Output node sums up
each of its input value
according to the weights
of its links
 Compare output node
against some threshold t

X1
X2
X3
Y
Black box
w1
t
Output
node
Input
nodes
w2
w3
)( tXwIY
i
ii  
Perceptron Model
)( tXwsignY
i
ii  
or

Clustering Definition
 Clustering is a division of data into groups of
similar objects.
 Each group is called cluster and consists of
objects that are similar between themselves and
dissimilar to objects of other groups .

Document clustering
 Document clustering is an automatic grouping
of text documents into clusters so that
documents within a cluster have high similarity
in comparison to one another, but are dissimilar
to documents in other clusters.

The challenge
 The problem of Document clustering is how to organize
a large set of documents of various topics and reach
satisfy organization. It can display as follow:
 Given: A huge set of documents of various topics (shared,
related, totally different).
 Required: Group the documents into a number of clusters
such that the intra-cluster similarity is maximized, and the
inter-cluster similarity is minimized.

The challenge
Document cluster Document cluster
Document cluster
Inter-Cluster
Sim.
Intra-Cluster
Sim.
Inter-Cluster Sim. < Intra-Cluster Sim.

Clustering’s Process
Knowledge
Document Data Model
Representation
•Document Cleaning
•Feature Selection or Extraction.
Documents samples
Clustering Algorithm
• Similarity Measure
• Criterion of Clustering
Cluster Validation
• External Indices
• Internal Indices
Results Interpretation
Clusters
1 2
3
4

Clustering Techniques
 Clustering methods in general can be viewed
from different perspectives, the most widely
applied to text domain are:
 Hierarchical Clustering
 Partitioning Clustering
 Neural Network based Clustering

 Suffix Tree Clustering algorithm
2015-09-26 16
D1: cat ate cheese
D2: mouse ate cheese too
D3: cat ate mouse too
and

 Document Index Graph for clustering (DIG)

 Graph based growing hierarchal SOM

Document clustering and classification

More Related Content

What's hot

Viewers also liked

Similar to Document clustering and classification

More from Mahmoud Alfarra

Recently uploaded

Document clustering and classification