3.5 model based clustering

Clustering
Model based techniques and Handling
high dimensional data
1

2
Model-Based Clustering Methods
 Attempt to optimize the fit between the data and some mathematical
model
 Assumption: Data are generated by a mixture of underlying probability
distributions
 Techniques
 Expectation-Maximization
 Conceptual Clustering
 Neural Networks Approach

Expectation Maximization
 Each cluster is represented mathematically by a
parametric probability distribution
 Component distribution
 Data is a mixture of these distributions
 Mixture density model
 Problem: To estimate parameters of probability distributions
3

 Iterative Refinement Algorithm – used to find parameter
estimates
 Extension of k-means
 Assigns an object to a cluster according to a weight representing
probability of membership
 Initial estimate of parameters
 Iteratively reassigns scores
4

 Initial guess for parameters; randomly select k objects to
represent cluster means or centers
 Iteratively refine parameters / clusters
 Expectation Step
 Assign each object xi to cluster Ck with probability
where
 Maximization Step
 Re-estimate model parameters
 Simple and easy to implement
 Complexity depends on features, objects and iterations
5

6
Conceptual Clustering
 Conceptual clustering
 A form of clustering in machine learning
 Produces a classification scheme for a set of unlabeled objects
 Finds characteristic description for each concept (class)
 COBWEB
 A popular and simple method of incremental conceptual learning
 Creates a hierarchical clustering in the form of a classification tree
 Each node refers to a concept and contains a probabilistic description
of that concept

7
COBWEB Clustering Method
A classification tree

COBWEB
 Classification tree
 Each node – Concept and its probabilistic distribution
(Summary of objects under that node)
 Description – Conditional probabilities P(Ai=vij / Ck)
 Sibling nodes at given level form a partition
 Category Utility
 Increase in the expected number of attribute values that can
be correctly guessed given a partition
8

COBWEB
 Category Utility rewards:
 Intra-class similarity P(Ai=vij|Ck)
 High value indicates many class members share this attribute-value
pair
 Inter-class dissimilarity P(Ck|Ai=vij)
 High values – fewer objects in different classes share this attribute-
value
 Placement of new objects
 Descend tree
 Identify best host
 Temporarily place object in each node and compute CU of resulting
partition
 Placement with highest CU is chosen
 COBWEB may also forms new nodes if object does not fit into the
existing tree
9

COBWEB
 COBWEB is sensitive to order of records
 Additional operations
 Merging and Splitting
 Two best hosts are considered for merging
 Best host is considered for splitting
 Limitations
 The assumption that the attributes are independent of each
other is often too strong because correlation may exist
 Not suitable for clustering large database data
 CLASSIT - an extension of COBWEB for incremental clustering of
continuous data
10

Neural Network Approach
 Represent each cluster as an exemplar, acting as a “prototype” of
the cluster
 New objects are distributed to the cluster whose exemplar is the
most similar according to some distance measure
 Self Organizing Map
 Competitive learning
 Involves a hierarchical architecture of several units (neurons)
 Neurons compete in a “winner-takes-all” fashion for the object currently
being presented
 Organization of units – forms a feature map
 Web Document Clustering
11

Clustering High-Dimensional data
 As dimensionality increases
 number of irrelevant dimensions may produce noise and mask real clusters
 data becomes sparse
 Distance measures –meaningless
 Feature transformation methods
 PCA, SVD – Summarize data by creating linear combinations of attributes
 But do not remove any attributes; transformed attributes – complex to
interpret
 Feature Selection methods
 Most relevant set of attributes with respect to class labels
 Entropy Analysis
 Subspace Clustering – searches for groups of clusters within different
subspaces of the same data set
13

CLIQUE: CLustering In QUest
 Dimension growth subspace clustering
 Starts at 1-D and grows upwards to higher dimensions
 Partitions each dimension – grids – determines whether
cell is dense
 CLIQUE
 Determines sparse and crowded units
 Dense unit – fraction of data points > threshold
 Cluster – maximal set of connected dense units
14

CLIQUE
 First partitions d-dimensional space into non-overlapping units
 Performed in 1-D
 Based on Apriori property: If a k-dimensional unit is dense so are its
projections in (k-1) dimensional space
 Search space size is reduced
 Determines the maximal dense region and Generates a minimal
description
15

CLIQUE
 Finds subspace of highest dimension
 Insensitive to order of inputs
 Performance depends on grid size and density threshold
 Difficult to determine across all dimensions
 Several lower dimensional subspaces will have to be
processed
 Can use adaptive strategy
16

PROCLUS – PROjected CLUStering
 Dimension-reduction Subspace Clustering technique
 Finds initial approximation of clusters in high dimensional
space
 Avoids generation of large number of overlapped
clusters of lower dimensionality
 Finds best set of medoids by hill-climbing process
(Similar to CLARANS)
 Manhattan Segmental distance measure
17

PROCLUS
 Initialization phase
 Greedy algorithm to select a set of initial medoids that are far
apart
 Iteration Phase
 Selects a random set of k-medoids
 Replaces bad medoids
 For each medoid a set of dimensions is chosen whose average
distances are small
 Refinement Phase
 Computes new dimensions for each medoid based on clusters
found, reasigns points to medoids and removes outliers
18

Frequent Pattern based Clustering
 Frequent patterns may also form clusters
 Instead of growing clusters dimension by dimension sets
of frequent itemsets are determined
 Two common technqiues
 Frequent term-based text Clustering
 Clustering by Pattern similarity
19

Frequent-term based text clustering
 Text documents are clustered based on frequent terms
they contain
 Documents – terms
 Dimensionality is very high
 Frequent term based analysis
 Well selected subset of set of all frequent terms must be
discovered
 Fi – Set of frequent term sets
 Cov(Fi) – set of documents covered by Fi
 ∑i=1 k
cov(Fi) = D and overlap between Fi and Fj must be
minimized
 Description of clusters – their frequent term sets
20

Clustering by Pattern Similarity
 pCluster on micro-array data analysis
 DNA micro-array analysis – expression levels of two
genes may rise and fall synchronously in response to
stimuli
 Two objects are similar if they exhibit a coherent pattern
on a subset of dimensions
21

pCluster
 Shift Pattern discovery
 Euclidean distance – not suitable
 Derive new attributes
 Bi-Clustering based on mean squared residue score
 pCluster
 Objects –x, y; attributes – a, b
 A pair (O,T) forms a δ-pCluster if for any 2 x 2 matrix X in (O, T)
pScore(X) <= δ
 Each pair of objects and their features must satisfy threshold
22

pCluster
 Scaling patterns
 pCluster can be used in other applications
also
23

3.5 model based clustering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 3.5 model based clustering

Similar to 3.5 model based clustering (20)

More from Krish_ver2

More from Krish_ver2 (20)

Recently uploaded

Recently uploaded (20)

3.5 model based clustering