SlideShare a Scribd company logo
Overview of Clustering
Rong Jin
Outline
 K means for clustering
 Expectation Maximization algorithm for clustering
 Spectrum clustering (if time is permitted)
Clustering
$$$
age
 Find out the underlying structure for given data
points
Application (I): Search Result Clustering
Application (II): Navigation
Application (III): Google News
Application (III): Visualization
Islands of music
(Pampalk et al., KDD’ 03)
Application (IV): Image Compression
http://www.ece.neu.edu/groups/rpl/kmeans/
How to Find good Clustering?
 Minimize the sum of
distance within clusters
C1
C2
C3
C4
C5
 
 
,
6 2
,
1 1,
arg min
j i j
n
i j i j
j iC m
m x C
 
 
,
6
,
1
1 the j-th cluster
0 the j-th cluster
1
any a single cluster
i
i j
i
i j
j
i
x
m
x
m
x


 


 

How to Efficiently Clustering Data?
 
 
,
6 2
,
1 1,
arg min
j i j
n
i j i j
j iC m
m x C
 
 
   ,Memberships and centers are correlated.i j jm C
 
,
1
,
,
1
Given memberships ,
n
i j i
i
i j j n
i j
i
m x
m C
m





2
,
1 arg min( )
Given centers { },
0 otherwise
i j
kj i j
j x C
C m
  
 

K-means for Clustering
 K-means
 Start with a random
guess of cluster
centers
 Determine the
membership of each
data points
 Adjust the cluster
centers
K-means for Clustering
 K-means
 Start with a random
guess of cluster
centers
 Determine the
membership of each
data points
 Adjust the cluster
centers
K-means for Clustering
 K-means
 Start with a random
guess of cluster
centers
 Determine the
membership of each
data points
 Adjust the cluster
centers
K-means
1. Ask user how many clusters
they’d like. (e.g. k=5)
K-means
1. Ask user how many clusters
they’d like. (e.g. k=5)
2. Randomly guess k cluster
Center locations
K-means
1. Ask user how many clusters
they’d like. (e.g. k=5)
2. Randomly guess k cluster
Center locations
3. Each datapoint finds out
which Center it’s closest to.
(Thus each Center “owns” a
set of datapoints)
K-means
1. Ask user how many clusters
they’d like. (e.g. k=5)
2. Randomly guess k cluster
Center locations
3. Each datapoint finds out
which Center it’s closest to.
4. Each Center finds the
centroid of the points it
owns
K-means
1. Ask user how many clusters
they’d like. (e.g. k=5)
2. Randomly guess k cluster
Center locations
3. Each datapoint finds out
which Center it’s closest to.
4. Each Center finds the
centroid of the points it
owns
Any Computational Problem?
Computational Complexity: O(N)
where N is the number of points?
Improve K-means
 Group points by region
 KD tree
 SR tree
 Key difference
 Find the closest center for
each rectangle
 Assign all the points within a
rectangle to one cluster
Improved K-means
 Find the closest center for
each rectangle
 Assign all the points within
a rectangle to one cluster
Improved K-means
Improved K-means
Improved K-means
Improved K-means
Improved K-means
Improved K-means
Improved K-means
Improved K-means
Improved K-means
A Gaussian Mixture Model for Clustering
 Assume that data are
generated from a
mixture of Gaussian
distributions
 For each Gaussian
distribution
 Center: i
 Variance: i (ignore)
 For each data point
 Determine membership
: if belongs to j-th clusterij iz x
Learning a Gaussian Mixture
(with known covariance)
 Probability ( )ip x x
 
2
/ 2 2
2
( ) ( , ) ( ) ( | )
1
( ) exp
22
j j
j
i i j j i j
i j
j d
p x x p x x p p x x
x
p
 

     

 

       
 
   
 
 
 

Learning a Gaussian Mixture
(with known covariance)
 Probability ( )ip x x
 
2
/ 2 2
2
( ) ( , ) ( ) ( | )
1
( ) exp
22
j j
j
i i j j i j
i j
j d
p x x p x x p p x x
x
p
 

     

 

       
 
   
 
 
 

 Log-likelihood of data
 Apply MLE to find optimal parameters
 
2
/ 2 2
2
1
log ( ) log ( ) exp
22j
i j
i j d
i i
x
p x x p


 

         
   
  
 ( ),j j j
p   
Learning a Gaussian Mixture
(with known covariance)
2
2
2
2
1
( )
2
1
( )
2
1
( )
( )
i j
i n
x
j
k x
n
n
e p
e p




 
 
 
 




[ ] ( | )ij j iE z p x x   E-Step
1
( | ) ( )
( | ) ( )
i j j
k
i n j
n
p x x p
p x x p
   
   

  

  
Learning a Gaussian Mixture
(with known covariance)
1
1
1
[ ]
[ ]
m
j ij im
i
ij
i
E z x
E z



 

M-Step
1
1
( ) [ ]
m
j ij
i
p E z
m
 

  
Gaussian Mixture Example: Start
After First Iteration
After 2nd Iteration
After 3rd Iteration
After 4th Iteration
After 5th Iteration
After 6th Iteration
After 20th Iteration
Mixture Model for Doc Clustering
 A set of language models

 1 2, ,..., K   
1 2{ ( | ), ( | ),..., ( | )}i i i V ip w p w p w   
Mixture Model for Doc Clustering
 A set of language models

 1 2, ,..., K   
1 2{ ( | ), ( | ),..., ( | )}i i i V ip w p w p w   
( )ip d d
( , )
1
( ) ( , )
( ) ( | )
( ) ( | )
j
j
k i
j
i i j
j i j
V tf w d
j k j
k
p d d p d d
p p d d
p p w



 
   
  

   
   
    


 
 Probability
Mixture Model for Doc Clustering
 A set of language models

 1 2, ,..., K   
1 2{ ( | ), ( | ),..., ( | )}i i i V ip w p w p w   
( )ip d d
( , )
1
( ) ( , )
( ) ( | )
( ) ( | )
j
j
k i
j
i i j
j i j
V tf w d
j k j
k
p d d p d d
p p d d
p p w



 
   
  

   
   
    


 
 Probability
Mixture Model for Doc Clustering
 A set of language models

 1 2, ,..., K   
1 2{ ( | ), ( | ),..., ( | )}i i i V ip w p w p w   
( )ip d d
( , )
1
( ) ( , )
( ) ( | )
( ) ( | )
j
j
k i
j
i i j
j i j
V tf w d
j k j
k
p d d p d d
p p d d
p p w



 
   
  

   
   
    


 
 Probability
Introduce hidden variable zij
zij: document di is generated by the
j-th language model j.
Learning a Mixture Model
 
( , )
1
( , )
1 1
( | ) ( )
( | ) ( )
k i
k i
V tf w d
m j j
m
VK
tf w d
m n n
n m
p w p
p w p
  
  

 
   




1
[ ] ( | )
( | ) ( )
( | ) ( )
ij j i
i j j
K
i n n
n
E z p d d
p d d p
p d d p
 
   
   

  
  

  
E-Step
K: number of language models
Learning a Mixture Model
M-Step
1
1
( ) [ ]
N
j ij
i
p E z
N
 

  
1
1
[ ] ( , )
( | )
[ ]
N
ij i k
k
i j N
ij k
k
E z tf w d
p w
E z d
 




N: number of documents
Examples of Mixture Models
Other Mixture Models
 Probabilistic latent semantic index (PLSI)
 Latent Dirichlet Allocation (LDA)
Problems (I)
 Both k-means and mixture models need to compute
centers of clusters and explicit distance measurement
 Given strange distance measurement, the center of
clusters can be hard to compute
E.g.,  ' ' '
1 1 2 2' max , ,..., n nx x x x x x x x
    
x y
z
 
  x y x z
Problems (II)
 Both k-means and mixture models look for compact
clustering structures
 In some cases, connected clustering structures are more desirable
Graph Partition
 MinCut: bipartite graphs with minimal number of
cut edges
CutSize = 2
2-way Spectral Graph Partitioning
 Weight matrix W
 wi,j: the weight between two
vertices i and j
 Membership vector q
1 Cluster
-1 Cluster
i
i A
q
i B

 

 
[ 1,1]
2
,
,
arg min
1
4
n
i j i j
i j
CutSize
CutSize J q q w
 

  
q
q
Solving the Optimization Problem
 Directly solving the above problem requires
combinatorial search  exponential complexity
 How to reduce the computation complexity?
 
2
,
[ 1,1] ,
1
argmin
4n
i j i j
i j
q q w
 
 
q
q
Relaxation Approach
 Key difficulty: qi has to be either –1, 1
 Relax qi to be any real number
 Impose constraint 2
1
n
ii
q n

   
2 2 2
, ,
, ,
2
, ,
,
1 1
2
4 4
1 1
2 2
4 4
i j i j i j i j i j
i j i j
i i j i j i j
i j i j
J q q w q q q q w
q w q q w
    
 
   
 
 
  
,i i j
j
d w 
 2
, , ,
,
1 1 1
2 2 2
i i i j i j i i i j i j j
i i j i
q d q q w q d w q     
,i i jD d    
( )T
J  q D W q
Relaxation Approach
2
* argmin argmin ( )
subject to
T
k
k
J
q n
  

q q
q q D W q
Relaxation Approach
 Solution: the second minimum eigenvector for D-W
2
* argmin argmin ( )
subject to
T
k
k
J
q n
  

q q
q q D W q
2( )D W  q q
Graph Laplacian
 L is semi-positive definitive matrix
 For Any x, we have xTLx  0, why?
 Minimum eigenvalue 1 = 0 (what is the eigenvector?)

 The second minimum eigenvalue 2 gives the best bipartite
graph
 , , ,: ,i j i j i jj
w w         L D W W D
1 2 30 ... k      
Recovering Partitions
 Due to the relaxation, q can be any number (not just
–1 and 1)
 How to construct partition based on the eigenvector?
 Simple strategy: { | 0}, { | 0}i iA i q B i q   
Spectral Clustering
 Minimum cut does not balance the size of bipartite
graphs
Normalized Cut (Shi & Malik, 1997)
 Minimize the similarity between clusters and meanwhile
maximize the similarity within clusters
,( , ) , ,
( , ) ( , )
i j A i B i
i A j B i A i B
A B
s A B w d d d d
s A B s A B
J
d d
   
  
 
   
,
( , ) ( , ) B A
i j
i A j BA B A B
d ds A B s A B
J w
d d d d 

    
j
j
d d 
 2
,
B A
i j
i A j B A B
d d
w
d d d 

  







Biddd
Aiddd
iq
BA
AB
if
if
/
/
)(
 
2
,i j i j
i j
w q q 
Normalized Cut
 
2
, ( - )
/ if
/ if
T
i j i j
i j
B A
i
A B
J w q q
d d d i A
q
d d d i B
  
 
 
 
 q D W q
Normalized Cut
 Relax q to real value under the constraint
 
2
, ( - )
/ if
/ if
T
i j i j
i j
B A
i
A B
J w q q
d d d i A
q
d d d i B
  
 
 
 
 q D W q
0,1  DeqDqq TT
 Solution: DqqWD  )(
Image Segmentation
Non-negative Matrix Factorization

More Related Content

What's hot

Mychurch File Upload
Mychurch File UploadMychurch File Upload
Mychurch File UploadJoe Suh
 
A Note on TopicRNN
A Note on TopicRNNA Note on TopicRNN
A Note on TopicRNN
Tomonari Masada
 
Discrete form of the riccati equation
Discrete form of the riccati equationDiscrete form of the riccati equation
Discrete form of the riccati equationAlberth Carantón
 
Graph kernels
Graph kernelsGraph kernels
Graph kernels
Luc Brun
 
Ee693 questionshomework
Ee693 questionshomeworkEe693 questionshomework
Ee693 questionshomework
Gopi Saiteja
 
Compact binary-tree-representation-of-logic-function-with-enhanced-throughput-
Compact binary-tree-representation-of-logic-function-with-enhanced-throughput-Compact binary-tree-representation-of-logic-function-with-enhanced-throughput-
Compact binary-tree-representation-of-logic-function-with-enhanced-throughput-Cemal Ardil
 
Digital Electronics University Question Bank
Digital Electronics University Question BankDigital Electronics University Question Bank
Digital Electronics University Question Bank
Nilesh Bhaskarrao Bahadure
 
Graph Edit Distance: Basics & Trends
Graph Edit Distance: Basics & TrendsGraph Edit Distance: Basics & Trends
Graph Edit Distance: Basics & Trends
Luc Brun
 
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
Masumi Shirakawa
 
QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017
Fred J. Hickernell
 
Chapter 06 boolean algebra 3o-p
Chapter 06 boolean algebra 3o-pChapter 06 boolean algebra 3o-p
Chapter 06 boolean algebra 3o-pIIUI
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
Masahiro Suzuki
 
Chapter 06 boolean algebra
Chapter 06 boolean algebraChapter 06 boolean algebra
Chapter 06 boolean algebraIIUI
 
D E S I G N A N D A N A L Y S I S O F A L G O R I T H M S J N T U M O D E L...
D E S I G N A N D A N A L Y S I S O F A L G O R I T H M S  J N T U  M O D E L...D E S I G N A N D A N A L Y S I S O F A L G O R I T H M S  J N T U  M O D E L...
D E S I G N A N D A N A L Y S I S O F A L G O R I T H M S J N T U M O D E L...guest3f9c6b
 
Chapter 06 boolean algebra 2o-p
Chapter 06 boolean algebra 2o-pChapter 06 boolean algebra 2o-p
Chapter 06 boolean algebra 2o-pIIUI
 
Paper id 71201927
Paper id 71201927Paper id 71201927
Paper id 71201927
IJRAT
 
An application of gd
An application of gdAn application of gd
An application of gd
graphhoc
 

What's hot (20)

Mychurch File Upload
Mychurch File UploadMychurch File Upload
Mychurch File Upload
 
A Note on TopicRNN
A Note on TopicRNNA Note on TopicRNN
A Note on TopicRNN
 
Discrete form of the riccati equation
Discrete form of the riccati equationDiscrete form of the riccati equation
Discrete form of the riccati equation
 
Graph kernels
Graph kernelsGraph kernels
Graph kernels
 
Ee693 questionshomework
Ee693 questionshomeworkEe693 questionshomework
Ee693 questionshomework
 
Compact binary-tree-representation-of-logic-function-with-enhanced-throughput-
Compact binary-tree-representation-of-logic-function-with-enhanced-throughput-Compact binary-tree-representation-of-logic-function-with-enhanced-throughput-
Compact binary-tree-representation-of-logic-function-with-enhanced-throughput-
 
Digital Electronics University Question Bank
Digital Electronics University Question BankDigital Electronics University Question Bank
Digital Electronics University Question Bank
 
Graph Edit Distance: Basics & Trends
Graph Edit Distance: Basics & TrendsGraph Edit Distance: Basics & Trends
Graph Edit Distance: Basics & Trends
 
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
 
QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017
 
Chapter 06 boolean algebra 3o-p
Chapter 06 boolean algebra 3o-pChapter 06 boolean algebra 3o-p
Chapter 06 boolean algebra 3o-p
 
Bq25399403
Bq25399403Bq25399403
Bq25399403
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
 
Chapter 06 boolean algebra
Chapter 06 boolean algebraChapter 06 boolean algebra
Chapter 06 boolean algebra
 
D E S I G N A N D A N A L Y S I S O F A L G O R I T H M S J N T U M O D E L...
D E S I G N A N D A N A L Y S I S O F A L G O R I T H M S  J N T U  M O D E L...D E S I G N A N D A N A L Y S I S O F A L G O R I T H M S  J N T U  M O D E L...
D E S I G N A N D A N A L Y S I S O F A L G O R I T H M S J N T U M O D E L...
 
Chapter 06 boolean algebra 2o-p
Chapter 06 boolean algebra 2o-pChapter 06 boolean algebra 2o-p
Chapter 06 boolean algebra 2o-p
 
Paper id 71201927
Paper id 71201927Paper id 71201927
Paper id 71201927
 
An application of gd
An application of gdAn application of gd
An application of gd
 
1508.07756v1
1508.07756v11508.07756v1
1508.07756v1
 
2-D array
2-D array2-D array
2-D array
 

Viewers also liked

Spectral clustering
Spectral clusteringSpectral clustering
Spectral clustering
SOYEON KIM
 
Spectral clustering Tutorial
Spectral clustering TutorialSpectral clustering Tutorial
Spectral clustering Tutorial
Zitao Liu
 
Seminar: Visualisasi Data Interaktif Data Terbuka Pemerintah Provinsi DKI Jak...
Seminar: Visualisasi Data Interaktif Data Terbuka Pemerintah Provinsi DKI Jak...Seminar: Visualisasi Data Interaktif Data Terbuka Pemerintah Provinsi DKI Jak...
Seminar: Visualisasi Data Interaktif Data Terbuka Pemerintah Provinsi DKI Jak...
Nadiar AS
 
IJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsIJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphs
Akisato Kimura
 
ICWSM12 Brief Review
ICWSM12 Brief ReviewICWSM12 Brief Review
ICWSM12 Brief Review
Akisato Kimura
 
Blog clustering
Blog clusteringBlog clustering
Blog clustering
Ahmad Ammari
 
Intelligent computer aided diagnosis system for liver fibrosis
Intelligent computer aided diagnosis system for liver fibrosisIntelligent computer aided diagnosis system for liver fibrosis
Intelligent computer aided diagnosis system for liver fibrosis
Aboul Ella Hassanien
 

Viewers also liked (7)

Spectral clustering
Spectral clusteringSpectral clustering
Spectral clustering
 
Spectral clustering Tutorial
Spectral clustering TutorialSpectral clustering Tutorial
Spectral clustering Tutorial
 
Seminar: Visualisasi Data Interaktif Data Terbuka Pemerintah Provinsi DKI Jak...
Seminar: Visualisasi Data Interaktif Data Terbuka Pemerintah Provinsi DKI Jak...Seminar: Visualisasi Data Interaktif Data Terbuka Pemerintah Provinsi DKI Jak...
Seminar: Visualisasi Data Interaktif Data Terbuka Pemerintah Provinsi DKI Jak...
 
IJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsIJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphs
 
ICWSM12 Brief Review
ICWSM12 Brief ReviewICWSM12 Brief Review
ICWSM12 Brief Review
 
Blog clustering
Blog clusteringBlog clustering
Blog clustering
 
Intelligent computer aided diagnosis system for liver fibrosis
Intelligent computer aided diagnosis system for liver fibrosisIntelligent computer aided diagnosis system for liver fibrosis
Intelligent computer aided diagnosis system for liver fibrosis
 

Similar to Pert 05 aplikasi clustering

Iwsm2014 an analogy-based approach to estimation of software development ef...
Iwsm2014   an analogy-based approach to estimation of software development ef...Iwsm2014   an analogy-based approach to estimation of software development ef...
Iwsm2014 an analogy-based approach to estimation of software development ef...
Nesma
 
Accelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference CompilationAccelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference Compilation
Feynman Liang
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
Wireilla
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ijfls
 
Unit 3
Unit 3Unit 3
Unit 3
guna287176
 
An optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideAn optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slide
WooSung Choi
 
Hierarchical matrix techniques for maximum likelihood covariance estimation
Hierarchical matrix techniques for maximum likelihood covariance estimationHierarchical matrix techniques for maximum likelihood covariance estimation
Hierarchical matrix techniques for maximum likelihood covariance estimation
Alexander Litvinenko
 
Lecture12 xing
Lecture12 xingLecture12 xing
Lecture12 xing
Tianlu Wang
 
Some Continued Mock Theta Functions from Ramanujan’s Lost Notebook (IV)
Some Continued Mock Theta Functions from Ramanujan’s Lost Notebook (IV)Some Continued Mock Theta Functions from Ramanujan’s Lost Notebook (IV)
Some Continued Mock Theta Functions from Ramanujan’s Lost Notebook (IV)
paperpublications3
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
Frank Nielsen
 
Cs229 notes7a
Cs229 notes7aCs229 notes7a
Cs229 notes7a
VuTran231
 
Skiena algorithm 2007 lecture15 backtracing
Skiena algorithm 2007 lecture15 backtracingSkiena algorithm 2007 lecture15 backtracing
Skiena algorithm 2007 lecture15 backtracingzukun
 
K-means Clustering Algorithm with Matlab Source code
K-means Clustering Algorithm with Matlab Source codeK-means Clustering Algorithm with Matlab Source code
K-means Clustering Algorithm with Matlab Source code
gokulprasath06
 
Ee693 sept2014midsem
Ee693 sept2014midsemEe693 sept2014midsem
Ee693 sept2014midsem
Gopi Saiteja
 
Machine learning (7)
Machine learning (7)Machine learning (7)
Machine learning (7)NYversity
 
Lecture9 xing
Lecture9 xingLecture9 xing
Lecture9 xing
Tianlu Wang
 

Similar to Pert 05 aplikasi clustering (20)

Iwsm2014 an analogy-based approach to estimation of software development ef...
Iwsm2014   an analogy-based approach to estimation of software development ef...Iwsm2014   an analogy-based approach to estimation of software development ef...
Iwsm2014 an analogy-based approach to estimation of software development ef...
 
Accelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference CompilationAccelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference Compilation
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
 
Unit 3
Unit 3Unit 3
Unit 3
 
Unit 3
Unit 3Unit 3
Unit 3
 
An optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideAn optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slide
 
Hierarchical matrix techniques for maximum likelihood covariance estimation
Hierarchical matrix techniques for maximum likelihood covariance estimationHierarchical matrix techniques for maximum likelihood covariance estimation
Hierarchical matrix techniques for maximum likelihood covariance estimation
 
Lecture12 xing
Lecture12 xingLecture12 xing
Lecture12 xing
 
LalitBDA2015V3
LalitBDA2015V3LalitBDA2015V3
LalitBDA2015V3
 
Some Continued Mock Theta Functions from Ramanujan’s Lost Notebook (IV)
Some Continued Mock Theta Functions from Ramanujan’s Lost Notebook (IV)Some Continued Mock Theta Functions from Ramanujan’s Lost Notebook (IV)
Some Continued Mock Theta Functions from Ramanujan’s Lost Notebook (IV)
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
 
Cs229 notes7a
Cs229 notes7aCs229 notes7a
Cs229 notes7a
 
Skiena algorithm 2007 lecture15 backtracing
Skiena algorithm 2007 lecture15 backtracingSkiena algorithm 2007 lecture15 backtracing
Skiena algorithm 2007 lecture15 backtracing
 
Lect4
Lect4Lect4
Lect4
 
K-means Clustering Algorithm with Matlab Source code
K-means Clustering Algorithm with Matlab Source codeK-means Clustering Algorithm with Matlab Source code
K-means Clustering Algorithm with Matlab Source code
 
Ee693 sept2014midsem
Ee693 sept2014midsemEe693 sept2014midsem
Ee693 sept2014midsem
 
Machine learning (7)
Machine learning (7)Machine learning (7)
Machine learning (7)
 
11 clusadvanced
11 clusadvanced11 clusadvanced
11 clusadvanced
 
Lecture9 xing
Lecture9 xingLecture9 xing
Lecture9 xing
 

More from aiiniR

Crm 8 strategi crm
Crm 8 strategi crmCrm 8 strategi crm
Crm 8 strategi crmaiiniR
 
Crm 7 mempertahankan pelanggan
Crm 7 mempertahankan pelangganCrm 7 mempertahankan pelanggan
Crm 7 mempertahankan pelangganaiiniR
 
Crm 6 tipe pelanggan
Crm 6 tipe pelangganCrm 6 tipe pelanggan
Crm 6 tipe pelangganaiiniR
 
Crm 5 nilai pelanggan
Crm 5 nilai pelangganCrm 5 nilai pelanggan
Crm 5 nilai pelangganaiiniR
 
Crm 4 analisis portofolio pelanggan
Crm 4 analisis portofolio pelangganCrm 4 analisis portofolio pelanggan
Crm 4 analisis portofolio pelangganaiiniR
 
Crm 3 rantai nilai crm
Crm 3 rantai nilai crmCrm 3 rantai nilai crm
Crm 3 rantai nilai crm
aiiniR
 
Crm 2 konsep crm
Crm 2 konsep crmCrm 2 konsep crm
Crm 2 konsep crm
aiiniR
 
Crm 1 kontrak kuliah
Crm 1 kontrak kuliahCrm 1 kontrak kuliah
Crm 1 kontrak kuliah
aiiniR
 
Testing&implementasi 4 5
Testing&implementasi 4 5Testing&implementasi 4 5
Testing&implementasi 4 5
aiiniR
 
Testing&implementasi 4
Testing&implementasi 4Testing&implementasi 4
Testing&implementasi 4
aiiniR
 
Testing&implementasi 3
Testing&implementasi 3Testing&implementasi 3
Testing&implementasi 3
aiiniR
 
Testing&implementasi 2
Testing&implementasi 2Testing&implementasi 2
Testing&implementasi 2
aiiniR
 
Testing&implementasi 1
Testing&implementasi 1Testing&implementasi 1
Testing&implementasi 1
aiiniR
 
Testing&implementasi 1 pendahuluan
Testing&implementasi 1   pendahuluanTesting&implementasi 1   pendahuluan
Testing&implementasi 1 pendahuluan
aiiniR
 
Pert 06 association rules
Pert 06 association rulesPert 06 association rules
Pert 06 association rules
aiiniR
 
Pert 04 clustering data mining
Pert 04 clustering   data miningPert 04 clustering   data mining
Pert 04 clustering data mining
aiiniR
 
Pert 03 introduction dm 2012
Pert 03 introduction dm 2012Pert 03 introduction dm 2012
Pert 03 introduction dm 2012
aiiniR
 
Pert 02 statistik deskriptif 2013
Pert 02 statistik deskriptif 2013Pert 02 statistik deskriptif 2013
Pert 02 statistik deskriptif 2013
aiiniR
 
3 basis data
3 basis data3 basis data
3 basis data
aiiniR
 
2 pengenalan peta
2 pengenalan peta2 pengenalan peta
2 pengenalan peta
aiiniR
 

More from aiiniR (20)

Crm 8 strategi crm
Crm 8 strategi crmCrm 8 strategi crm
Crm 8 strategi crm
 
Crm 7 mempertahankan pelanggan
Crm 7 mempertahankan pelangganCrm 7 mempertahankan pelanggan
Crm 7 mempertahankan pelanggan
 
Crm 6 tipe pelanggan
Crm 6 tipe pelangganCrm 6 tipe pelanggan
Crm 6 tipe pelanggan
 
Crm 5 nilai pelanggan
Crm 5 nilai pelangganCrm 5 nilai pelanggan
Crm 5 nilai pelanggan
 
Crm 4 analisis portofolio pelanggan
Crm 4 analisis portofolio pelangganCrm 4 analisis portofolio pelanggan
Crm 4 analisis portofolio pelanggan
 
Crm 3 rantai nilai crm
Crm 3 rantai nilai crmCrm 3 rantai nilai crm
Crm 3 rantai nilai crm
 
Crm 2 konsep crm
Crm 2 konsep crmCrm 2 konsep crm
Crm 2 konsep crm
 
Crm 1 kontrak kuliah
Crm 1 kontrak kuliahCrm 1 kontrak kuliah
Crm 1 kontrak kuliah
 
Testing&implementasi 4 5
Testing&implementasi 4 5Testing&implementasi 4 5
Testing&implementasi 4 5
 
Testing&implementasi 4
Testing&implementasi 4Testing&implementasi 4
Testing&implementasi 4
 
Testing&implementasi 3
Testing&implementasi 3Testing&implementasi 3
Testing&implementasi 3
 
Testing&implementasi 2
Testing&implementasi 2Testing&implementasi 2
Testing&implementasi 2
 
Testing&implementasi 1
Testing&implementasi 1Testing&implementasi 1
Testing&implementasi 1
 
Testing&implementasi 1 pendahuluan
Testing&implementasi 1   pendahuluanTesting&implementasi 1   pendahuluan
Testing&implementasi 1 pendahuluan
 
Pert 06 association rules
Pert 06 association rulesPert 06 association rules
Pert 06 association rules
 
Pert 04 clustering data mining
Pert 04 clustering   data miningPert 04 clustering   data mining
Pert 04 clustering data mining
 
Pert 03 introduction dm 2012
Pert 03 introduction dm 2012Pert 03 introduction dm 2012
Pert 03 introduction dm 2012
 
Pert 02 statistik deskriptif 2013
Pert 02 statistik deskriptif 2013Pert 02 statistik deskriptif 2013
Pert 02 statistik deskriptif 2013
 
3 basis data
3 basis data3 basis data
3 basis data
 
2 pengenalan peta
2 pengenalan peta2 pengenalan peta
2 pengenalan peta
 

Recently uploaded

Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 

Recently uploaded (20)

Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 

Pert 05 aplikasi clustering

  • 2. Outline  K means for clustering  Expectation Maximization algorithm for clustering  Spectrum clustering (if time is permitted)
  • 3. Clustering $$$ age  Find out the underlying structure for given data points
  • 4. Application (I): Search Result Clustering
  • 7. Application (III): Visualization Islands of music (Pampalk et al., KDD’ 03)
  • 8. Application (IV): Image Compression http://www.ece.neu.edu/groups/rpl/kmeans/
  • 9. How to Find good Clustering?  Minimize the sum of distance within clusters C1 C2 C3 C4 C5     , 6 2 , 1 1, arg min j i j n i j i j j iC m m x C     , 6 , 1 1 the j-th cluster 0 the j-th cluster 1 any a single cluster i i j i i j j i x m x m x         
  • 10. How to Efficiently Clustering Data?     , 6 2 , 1 1, arg min j i j n i j i j j iC m m x C        ,Memberships and centers are correlated.i j jm C   , 1 , , 1 Given memberships , n i j i i i j j n i j i m x m C m      2 , 1 arg min( ) Given centers { }, 0 otherwise i j kj i j j x C C m      
  • 11. K-means for Clustering  K-means  Start with a random guess of cluster centers  Determine the membership of each data points  Adjust the cluster centers
  • 12. K-means for Clustering  K-means  Start with a random guess of cluster centers  Determine the membership of each data points  Adjust the cluster centers
  • 13. K-means for Clustering  K-means  Start with a random guess of cluster centers  Determine the membership of each data points  Adjust the cluster centers
  • 14. K-means 1. Ask user how many clusters they’d like. (e.g. k=5)
  • 15. K-means 1. Ask user how many clusters they’d like. (e.g. k=5) 2. Randomly guess k cluster Center locations
  • 16. K-means 1. Ask user how many clusters they’d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)
  • 17. K-means 1. Ask user how many clusters they’d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it’s closest to. 4. Each Center finds the centroid of the points it owns
  • 18. K-means 1. Ask user how many clusters they’d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it’s closest to. 4. Each Center finds the centroid of the points it owns Any Computational Problem? Computational Complexity: O(N) where N is the number of points?
  • 19. Improve K-means  Group points by region  KD tree  SR tree  Key difference  Find the closest center for each rectangle  Assign all the points within a rectangle to one cluster
  • 20. Improved K-means  Find the closest center for each rectangle  Assign all the points within a rectangle to one cluster
  • 30. A Gaussian Mixture Model for Clustering  Assume that data are generated from a mixture of Gaussian distributions  For each Gaussian distribution  Center: i  Variance: i (ignore)  For each data point  Determine membership : if belongs to j-th clusterij iz x
  • 31. Learning a Gaussian Mixture (with known covariance)  Probability ( )ip x x   2 / 2 2 2 ( ) ( , ) ( ) ( | ) 1 ( ) exp 22 j j j i i j j i j i j j d p x x p x x p p x x x p                                  
  • 32. Learning a Gaussian Mixture (with known covariance)  Probability ( )ip x x   2 / 2 2 2 ( ) ( , ) ( ) ( | ) 1 ( ) exp 22 j j j i i j j i j i j j d p x x p x x p p x x x p                                    Log-likelihood of data  Apply MLE to find optimal parameters   2 / 2 2 2 1 log ( ) log ( ) exp 22j i j i j d i i x p x x p                        ( ),j j j p   
  • 33. Learning a Gaussian Mixture (with known covariance) 2 2 2 2 1 ( ) 2 1 ( ) 2 1 ( ) ( ) i j i n x j k x n n e p e p                 [ ] ( | )ij j iE z p x x   E-Step 1 ( | ) ( ) ( | ) ( ) i j j k i n j n p x x p p x x p                
  • 34. Learning a Gaussian Mixture (with known covariance) 1 1 1 [ ] [ ] m j ij im i ij i E z x E z       M-Step 1 1 ( ) [ ] m j ij i p E z m      
  • 43. Mixture Model for Doc Clustering  A set of language models   1 2, ,..., K    1 2{ ( | ), ( | ),..., ( | )}i i i V ip w p w p w   
  • 44. Mixture Model for Doc Clustering  A set of language models   1 2, ,..., K    1 2{ ( | ), ( | ),..., ( | )}i i i V ip w p w p w    ( )ip d d ( , ) 1 ( ) ( , ) ( ) ( | ) ( ) ( | ) j j k i j i i j j i j V tf w d j k j k p d d p d d p p d d p p w                                Probability
  • 45. Mixture Model for Doc Clustering  A set of language models   1 2, ,..., K    1 2{ ( | ), ( | ),..., ( | )}i i i V ip w p w p w    ( )ip d d ( , ) 1 ( ) ( , ) ( ) ( | ) ( ) ( | ) j j k i j i i j j i j V tf w d j k j k p d d p d d p p d d p p w                                Probability
  • 46. Mixture Model for Doc Clustering  A set of language models   1 2, ,..., K    1 2{ ( | ), ( | ),..., ( | )}i i i V ip w p w p w    ( )ip d d ( , ) 1 ( ) ( , ) ( ) ( | ) ( ) ( | ) j j k i j i i j j i j V tf w d j k j k p d d p d d p p d d p p w                                Probability Introduce hidden variable zij zij: document di is generated by the j-th language model j.
  • 47. Learning a Mixture Model   ( , ) 1 ( , ) 1 1 ( | ) ( ) ( | ) ( ) k i k i V tf w d m j j m VK tf w d m n n n m p w p p w p                  1 [ ] ( | ) ( | ) ( ) ( | ) ( ) ij j i i j j K i n n n E z p d d p d d p p d d p                      E-Step K: number of language models
  • 48. Learning a Mixture Model M-Step 1 1 ( ) [ ] N j ij i p E z N       1 1 [ ] ( , ) ( | ) [ ] N ij i k k i j N ij k k E z tf w d p w E z d       N: number of documents
  • 50. Other Mixture Models  Probabilistic latent semantic index (PLSI)  Latent Dirichlet Allocation (LDA)
  • 51. Problems (I)  Both k-means and mixture models need to compute centers of clusters and explicit distance measurement  Given strange distance measurement, the center of clusters can be hard to compute E.g.,  ' ' ' 1 1 2 2' max , ,..., n nx x x x x x x x      x y z     x y x z
  • 52. Problems (II)  Both k-means and mixture models look for compact clustering structures  In some cases, connected clustering structures are more desirable
  • 53. Graph Partition  MinCut: bipartite graphs with minimal number of cut edges CutSize = 2
  • 54. 2-way Spectral Graph Partitioning  Weight matrix W  wi,j: the weight between two vertices i and j  Membership vector q 1 Cluster -1 Cluster i i A q i B       [ 1,1] 2 , , arg min 1 4 n i j i j i j CutSize CutSize J q q w       q q
  • 55. Solving the Optimization Problem  Directly solving the above problem requires combinatorial search  exponential complexity  How to reduce the computation complexity?   2 , [ 1,1] , 1 argmin 4n i j i j i j q q w     q q
  • 56. Relaxation Approach  Key difficulty: qi has to be either –1, 1  Relax qi to be any real number  Impose constraint 2 1 n ii q n      2 2 2 , , , , 2 , , , 1 1 2 4 4 1 1 2 2 4 4 i j i j i j i j i j i j i j i i j i j i j i j i j J q q w q q q q w q w q q w                   ,i i j j d w   2 , , , , 1 1 1 2 2 2 i i i j i j i i i j i j j i i j i q d q q w q d w q      ,i i jD d     ( )T J  q D W q
  • 57. Relaxation Approach 2 * argmin argmin ( ) subject to T k k J q n     q q q q D W q
  • 58. Relaxation Approach  Solution: the second minimum eigenvector for D-W 2 * argmin argmin ( ) subject to T k k J q n     q q q q D W q 2( )D W  q q
  • 59. Graph Laplacian  L is semi-positive definitive matrix  For Any x, we have xTLx  0, why?  Minimum eigenvalue 1 = 0 (what is the eigenvector?)   The second minimum eigenvalue 2 gives the best bipartite graph  , , ,: ,i j i j i jj w w         L D W W D 1 2 30 ... k      
  • 60. Recovering Partitions  Due to the relaxation, q can be any number (not just –1 and 1)  How to construct partition based on the eigenvector?  Simple strategy: { | 0}, { | 0}i iA i q B i q   
  • 61. Spectral Clustering  Minimum cut does not balance the size of bipartite graphs
  • 62. Normalized Cut (Shi & Malik, 1997)  Minimize the similarity between clusters and meanwhile maximize the similarity within clusters ,( , ) , , ( , ) ( , ) i j A i B i i A j B i A i B A B s A B w d d d d s A B s A B J d d              , ( , ) ( , ) B A i j i A j BA B A B d ds A B s A B J w d d d d        j j d d   2 , B A i j i A j B A B d d w d d d             Biddd Aiddd iq BA AB if if / / )(   2 ,i j i j i j w q q 
  • 63. Normalized Cut   2 , ( - ) / if / if T i j i j i j B A i A B J w q q d d d i A q d d d i B           q D W q
  • 64. Normalized Cut  Relax q to real value under the constraint   2 , ( - ) / if / if T i j i j i j B A i A B J w q q d d d i A q d d d i B           q D W q 0,1  DeqDqq TT  Solution: DqqWD  )(