Unsupervised	
  Learning:	
  
K-­‐Means	
  &	
  	
  
Gaussian	
  Mixture	
  Models	
  
Unsupervised	
  Learning	
  
• Supervised	
  learning	
  used	
  labeled	
  data	
  pairs	
  (x, y)	
  
to	
  learn	
  a	
  func=on	
  f : X→Y	
  
– But,	
  what	
  if	
  we	
  don’t	
  have	
  labels?	
  
• No	
  labels	
  =	
  unsupervised	
  learning	
  
• Only	
  some	
  points	
  are	
  labeled	
  =	
  semi-­‐supervised	
  
learning	
  
– Labels	
  may	
  be	
  expensive	
  to	
  obtain,	
  so	
  we	
  only	
  get	
  a	
  few	
  
• Clustering	
  is	
  the	
  unsupervised	
  grouping	
  of	
  data	
  
points.	
  	
  It	
  can	
  be	
  used	
  for	
  knowledge	
  discovery.	
  
K-­‐Means	
  Clustering	
  
Some	
  material	
  adapted	
  from	
  slides	
  by	
  Andrew	
  Moore,	
  CMU.	
  
	
  
Visit	
  hMp://www.autonlab.org/tutorials/	
  for	
  
Andrew’s	
  repository	
  of	
  Data	
  Mining	
  tutorials.	
  
Clustering	
  Data	
  
K-­‐Means	
  Clustering	
  
K-­‐Means	
  (	
  k	
  ,	
  X	
  )	
  
• Randomly	
  choose	
  k	
  cluster	
  
center	
  loca=ons	
  (centroids)	
  
• Loop	
  un=l	
  convergence	
  
• Assign	
  each	
  point	
  to	
  the	
  
cluster	
  of	
  the	
  closest	
  centroid	
  
• Re-­‐es=mate	
  the	
  cluster	
  
centroids	
  based	
  on	
  the	
  data	
  
assigned	
  to	
  each	
  cluster	
  
K-­‐Means	
  Clustering	
  
K-­‐Means	
  (	
  k	
  ,	
  X	
  )	
  
• Randomly	
  choose	
  k	
  cluster	
  
center	
  loca=ons	
  (centroids)	
  
• Loop	
  un=l	
  convergence	
  
• Assign	
  each	
  point	
  to	
  the	
  
cluster	
  of	
  the	
  closest	
  centroid	
  
• Re-­‐es=mate	
  the	
  cluster	
  
centroids	
  based	
  on	
  the	
  data	
  
assigned	
  to	
  each	
  cluster	
  
K-­‐Means	
  Clustering	
  
K-­‐Means	
  (	
  k	
  ,	
  X	
  )	
  
• Randomly	
  choose	
  k	
  cluster	
  
center	
  loca=ons	
  (centroids)	
  
• Loop	
  un=l	
  convergence	
  
• Assign	
  each	
  point	
  to	
  the	
  
cluster	
  of	
  the	
  closest	
  centroid	
  
• Re-­‐es=mate	
  the	
  cluster	
  
centroids	
  based	
  on	
  the	
  data	
  
assigned	
  to	
  each	
  cluster	
  
K-­‐Means	
  Anima=on	
  
Example generated by
Andrew Moore using
Dan Pelleg’s super-
duper fast K-means
system:
Dan Pelleg and Andrew
Moore. Accelerating
Exact k-means
Algorithms with
Geometric Reasoning.
Proc. Conference on
Knowledge Discovery in
Databases 1999.
K-­‐Means	
  Objec=ve	
  Func=on	
  
• K-­‐means	
  finds	
  a	
  local	
  op=mum	
  of	
  the	
  
following	
  objec=ve	
  func=on:	
  
arg min
S
k
X
i=1
X
x2Si
kx µik2
2
where S = {S1, . . . , Sk} is a partitioning ov
X = {x1, . . . , xn} s.t. X =
Sk
i=1 Si
and µi = mean(Si)
arg min
S
k
X
i=1
X
x2Si
kx µik2
2
where S = {S1, . . . , Sk} is a partitioning over
X = {x1, . . . , xn} s.t. X =
Sk
i=1 Si
and µi = mean(Si)
Problems	
  with	
  K-­‐Means	
  
• Very	
  sensi=ve	
  to	
  the	
  ini=al	
  points	
  
– Do	
  many	
  runs	
  of	
  K-­‐Means,	
  each	
  with	
  different	
  ini=al	
  
centroids	
  
– Seed	
  the	
  centroids	
  using	
  a	
  beMer	
  method	
  than	
  
randomly	
  choosing	
  the	
  centroids	
  
• e.g.,	
  Farthest-­‐first	
  sampling	
  
• Must	
  manually	
  choose	
  k	
  
– Learn	
  the	
  op=mal	
  k	
  for	
  the	
  clustering	
  
• Note	
  that	
  this	
  requires	
  a	
  performance	
  measure	
  
• How	
  do	
  you	
  tell	
  it	
  which	
  clustering	
  you	
  want?	
  
	
  
Constrained	
  clustering	
  techniques	
  	
  (semi-­‐supervised)	
  
Problems	
  with	
  K-­‐Means	
  
Same-­‐cluster	
  constraint	
  
(must-­‐link)	
  
Different-­‐cluster	
  constraint	
  
(cannot-­‐link)	
  
k = 2"
Gaussian	
  Mixture	
  Models	
  
• Recall	
  the	
  Gaussian	
  distribu=on:	
  
P(x | µ, ⌃) =
1
p
(2⇡)d|⌃|
exp
✓
1
2
(x µ)|
⌃ 1
(x µ)
◆
Clustering with Gaussian Mixtures: Slide 13
Copyright © 2001, 2004, Andrew W. Moore
The GMM assumption
• There are k components. The
i’th component is called ωi
• Component ωi has an
associated mean vector µi
µ1
µ2
µ3
Clustering with Gaussian Mixtures: Slide 14
Copyright © 2001, 2004, Andrew W. Moore
The GMM assumption
• There are k components. The
i’th component is called ωi
• Component ωi has an
associated mean vector µi
• Each component generates data
from a Gaussian with mean µi
and covariance matrix σ2I
Assume that each datapoint is
generated according to the
following recipe:
µ1
µ2
µ3
Clustering with Gaussian Mixtures: Slide 15
Copyright © 2001, 2004, Andrew W. Moore
The GMM assumption
• There are k components. The
i’th component is called ωi
• Component ωi has an
associated mean vector µi
• Each component generates data
from a Gaussian with mean µi
and covariance matrix σ2I
Assume that each datapoint is
generated according to the
following recipe:
1. Pick a component at random.
Choose component i with
probability P(ωi).
µ2
Clustering with Gaussian Mixtures: Slide 16
Copyright © 2001, 2004, Andrew W. Moore
The GMM assumption
• There are k components. The
i’th component is called ωi
• Component ωi has an
associated mean vector µi
• Each component generates data
from a Gaussian with mean µi
and covariance matrix σ2I
Assume that each datapoint is
generated according to the
following recipe:
1. Pick a component at random.
Choose component i with
probability P(ωi).
2. Datapoint ~ N(µi, σ2I )
µ2
x
Clustering with Gaussian Mixtures: Slide 17
Copyright © 2001, 2004, Andrew W. Moore
The General GMM assumption
µ1
µ2
µ3
• There are k components. The
i’th component is called ωi
• Component ωi has an
associated mean vector µi
• Each component generates data
from a Gaussian with mean µi
and covariance matrix Σi
Assume that each datapoint is
generated according to the
following recipe:
1. Pick a component at random.
Choose component i with
probability P(ωi).
2. Datapoint ~ N(µi , Σi )
Fi8ng	
  a	
  Gaussian	
  	
  
Mixture	
  Model	
  
(Optional)
Clustering with Gaussian Mixtures: Slide 19
Just evaluate a
Gaussian at xk
Copyright © 2001, 2004, Andrew W. Moore
Expectation-Maximization for GMMs
Iterate until convergence:
On the t’th iteration let our estimates be
λt = { µ1(t), µ2(t) … µc(t) }
E-step: Compute “expected” classes of all datapoints for each class
( ) ( ) ( )
( )
( )
( )
∑
=
=
= c
j
j
j
j
k
i
i
i
k
t
k
t
i
t
i
k
t
k
i
t
p
t
w
x
t
p
t
w
x
x
w
w
x
x
w
1
2
2
)
(
),
(
,
p
)
(
),
(
,
p
p
P
,
p
,
P
I
I
σ
µ
σ
µ
λ
λ
λ
λ
M-step: Estimate µ given our data’s class membership distributions
( )
( )
( )
∑
∑
=
+
k
t
k
i
k
k
t
k
i
i
x
w
x
x
w
t
λ
λ
,
P
,
P
1
µ
Clustering with Gaussian Mixtures: Slide 20
Copyright © 2001, 2004, Andrew W. Moore
E.M. for General GMMs
Iterate. On the t’th iteration let our estimates be
λt = { µ1(t), µ2(t) … µc(t), Σ1(t), Σ2(t) … Σc(t), p1(t), p2(t) … pc(t) }
E-step: Compute “expected” clusters of all datapoints
( ) ( ) ( )
( )
( )
( )
∑
=
Σ
Σ
=
= c
j
j
j
j
j
k
i
i
i
i
k
t
k
t
i
t
i
k
t
k
i
t
p
t
t
w
x
t
p
t
t
w
x
x
w
w
x
x
w
1
)
(
)
(
),
(
,
p
)
(
)
(
),
(
,
p
p
P
,
p
,
P
µ
µ
λ
λ
λ
λ
M-step: Estimate µ, Σ given our data’s class membership distributions
pi(t) is shorthand
for estimate of
P(ωi) on t’th
iteration
( )
( )
( )
∑
∑
=
+
k
t
k
i
k
k
t
k
i
i
x
w
x
x
w
t
λ
λ
,
P
,
P
1
µ ( )
( ) ( )
[ ] ( )
[ ]
( )
∑
∑ +
−
+
−
=
+
Σ
k
t
k
i
T
i
k
i
k
k
t
k
i
i
x
w
t
x
t
x
x
w
t
λ
µ
µ
λ
,
P
1
1
,
P
1
( )
( )
R
x
w
t
p k
t
k
i
i
∑
=
+
λ
,
P
1 R = #records
Just evaluate a
Gaussian at xk
(End	
  op>onal	
  sec>on)	
  
Clustering with Gaussian Mixtures: Slide 22
Copyright © 2001, 2004, Andrew W. Moore
Gaussian
Mixture
Example:
Start
Advance apologies: in Black
and White this example will be
incomprehensible
Clustering with Gaussian Mixtures: Slide 23
Copyright © 2001, 2004, Andrew W. Moore
After first
iteration
Clustering with Gaussian Mixtures: Slide 24
Copyright © 2001, 2004, Andrew W. Moore
After 2nd
iteration
Clustering with Gaussian Mixtures: Slide 25
Copyright © 2001, 2004, Andrew W. Moore
After 3rd
iteration
Clustering with Gaussian Mixtures: Slide 26
Copyright © 2001, 2004, Andrew W. Moore
After 4th
iteration
Clustering with Gaussian Mixtures: Slide 27
Copyright © 2001, 2004, Andrew W. Moore
After 5th
iteration
Clustering with Gaussian Mixtures: Slide 28
Copyright © 2001, 2004, Andrew W. Moore
After 6th
iteration
Clustering with Gaussian Mixtures: Slide 29
Copyright © 2001, 2004, Andrew W. Moore
After 20th
iteration
Clustering with Gaussian Mixtures: Slide 30
Copyright © 2001, 2004, Andrew W. Moore
Some Bio
Assay
data
Clustering with Gaussian Mixtures: Slide 31
Copyright © 2001, 2004, Andrew W. Moore
GMM
clustering
of the
assay data
Clustering with Gaussian Mixtures: Slide 32
Copyright © 2001, 2004, Andrew W. Moore
Resulting
Density
Estimator

13_Unsupervised Learning.pdf

  • 1.
    Unsupervised  Learning:   K-­‐Means  &     Gaussian  Mixture  Models  
  • 2.
    Unsupervised  Learning   •Supervised  learning  used  labeled  data  pairs  (x, y)   to  learn  a  func=on  f : X→Y   – But,  what  if  we  don’t  have  labels?   • No  labels  =  unsupervised  learning   • Only  some  points  are  labeled  =  semi-­‐supervised   learning   – Labels  may  be  expensive  to  obtain,  so  we  only  get  a  few   • Clustering  is  the  unsupervised  grouping  of  data   points.    It  can  be  used  for  knowledge  discovery.  
  • 3.
    K-­‐Means  Clustering   Some  material  adapted  from  slides  by  Andrew  Moore,  CMU.     Visit  hMp://www.autonlab.org/tutorials/  for   Andrew’s  repository  of  Data  Mining  tutorials.  
  • 4.
  • 5.
    K-­‐Means  Clustering   K-­‐Means  (  k  ,  X  )   • Randomly  choose  k  cluster   center  loca=ons  (centroids)   • Loop  un=l  convergence   • Assign  each  point  to  the   cluster  of  the  closest  centroid   • Re-­‐es=mate  the  cluster   centroids  based  on  the  data   assigned  to  each  cluster  
  • 6.
    K-­‐Means  Clustering   K-­‐Means  (  k  ,  X  )   • Randomly  choose  k  cluster   center  loca=ons  (centroids)   • Loop  un=l  convergence   • Assign  each  point  to  the   cluster  of  the  closest  centroid   • Re-­‐es=mate  the  cluster   centroids  based  on  the  data   assigned  to  each  cluster  
  • 7.
    K-­‐Means  Clustering   K-­‐Means  (  k  ,  X  )   • Randomly  choose  k  cluster   center  loca=ons  (centroids)   • Loop  un=l  convergence   • Assign  each  point  to  the   cluster  of  the  closest  centroid   • Re-­‐es=mate  the  cluster   centroids  based  on  the  data   assigned  to  each  cluster  
  • 8.
    K-­‐Means  Anima=on   Examplegenerated by Andrew Moore using Dan Pelleg’s super- duper fast K-means system: Dan Pelleg and Andrew Moore. Accelerating Exact k-means Algorithms with Geometric Reasoning. Proc. Conference on Knowledge Discovery in Databases 1999.
  • 9.
    K-­‐Means  Objec=ve  Func=on   • K-­‐means  finds  a  local  op=mum  of  the   following  objec=ve  func=on:   arg min S k X i=1 X x2Si kx µik2 2 where S = {S1, . . . , Sk} is a partitioning ov X = {x1, . . . , xn} s.t. X = Sk i=1 Si and µi = mean(Si) arg min S k X i=1 X x2Si kx µik2 2 where S = {S1, . . . , Sk} is a partitioning over X = {x1, . . . , xn} s.t. X = Sk i=1 Si and µi = mean(Si)
  • 10.
    Problems  with  K-­‐Means   • Very  sensi=ve  to  the  ini=al  points   – Do  many  runs  of  K-­‐Means,  each  with  different  ini=al   centroids   – Seed  the  centroids  using  a  beMer  method  than   randomly  choosing  the  centroids   • e.g.,  Farthest-­‐first  sampling   • Must  manually  choose  k   – Learn  the  op=mal  k  for  the  clustering   • Note  that  this  requires  a  performance  measure  
  • 11.
    • How  do  you  tell  it  which  clustering  you  want?     Constrained  clustering  techniques    (semi-­‐supervised)   Problems  with  K-­‐Means   Same-­‐cluster  constraint   (must-­‐link)   Different-­‐cluster  constraint   (cannot-­‐link)   k = 2"
  • 12.
    Gaussian  Mixture  Models   • Recall  the  Gaussian  distribu=on:   P(x | µ, ⌃) = 1 p (2⇡)d|⌃| exp ✓ 1 2 (x µ)| ⌃ 1 (x µ) ◆
  • 13.
    Clustering with GaussianMixtures: Slide 13 Copyright © 2001, 2004, Andrew W. Moore The GMM assumption • There are k components. The i’th component is called ωi • Component ωi has an associated mean vector µi µ1 µ2 µ3
  • 14.
    Clustering with GaussianMixtures: Slide 14 Copyright © 2001, 2004, Andrew W. Moore The GMM assumption • There are k components. The i’th component is called ωi • Component ωi has an associated mean vector µi • Each component generates data from a Gaussian with mean µi and covariance matrix σ2I Assume that each datapoint is generated according to the following recipe: µ1 µ2 µ3
  • 15.
    Clustering with GaussianMixtures: Slide 15 Copyright © 2001, 2004, Andrew W. Moore The GMM assumption • There are k components. The i’th component is called ωi • Component ωi has an associated mean vector µi • Each component generates data from a Gaussian with mean µi and covariance matrix σ2I Assume that each datapoint is generated according to the following recipe: 1. Pick a component at random. Choose component i with probability P(ωi). µ2
  • 16.
    Clustering with GaussianMixtures: Slide 16 Copyright © 2001, 2004, Andrew W. Moore The GMM assumption • There are k components. The i’th component is called ωi • Component ωi has an associated mean vector µi • Each component generates data from a Gaussian with mean µi and covariance matrix σ2I Assume that each datapoint is generated according to the following recipe: 1. Pick a component at random. Choose component i with probability P(ωi). 2. Datapoint ~ N(µi, σ2I ) µ2 x
  • 17.
    Clustering with GaussianMixtures: Slide 17 Copyright © 2001, 2004, Andrew W. Moore The General GMM assumption µ1 µ2 µ3 • There are k components. The i’th component is called ωi • Component ωi has an associated mean vector µi • Each component generates data from a Gaussian with mean µi and covariance matrix Σi Assume that each datapoint is generated according to the following recipe: 1. Pick a component at random. Choose component i with probability P(ωi). 2. Datapoint ~ N(µi , Σi )
  • 18.
    Fi8ng  a  Gaussian     Mixture  Model   (Optional)
  • 19.
    Clustering with GaussianMixtures: Slide 19 Just evaluate a Gaussian at xk Copyright © 2001, 2004, Andrew W. Moore Expectation-Maximization for GMMs Iterate until convergence: On the t’th iteration let our estimates be λt = { µ1(t), µ2(t) … µc(t) } E-step: Compute “expected” classes of all datapoints for each class ( ) ( ) ( ) ( ) ( ) ( ) ∑ = = = c j j j j k i i i k t k t i t i k t k i t p t w x t p t w x x w w x x w 1 2 2 ) ( ), ( , p ) ( ), ( , p p P , p , P I I σ µ σ µ λ λ λ λ M-step: Estimate µ given our data’s class membership distributions ( ) ( ) ( ) ∑ ∑ = + k t k i k k t k i i x w x x w t λ λ , P , P 1 µ
  • 20.
    Clustering with GaussianMixtures: Slide 20 Copyright © 2001, 2004, Andrew W. Moore E.M. for General GMMs Iterate. On the t’th iteration let our estimates be λt = { µ1(t), µ2(t) … µc(t), Σ1(t), Σ2(t) … Σc(t), p1(t), p2(t) … pc(t) } E-step: Compute “expected” clusters of all datapoints ( ) ( ) ( ) ( ) ( ) ( ) ∑ = Σ Σ = = c j j j j j k i i i i k t k t i t i k t k i t p t t w x t p t t w x x w w x x w 1 ) ( ) ( ), ( , p ) ( ) ( ), ( , p p P , p , P µ µ λ λ λ λ M-step: Estimate µ, Σ given our data’s class membership distributions pi(t) is shorthand for estimate of P(ωi) on t’th iteration ( ) ( ) ( ) ∑ ∑ = + k t k i k k t k i i x w x x w t λ λ , P , P 1 µ ( ) ( ) ( ) [ ] ( ) [ ] ( ) ∑ ∑ + − + − = + Σ k t k i T i k i k k t k i i x w t x t x x w t λ µ µ λ , P 1 1 , P 1 ( ) ( ) R x w t p k t k i i ∑ = + λ , P 1 R = #records Just evaluate a Gaussian at xk
  • 21.
  • 22.
    Clustering with GaussianMixtures: Slide 22 Copyright © 2001, 2004, Andrew W. Moore Gaussian Mixture Example: Start Advance apologies: in Black and White this example will be incomprehensible
  • 23.
    Clustering with GaussianMixtures: Slide 23 Copyright © 2001, 2004, Andrew W. Moore After first iteration
  • 24.
    Clustering with GaussianMixtures: Slide 24 Copyright © 2001, 2004, Andrew W. Moore After 2nd iteration
  • 25.
    Clustering with GaussianMixtures: Slide 25 Copyright © 2001, 2004, Andrew W. Moore After 3rd iteration
  • 26.
    Clustering with GaussianMixtures: Slide 26 Copyright © 2001, 2004, Andrew W. Moore After 4th iteration
  • 27.
    Clustering with GaussianMixtures: Slide 27 Copyright © 2001, 2004, Andrew W. Moore After 5th iteration
  • 28.
    Clustering with GaussianMixtures: Slide 28 Copyright © 2001, 2004, Andrew W. Moore After 6th iteration
  • 29.
    Clustering with GaussianMixtures: Slide 29 Copyright © 2001, 2004, Andrew W. Moore After 20th iteration
  • 30.
    Clustering with GaussianMixtures: Slide 30 Copyright © 2001, 2004, Andrew W. Moore Some Bio Assay data
  • 31.
    Clustering with GaussianMixtures: Slide 31 Copyright © 2001, 2004, Andrew W. Moore GMM clustering of the assay data
  • 32.
    Clustering with GaussianMixtures: Slide 32 Copyright © 2001, 2004, Andrew W. Moore Resulting Density Estimator