Designing IA for AI - Information Architecture Conference 2024
A new implementation of k-MLE for mixture modelling of Wishart distributions
1. A new implementation of k-MLE for
mixture modelling of Wishart distributions
Christophe Saint-Jean Frank Nielsen
Geometric Science of Information 2013
August 28, 2013 - Mines Paris Tech
2. Application Context (1)
2/31
We are interested in clustering varying-length sets of multivariate
observations of same dim. p.
X1 =
0
@
3:6 0:05 4:
3:6 0:05 4:
3:6 0:05 4:
1
A; : : : ;XN =
0
BBBB@
5:3 0:5 2:5
3:6 0:5 3:5
1:6 0:5 4:6
1:6 0:5 5:1
2:9 0:5 6:1
1
CCCCA
Sample mean is a good but not discriminative enough feature.
Second order cross-product matrices tXiXi may capture some
relations between (column) variables.
3. Application Context (2)
3/31
The problem is now the clustering of a set of p p PSD matrices :
=
x1 = tX1X1; x2 = tX2X2; : : : ; xN = tXNXN
Examples of applications : multispectral/DTI/radar imaging,
motion retrieval system, ...
4. Application Context (2)
3/31
The problem is now the clustering of a set of p p PSD matrices :
=
x1 = tX1X1; x2 = tX2X2; : : : ; xN = tXNXN
Examples of applications : multispectral/DTI/radar imaging,
motion retrieval system, ...
5. Outline of this talk
4/31
1 MLE and Wishart Distribution
Exponential Family and Maximum Likehood Estimate
Wishart Distribution
Two sub-families of the Wishart Distribution
2 Mixture modeling with k-MLE
Original k-MLE
k-MLE for Wishart distributions
Heuristics for the initialization
3 Application to motion retrieval
6. Reminder : Exponential Family (EF)
5/31
An exponential family is a set of parametric probability distributions
EF = fp(x; ) = pF (x; ) = exp fht(x); i + k(x) F()j 2 g
Terminology:
source parameters.
natural parameters.
t(x) sucient statistic.
k(x) auxiliary carrier measure.
F() the log-normalizer:
dierentiable, strictly
convex
= f 2 RDjF() 1g
is an open convex set
Almost all commonly used distributions are EF members but
uniform, Cauchy distributions.
7. Reminder : Maximum Likehood Estimate (MLE)
6/31
Maximum Likehood Estimate principle is a very common
approach for
8. tting parameters of a distribution
^ = argmax
L(; ) = argmax
YN
i=1
p(xi ; ) = argmin
1
N
XN
i=1
log p(xi ; )
assuming a sample = fx1; x2; :::; xNg of i.i.d observations.
Log density have a convenient expression for EF members
log pF (x; ) = ht(x); i + k(x) F()
It follows
^ = argmax
XN
i=1
log pF (xi ; ) = argmax
h
XN
i=1
!
t(xi ); i NF()
9. MLE with EF
7/31
Since F is a strictly convex, dierentiable function, MLE
exists and is unique :
rF(^) =
1
N
XN
i=1
t(xi )
Ideally, we have a closed form :
^ = rF1
1
N
XN
i=1
!
t(xi )
Numerical methods including Newton-Raphson can be
successfully applied.
11. nition (Central Wishart distribution)
Wishart distribution characterizes empirical covariance matrices for
zero-mean gaussian samples:
Wd (X; n; S) =
jXj
nd1
2 exp
12
tr(S1X)
2
nd
2 jSj
n
2 d
n
2
where for x 0, d (x) =
d(d1)
4
Qd
j=1
x j1
2
is the
multivariate gamma function.
Remarks : n d 1, E[X] = nS
The multivariate generalization of the chi-square distribution.
12. Wishart Distribution as an EF
9/31
It's an exponential family:
logWd (X; n; S ) = n; log jXj R + S ;
1
2
X HS
+ k(X) F(n; S )
with k(X) = 0 and
(n; S ) = (
n d 1
2
; S1); t(X) = (log jXj;
1
2
X);
F(n; S ) =
n +
(d + 1)
2
(d log(2) log jS j)+log d
n +
(d + 1)
2
13. MLE for Wishart Distribution
10/31
In the case of the Wishart distribution, a closed form would be
obtained by solving the following system
^ = rF1
1
N
XN
i=1
!
t(xi )
8
:
d log(2) log jS j + d
n + (d+1)
2
= n
n + (d+1)
2
1
S = S
(1)
with n and S the expectation parameters and d the derivative
of the log d .
Unfortunately, no closed-form solution is known.
15. xed (n = 2n + d + 1)
Fn(S ) =
nd
2
log(2)
n
2
log jS j + log d
n
2
kn(X) =
n d 1
2
log jXj
Case S
16. xed (S = 1
S )
FS (n) =
n +
d + 1
2
log j2Sj + log d
n +
d + 1
2
kS (X) =
1
2
tr (S1X)
17. Two sub-families of the Wishart Distribution (2)
12/31
Both are exponential families and MLE equations are solvable !
Case n
18. xed:
n
2
^1
S =
1
N
XN
i=1
1
2
Xi =) ^S = Nn
XN
i=1
Xi
!1
(2)
Case S
19. xed :
^n = 1
d
1
N
XN
i=1
!
log jXi j log j2Sj
d + 1
2
; ^n 0 (3)
with 1
d the functional reciprocal of d .
20. An iterative estimator for the Wishart Distribution
13/31
Algorithm 1: An estimator for parameters of the Wishart
Input: A sample X1;X2; : : : ;XN of Sd
++
Output: Final values of ^n and ^S
Initialize ^n with some value 0;
repeat
Update ^S using Eq. 2 with n = 2^n + d + 1;
Update ^n using Eq. 3 with S the inverse matrix of ^S ;
until convergence of the likelihood;
21. Questions and open problems
14/31
From a sample of Wishart matrices, distr. parameters are
recovered in few iterations.
Major question : do you have a MLE ? probably ...
Minor question : sample size N = 1 ?
Under-determined system
Regularization by sampling around X1
23. nite) mixture is a
exible tool to model a more
complex distribution m:
m(x) =
Xk
j=1
wjpj (x); 0 wj 1;
Xk
j=1
wj = 1
where pj are the component distributions of the mixture, wj
the mixing proportions.
In our case, we consider pj as member of some parametric
family (EF)
m(x; ) =
Xk
j=1
wjpFj (x; j )
with = (w1;w2; :::;wk1; 1; 2; :::; k )
Expectation-Maximization is not fast enough [5] ...
24. Original k-MLE (primal form.) in one slide
16/31
Algorithm 2: k-MLE
Input: A sample = fx1; x2; :::; xNg, F1; F2; :::; Fk Bregman
generator
Output: Estimate ^
of mixture parameters
A good initialization for (see later);
repeat
repeat
foreach xi 2 do zi = argmaxj log w^jpFj (xi ; ^j );
foreach Cj := fxi 2 jzi = jg do ^j = MLEFj (Cj );
until Convergence of the complete likelihood;
Update mixing proportions : w^j = jCj j=N
until Further convergence of the complete likelihood;
25. k-MLE’s properties
17/31
Another formulation comes with the connection between EF
and Bregman divergences [3]:
log pF (x; ) = BF(t(x) : ) + F(t(x)) + k(x)
Bregman divergence BF (: : :) associated to a strictly convex
and dierentiable function F :
26. Original k-MLE (dual form.) in one slide
18/31
Algorithm 3: k-MLE
Input: A sample = fy1 = t(x1); y2 = x2; :::; yn = t(xN)g,
F
1 ; F
2 ; :::; F
k Bregman generator
Output: ^
= ( ^w1; ^w2; :::; ^wk1; ^1 = rF(^1); :::; ^k = rF(^k ))
A good initialization for (see later);
repeat
repeat
foreach xi 2 do zi = argminj
h
BF
j
(yi : ^j ) log w^j
i
;
foreach Cj := fxi 2 jzi = jg do ^j =
P
xi2Cj
yi=jCj j
until Convergence of the complete likelihood;
Update mixing proportions : w^j = jCj j=N
until Further convergence of the complete likelihood;
27. k-MLE for Wishart distributions
19/31
Practical considerations impose modi
28. cations of the algorithm:
During the assignment empty clusters may appear (High
dimensional data get this worse).
A possible solution is to consider Hartigan and Wang's
strategy [6] instead of Lloyd's strategy:
Optimally transfer one observation at a time
Update the parameters of involved clusters.
Stop when no transfer is possible.
This should guarantees non-empty clusters [7] but does not
work when considering weighted clusters...
Get back to an old school criterion : jCzi j 1
Experimentally shown to perform better in high dimension
than the Lloyd's strategy.
29. k-MLE - Hartigan and Wang
20/31
Criterion for potential transfer (Max):
log ^wzi pFzi
(xi ; ^zi )
log ^wz
i
pFz
i
(xi ; ^zi
)
1
i = argmaxj log w^jpFj (xi ; ^j )
with z
Update rules :
^zi = MLEFj (Czi nfxig)
^z
i
= MLEFj (Cz
i
[ fxig)
OR
Criterion for potential transfer (Min):
BF(yi : z
i
) log wz
i
BF(yi : zi ) log wzi
1
with z
i = argminj (BF(yi : j )
log wj )
Update rules :
zi =
jCzi jzi yi
jCzi j 1
z
i
=
jCz
i
jz
i
+ yi
jCz
i
j + 1
30. Towards a good initialization...
21/31
Classical initializations methods : random centers, random
partition, furthest point (2-approximation), ...
Better approach is k-means++ [8]:
Sampling prop. to sq. distance to the nearest center.
31. Towards a good initialization...
21/31
Classical initializations methods : random centers, random
partition, furthest point (2-approximation), ...
Better approach is k-means++ [8]:
Sampling prop. to sq. distance to the nearest center.
32. Towards a good initialization...
21/31
Classical initializations methods : random centers, random
partition, furthest point (2-approximation), ...
Better approach is k-means++ [8]:
Sampling prop. to sq. distance to the nearest center.
33. Towards a good initialization...
21/31
Classical initializations methods : random centers, random
partition, furthest point (2-approximation), ...
Better approach is k-means++ [8]:
Sampling prop. to sq. distance to the nearest center.
34. Towards a good initialization...
21/31
Classical initializations methods : random centers, random
partition, furthest point (2-approximation), ...
Better approach is k-means++ [8]:
Sampling prop. to sq. distance to the nearest center.
35. Towards a good initialization...
21/31
Classical initializations methods : random centers, random
partition, furthest point (2-approximation), ...
Better approach is k-means++ [8]:
Sampling prop. to sq. distance to the nearest center.
36. Towards a good initialization...
21/31
Classical initializations methods : random centers, random
partition, furthest point (2-approximation), ...
Better approach is k-means++ [8]:
Sampling prop. to sq. distance to the nearest center.
37. Towards a good initialization...
21/31
Classical initializations methods : random centers, random
partition, furthest point (2-approximation), ...
Better approach is k-means++ [8]:
Sampling prop. to sq. distance to the nearest center.
Fast and greedy approximation : (kN)
Probabilistic guarantee of good initialization:
OPTF k-meansF O(log k)OPTF
Dual Bregman divergence BF may replace the square distance
39. x k, the number of clusters
We propose on-the-
y cluster creation together with the
k-MLE++ (inspired by DP-k-means [9]) :
Create cluster when there exists observations contributing too
much to the loss function with already selected centers
41. x k, the number of clusters
We propose on-the-
y cluster creation together with the
k-MLE++ (inspired by DP-k-means [9]) :
Create cluster when there exists observations contributing too
much to the loss function with already selected centers
43. x k, the number of clusters
We propose on-the-
y cluster creation together with the
k-MLE++ (inspired by DP-k-means [9]) :
Create cluster when there exists observations contributing too
much to the loss function with already selected centers
It may overestimate the number of clusters...
44. Initialization with DP-k-MLE++
23/31
Algorithm 4: DP-k-MLE++
Input: A sample y1 = t(X1); : : : ; yN = t(XN), F , 0
Output: C a subset of y1; : : : ; yN, k the number of clusters
Choose
45. rst seed C = fyjg, for j uniformly random in f1; 2; : : : ;Ng;
repeat
foreach yi do compute pi = BF(yi : C)=
PN
i 0=1 BF(yi 0 : C)
where BF(yi : C) = minc2CBF(yi : c) ;
if 9pi then
Choose next seed s among y1; y2; : : : ; yN with prob. pi ;
Add selected seed to C : C = C [ fsg ;
until all pi ;
k = jCj;
46. Motion capture
24/31
Real dataset:
Motion capture of contemporary dancers (15 sensors in 3d).
47. Application to motion retrieval(1)
25/31
Motion capture data can be view as matrices Xi with dierent
row sizes but same column size d.
The idea is to describe Xi through one mixture model
parameters ^
i .
48. Application to motion retrieval(1)
25/31
Motion capture data can be view as matrices Xi with dierent
row sizes but same column size d.
The idea is to describe Xi through one mixture model
parameters ^
i .
49. Application to motion retrieval(1)
25/31
Motion capture data can be view as matrices Xi with dierent
row sizes but same column size d.
The idea is to describe Xi through one mixture model
parameters ^
i .
50. Application to motion retrieval(1)
25/31
Motion capture data can be view as matrices Xi with dierent
row sizes but same column size d.
The idea is to describe Xi through one mixture model
parameters ^
i .
51. Application to motion retrieval(1)
25/31
Motion capture data can be view as matrices Xi with dierent
row sizes but same column size d.
The idea is to describe Xi through one mixture model
parameters ^
i .
Remark: Size of each sub-motion is known (so its n)
52. Application to motion retrieval(1)
25/31
Motion capture data can be view as matrices Xi with dierent
row sizes but same column size d.
The idea is to describe Xi through one mixture model
parameters ^
i .
Mixture parameters can be viewed as a sparse representation
of local dynamics in Xi .
53. Application to motion retrieval(2)
26/31
Comparing two movements amounts to compute a
dissimilarity measure between ^
i and ^
j .
Remark 1 : with DP-k-MLE++, the two mixtures would not
probably have the same number of components.
Remark 2 : when both mixtures have one component, a
natural choice is
KL(Wd (:; ^)jjWd (:; ^0)) = BF(^ : ^0) = BF (^0 : ^)
A closed form is always available !
No closed form exists for KL divergence between general
mixtures.
54. Application to motion retrieval(3)
27/31
A possible solution is to use the CS divergence [10]:
CS(m : m0) = log
R
m(x)m0 R (x)dx
m(x)2dx
R
m0(x)2dx
It has a analytic formula for
Z
m(x)m0(x)dx =
Xk
j=1
k0 X
j 0=1
j 0 expF(j+0
wjw0
j0 )(F(j)+F(0
j0 ))
+
Note that this expression is well de
57. c code in MatlabTM.
Today implementation in Python (based on pyMEF [2])
Ongoing proof of concept (with Herranz F., Beurive A.)
58. Conclusions - Future works
29/31
Still some mathematical work to be done:
Solve MLE equations to get rF = (rF)1 then F
Characterize our estimator for full Wishart distribution.
Complete and validate the prototype of system for motion
retrieval.
Speeding-up algorithm: computational/numerical/algorithmic
tricks.
library for bregman divergences learning ?
Possible extensions:
Reintroduce mean vector in the model : Gaussian-Wishart
Online k-means - online k-MLE ...
59. References I
30/31
Nielsen, F.:
k-MLE: A fast algorithm for learning statistical mixture models.
In: International Conference on Acoustics, Speech and Signal Processing.
(2012) pp. 869{872
Schwander, O. and Nielsen, F.
pyMEF - A framework for Exponential Families in Python
in Proceedings of the 2011 IEEE Workshop on Statistical Signal Processing
Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.
Clustering with bregman divergences.
Journal of Machine Learning Research (6) (2005) 1705{1749
Nielsen, F., Garcia, V.:
Statistical exponential families: A digest with
ash cards.
http://arxiv.org/abs/0911.4863 (11 2009)
Hidot, S., Saint Jean, C.:
An Expectation-Maximization algorithm for the Wishart mixture model:
Application to movement clustering.
Pattern Recognition Letters 31(14) (2010) 2318{2324
60. References II
31/31
Hartigan, J.A., Wong, M.A.:
Algorithm AS 136: A k-means clustering algorithm.
Journal of the Royal Statistical Society. Series C (Applied Statistics) 28(1)
(1979) 100{108
Telgarsky, M., Vattani, A.:
Hartigan's method: k-means clustering without Voronoi.
In: Proc. of International Conference on Arti
61. cial Intelligence and
Statistics (AISTATS). (2010) pp. 820{827
Arthur, D., Vassilvitskii, S.:
k-means++: The advantages of careful seeding
In: Proceedings of the eighteenth annual ACM-SIAM symposium on
Discrete algorithms (2007) pp. 1027{1035
Kulis, B., Jordan, M.I.:
Revisiting k-means: New algorithms via Bayesian nonparametrics.
In: International Conference on Machine Learning (ICML). (2012)
Nielsen, F.:
Closed-form information-theoretic divergences for statistical mixtures.
In: International Conference on Pattern Recognition (ICPR). (2012) pp.
1723{1726