A new implementation of k-MLE for mixture modelling of Wishart distributions

1. A new implementation of k-MLE for mixture modelling of Wishart distributions Christophe Saint-Jean Frank Nielsen Geometric Science of Information 2013 August 28, 2013 - Mines Paris Tech

2. Application Context (1) 2/31 We are interested in clustering varying-length sets of multivariate observations of same dim. p. X1 = 0 @ 3:6 0:05 4: 3:6 0:05 4: 3:6 0:05 4: 1 A; : : : ;XN = 0 BBBB@ 5:3 0:5 2:5 3:6 0:5 3:5 1:6 0:5 4:6 1:6 0:5 5:1 2:9 0:5 6:1 1 CCCCA Sample mean is a good but not discriminative enough feature. Second order cross-product matrices tXiXi may capture some relations between (column) variables.

3. Application Context (2) 3/31 The problem is now the clustering of a set of p p PSD matrices : = x1 = tX1X1; x2 = tX2X2; : : : ; xN = tXNXN Examples of applications : multispectral/DTI/radar imaging, motion retrieval system, ...

4. Application Context (2) 3/31 The problem is now the clustering of a set of p p PSD matrices : = x1 = tX1X1; x2 = tX2X2; : : : ; xN = tXNXN Examples of applications : multispectral/DTI/radar imaging, motion retrieval system, ...

5. Outline of this talk 4/31 1 MLE and Wishart Distribution Exponential Family and Maximum Likehood Estimate Wishart Distribution Two sub-families of the Wishart Distribution 2 Mixture modeling with k-MLE Original k-MLE k-MLE for Wishart distributions Heuristics for the initialization 3 Application to motion retrieval

6. Reminder : Exponential Family (EF) 5/31 An exponential family is a set of parametric probability distributions EF = fp(x; ) = pF (x; ) = exp fht(x); i + k(x) F()j 2 g Terminology: source parameters. natural parameters. t(x) sucient statistic. k(x) auxiliary carrier measure. F() the log-normalizer: dierentiable, strictly convex = f 2 RDjF() 1g is an open convex set Almost all commonly used distributions are EF members but uniform, Cauchy distributions.

7. Reminder : Maximum Likehood Estimate (MLE) 6/31 Maximum Likehood Estimate principle is a very common approach for

8. tting parameters of a distribution ^ = argmax L(; ) = argmax YN i=1 p(xi ; ) = argmin 1 N XN i=1 log p(xi ; ) assuming a sample = fx1; x2; :::; xNg of i.i.d observations. Log density have a convenient expression for EF members log pF (x; ) = ht(x); i + k(x) F() It follows ^ = argmax XN i=1 log pF (xi ; ) = argmax h XN i=1 ! t(xi ); i NF()

9. MLE with EF 7/31 Since F is a strictly convex, dierentiable function, MLE exists and is unique : rF(^) = 1 N XN i=1 t(xi ) Ideally, we have a closed form : ^ = rF1 1 N XN i=1 ! t(xi ) Numerical methods including Newton-Raphson can be successfully applied.

10. Wishart Distribution 8/31 De

11. nition (Central Wishart distribution) Wishart distribution characterizes empirical covariance matrices for zero-mean gaussian samples: Wd (X; n; S) = jXj nd1 2 exp 12 tr(S1X) 2 nd 2 jSj n 2 d n 2 where for x 0, d (x) = d(d1) 4 Qd j=1 x j1 2 is the multivariate gamma function. Remarks : n d 1, E[X] = nS The multivariate generalization of the chi-square distribution.

12. Wishart Distribution as an EF 9/31 It's an exponential family: logWd (X; n; S ) = n; log jXj R + S ; 1 2 X HS + k(X) F(n; S ) with k(X) = 0 and (n; S ) = ( n d 1 2 ; S1); t(X) = (log jXj; 1 2 X); F(n; S ) = n + (d + 1) 2 (d log(2) log jS j)+log d n + (d + 1) 2

13. MLE for Wishart Distribution 10/31 In the case of the Wishart distribution, a closed form would be obtained by solving the following system ^ = rF1 1 N XN i=1 ! t(xi ) 8 : d log(2) log jS j + d n + (d+1) 2 = n n + (d+1) 2 1 S = S (1) with n and S the expectation parameters and d the derivative of the log d . Unfortunately, no closed-form solution is known.

14. Two sub-families of the Wishart Distribution (1) 11/31 Case n

15. xed (n = 2n + d + 1) Fn(S ) = nd 2 log(2) n 2 log jS j + log d n 2 kn(X) = n d 1 2 log jXj Case S

16. xed (S = 1 S ) FS (n) = n + d + 1 2 log j2Sj + log d n + d + 1 2 kS (X) = 1 2 tr (S1X)

17. Two sub-families of the Wishart Distribution (2) 12/31 Both are exponential families and MLE equations are solvable ! Case n

18. xed: n 2 ^1 S = 1 N XN i=1 1 2 Xi =) ^S = Nn XN i=1 Xi !1 (2) Case S

19. xed : ^n = 1 d 1 N XN i=1 ! log jXi j log j2Sj d + 1 2 ; ^n 0 (3) with 1 d the functional reciprocal of d .

20. An iterative estimator for the Wishart Distribution 13/31 Algorithm 1: An estimator for parameters of the Wishart Input: A sample X1;X2; : : : ;XN of Sd ++ Output: Final values of ^n and ^S Initialize ^n with some value 0; repeat Update ^S using Eq. 2 with n = 2^n + d + 1; Update ^n using Eq. 3 with S the inverse matrix of ^S ; until convergence of the likelihood;

21. Questions and open problems 14/31 From a sample of Wishart matrices, distr. parameters are recovered in few iterations. Major question : do you have a MLE ? probably ... Minor question : sample size N = 1 ? Under-determined system Regularization by sampling around X1

22. Mixture Models (MM) 15/31 A additive (

23. nite) mixture is a exible tool to model a more complex distribution m: m(x) = Xk j=1 wjpj (x); 0 wj 1; Xk j=1 wj = 1 where pj are the component distributions of the mixture, wj the mixing proportions. In our case, we consider pj as member of some parametric family (EF) m(x; ) = Xk j=1 wjpFj (x; j ) with = (w1;w2; :::;wk1; 1; 2; :::; k ) Expectation-Maximization is not fast enough [5] ...

24. Original k-MLE (primal form.) in one slide 16/31 Algorithm 2: k-MLE Input: A sample = fx1; x2; :::; xNg, F1; F2; :::; Fk Bregman generator Output: Estimate ^ of mixture parameters A good initialization for (see later); repeat repeat foreach xi 2 do zi = argmaxj log w^jpFj (xi ; ^j ); foreach Cj := fxi 2 jzi = jg do ^j = MLEFj (Cj ); until Convergence of the complete likelihood; Update mixing proportions : w^j = jCj j=N until Further convergence of the complete likelihood;

25. k-MLE’s properties 17/31 Another formulation comes with the connection between EF and Bregman divergences [3]: log pF (x; ) = BF(t(x) : ) + F(t(x)) + k(x) Bregman divergence BF (: : :) associated to a strictly convex and dierentiable function F :

26. Original k-MLE (dual form.) in one slide 18/31 Algorithm 3: k-MLE Input: A sample = fy1 = t(x1); y2 = x2; :::; yn = t(xN)g, F 1 ; F 2 ; :::; F k Bregman generator Output: ^ = ( ^w1; ^w2; :::; ^wk1; ^1 = rF(^1); :::; ^k = rF(^k )) A good initialization for (see later); repeat repeat foreach xi 2 do zi = argminj h BF j (yi : ^j ) log w^j i ; foreach Cj := fxi 2 jzi = jg do ^j = P xi2Cj yi=jCj j until Convergence of the complete likelihood; Update mixing proportions : w^j = jCj j=N until Further convergence of the complete likelihood;

27. k-MLE for Wishart distributions 19/31 Practical considerations impose modi

28. cations of the algorithm: During the assignment empty clusters may appear (High dimensional data get this worse). A possible solution is to consider Hartigan and Wang's strategy [6] instead of Lloyd's strategy: Optimally transfer one observation at a time Update the parameters of involved clusters. Stop when no transfer is possible. This should guarantees non-empty clusters [7] but does not work when considering weighted clusters... Get back to an old school criterion : jCzi j 1 Experimentally shown to perform better in high dimension than the Lloyd's strategy.

29. k-MLE - Hartigan and Wang 20/31 Criterion for potential transfer (Max): log ^wzi pFzi (xi ; ^zi ) log ^wz i pFz i (xi ; ^zi ) 1 i = argmaxj log w^jpFj (xi ; ^j ) with z Update rules : ^zi = MLEFj (Czi nfxig) ^z i = MLEFj (Cz i [ fxig) OR Criterion for potential transfer (Min): BF(yi : z i ) log wz i BF(yi : zi ) log wzi 1 with z i = argminj (BF(yi : j ) log wj ) Update rules : zi = jCzi jzi yi jCzi j 1 z i = jCz i jz i + yi jCz i j + 1

30. Towards a good initialization... 21/31 Classical initializations methods : random centers, random partition, furthest point (2-approximation), ... Better approach is k-means++ [8]: Sampling prop. to sq. distance to the nearest center.

37. Towards a good initialization... 21/31 Classical initializations methods : random centers, random partition, furthest point (2-approximation), ... Better approach is k-means++ [8]: Sampling prop. to sq. distance to the nearest center. Fast and greedy approximation : (kN) Probabilistic guarantee of good initialization: OPTF k-meansF O(log k)OPTF Dual Bregman divergence BF may replace the square distance

38. Heuristic to avoid to fix k 22/31 K-means imposes to

39. x k, the number of clusters We propose on-the- y cluster creation together with the k-MLE++ (inspired by DP-k-means [9]) : Create cluster when there exists observations contributing too much to the loss function with already selected centers

41. x k, the number of clusters We propose on-the- y cluster creation together with the k-MLE++ (inspired by DP-k-means [9]) : Create cluster when there exists observations contributing too much to the loss function with already selected centers

43. x k, the number of clusters We propose on-the- y cluster creation together with the k-MLE++ (inspired by DP-k-means [9]) : Create cluster when there exists observations contributing too much to the loss function with already selected centers It may overestimate the number of clusters...

44. Initialization with DP-k-MLE++ 23/31 Algorithm 4: DP-k-MLE++ Input: A sample y1 = t(X1); : : : ; yN = t(XN), F , 0 Output: C a subset of y1; : : : ; yN, k the number of clusters Choose

45. rst seed C = fyjg, for j uniformly random in f1; 2; : : : ;Ng; repeat foreach yi do compute pi = BF(yi : C)= PN i 0=1 BF(yi 0 : C) where BF(yi : C) = minc2CBF(yi : c) ; if 9pi then Choose next seed s among y1; y2; : : : ; yN with prob. pi ; Add selected seed to C : C = C [ fsg ; until all pi ; k = jCj;

46. Motion capture 24/31 Real dataset: Motion capture of contemporary dancers (15 sensors in 3d).

47. Application to motion retrieval(1) 25/31 Motion capture data can be view as matrices Xi with dierent row sizes but same column size d. The idea is to describe Xi through one mixture model parameters ^ i .

51. Application to motion retrieval(1) 25/31 Motion capture data can be view as matrices Xi with dierent row sizes but same column size d. The idea is to describe Xi through one mixture model parameters ^ i . Remark: Size of each sub-motion is known (so its n)

52. Application to motion retrieval(1) 25/31 Motion capture data can be view as matrices Xi with dierent row sizes but same column size d. The idea is to describe Xi through one mixture model parameters ^ i . Mixture parameters can be viewed as a sparse representation of local dynamics in Xi .

53. Application to motion retrieval(2) 26/31 Comparing two movements amounts to compute a dissimilarity measure between ^ i and ^ j . Remark 1 : with DP-k-MLE++, the two mixtures would not probably have the same number of components. Remark 2 : when both mixtures have one component, a natural choice is KL(Wd (:; ^)jjWd (:; ^0)) = BF(^ : ^0) = BF (^0 : ^) A closed form is always available ! No closed form exists for KL divergence between general mixtures.

54. Application to motion retrieval(3) 27/31 A possible solution is to use the CS divergence [10]: CS(m : m0) = log R m(x)m0 R (x)dx m(x)2dx R m0(x)2dx It has a analytic formula for Z m(x)m0(x)dx = Xk j=1 k0 X j 0=1 j 0 expF(j+0 wjw0 j0 )(F(j)+F(0 j0 )) + Note that this expression is well de

55. ned since natural parameter space = R Sp ++ is a convex cone.

56. Implementation 28/31 Early speci

57. c code in MatlabTM. Today implementation in Python (based on pyMEF [2]) Ongoing proof of concept (with Herranz F., Beurive A.)

58. Conclusions - Future works 29/31 Still some mathematical work to be done: Solve MLE equations to get rF = (rF)1 then F Characterize our estimator for full Wishart distribution. Complete and validate the prototype of system for motion retrieval. Speeding-up algorithm: computational/numerical/algorithmic tricks. library for bregman divergences learning ? Possible extensions: Reintroduce mean vector in the model : Gaussian-Wishart Online k-means - online k-MLE ...

59. References I 30/31 Nielsen, F.: k-MLE: A fast algorithm for learning statistical mixture models. In: International Conference on Acoustics, Speech and Signal Processing. (2012) pp. 869{872 Schwander, O. and Nielsen, F. pyMEF - A framework for Exponential Families in Python in Proceedings of the 2011 IEEE Workshop on Statistical Signal Processing Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J. Clustering with bregman divergences. Journal of Machine Learning Research (6) (2005) 1705{1749 Nielsen, F., Garcia, V.: Statistical exponential families: A digest with ash cards. http://arxiv.org/abs/0911.4863 (11 2009) Hidot, S., Saint Jean, C.: An Expectation-Maximization algorithm for the Wishart mixture model: Application to movement clustering. Pattern Recognition Letters 31(14) (2010) 2318{2324

60. References II 31/31 Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28(1) (1979) 100{108 Telgarsky, M., Vattani, A.: Hartigan's method: k-means clustering without Voronoi. In: Proc. of International Conference on Arti

61. cial Intelligence and Statistics (AISTATS). (2010) pp. 820{827 Arthur, D., Vassilvitskii, S.: k-means++: The advantages of careful seeding In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms (2007) pp. 1027{1035 Kulis, B., Jordan, M.I.: Revisiting k-means: New algorithms via Bayesian nonparametrics. In: International Conference on Machine Learning (ICML). (2012) Nielsen, F.: Closed-form information-theoretic divergences for statistical mixtures. In: International Conference on Pattern Recognition (ICPR). (2012) pp. 1723{1726

A new implementation of k-MLE for mixture modelling of Wishart distributions

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to A new implementation of k-MLE for mixture modelling of Wishart distributions

Similar to A new implementation of k-MLE for mixture modelling of Wishart distributions (20)

Recently uploaded

Recently uploaded (20)

A new implementation of k-MLE for mixture modelling of Wishart distributions