k-MLE: 
A fast algorithm for learning statistical mixture models 
(arXiv:1203.5181) 
Frank NIELSEN 
Sony Computer Science Laboratories, Inc. 
28th March 2012 
International Conference on Acoustics, Speech, and Signal Processing 
ICASSP, Kyoto ICC 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/28
Outline 
I Background 
I Statistical mixtures of exponential families (EFMMs) 
I Legendre transform and mixture dual parameterizations 
I Contributions 
I k-MLE and its variants 
I k-MLE initialization (k-MLE++) 
I Summary 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/28
Exponential Family Mixture Models (EFMMs) 
Generalize Gaussian & Rayleigh MMs to many common 
distributions. 
m(x) = 
Xk 
i=1 
wipF (x; i ) with 8i wi  0, 
Pk 
i=1 wi = 1 
pF (x; ) = eht(x),i−F()+k(x) 
F: log-Laplace transform (partition, cumulant function): 
Z 
x2X 
pF (x; )dx = 1 ) F() = log 
Z 
x2X 
eht(x),i+k(x)dx, 
 2  = 
 
 | 
Z 
x2X 
eht(x),i+k(x)dx  1 
 
the natural parameter space. 
I d: Dimension of the support X. 
I D: order of the family (= dim). Statistic: t(x) : Rd ! RD. 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/28
Statistical mixtures: Rayleigh MMs [7, 5] 
IntraVascular UltraSound (IVUS) imaging: 
Rayleigh distribution: 
p(x; ) = x 
2 e− x2 
22 
x 2 R+ = X 
d = 1 (univariate) 
D = 1 (order 1) 
 = − 1 
22 
 = (−1, 0) 
F() = −log(−2) 
t(x) = x2 
k(x) = log x 
(Weibull k = 2) 
Coronary plaques: fibrotic tissues, calcified tissues, lipidic tissues 
Rayleigh Mixture Models (RMMs): 
for segmentation and classification tasks 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/28
Statistical mixtures: Gaussian MMs [3, 5] 
Gaussian mixture models (GMMs). 
Color image interpreted as a 5D xyRGB point set. 
Gaussian distribution p(x; μ,): 
1 
(2) 
d 
2p|| 
e−1 
2D−1 (x−μ,x−μ) 
Squared Mahalanobis distance: 
DQ(x, y) = (x − y)TQ(x − y) 
x 2 Rd = X 
d (multivariate) 
D = d(d+3) 
2 (order) 
 = (−1μ, 1 
2−1) = (v , M) 
 = R × Sd+ 
+ 
F() = 1 
v −1 
4 T 
M v − 1 
2 log |M| + 
d 
2 log  
t(x) = (x,−xxT ) 
k(x) = 0 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/28
Sampling from a Gaussian Mixture Model (GMM) 
To sample a variate x from a GMM: 
I Choose a component l according to the weight distribution 
w1, ...,wk , 
I Draw a variate x according to N(μl ,l ). 
Doubly stochastic process: 
1. throw a (biased) dice with k faces to choose the component: 
l  Multinomial(w1, ...,wk ) 
(Multinomial distribution belongs also to the exponential 
families.) 
2. then draw at random a variate x from the l -th component 
x  Normal(μl ,l ) 
x = μ + Cz with Cholesky:  = CCT and z = [z1 ... zd ]T 
standard normal random variate: zi = p−2 log U1 cos(2U2) 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/28
Statistical mixtures: Generative models of data sets 
GMM = feature descriptor for information retrieval (IR) 
! classification, matching, etc. 
Increase dimension using color image patches. 
Low-frequency information encoded into compact statistical model. 
Generative model ! statistical image by GMM sampling. 
Source GMM Sample 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/28
Distance between exponential families: Relative entropy 
I Distance between features (e.g., GMMs) 
I Kullback-Leibler divergence (cross-entropy minus entropy): 
KL(P : Q) = 
Z 
p(x) log 
p(x) 
q(x) 
dx  0 
= 
Z 
p(x) log 
1 
q(x) 
dx 
| {z } 
H×(P:Q) 
− 
Z 
p(x) log 
1 
p(x) 
dx 
| {z } 
H(p)=H×(P:P) 
= F(Q) − F(P) − hQ − P,rF(P)i 
= BF (Q : P) 
Bregman divergence BF defined for a strictly convex and 
differentiable function (up to some affine terms). 
I Proof KL(P : Q) = BF (Q : P) follows from 
X  EF () =) E[t(X)] = rF() 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/28
Bregman divergence: Geometric interpretation 
Potential function F, graph plot F : (x, F(x)). 
DF (p : q) = F(p) − F(q) − hp − q,rF(q)i 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/28
Convex duality: Legendre transformation 
I For a strictly convex and differentiable function F : X ! R: 
F(y) = sup 
x2X{hy, xi − F(x) | {z } 
lF (y;x); 
} 
I Maximum obtained for y = rF(x): 
rx lF (y; x) = y − rF(x) = 0 ) y = rF(x) 
I Maximum unique from convexity of F (r2F  0): 
r2x 
lF (y; x) = −r2F(x)  0 
I Convex conjugates: 
(F,X) , (F,Y), Y = {rF(x) | x 2 X} 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/28
Legendre duality  Canonical divergence 
I Convex conjugates have functional inverse gradients 
rF−1 = rF 
rF may require numerical approximation 
(not always available in analytical closed-form) 
I Involution: (F) = F. 
I Convex conjugate F expressed using (rF)−1: 
F(y) = h(rF)−1(y), yi − F((rF)−1(y)) 
I Fenchel-Young inequality at the heart of canonical divergence: 
F(x) + F(y)  hx, yi 
AF (x : y) = AF(y : x) = F(x) + F(y) − hx, yi  0 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/28
Dual Bregman divergences  canonical divergence [6] 
KL(P : Q) = EP 
 
log 
p(x) 
q(x) 
 
 0 
= BF (Q : P) = BF(P : Q) 
= F(Q) + F(P) − hQ, Pi 
= AF (Q : P) = AF(P : Q) 
with Q (natural parameterization) and P = EP[t(X)] = rF(P) 
(moment parameterization). 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/28
Exponential family mixtures: Dual parameterizations 
A finite weighted point set {(wi , i )}ki 
=1 in a statistical manifold. 
Many coordinate systems for computing (two canonical): 
I usual -parameterization, 
I natural -parameterization and dual -parameterization. 
Original parameters 
 2  
Exponential family 
dual parameterization 
Legendre transform 
(, F) $ (H, F) 
 2   2 H 
 = rF()  = rF() 
Natural parameters Expectation parameters 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/28
Maximum Likelihood Estimator (MLE) 
Given n identical and independently distributed observations 
X = {x1, ..., xn} 
Maximum Likelihood Estimator 
Yn 
 ˆ= argmax2 
i=1 
pF (xi ; ) = argmax2ePn 
i=1ht(xi ),i−F()+k(xi ) 
is unique maximum since r2F  0 (Hessian): 
rF(ˆ) = 
1 
n 
Xn 
i=1 
t(xi ) 
MLE is consistent, efficient with asymptotic normal distribution 
ˆ  N 
 
, 
1 
n 
 
I−1() 
Fisher information matrix 
I () = var[t(X)] = r2F() 
MLE may be biased (eg, normal distributions). 
c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/28
Duality Bregman $ Exponential families [2] 
Bregman divergence: 
BF (x : ) 
Bregman generator: 
F() 
Cumulant function: 
F() 
Exponential family: 
pF (x|) 
Legendre 
duality 
 = rF() 
An exponential family... 
pF (x; ) = exp(ht(x), i − F() + k(x)) 
has the log-density interpreted as a Bregman divergence: 
log pF (x; ) = −BF(t(x) : ) + F(t(x)) + k(x) 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/28
Exponential families , Bregman divergences: Examples 
F(x) pF (x|) , BF 
Generator Exponential Family , Dual Bregman divergence 
x2 Spherical Gaussian , Squared loss 
x log x Multinomial , Kullback-Leibler divergence 
x log x − x Poisson , I -divergence 
−log(−2x) Rayleigh , Itakura-Saito divergence 
−log x Geometric , Itakura-Saito divergence 
log |X| Wishart , log-det/Burg matrix div. [8] 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/28
Maximum likelihood estimator revisited 
ˆ = argmax 
Qn 
i=1 pF (xi ; ) = argmax 
Pn 
i=1 log pF (xi ; ) 
argmax 
Xn 
i=1 
(ht(xi ), i − F() + k(xi )) 
argmax 
Xn 
i=1 
−BF(t(xi ) : ) + F(t(xi )) + k(xi ) | {z } 
constant 
 argmin 
Xn 
i=1 
BF(t(xi ) : ) 
Right-sided Bregman centroid = center of mass: ˆ = 1 
n 
Pn 
i=1 t(xi ). 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/28
Bregman batched Lloyd’s k-means [2] 
Extends Lloyd’s k-means heuristic to Bregman divergences. 
I Initialize distinct seeds: C1 = P1, ..., Ck = Pk 
I Repeat until convergence 
I Assign point Pi to its “closest” centroid (wrt. BF (Pi : C)) 
Ci = {P 2 P | BF (P : Ci )  BF (P : Cj ) 8j6= i} 
I Update cluster centroids by taking their center of mass: 
Ci = 1 
|Ci | 
P 
P2Ci 
P. 
Loss function 
LF (P : C) = 
X 
P2P 
BF (P : C) 
BF (P : C) = min 
i2{1,...,k} 
BF (P : Ci ) 
...monotonically decreases and converges to a local optimum. 
(Extend to weighted point sets using barycenters.) 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 18/28
k-MLE for EFMM  Bregman Hard Clustering [4] 
Bijection exponential families (distributions) $ Bregman distances 
log pF (x; ) = −BF(t(x) : ) + F(t(x)) + k(x),  = rF() 
Bregman k-MLE for EFMMs (F) = additively weighted Bregman 
hard k-means for F in the Qspace {yi = t(xi )}i : 
Complete log-likelihood log 
n 
i=1 
Qk 
j=1(wjpF (xi |j ))j (zi ): 
= max 
,w 
Xn 
i=1 
Xk 
j=1 
j (zi )(log pF (xi |j ) + log wj ) 
min 
H,w 
Xn 
i=1 
Xk 
j=1 
j (zi )((BF (t(xi ) : j ) − log wj )−k(xi ) − F(t(xi ) | {z } 
constant 
) 
 min 
,w 
Xn 
i=1 
k 
min 
j=1 
BF(t(xi ) : j ) − log wj 
(This is the argmin that gives the zi ’s) 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 19/28
Complete average log-likelihood optimization 
Minimize monotonically the complete average log-likelihood: 
1 
n 
min 
H,w 
Xn 
i=1 
k 
min 
j=1 
BF(t(xi ) : j ) − log wj 
I 1. Constant weights ! dual additive Bregman k-means 
1 
n 
min 
H 
Xn 
i=1 
k 
min 
j=1 
(BF(t(xi ) : j ) − log wj ) 
I 2. Component moment parameters  fixed: 
min 
w 
Xn 
i=1 
Xk 
j=1 
−j (zi ) log wj = min 
w 
Xk 
j=1 
−j log wj , 
where j = |Cj | 
n . That is, minimize the cross-entropy: 
minw H×( : w) ) w = . 
I Go to 1 until (local) convergence is met. 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 20/28
k-MLE-EFMM algorithm [4] 
I 0. Initialization: 8i 2 {1, ..., k}, let wi = 1 
k and i = t(xi ) 
(initialization is further discussed later on). 
I 1. Assignment: 
8i 2 {1, ..., n}, zi = argminkj 
=1BF(t(xi ) : j ) − log wj . 
Let Ci = {xj |zj = i}, 8i 2 {1, ..., k} be the cluster partition: 
X = [ki 
=1Ci . 
I 2. Update the -parameters: 
P 
8i 2 {1, ..., k}, 1 
i = |Ci | 
x2Ci 
t(x). 
Goto step 1 unless local convergence of the complete 
likelihood is reached. 
I 3. Update the mixture weights: 8i 2 {1, ..., k},wi = 1 
n |Ci |. 
Goto step 1 unless local convergence of the complete 
likelihood is reached. 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 21/28
k-MLE initialization 
I Forgy’s random seed (d = D), 
I Bregman k-means (for F on Y, and MLE on each cluster). 
Usually D  d (eg., multivariate Gaussians D = d(d+3) 
2 ) 
I Compute global MLE ˆ = 1 
n 
Pn 
i=1 t(xi ) 
(well-defined for n  D ! ˆ 2 ) 
I Consider restricted exponential family for Fˆ(d+1...D)((1...d)), 
i = t(1...d)(xi ) and (d+1...D) 
then set (1...d) 
i = ˆ(d+1...D). 
(e.g., we fix global covariance matrix, and let μi = xi for 
Gaussians) 
I Improve initialization by applying Bregman k-means++ [1] for 
the convex conjugate of Fˆ(d+1...D)((1...d)) 
k-MLE++ based on Bregman k-means++ 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 22/28
k-MLE variants using any Bregman k-means heuristic 
I Any k-means optimization heuristic allows one to update the 
mixture -parameters. 
I Hartigan  Wang’s greedy swap (after Lloyd convergence) 
I Kanungo et al. swap ((9 + )-approximation) 
I Performing successively mixture  and w parameters yield 
Hard EM variant. 
(easily implemented by winner-take-all EM weight 
membership) 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 23/28
k-MLE for MVNs with the (μ,) parameters 
I 0. Initialization: 
I Calculate global mean ¯μ and global covariance matrix ¯ 
: 
¯μ = 1 
n 
Pk 
i=1 xi , ¯ 
= 1 
n 
Pk 
i=1 xi xT 
i − ¯μ¯μT 
I 8i 2 {1, ..., k}, initialize the i th seed as (μi = xi ,i = ¯ 
). 
I 1. Assignment: 8i 2 {1, ..., n} 
zi = argminkj 
=1M−1 
i 
(x − μi , x − μi ) + log |i | − 2 log wi with 
M−1 
i 
(x − μi , x − μi ) the squared Mahalanobis distance: 
MQ(x, y) = (x − y)TQ(x − y). Let 
Ci = {xj |zj = i}, 8i 2 {1, ..., k} be the cluster partition: 
X = [ki 
=1Ci . 
I 2. Update the parameters: 
P 
8i 2 {1, ..., k}, μi = 1 
|Ci | 
x2Ci 
x,i = 1 
|Ci | 
P 
x2Ci 
xxT − μiμTi 
Goto step 1 unless local convergence. 
I 3. Update the mixture weights: 8i 2 {1, ..., k},wi = |Ci | 
n . 
Goto step 1 unless local convergence. 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 24/28
Summary of contributions 
I Hard k-MLE versus soft EM: 
I k-MLE maximizes locally the complete likelihood 
I EM maximizes the incomplete likelihood 
I The component parameter  update can be implemented 
using any Bregman k-means heuristic on conjugate F, 
I Initialization can be performed using k-MLE++ 
I Indivisibility: Robustness when identifying statistical mixture 
models? Which k? 8k 2 N, N(μ, 2) = 
Pk 
i=1 N 
 
μ 
k , 2 
k 
 
Simplifying mixtures from kernel density estimators is one 
fine-to-coarse solution. See: 
Model centroids for the simplification of kernel density 
estimators, ICASSP 2012, March 29th. 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 25/28
Marcel R. Ackermann and Johannes Bl¨omer. 
Bregman clustering for separable instances. 
In Scandinavian Workshop on Algorithm Theory (SWAT), 
pages 212–223, 2010. 
Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and 
Joydeep Ghosh. 
Clustering with Bregman divergences. 
Journal of Machine Learning Research, 6:1705–1749, 2005. 
Vincent Garcia and Frank Nielsen. 
Simplification and hierarchical representations of mixtures of 
exponential families. 
Signal Processing (Elsevier), 90(12):3197–3212, 2010. 
Frank Nielsen. 
k-MLE: A fast algorithm for learning statistical mixture 
models. 
In IEEE International Conference on Acoustics, Speech, and 
Signal Processing (ICASSP). IEEE, 2012. 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 25/28
preliminary, technical report on arXiv. 
Frank Nielsen and Vincent Garcia. 
Statistical exponential families: A digest with flash cards, 
2009. 
arXiv.org:0911.4863. 
Frank Nielsen and Richard Nock. 
Entropies and cross-entropies of exponential families. 
In International Conference on Image Processing (ICIP), pages 
3621–3624, 2010. 
Jose Seabra, Francesco Ciompi, Oriol Pujol, Josepa Mauri, 
Petia Radeva, and Joao Sanchez. 
Rayleigh mixture model for plaque characterization in 
intravascular ultrasound. 
IEEE Transaction on Biomedical Engineering, 
58(5):1314–1324, 2011. 
Shijun Wang and Rong Jin. 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 25/28
An information geometry approach for distance metric 
learning. 
Journal of Machine Learning Research, 5:591–598, 2009. 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 26/28
Anisotropic Voronoi diagram (for MVN MMs) 
From the source color image (a), we buid a 5D GMM with k = 32 
components, and color each pixel with the mean color of the 
anisotropic Voronoi cell it belongs to 
(a) (b) 
Speed-up assignment step using Bregman ball trees or Bregman 
vantage point trees. 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 26/28
Expectation-maximization (EM) for EFMMs [2] 
EM increases monotonically the expected complete likelihood 
(marginalize): 
Xn 
i=1 
Xk 
j=1 
p(zj |xi , ) log p(xi , zj |) 
Banerjee et al. [2] proved it amounts to a Bregman soft clustering: 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 27/28
Comparisons: k-MLE vs. EM for EFMMs 
k-MLE/Hard EM Soft EM (1977) 
= Bregman hard clustering = Bregman soft clustering 
Memory lighter heavier (W matrix) 
Speed lighter (VP-tree) heavier (all weights wij ) 
Conv. always finitely 1, stopping criterion 
Init. k-MLE++ k-means(++) 

c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 28/28

k-MLE: A fast algorithm for learning statistical mixture models

  • 1.
    k-MLE: A fastalgorithm for learning statistical mixture models (arXiv:1203.5181) Frank NIELSEN Sony Computer Science Laboratories, Inc. 28th March 2012 International Conference on Acoustics, Speech, and Signal Processing ICASSP, Kyoto ICC c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/28
  • 2.
    Outline I Background I Statistical mixtures of exponential families (EFMMs) I Legendre transform and mixture dual parameterizations I Contributions I k-MLE and its variants I k-MLE initialization (k-MLE++) I Summary c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/28
  • 3.
    Exponential Family MixtureModels (EFMMs) Generalize Gaussian & Rayleigh MMs to many common distributions. m(x) = Xk i=1 wipF (x; i ) with 8i wi 0, Pk i=1 wi = 1 pF (x; ) = eht(x),i−F()+k(x) F: log-Laplace transform (partition, cumulant function): Z x2X pF (x; )dx = 1 ) F() = log Z x2X eht(x),i+k(x)dx, 2 = | Z x2X eht(x),i+k(x)dx 1 the natural parameter space. I d: Dimension of the support X. I D: order of the family (= dim). Statistic: t(x) : Rd ! RD. c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/28
  • 4.
    Statistical mixtures: RayleighMMs [7, 5] IntraVascular UltraSound (IVUS) imaging: Rayleigh distribution: p(x; ) = x 2 e− x2 22 x 2 R+ = X d = 1 (univariate) D = 1 (order 1) = − 1 22 = (−1, 0) F() = −log(−2) t(x) = x2 k(x) = log x (Weibull k = 2) Coronary plaques: fibrotic tissues, calcified tissues, lipidic tissues Rayleigh Mixture Models (RMMs): for segmentation and classification tasks c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/28
  • 5.
    Statistical mixtures: GaussianMMs [3, 5] Gaussian mixture models (GMMs). Color image interpreted as a 5D xyRGB point set. Gaussian distribution p(x; μ,): 1 (2) d 2p|| e−1 2D−1 (x−μ,x−μ) Squared Mahalanobis distance: DQ(x, y) = (x − y)TQ(x − y) x 2 Rd = X d (multivariate) D = d(d+3) 2 (order) = (−1μ, 1 2−1) = (v , M) = R × Sd+ + F() = 1 v −1 4 T M v − 1 2 log |M| + d 2 log t(x) = (x,−xxT ) k(x) = 0 c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/28
  • 6.
    Sampling from aGaussian Mixture Model (GMM) To sample a variate x from a GMM: I Choose a component l according to the weight distribution w1, ...,wk , I Draw a variate x according to N(μl ,l ). Doubly stochastic process: 1. throw a (biased) dice with k faces to choose the component: l Multinomial(w1, ...,wk ) (Multinomial distribution belongs also to the exponential families.) 2. then draw at random a variate x from the l -th component x Normal(μl ,l ) x = μ + Cz with Cholesky: = CCT and z = [z1 ... zd ]T standard normal random variate: zi = p−2 log U1 cos(2U2) c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/28
  • 7.
    Statistical mixtures: Generativemodels of data sets GMM = feature descriptor for information retrieval (IR) ! classification, matching, etc. Increase dimension using color image patches. Low-frequency information encoded into compact statistical model. Generative model ! statistical image by GMM sampling. Source GMM Sample c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/28
  • 8.
    Distance between exponentialfamilies: Relative entropy I Distance between features (e.g., GMMs) I Kullback-Leibler divergence (cross-entropy minus entropy): KL(P : Q) = Z p(x) log p(x) q(x) dx 0 = Z p(x) log 1 q(x) dx | {z } H×(P:Q) − Z p(x) log 1 p(x) dx | {z } H(p)=H×(P:P) = F(Q) − F(P) − hQ − P,rF(P)i = BF (Q : P) Bregman divergence BF defined for a strictly convex and differentiable function (up to some affine terms). I Proof KL(P : Q) = BF (Q : P) follows from X EF () =) E[t(X)] = rF() c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/28
  • 9.
    Bregman divergence: Geometricinterpretation Potential function F, graph plot F : (x, F(x)). DF (p : q) = F(p) − F(q) − hp − q,rF(q)i c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/28
  • 10.
    Convex duality: Legendretransformation I For a strictly convex and differentiable function F : X ! R: F(y) = sup x2X{hy, xi − F(x) | {z } lF (y;x); } I Maximum obtained for y = rF(x): rx lF (y; x) = y − rF(x) = 0 ) y = rF(x) I Maximum unique from convexity of F (r2F 0): r2x lF (y; x) = −r2F(x) 0 I Convex conjugates: (F,X) , (F,Y), Y = {rF(x) | x 2 X} c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/28
  • 11.
    Legendre duality Canonical divergence I Convex conjugates have functional inverse gradients rF−1 = rF rF may require numerical approximation (not always available in analytical closed-form) I Involution: (F) = F. I Convex conjugate F expressed using (rF)−1: F(y) = h(rF)−1(y), yi − F((rF)−1(y)) I Fenchel-Young inequality at the heart of canonical divergence: F(x) + F(y) hx, yi AF (x : y) = AF(y : x) = F(x) + F(y) − hx, yi 0 c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/28
  • 12.
    Dual Bregman divergences canonical divergence [6] KL(P : Q) = EP log p(x) q(x) 0 = BF (Q : P) = BF(P : Q) = F(Q) + F(P) − hQ, Pi = AF (Q : P) = AF(P : Q) with Q (natural parameterization) and P = EP[t(X)] = rF(P) (moment parameterization). c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/28
  • 13.
    Exponential family mixtures:Dual parameterizations A finite weighted point set {(wi , i )}ki =1 in a statistical manifold. Many coordinate systems for computing (two canonical): I usual -parameterization, I natural -parameterization and dual -parameterization. Original parameters 2 Exponential family dual parameterization Legendre transform (, F) $ (H, F) 2 2 H = rF() = rF() Natural parameters Expectation parameters c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/28
  • 14.
    Maximum Likelihood Estimator(MLE) Given n identical and independently distributed observations X = {x1, ..., xn} Maximum Likelihood Estimator Yn ˆ= argmax2 i=1 pF (xi ; ) = argmax2ePn i=1ht(xi ),i−F()+k(xi ) is unique maximum since r2F 0 (Hessian): rF(ˆ) = 1 n Xn i=1 t(xi ) MLE is consistent, efficient with asymptotic normal distribution ˆ N , 1 n I−1() Fisher information matrix I () = var[t(X)] = r2F() MLE may be biased (eg, normal distributions). c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/28
  • 15.
    Duality Bregman $Exponential families [2] Bregman divergence: BF (x : ) Bregman generator: F() Cumulant function: F() Exponential family: pF (x|) Legendre duality = rF() An exponential family... pF (x; ) = exp(ht(x), i − F() + k(x)) has the log-density interpreted as a Bregman divergence: log pF (x; ) = −BF(t(x) : ) + F(t(x)) + k(x) c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/28
  • 16.
    Exponential families ,Bregman divergences: Examples F(x) pF (x|) , BF Generator Exponential Family , Dual Bregman divergence x2 Spherical Gaussian , Squared loss x log x Multinomial , Kullback-Leibler divergence x log x − x Poisson , I -divergence −log(−2x) Rayleigh , Itakura-Saito divergence −log x Geometric , Itakura-Saito divergence log |X| Wishart , log-det/Burg matrix div. [8] c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/28
  • 17.
    Maximum likelihood estimatorrevisited ˆ = argmax Qn i=1 pF (xi ; ) = argmax Pn i=1 log pF (xi ; ) argmax Xn i=1 (ht(xi ), i − F() + k(xi )) argmax Xn i=1 −BF(t(xi ) : ) + F(t(xi )) + k(xi ) | {z } constant argmin Xn i=1 BF(t(xi ) : ) Right-sided Bregman centroid = center of mass: ˆ = 1 n Pn i=1 t(xi ). c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/28
  • 18.
    Bregman batched Lloyd’sk-means [2] Extends Lloyd’s k-means heuristic to Bregman divergences. I Initialize distinct seeds: C1 = P1, ..., Ck = Pk I Repeat until convergence I Assign point Pi to its “closest” centroid (wrt. BF (Pi : C)) Ci = {P 2 P | BF (P : Ci ) BF (P : Cj ) 8j6= i} I Update cluster centroids by taking their center of mass: Ci = 1 |Ci | P P2Ci P. Loss function LF (P : C) = X P2P BF (P : C) BF (P : C) = min i2{1,...,k} BF (P : Ci ) ...monotonically decreases and converges to a local optimum. (Extend to weighted point sets using barycenters.) c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 18/28
  • 19.
    k-MLE for EFMM Bregman Hard Clustering [4] Bijection exponential families (distributions) $ Bregman distances log pF (x; ) = −BF(t(x) : ) + F(t(x)) + k(x), = rF() Bregman k-MLE for EFMMs (F) = additively weighted Bregman hard k-means for F in the Qspace {yi = t(xi )}i : Complete log-likelihood log n i=1 Qk j=1(wjpF (xi |j ))j (zi ): = max ,w Xn i=1 Xk j=1 j (zi )(log pF (xi |j ) + log wj ) min H,w Xn i=1 Xk j=1 j (zi )((BF (t(xi ) : j ) − log wj )−k(xi ) − F(t(xi ) | {z } constant ) min ,w Xn i=1 k min j=1 BF(t(xi ) : j ) − log wj (This is the argmin that gives the zi ’s) c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 19/28
  • 20.
    Complete average log-likelihoodoptimization Minimize monotonically the complete average log-likelihood: 1 n min H,w Xn i=1 k min j=1 BF(t(xi ) : j ) − log wj I 1. Constant weights ! dual additive Bregman k-means 1 n min H Xn i=1 k min j=1 (BF(t(xi ) : j ) − log wj ) I 2. Component moment parameters fixed: min w Xn i=1 Xk j=1 −j (zi ) log wj = min w Xk j=1 −j log wj , where j = |Cj | n . That is, minimize the cross-entropy: minw H×( : w) ) w = . I Go to 1 until (local) convergence is met. c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 20/28
  • 21.
    k-MLE-EFMM algorithm [4] I 0. Initialization: 8i 2 {1, ..., k}, let wi = 1 k and i = t(xi ) (initialization is further discussed later on). I 1. Assignment: 8i 2 {1, ..., n}, zi = argminkj =1BF(t(xi ) : j ) − log wj . Let Ci = {xj |zj = i}, 8i 2 {1, ..., k} be the cluster partition: X = [ki =1Ci . I 2. Update the -parameters: P 8i 2 {1, ..., k}, 1 i = |Ci | x2Ci t(x). Goto step 1 unless local convergence of the complete likelihood is reached. I 3. Update the mixture weights: 8i 2 {1, ..., k},wi = 1 n |Ci |. Goto step 1 unless local convergence of the complete likelihood is reached. c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 21/28
  • 22.
    k-MLE initialization IForgy’s random seed (d = D), I Bregman k-means (for F on Y, and MLE on each cluster). Usually D d (eg., multivariate Gaussians D = d(d+3) 2 ) I Compute global MLE ˆ = 1 n Pn i=1 t(xi ) (well-defined for n D ! ˆ 2 ) I Consider restricted exponential family for Fˆ(d+1...D)((1...d)), i = t(1...d)(xi ) and (d+1...D) then set (1...d) i = ˆ(d+1...D). (e.g., we fix global covariance matrix, and let μi = xi for Gaussians) I Improve initialization by applying Bregman k-means++ [1] for the convex conjugate of Fˆ(d+1...D)((1...d)) k-MLE++ based on Bregman k-means++ c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 22/28
  • 23.
    k-MLE variants usingany Bregman k-means heuristic I Any k-means optimization heuristic allows one to update the mixture -parameters. I Hartigan Wang’s greedy swap (after Lloyd convergence) I Kanungo et al. swap ((9 + )-approximation) I Performing successively mixture and w parameters yield Hard EM variant. (easily implemented by winner-take-all EM weight membership) c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 23/28
  • 24.
    k-MLE for MVNswith the (μ,) parameters I 0. Initialization: I Calculate global mean ¯μ and global covariance matrix ¯ : ¯μ = 1 n Pk i=1 xi , ¯ = 1 n Pk i=1 xi xT i − ¯μ¯μT I 8i 2 {1, ..., k}, initialize the i th seed as (μi = xi ,i = ¯ ). I 1. Assignment: 8i 2 {1, ..., n} zi = argminkj =1M−1 i (x − μi , x − μi ) + log |i | − 2 log wi with M−1 i (x − μi , x − μi ) the squared Mahalanobis distance: MQ(x, y) = (x − y)TQ(x − y). Let Ci = {xj |zj = i}, 8i 2 {1, ..., k} be the cluster partition: X = [ki =1Ci . I 2. Update the parameters: P 8i 2 {1, ..., k}, μi = 1 |Ci | x2Ci x,i = 1 |Ci | P x2Ci xxT − μiμTi Goto step 1 unless local convergence. I 3. Update the mixture weights: 8i 2 {1, ..., k},wi = |Ci | n . Goto step 1 unless local convergence. c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 24/28
  • 25.
    Summary of contributions I Hard k-MLE versus soft EM: I k-MLE maximizes locally the complete likelihood I EM maximizes the incomplete likelihood I The component parameter update can be implemented using any Bregman k-means heuristic on conjugate F, I Initialization can be performed using k-MLE++ I Indivisibility: Robustness when identifying statistical mixture models? Which k? 8k 2 N, N(μ, 2) = Pk i=1 N μ k , 2 k Simplifying mixtures from kernel density estimators is one fine-to-coarse solution. See: Model centroids for the simplification of kernel density estimators, ICASSP 2012, March 29th. c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 25/28
  • 26.
    Marcel R. Ackermannand Johannes Bl¨omer. Bregman clustering for separable instances. In Scandinavian Workshop on Algorithm Theory (SWAT), pages 212–223, 2010. Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clustering with Bregman divergences. Journal of Machine Learning Research, 6:1705–1749, 2005. Vincent Garcia and Frank Nielsen. Simplification and hierarchical representations of mixtures of exponential families. Signal Processing (Elsevier), 90(12):3197–3212, 2010. Frank Nielsen. k-MLE: A fast algorithm for learning statistical mixture models. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2012. c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 25/28
  • 27.
    preliminary, technical reporton arXiv. Frank Nielsen and Vincent Garcia. Statistical exponential families: A digest with flash cards, 2009. arXiv.org:0911.4863. Frank Nielsen and Richard Nock. Entropies and cross-entropies of exponential families. In International Conference on Image Processing (ICIP), pages 3621–3624, 2010. Jose Seabra, Francesco Ciompi, Oriol Pujol, Josepa Mauri, Petia Radeva, and Joao Sanchez. Rayleigh mixture model for plaque characterization in intravascular ultrasound. IEEE Transaction on Biomedical Engineering, 58(5):1314–1324, 2011. Shijun Wang and Rong Jin. c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 25/28
  • 28.
    An information geometryapproach for distance metric learning. Journal of Machine Learning Research, 5:591–598, 2009. c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 26/28
  • 29.
    Anisotropic Voronoi diagram(for MVN MMs) From the source color image (a), we buid a 5D GMM with k = 32 components, and color each pixel with the mean color of the anisotropic Voronoi cell it belongs to (a) (b) Speed-up assignment step using Bregman ball trees or Bregman vantage point trees. c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 26/28
  • 30.
    Expectation-maximization (EM) forEFMMs [2] EM increases monotonically the expected complete likelihood (marginalize): Xn i=1 Xk j=1 p(zj |xi , ) log p(xi , zj |) Banerjee et al. [2] proved it amounts to a Bregman soft clustering: c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 27/28
  • 31.
    Comparisons: k-MLE vs.EM for EFMMs k-MLE/Hard EM Soft EM (1977) = Bregman hard clustering = Bregman soft clustering Memory lighter heavier (W matrix) Speed lighter (VP-tree) heavier (all weights wij ) Conv. always finitely 1, stopping criterion Init. k-MLE++ k-means(++) c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 28/28