SlideShare a Scribd company logo
1 of 36
Download to read offline
On learning statistical mixtures maximizing the 
complete likelihood 
The k-MLE methodology using geometric hard clustering 
Frank NIELSEN 
´ Ecole Polytechnique 
Sony Computer Science Laboratories 
MaxEnt 2014 
September 21-26 2014 
Amboise, France 

c 2014 Frank Nielsen 1/39
Finite mixtures: Semi-parametric statistical models 
◮ Mixture M ∼ MM(W, ) with density m(x) = 
Xk 
i=1 
wip(x|λi ) 
not sum of RVs!.  = {λi}i , W = {wi }i 
◮ Multimodal, universally modeling smooth densities 
◮ Gaussian MMs with support X = R, Gamma MMs with 
support X = R+ (modeling distances [34]) 
◮ Pioneered by Karl Pearson [29] (1894). precursors: Francis 
Galton [13] (1869), Adolphe Quetelet [31] (1846), etc. 
◮ Capture sub-populations within an overall population 
(k = 2, crab data [29] in Pearson) 

c 2014 Frank Nielsen 2/39
Example of k = 2-component mixture [17] 
Sub-populations (k = 2) within an overall population... 
Sub-species in species, etc. 
Truncated distributions (what is the support! black swans ?!) 

c 2014 Frank Nielsen 3/39
Sampling from mixtures: Doubly stochastic process 
To sample a variate x from a MM: 
◮ Choose a component l according to the weight distribution 
w1, ...,wk (multinomial), 
◮ Draw a variate x according to p(x|λl ). 

c 2014 Frank Nielsen 4/39
Statistical mixtures: Generative data models 
Image = 5D xyRGB point set 
GMM = feature descriptor for information retrieval (IR) 
Increase dimension d using color image s × s patches: d = 2 + 3s2 
Source GMM Sample (stat img) 
Low-frequency information encoded into compact statistical model. 

c 2014 Frank Nielsen 5/39
Mixtures: ǫ-statistically learnable and ǫ-estimates 
ˆProblem statement: W 
Given n ˆIID d-dimensional observations 
x1, ..., xn ∼ MM(,W), estimate MM(, ): 
◮ Theoretical Computer Science (TCS) approach: ǫ-closely 
parameter recovery (π: permutation) 
◮ |wi − ˆw(i )| ≤ ǫ 
◮ KL(p(x|λi ) : p(x|ˆλ 
(i ))) ≤ ǫ (or other divergences like TV, 
etc.) 
Consider ǫ-learnable MMs: 
◮ mini wi ≥ ǫ 
◮ KL(p(x|λi ) : p(x|λi )) ≥ ǫ, ∀i6= j (or other divergence) 
◮ Statistical approach: 
Define the best model/MM as Q 
the one maximizing the 
likelihood function l (,W) = 
i m(xi |,W). 

c 2014 Frank Nielsen 6/39
Mixture inference: Incomplete versus complete likelihood 
◮ Sub-populations within an overall population: observed data 
xi does not include the subpopulation label li 
◮ k = 2: Classification and Bayes error (upper bounded by 
Chernoff information [24]) 
◮ Inference: Assume IID, maximize (log)-likelihood: 
◮ Complete using indicator variables zi ,j (for li : zi ,li = 1): 
lc = log 
Yn 
i=1 
Yk 
j=1 
(wjp(xi |θj ))zi,j = 
X 
i 
X 
j 
zi ,j log(wjp(xi |θj )) 
◮ Incomplete (hidden/latent variables) and log-sum 
intractability: 
li = log 
Y 
i 
m(x|W, ) = 
X 
i 
log 
 
 
X 
j 
 
 
wjp(xi |θj ) 

c 2014 Frank Nielsen 7/39
Mixture learnability and inference algorithms 
◮ Which criterion to maximize? incomplete or complete 
likelihood? What kind of evaluation criteria? 
◮ From Expectation-Maximization [8] (1977) to TCS methods: 
Polynomial learnability of mixtures [22, 15] (2014), mixtures 
and core-sets [10] for massive data sets, etc. 
Some technicalities: 
◮ Many local maxima of likelihood functions li and lc (EM 
converges locally and needs a stopping criterion) 
◮ Multimodal density (#modes  k [9], ghost modes even for 
isotropic GMMs) 
◮ Identifiability (permutation of labels, parameter distinctness) 
◮ Irregularity: Fisher information may be zero [6], convergence 
speed of EM 
◮ etc. 

c 2014 Frank Nielsen 8/39
Learning MMs: A geometric hard clustering viewpoint 
max 
W,Λ 
lc (W, ) = max 
Λ 
Xn 
i=1 
k 
max 
j=1 
log(wjp(xi |θj )) 
≡ min 
W,Λ 
X 
i 
min 
j 
(−log p(xi |θj ) − log wj ) 
= min 
W,Λ 
Xn 
i=1 
k 
min 
j=1 
Dj (xi ) , 
where cj = (wj , θj ) (cluster prototype) and 
Dj (xi ) = −log p(xi |θj ) − log wj are potential distance-like 
functions. 
◮ Maximizing the complete likelihood amounts to a geometric 
hard clustering [37, 11] for fixed P 
wj ’s (distance Dj (·) depends 
on cluster prototypes cj ): minΛ 
i minj Dj (xi ). 
◮ Related to classification EM [5] (CEM), hard/truncated EM 
◮ Solution of argmax lc to initialize li (optimized by EM) 

c 2014 Frank Nielsen 9/39
The k-MLE method: k-means type clustering algorithms 
k-MLE: 
1. Initialize weight W (in open probability simplex P 
k ) 
2. Solve minΛ 
i minj Dj (xi ) (center-based clustering, W 
fixed) 
3. Solve minW 
P 
i minj Dj (xi ) ( fixed) 
4. Test for convergence and go to step 2) otherwise. 
⇒ group coordinate ascent (ML)/descent (distance) optimization. 

c 2014 Frank Nielsen 10/39
k-MLE: Center-based clustering, W fixed 
Solve min 
Λ 
X 
i 
min 
j 
Dj (xi ) 
k-means type convergence proof for assignment/relocation: 
◮ Data assignment: 
∀i , li = argmaxj wjp(x|λj ) = argminj Dj (xi ), Cj = {xi |li = j} 
◮ Center relocation: ∀j , λj = MLE(Cj ) 
Farthest Maximum Likelihood (FML) Voronoi diagram: 
VorFML(ci ) = {x ∈ X : wip(x|λi ) ≥ wjp(x|λj ), ∀i6= j} 
Vor(ci ) = {x ∈ X : Di (x) ≤ Dj (x), ∀i6= j} 
FML Voronoi ≡ additively weighted Voronoi with: 
Dl (x) = −log p(x|λl ) − log wl 

c 2014 Frank Nielsen 11/39
k-MLE: Example for mixtures of exponential families 
Exponential family: 
Component density p(x|θ) = exp(t(x)⊤θ − F(θ) + k(x)) is 
log-concave with: 
◮ t(x): sufficient statistic in RD, D: family order. 
◮ k(x): auxiliary carrier term (wrt Lebesgue/counting measure) 
◮ F(θ): log-normalized, cumulant function, log-partition. 
Dj (x) is convex: Clustering k-means wrt convex “distances”. 
Farthest ML Voronoi ≡ additively-weighted Bregman Voronoi [4]: 
−log p(x; θ) − log w = F(θ) − t(x)⊤θ − k(x) − log w 
= BF∗(t(x) : η) + F∗(t(x)) + k(x) − log w 
F∗(η) = max(θ⊤η − F(θ)): Legendre-Fenchel convex conjugate 

c 2014 Frank Nielsen 12/39
Exponential families: Rayleigh distributions [36, 25] 
Application: IntraVascular UltraSound (IVUS) imaging: 
Rayleigh distribution: 
p(x; λ) = x 
2 e− x2 
22 
x ∈ R+ = X 
d = 1 (univariate) 
D = 1 (order 1) 
θ = − 1 
22 
 = (−∞, 0) 
F(θ) = −log(−2θ) 
t(x) = x2 
k(x) = log x 
(Weibull for k = 2) 
Coronary plaques: fibrotic tissues, calcified tissues, lipidic tissues 
Rayleigh Mixture Models (RMMs): 
for segmentation and classification tasks 

c 2014 Frank Nielsen 13/39
Exponential families: Multivariate Gaussians [14, 25] 
Gaussian Mixture Models (GMMs). 
(Color image interpreted as a 5D xyRGB point set) 
Gaussian distribution p(x; μ,): 
1 
(2) 
d 
2√|Σ| 
e−1 
2DΣ−1 (x−μ,x−μ) 
Squared Mahalanobis distance: 
DQ(x, y) = (x − y)TQ(x − y) 
x ∈ Rd = X 
d (multivariate) 
D = d(d+3) 
2 (order) 
θ = (−1μ, 1 
2−1) = (θv , θM) 
 = R × Sd+ 
+ 
F(θ) = 1 
v θ−1 
4 θT 
M θv − 1 
2 log |θM| + 
d 
2 log π 
t(x) = (x,−xxT ) 
k(x) = 0 

c 2014 Frank Nielsen 14/39
The k-MLE method for exponential families 
k-MLEEF: 
1. Initialize weight W (in open probability simplex P 
k ) 
2. Solve minΛ 
i minj (BF∗ (t(x) : ηj ) − log wj ) 
3. Solve minW 
P 
i minj Dj (xi ) 
4. Test for convergence and go to step 2) otherwise. 
Assignment condition in Step 2: additively-weighted Bregman 
Voronoi diagram. 

c 2014 Frank Nielsen 15/39
k-MLE: Solving for weights given component parameters 
Solve min 
W 
X 
i 
min 
j 
Dj (xi ) 
Amounts to argminW −nj log wj = argminW −nj 
n log wj where 
nj = #{xi ∈ Vor(cj )} = |Cj |. 
min 
W∈Δk 
H×(N : W) 
n , ..., nk 
n ) is cluster point proportion vector ∈ k . 
where N = ( n1 
Cross-entropy H× is minimized when H×(N : W) = H(N) that is 
W = N. 
Kullback-Leibler divergence: 
KL(N : W) = H×(N : W) − H(N) = 0 when W = N. 

c 2014 Frank Nielsen 16/39
MLE for exponential families 
Given a ML farthest Voronoi partition, computes MLEs θj ’s: 
ˆθj = argmax 
∈Θ 
Y 
xi∈Vor(cj ) 
pF (xi ; θ) 
is unique (***) maximum since ∇2F(θ) ≻ 0: 
Moment equation : ∇F(ˆθj ) = η(ˆθj ) = 
1 
nj 
X 
xi∈Vor(cj ) 
t(xi ) = ¯t = ˆη 
MLE is consistent, efficient with asymptotic normal distribution: 
ˆθj ∼ N 
 
θj , 
1 
nj 
 
I−1(θj ) 
Fisher information matrix 
I (θj ) = var[t(X)] = ∇2F(θj ) = (∇2F∗)−1(ηj ) 
MLE may be biased (eg, normal distributions). 

c 2014 Frank Nielsen 17/39
Existence of MLEs for exponential families (***) 
For minimal and full EFs, MLE guaranteed to exist [3, 21] provided 
that matrix: 
T = 
 
 
1 t1(x1) ... tD(x1) 
... 
... ... 
... 
1 t1(xn) ... tD(xn) 
 
 
(1) 
of dimension n × (D + 1) has rank D + 1 [3]. 
For example, problems for MLEs of MVNs with n  d observations 
(undefined with likelihood ∞). 
Condition: ¯t = 1 
nj 
P 
xi∈Vor(cj ) t(xi ) ∈ int(C), where C is closed 
convex support. 

c 2014 Frank Nielsen 18/39
MLE of EFs: Observed point in IG/Bregman 1-mean 
ˆθ = argmax 
Qn 
i=1 pF (xi ; θ) = argmax 
Pn 
i=1 log pF (xi ; θ) 
argmax 
Xn 
i=1 
−BF∗(t(xi ) : η) + F∗(t(xi )) + k(xi ) | {z } 
constant 
≡ argmin 
Xn 
i=1 
BF∗(t(xi ) : η ) 
Right-sided Bregman centroid = center of mass: ˆη = 
1 
n 
Xn 
i=1 
t(xi ) . 
¯l = 
1 
n 
Xn 
i=1 
(−BF∗(t(xi ) : ˆη) + F∗(t(xi )) + k(xi )) 
= hˆη, ˆθi − F(ˆθ) + ¯k = F∗(ˆη) + ¯k 

c 2014 Frank Nielsen 19/39
The k-MLE method: Heuristics based on k-means 
k-means is NP-hard (non-convex optimization) when d  1 and 
k  1 and solved exactly using dynamic programming [26] in 
O(n2k) and O(n) memory when d = 1. 
Heuristics: 
◮ Kanungo et al. [18] swap: yields a (9 + ǫ)-approximation 
◮ Global seeds: random seed (Forgy [12]), k-means++ [2], 
global k-means initialization [38], 
◮ Local refinements: Lloyd batched update [19], MacQueen 
iterative update [20], Hartigan single-point swap [16], etc. 
◮ etc. 

c 2014 Frank Nielsen 20/39
Generalized k-MLE 
Weibull or generalized Gaussians are parametric families of 
exponential families [35]: F(γ). 
Fixing some parameters yields nested families of (sub)-exponential 
families [34]: obtain one free parameter with convex conjugate F∗ 
approximated by line search (Gamma distributions/generalized 
Gaussians). 

c 2014 Frank Nielsen 21/39
Generalized k-MLE 
k-GMLE: 
1. Initialize weight W ∈ k and family type (F1, ..., Fk ) for each 
cluster 
2. Solve minΛ 
P 
i minj Dj (xi ) (center-based clustering for W 
fixed) with potential functions: 
Dj (xi ) = −log pFj (xi |θj ) − log wj 
3. Solve family types maximizing the MLE in each cluster Cj by 
choosing the parametric family of distributions Fj = F(γj ) 
that yields the best likelihood: 
P 
minF1=F(
1),...,Fk=F(
k )∈F(
) 
i minj Dwj ,j ,Fj (xi ). 
4. Update W as the cluster point proportion 
5. Test for convergence and go to step 2) otherwise. 
Dwj ,j ,Fj (x) = −log pFj (x; θj ) − log wj 

c 2014 Frank Nielsen 22/39
Generalized k-MLE: Convergence 
◮ Lloyd’s batched generalized k-MLE maximizes monotonically 
the complete likelihood 
◮ Hartigan single-point relocation generalized k-MLE maximizes 
monotonically the complete likelihood [32], improves over 
Lloyd local maxima, and avoids the problem of the existence 
of MLE inside clusters by ensuring nj ≥ D in general position 
(T rank D + 1). 
◮ Model selection: Learn k automatically using DP 
k-means [32] (Dirichlet Process) 

c 2014 Frank Nielsen 23/39
k-MLE [23] versus EM for Exponential Families [1] 
k-MLE/Hard EM [23] (2012-) Soft EM [1] (1977) 
= Bregman hard clustering = Bregman soft clustering 
Memory lighter O(n) heavier O(nk) 
Assignment NNs with VP-trees [27], BB-trees [30] all k-NNs 
Conv. always finitely ∞, need stopping criterion 
Many (probabilistically) guaranteed initialization for 
k-MLE [18, 2, 28] 

c 2014 Frank Nielsen 24/39
k-MLE: Solving for D = 1 exponential families 
◮ Rayleigh, Poisson or (nested) univariate normal with constant 
σ are order 1 EFs (D = 1). 
◮ Clustering problem: Dual 1D Bregman clustering [1] on 1D 
scalars yi = t(xi ). 
◮ FML Voronoi diagrams have connected cells: Optimal 
clustering yields interval clustering. 
◮ 1D k-means (with additive weights) can be solved exactly 
using dynamic programming in O(n2k) time [26]. Then 
update the weights W (cluster point proportion) and 
reiterate... 

c 2014 Frank Nielsen 25/39
Dynamic programming for D = 1-order mixtures [26] 
Consider W fixed. k-MLE cost: 
Pk 
j=1 l (Cj ) where Cj are clusters. 
x1 xj−1 xj xn 
MLEk−1(X1,j−1) MLE1(Xj,n) 
ˆ 
k = ˆ 
j,n 
Dynamic programming optimality equation: 
MLEk (x1, ..., xn) = 
n 
max 
j=2 
(MLEk−1(X1,j−1) +MLE1(Xj ,n)) 
Xl ,r : {xl , xl+1, ..., xr−1, xr }. 
◮ Build dynamic programming table from l = 1 to l = k 
columns, m = 1 to m = n rows. 
◮ Retrieve Cj from DP table by backtracking on the argmaxj . 
◮ For D = 1 EFs, O(n2k) time [26]. 

c 2014 Frank Nielsen 26/39
Experiments with: 1D Gaussian Mixture Models (GMMs) 
gmm1 score = −3.075 (Euclidean k-means, σ fixed) 
gmm2 score = −3.038 (Bregman k-means, σ fitted, better) 

c 2014 Frank Nielsen 27/39
Summary: k-MLE methodology for learning mixtures 
Learn MMs from sequences of geometric hard clustering [11]. 
◮ Hard k-MLE (≡ dual Bregman hard clustering for EFs) versus 
soft EM (≡ soft Bregman clustering [1] for EFs): 
◮ k-MLE maximizes the complete likelihood lc . 
◮ EM maximizes locally the incomplete likelihood li . 
◮ The component parameters η geometric clustering (Step 2.) 
can be implemented using any Bregman k-means heuristic on 
conjugate F∗ 
◮ Consider generalized k-MLE when F∗ not available in closed 
form: nested exponential families (eg., Gamma) 
◮ Initialization can be performed using k-means initialization: 
k-MLE++, etc. 
◮ Exact solution with dynamic programming for order 1 EFs 
(with prescribed weight proportion W). 
◮ Avoid unbounded likelihood (eg., ∞ for location-scale 
member with σ → 0: Dirac) using Hartigan’s heuristic [32] 

c 2014 Frank Nielsen 28/39
Discussion: Learning statistical models FAST! 
◮ (EF) Mixture Models allow one to approximate universally 
smooth densities 
◮ A single (multimodal) EF can approximate any smooth 
density too [7] but F not in closed-form 
◮ Which criterion to maximize is best/realistic: incomplete or 
complete, or parameter distortions? Leverage many recent 
results on k-means clustering to learning mixture models. 
◮ Alternative approach: Simplifying mixtures from kernel density 
estimators (KDEs) is one fine-to-coarse solution [33] 
◮ Open problem: How to constrain the MMs to have a 
prescribed number of modes/antimodes? 

c 2014 Frank Nielsen 29/39
Thank you. 
Experiments and performance evaluations on generalized k-MLE: 
◮ k-GMLE for generalized Gaussians [35] 
◮ k-GMLE for Gamma distributions [34] 
◮ k-GMLE for singly-parametric distributions [26] 
(compared with Expectation-Maximization [8]) 
Frank Nielsen (5793b870). 

c 2014 Frank Nielsen 30/39
Bibliography I 
Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. 
Clustering with Bregman divergences. 
Journal of Machine Learning Research, 6:1705–1749, 2005. 
Anup Bhattacharya, Ragesh Jaiswal, and Nir Ailon. 
A tight lower bound instance for k-means++ in constant dimension. 
In T.V. Gopal, Manindra Agrawal, Angsheng Li, and S.Barry Cooper, editors, Theory and Applications of 
Models of Computation, volume 8402 of Lecture Notes in Computer Science, pages 7–22. Springer 
International Publishing, 2014. 
Krzysztof Bogdan and Ma lgorzata Bogdan. 
On existence of maximum likelihood estimators in exponential families. 
Statistics, 34(2):137–149, 2000. 
Jean-Daniel Boissonnat, Frank Nielsen, and Richard Nock. 
Bregman Voronoi diagrams. 
Discrete Comput. Geom., 44(2):281–307, September 2010. 
Gilles Celeux and G´erard Govaert. 
A classification EM algorithm for clustering and two stochastic versions. 
Comput. Stat. Data Anal., 14(3):315–332, October 1992. 
Jiahua Chen. 
Optimal rate of convergence for finite mixture models. 
The Annals of Statistics, pages 221–233, 1995. 
Loren Cobb, Peter Koppstein, and Neng Hsin Chen. 
Estimation and moment recursion relations for multimodal distributions of the exponential family. 
Journal of the American Statistical Association, 78(381):124–130, 1983. 

c 2014 Frank Nielsen 31/39
Bibliography II 
Arthur Pentland Dempster, Nan M. Laird, and Donald B. Rubin. 
Maximum likelihood from incomplete data via the EM algorithm. 
Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977. 
Herbert Edelsbrunner, Brittany Terese Fasy, and G¨unter Rote. 
Add isotropic Gaussian kernels at own risk: more and more resilient modes in higher dimensions. 
In Proceedings of the 2012 symposuim on Computational Geometry, SoCG ’12, pages 91–100, New York, 
NY, USA, 2012. ACM. 
Dan Feldman, Matthew Faulkner, and Andreas Krause. 
Scalable training of mixture models via coresets. 
In J. Shawe-Taylor, R.S. Zemel, P.L. Bartlett, F. Pereira, and K.Q. Weinberger, editors, Advances in Neural 
Information Processing Systems 24, pages 2142–2150. Curran Associates, Inc., 2011. 
Dan Feldman, Morteza Monemizadeh, and Christian Sohler. 
A PTAS for k-means clustering based on weak coresets. 
In Proceedings of the twenty-third annual symposium on Computational geometry, pages 11–18. ACM, 
2007. 
Edward W. Forgy. 
Cluster analysis of multivariate data: efficiency vs interpretability of classifications. 
Biometrics, 1965. 
Francis Galton. 
Hereditary genius. 
Macmillan and Company, 1869. 
Vincent Garcia and Frank Nielsen. 
Simplification and hierarchical representations of mixtures of exponential families. 
Signal Processing (Elsevier), 90(12):3197–3212, 2010. 

c 2014 Frank Nielsen 32/39
Bibliography III 
Moritz Hardt and Eric Price. 
Sharp bounds for learning a mixture of two gaussians. 
CoRR, abs/1404.4997, 2014. 
John A. Hartigan. 
Clustering Algorithms. 
John Wiley  Sons, Inc., New York, NY, USA, 99th edition, 1975. 
Adam Tauman Kalai, Ankur Moitra, and Gregory Valiant. 
Disentangling gaussians. 
Communications of the ACM, 55(2):113–120, 2012. 
Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and 
Angela Y. Wu. 
A local search approximation algorithm for k-means clustering. 
Computational Geometry: Theory  Applications, 28(2-3):89–112, 2004. 
Stuart P. Lloyd. 
Least squares quantization in PCM. 
Technical report, Bell Laboratories, 1957. 
James B. MacQueen. 
Some methods of classification and analysis of multivariate observations. 
In L. M. Le Cam and J. Neyman, editors, Proceedings of the Fifth Berkeley Symposium on Mathematical 
Statistics and Probability. University of California Press, Berkeley, CA, USA, 1967. 
Weiwen Miao and Marjorie Hahn. 
Existence of maximum likelihood estimates for multi-dimensional exponential families. 
Scandinavian Journal of Statistics, 24(3):371–386, 1997. 

c 2014 Frank Nielsen 33/39
Bibliography IV 
Ankur Moitra and Gregory Valiant. 
Settling the polynomial learnability of mixtures of Gaussians. 
In 51st IEEE Annual Symposium on Foundations of Computer Science, pages 93–102. IEEE, 2010. 
Frank Nielsen. 
k-MLE: A fast algorithm for learning statistical mixture models. 
CoRR, abs/1203.5181, 2012. 
Frank Nielsen. 
Generalized Bhattacharyya and Chernoff upper bounds on Bayes error using quasi-arithmetic means. 
Pattern Recognition Letters, 42(0):25 – 34, 2014. 
Frank Nielsen and Vincent Garcia. 
Statistical exponential families: A digest with flash cards, 2009. 
arXiv.org:0911.4863. 
Frank Nielsen and Richard Nock. 
Optimal interval clustering: Application to bregman clustering and statistical mixture learning. 
Signal Processing Letters, IEEE, 21(10):1289–1292, Oct 2014. 
Frank Nielsen, Paolo Piro, and Michel Barlaud. 
Bregman vantage point trees for efficient nearest neighbor queries. 
In Proceedings of the 2009 IEEE International Conference on Multimedia and Expo (ICME), pages 878–881, 
2009. 
Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman, and Chaitanya Swamy. 
The effectiveness of Lloyd-type methods for the k-means problem. 
In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, pages 165–176, 
Washington, DC, USA, 2006. IEEE Computer Society. 

c 2014 Frank Nielsen 34/39
Bibliography V 
Karl Pearson. 
Contributions to the mathematical theory of evolution. 
Philosophical Transactions of the Royal Society A, 185:71–110, 1894. 
Paolo Piro, Frank Nielsen, and Michel Barlaud. 
Tailored Bregman ball trees for effective nearest neighbors. 
In European Workshop on Computational Geometry (EuroCG), LORIA, Nancy, France, March 2009. IEEE. 
Adolphe Quetelet. 
Lettres sur la th´eorie des probabilit´es, appliqu´ee aux sciences morales et politiques. 
Hayez, 1846. 
Christophe Saint-Jean and Frank Nielsen. 
Hartigan’s method for k-MLE: Mixture modeling with Wishart distributions and its application to motion 
retrieval. 
In Geometric Theory of Information, pages 301–330. Springer International Publishing, 2014. 
Olivier Schwander and Frank Nielsen. 
Model centroids for the simplification of kernel density estimators. 
In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 737–740, 
2012. 
Olivier Schwander and Frank Nielsen. 
Fast learning of Gamma mixture models with k-mle. 
In Similarity-Based Pattern Recognition (SIMBAD), pages 235–249, 2013. 
Olivier Schwander, Aur´elien J. Schutz, Frank Nielsen, and Yannick Berthoumieu. 
k-MLE for mixtures of generalized Gaussians. 
In Proceedings of the 21st International Conference on Pattern Recognition (ICPR), pages 2825–2828, 2012. 

c 2014 Frank Nielsen 35/39
Bibliography VI 
Jose Seabra, Francesco Ciompi, Oriol Pujol, Josepa Mauri, Petia Radeva, and Joao Sanchez. 
Rayleigh mixture model for plaque characterization in intravascular ultrasound. 
IEEE Transaction on Biomedical Engineering, 58(5):1314–1324, 2011. 
Marc Teboulle. 
A unified continuous optimization framework for center-based clustering methods. 
Journal of Machine Learning Research, 8:65–102, 2007. 
Juanying Xie, Shuai Jiang, Weixin Xie, and Xinbo Gao. 
An efficient global k-means clustering algorithm. 
Journal of computers, 6(2), 2011. 

c 2014 Frank Nielsen 36/39

More Related Content

What's hot

Bregman divergences from comparative convexity
Bregman divergences from comparative convexityBregman divergences from comparative convexity
Bregman divergences from comparative convexityFrank Nielsen
 
On the Jensen-Shannon symmetrization of distances relying on abstract means
On the Jensen-Shannon symmetrization of distances relying on abstract meansOn the Jensen-Shannon symmetrization of distances relying on abstract means
On the Jensen-Shannon symmetrization of distances relying on abstract meansFrank Nielsen
 
Classification with mixtures of curved Mahalanobis metrics
Classification with mixtures of curved Mahalanobis metricsClassification with mixtures of curved Mahalanobis metrics
Classification with mixtures of curved Mahalanobis metricsFrank Nielsen
 
ABC with Wasserstein distances
ABC with Wasserstein distancesABC with Wasserstein distances
ABC with Wasserstein distancesChristian Robert
 
Optimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-peripheryOptimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-peripheryFrancesco Tudisco
 
ABC based on Wasserstein distances
ABC based on Wasserstein distancesABC based on Wasserstein distances
ABC based on Wasserstein distancesChristian Robert
 
Clustering in Hilbert simplex geometry
Clustering in Hilbert simplex geometryClustering in Hilbert simplex geometry
Clustering in Hilbert simplex geometryFrank Nielsen
 
Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...
Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...
Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...Francesco Tudisco
 
Small updates of matrix functions used for network centrality
Small updates of matrix functions used for network centralitySmall updates of matrix functions used for network centrality
Small updates of matrix functions used for network centralityFrancesco Tudisco
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Valentin De Bortoli
 
ABC convergence under well- and mis-specified models
ABC convergence under well- and mis-specified modelsABC convergence under well- and mis-specified models
ABC convergence under well- and mis-specified modelsChristian Robert
 
Slides: The dual Voronoi diagrams with respect to representational Bregman di...
Slides: The dual Voronoi diagrams with respect to representational Bregman di...Slides: The dual Voronoi diagrams with respect to representational Bregman di...
Slides: The dual Voronoi diagrams with respect to representational Bregman di...Frank Nielsen
 
Continuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGDContinuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGDValentin De Bortoli
 
prior selection for mixture estimation
prior selection for mixture estimationprior selection for mixture estimation
prior selection for mixture estimationChristian Robert
 
Slides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingSlides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingFrank Nielsen
 
Macrocanonical models for texture synthesis
Macrocanonical models for texture synthesisMacrocanonical models for texture synthesis
Macrocanonical models for texture synthesisValentin De Bortoli
 

What's hot (20)

Bregman divergences from comparative convexity
Bregman divergences from comparative convexityBregman divergences from comparative convexity
Bregman divergences from comparative convexity
 
On the Jensen-Shannon symmetrization of distances relying on abstract means
On the Jensen-Shannon symmetrization of distances relying on abstract meansOn the Jensen-Shannon symmetrization of distances relying on abstract means
On the Jensen-Shannon symmetrization of distances relying on abstract means
 
Classification with mixtures of curved Mahalanobis metrics
Classification with mixtures of curved Mahalanobis metricsClassification with mixtures of curved Mahalanobis metrics
Classification with mixtures of curved Mahalanobis metrics
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
ABC with Wasserstein distances
ABC with Wasserstein distancesABC with Wasserstein distances
ABC with Wasserstein distances
 
Optimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-peripheryOptimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-periphery
 
ABC based on Wasserstein distances
ABC based on Wasserstein distancesABC based on Wasserstein distances
ABC based on Wasserstein distances
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Clustering in Hilbert simplex geometry
Clustering in Hilbert simplex geometryClustering in Hilbert simplex geometry
Clustering in Hilbert simplex geometry
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...
Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...
Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...
 
Small updates of matrix functions used for network centrality
Small updates of matrix functions used for network centralitySmall updates of matrix functions used for network centrality
Small updates of matrix functions used for network centrality
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...
 
ABC convergence under well- and mis-specified models
ABC convergence under well- and mis-specified modelsABC convergence under well- and mis-specified models
ABC convergence under well- and mis-specified models
 
Slides: The dual Voronoi diagrams with respect to representational Bregman di...
Slides: The dual Voronoi diagrams with respect to representational Bregman di...Slides: The dual Voronoi diagrams with respect to representational Bregman di...
Slides: The dual Voronoi diagrams with respect to representational Bregman di...
 
Continuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGDContinuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGD
 
prior selection for mixture estimation
prior selection for mixture estimationprior selection for mixture estimation
prior selection for mixture estimation
 
Slides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingSlides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processing
 
Macrocanonical models for texture synthesis
Macrocanonical models for texture synthesisMacrocanonical models for texture synthesis
Macrocanonical models for texture synthesis
 

Viewers also liked

(slides 8) Visual Computing: Geometry, Graphics, and Vision
(slides 8) Visual Computing: Geometry, Graphics, and Vision(slides 8) Visual Computing: Geometry, Graphics, and Vision
(slides 8) Visual Computing: Geometry, Graphics, and VisionFrank Nielsen
 
Fundamentals cig 4thdec
Fundamentals cig 4thdecFundamentals cig 4thdec
Fundamentals cig 4thdecFrank Nielsen
 
(slides 9) Visual Computing: Geometry, Graphics, and Vision
(slides 9) Visual Computing: Geometry, Graphics, and Vision(slides 9) Visual Computing: Geometry, Graphics, and Vision
(slides 9) Visual Computing: Geometry, Graphics, and VisionFrank Nielsen
 
A new implementation of k-MLE for mixture modelling of Wishart distributions
A new implementation of k-MLE for mixture modelling of Wishart distributionsA new implementation of k-MLE for mixture modelling of Wishart distributions
A new implementation of k-MLE for mixture modelling of Wishart distributionsFrank Nielsen
 
k-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture modelsk-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture modelsFrank Nielsen
 
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...Frank Nielsen
 
On approximating the Riemannian 1-center
On approximating the Riemannian 1-centerOn approximating the Riemannian 1-center
On approximating the Riemannian 1-centerFrank Nielsen
 
On Clustering Histograms with k-Means by Using Mixed α-Divergences
 On Clustering Histograms with k-Means by Using Mixed α-Divergences On Clustering Histograms with k-Means by Using Mixed α-Divergences
On Clustering Histograms with k-Means by Using Mixed α-DivergencesFrank Nielsen
 
INF442: Traitement des données massives
INF442: Traitement des données massivesINF442: Traitement des données massives
INF442: Traitement des données massivesFrank Nielsen
 
Traitement massif des données 2016
Traitement massif des données 2016Traitement massif des données 2016
Traitement massif des données 2016Frank Nielsen
 
Traitement des données massives (INF442, A5)
Traitement des données massives (INF442, A5)Traitement des données massives (INF442, A5)
Traitement des données massives (INF442, A5)Frank Nielsen
 
On representing spherical videos (Frank Nielsen, CVPR 2001)
On representing spherical videos (Frank Nielsen, CVPR 2001)On representing spherical videos (Frank Nielsen, CVPR 2001)
On representing spherical videos (Frank Nielsen, CVPR 2001)Frank Nielsen
 
Traitement des données massives (INF442, A8)
Traitement des données massives (INF442, A8)Traitement des données massives (INF442, A8)
Traitement des données massives (INF442, A8)Frank Nielsen
 
Traitement des données massives (INF442, A6)
Traitement des données massives (INF442, A6)Traitement des données massives (INF442, A6)
Traitement des données massives (INF442, A6)Frank Nielsen
 
Traitement des données massives (INF442, A7)
Traitement des données massives (INF442, A7)Traitement des données massives (INF442, A7)
Traitement des données massives (INF442, A7)Frank Nielsen
 
Computational Information Geometry for Machine Learning
Computational Information Geometry for Machine LearningComputational Information Geometry for Machine Learning
Computational Information Geometry for Machine LearningFrank Nielsen
 

Viewers also liked (16)

(slides 8) Visual Computing: Geometry, Graphics, and Vision
(slides 8) Visual Computing: Geometry, Graphics, and Vision(slides 8) Visual Computing: Geometry, Graphics, and Vision
(slides 8) Visual Computing: Geometry, Graphics, and Vision
 
Fundamentals cig 4thdec
Fundamentals cig 4thdecFundamentals cig 4thdec
Fundamentals cig 4thdec
 
(slides 9) Visual Computing: Geometry, Graphics, and Vision
(slides 9) Visual Computing: Geometry, Graphics, and Vision(slides 9) Visual Computing: Geometry, Graphics, and Vision
(slides 9) Visual Computing: Geometry, Graphics, and Vision
 
A new implementation of k-MLE for mixture modelling of Wishart distributions
A new implementation of k-MLE for mixture modelling of Wishart distributionsA new implementation of k-MLE for mixture modelling of Wishart distributions
A new implementation of k-MLE for mixture modelling of Wishart distributions
 
k-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture modelsk-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture models
 
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...
 
On approximating the Riemannian 1-center
On approximating the Riemannian 1-centerOn approximating the Riemannian 1-center
On approximating the Riemannian 1-center
 
On Clustering Histograms with k-Means by Using Mixed α-Divergences
 On Clustering Histograms with k-Means by Using Mixed α-Divergences On Clustering Histograms with k-Means by Using Mixed α-Divergences
On Clustering Histograms with k-Means by Using Mixed α-Divergences
 
INF442: Traitement des données massives
INF442: Traitement des données massivesINF442: Traitement des données massives
INF442: Traitement des données massives
 
Traitement massif des données 2016
Traitement massif des données 2016Traitement massif des données 2016
Traitement massif des données 2016
 
Traitement des données massives (INF442, A5)
Traitement des données massives (INF442, A5)Traitement des données massives (INF442, A5)
Traitement des données massives (INF442, A5)
 
On representing spherical videos (Frank Nielsen, CVPR 2001)
On representing spherical videos (Frank Nielsen, CVPR 2001)On representing spherical videos (Frank Nielsen, CVPR 2001)
On representing spherical videos (Frank Nielsen, CVPR 2001)
 
Traitement des données massives (INF442, A8)
Traitement des données massives (INF442, A8)Traitement des données massives (INF442, A8)
Traitement des données massives (INF442, A8)
 
Traitement des données massives (INF442, A6)
Traitement des données massives (INF442, A6)Traitement des données massives (INF442, A6)
Traitement des données massives (INF442, A6)
 
Traitement des données massives (INF442, A7)
Traitement des données massives (INF442, A7)Traitement des données massives (INF442, A7)
Traitement des données massives (INF442, A7)
 
Computational Information Geometry for Machine Learning
Computational Information Geometry for Machine LearningComputational Information Geometry for Machine Learning
Computational Information Geometry for Machine Learning
 

Similar to On learning statistical mixtures maximizing the complete likelihood

Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...Frank Nielsen
 
Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Frank Nielsen
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking componentsChristian Robert
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep LearningRayKim51
 
Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...Alexander Litvinenko
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applicationsFrank Nielsen
 
Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Daisuke Yoneoka
 
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)Frank Nielsen
 
ENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-MeansENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-Meanstthonet
 
Classification and regression based on derivatives: a consistency result for ...
Classification and regression based on derivatives: a consistency result for ...Classification and regression based on derivatives: a consistency result for ...
Classification and regression based on derivatives: a consistency result for ...tuxette
 
Tensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationTensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationAlexander Litvinenko
 
Density theorems for Euclidean point configurations
Density theorems for Euclidean point configurationsDensity theorems for Euclidean point configurations
Density theorems for Euclidean point configurationsVjekoslavKovac1
 
Gaussian process in machine learning
Gaussian process in machine learningGaussian process in machine learning
Gaussian process in machine learningVARUN KUMAR
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionCharles Deledalle
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector MachinesEdgar Marca
 
Automatic Bayesian method for Numerical Integration
Automatic Bayesian method for Numerical Integration Automatic Bayesian method for Numerical Integration
Automatic Bayesian method for Numerical Integration Jagadeeswaran Rathinavel
 
The Multivariate Gaussian Probability Distribution
The Multivariate Gaussian Probability DistributionThe Multivariate Gaussian Probability Distribution
The Multivariate Gaussian Probability DistributionPedro222284
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBOYoonho Lee
 

Similar to On learning statistical mixtures maximizing the complete likelihood (20)

Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
 
Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep Learning
 
Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
 
Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9
 
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
 
QMC: Operator Splitting Workshop, Proximal Algorithms in Probability Spaces -...
QMC: Operator Splitting Workshop, Proximal Algorithms in Probability Spaces -...QMC: Operator Splitting Workshop, Proximal Algorithms in Probability Spaces -...
QMC: Operator Splitting Workshop, Proximal Algorithms in Probability Spaces -...
 
ENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-MeansENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-Means
 
Classification and regression based on derivatives: a consistency result for ...
Classification and regression based on derivatives: a consistency result for ...Classification and regression based on derivatives: a consistency result for ...
Classification and regression based on derivatives: a consistency result for ...
 
talk MCMC & SMC 2004
talk MCMC & SMC 2004talk MCMC & SMC 2004
talk MCMC & SMC 2004
 
Tensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationTensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantification
 
Density theorems for Euclidean point configurations
Density theorems for Euclidean point configurationsDensity theorems for Euclidean point configurations
Density theorems for Euclidean point configurations
 
Gaussian process in machine learning
Gaussian process in machine learningGaussian process in machine learning
Gaussian process in machine learning
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - Introduction
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector Machines
 
Automatic Bayesian method for Numerical Integration
Automatic Bayesian method for Numerical Integration Automatic Bayesian method for Numerical Integration
Automatic Bayesian method for Numerical Integration
 
The Multivariate Gaussian Probability Distribution
The Multivariate Gaussian Probability DistributionThe Multivariate Gaussian Probability Distribution
The Multivariate Gaussian Probability Distribution
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBO
 

Recently uploaded

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 

Recently uploaded (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 

On learning statistical mixtures maximizing the complete likelihood

  • 1. On learning statistical mixtures maximizing the complete likelihood The k-MLE methodology using geometric hard clustering Frank NIELSEN ´ Ecole Polytechnique Sony Computer Science Laboratories MaxEnt 2014 September 21-26 2014 Amboise, France c 2014 Frank Nielsen 1/39
  • 2. Finite mixtures: Semi-parametric statistical models ◮ Mixture M ∼ MM(W, ) with density m(x) = Xk i=1 wip(x|λi ) not sum of RVs!. = {λi}i , W = {wi }i ◮ Multimodal, universally modeling smooth densities ◮ Gaussian MMs with support X = R, Gamma MMs with support X = R+ (modeling distances [34]) ◮ Pioneered by Karl Pearson [29] (1894). precursors: Francis Galton [13] (1869), Adolphe Quetelet [31] (1846), etc. ◮ Capture sub-populations within an overall population (k = 2, crab data [29] in Pearson) c 2014 Frank Nielsen 2/39
  • 3. Example of k = 2-component mixture [17] Sub-populations (k = 2) within an overall population... Sub-species in species, etc. Truncated distributions (what is the support! black swans ?!) c 2014 Frank Nielsen 3/39
  • 4. Sampling from mixtures: Doubly stochastic process To sample a variate x from a MM: ◮ Choose a component l according to the weight distribution w1, ...,wk (multinomial), ◮ Draw a variate x according to p(x|λl ). c 2014 Frank Nielsen 4/39
  • 5. Statistical mixtures: Generative data models Image = 5D xyRGB point set GMM = feature descriptor for information retrieval (IR) Increase dimension d using color image s × s patches: d = 2 + 3s2 Source GMM Sample (stat img) Low-frequency information encoded into compact statistical model. c 2014 Frank Nielsen 5/39
  • 6. Mixtures: ǫ-statistically learnable and ǫ-estimates ˆProblem statement: W Given n ˆIID d-dimensional observations x1, ..., xn ∼ MM(,W), estimate MM(, ): ◮ Theoretical Computer Science (TCS) approach: ǫ-closely parameter recovery (π: permutation) ◮ |wi − ˆw(i )| ≤ ǫ ◮ KL(p(x|λi ) : p(x|ˆλ (i ))) ≤ ǫ (or other divergences like TV, etc.) Consider ǫ-learnable MMs: ◮ mini wi ≥ ǫ ◮ KL(p(x|λi ) : p(x|λi )) ≥ ǫ, ∀i6= j (or other divergence) ◮ Statistical approach: Define the best model/MM as Q the one maximizing the likelihood function l (,W) = i m(xi |,W). c 2014 Frank Nielsen 6/39
  • 7. Mixture inference: Incomplete versus complete likelihood ◮ Sub-populations within an overall population: observed data xi does not include the subpopulation label li ◮ k = 2: Classification and Bayes error (upper bounded by Chernoff information [24]) ◮ Inference: Assume IID, maximize (log)-likelihood: ◮ Complete using indicator variables zi ,j (for li : zi ,li = 1): lc = log Yn i=1 Yk j=1 (wjp(xi |θj ))zi,j = X i X j zi ,j log(wjp(xi |θj )) ◮ Incomplete (hidden/latent variables) and log-sum intractability: li = log Y i m(x|W, ) = X i log   X j   wjp(xi |θj ) c 2014 Frank Nielsen 7/39
  • 8. Mixture learnability and inference algorithms ◮ Which criterion to maximize? incomplete or complete likelihood? What kind of evaluation criteria? ◮ From Expectation-Maximization [8] (1977) to TCS methods: Polynomial learnability of mixtures [22, 15] (2014), mixtures and core-sets [10] for massive data sets, etc. Some technicalities: ◮ Many local maxima of likelihood functions li and lc (EM converges locally and needs a stopping criterion) ◮ Multimodal density (#modes k [9], ghost modes even for isotropic GMMs) ◮ Identifiability (permutation of labels, parameter distinctness) ◮ Irregularity: Fisher information may be zero [6], convergence speed of EM ◮ etc. c 2014 Frank Nielsen 8/39
  • 9. Learning MMs: A geometric hard clustering viewpoint max W,Λ lc (W, ) = max Λ Xn i=1 k max j=1 log(wjp(xi |θj )) ≡ min W,Λ X i min j (−log p(xi |θj ) − log wj ) = min W,Λ Xn i=1 k min j=1 Dj (xi ) , where cj = (wj , θj ) (cluster prototype) and Dj (xi ) = −log p(xi |θj ) − log wj are potential distance-like functions. ◮ Maximizing the complete likelihood amounts to a geometric hard clustering [37, 11] for fixed P wj ’s (distance Dj (·) depends on cluster prototypes cj ): minΛ i minj Dj (xi ). ◮ Related to classification EM [5] (CEM), hard/truncated EM ◮ Solution of argmax lc to initialize li (optimized by EM) c 2014 Frank Nielsen 9/39
  • 10. The k-MLE method: k-means type clustering algorithms k-MLE: 1. Initialize weight W (in open probability simplex P k ) 2. Solve minΛ i minj Dj (xi ) (center-based clustering, W fixed) 3. Solve minW P i minj Dj (xi ) ( fixed) 4. Test for convergence and go to step 2) otherwise. ⇒ group coordinate ascent (ML)/descent (distance) optimization. c 2014 Frank Nielsen 10/39
  • 11. k-MLE: Center-based clustering, W fixed Solve min Λ X i min j Dj (xi ) k-means type convergence proof for assignment/relocation: ◮ Data assignment: ∀i , li = argmaxj wjp(x|λj ) = argminj Dj (xi ), Cj = {xi |li = j} ◮ Center relocation: ∀j , λj = MLE(Cj ) Farthest Maximum Likelihood (FML) Voronoi diagram: VorFML(ci ) = {x ∈ X : wip(x|λi ) ≥ wjp(x|λj ), ∀i6= j} Vor(ci ) = {x ∈ X : Di (x) ≤ Dj (x), ∀i6= j} FML Voronoi ≡ additively weighted Voronoi with: Dl (x) = −log p(x|λl ) − log wl c 2014 Frank Nielsen 11/39
  • 12. k-MLE: Example for mixtures of exponential families Exponential family: Component density p(x|θ) = exp(t(x)⊤θ − F(θ) + k(x)) is log-concave with: ◮ t(x): sufficient statistic in RD, D: family order. ◮ k(x): auxiliary carrier term (wrt Lebesgue/counting measure) ◮ F(θ): log-normalized, cumulant function, log-partition. Dj (x) is convex: Clustering k-means wrt convex “distances”. Farthest ML Voronoi ≡ additively-weighted Bregman Voronoi [4]: −log p(x; θ) − log w = F(θ) − t(x)⊤θ − k(x) − log w = BF∗(t(x) : η) + F∗(t(x)) + k(x) − log w F∗(η) = max(θ⊤η − F(θ)): Legendre-Fenchel convex conjugate c 2014 Frank Nielsen 12/39
  • 13. Exponential families: Rayleigh distributions [36, 25] Application: IntraVascular UltraSound (IVUS) imaging: Rayleigh distribution: p(x; λ) = x 2 e− x2 22 x ∈ R+ = X d = 1 (univariate) D = 1 (order 1) θ = − 1 22 = (−∞, 0) F(θ) = −log(−2θ) t(x) = x2 k(x) = log x (Weibull for k = 2) Coronary plaques: fibrotic tissues, calcified tissues, lipidic tissues Rayleigh Mixture Models (RMMs): for segmentation and classification tasks c 2014 Frank Nielsen 13/39
  • 14. Exponential families: Multivariate Gaussians [14, 25] Gaussian Mixture Models (GMMs). (Color image interpreted as a 5D xyRGB point set) Gaussian distribution p(x; μ,): 1 (2) d 2√|Σ| e−1 2DΣ−1 (x−μ,x−μ) Squared Mahalanobis distance: DQ(x, y) = (x − y)TQ(x − y) x ∈ Rd = X d (multivariate) D = d(d+3) 2 (order) θ = (−1μ, 1 2−1) = (θv , θM) = R × Sd+ + F(θ) = 1 v θ−1 4 θT M θv − 1 2 log |θM| + d 2 log π t(x) = (x,−xxT ) k(x) = 0 c 2014 Frank Nielsen 14/39
  • 15. The k-MLE method for exponential families k-MLEEF: 1. Initialize weight W (in open probability simplex P k ) 2. Solve minΛ i minj (BF∗ (t(x) : ηj ) − log wj ) 3. Solve minW P i minj Dj (xi ) 4. Test for convergence and go to step 2) otherwise. Assignment condition in Step 2: additively-weighted Bregman Voronoi diagram. c 2014 Frank Nielsen 15/39
  • 16. k-MLE: Solving for weights given component parameters Solve min W X i min j Dj (xi ) Amounts to argminW −nj log wj = argminW −nj n log wj where nj = #{xi ∈ Vor(cj )} = |Cj |. min W∈Δk H×(N : W) n , ..., nk n ) is cluster point proportion vector ∈ k . where N = ( n1 Cross-entropy H× is minimized when H×(N : W) = H(N) that is W = N. Kullback-Leibler divergence: KL(N : W) = H×(N : W) − H(N) = 0 when W = N. c 2014 Frank Nielsen 16/39
  • 17. MLE for exponential families Given a ML farthest Voronoi partition, computes MLEs θj ’s: ˆθj = argmax ∈Θ Y xi∈Vor(cj ) pF (xi ; θ) is unique (***) maximum since ∇2F(θ) ≻ 0: Moment equation : ∇F(ˆθj ) = η(ˆθj ) = 1 nj X xi∈Vor(cj ) t(xi ) = ¯t = ˆη MLE is consistent, efficient with asymptotic normal distribution: ˆθj ∼ N θj , 1 nj I−1(θj ) Fisher information matrix I (θj ) = var[t(X)] = ∇2F(θj ) = (∇2F∗)−1(ηj ) MLE may be biased (eg, normal distributions). c 2014 Frank Nielsen 17/39
  • 18. Existence of MLEs for exponential families (***) For minimal and full EFs, MLE guaranteed to exist [3, 21] provided that matrix: T =   1 t1(x1) ... tD(x1) ... ... ... ... 1 t1(xn) ... tD(xn)   (1) of dimension n × (D + 1) has rank D + 1 [3]. For example, problems for MLEs of MVNs with n d observations (undefined with likelihood ∞). Condition: ¯t = 1 nj P xi∈Vor(cj ) t(xi ) ∈ int(C), where C is closed convex support. c 2014 Frank Nielsen 18/39
  • 19. MLE of EFs: Observed point in IG/Bregman 1-mean ˆθ = argmax Qn i=1 pF (xi ; θ) = argmax Pn i=1 log pF (xi ; θ) argmax Xn i=1 −BF∗(t(xi ) : η) + F∗(t(xi )) + k(xi ) | {z } constant ≡ argmin Xn i=1 BF∗(t(xi ) : η ) Right-sided Bregman centroid = center of mass: ˆη = 1 n Xn i=1 t(xi ) . ¯l = 1 n Xn i=1 (−BF∗(t(xi ) : ˆη) + F∗(t(xi )) + k(xi )) = hˆη, ˆθi − F(ˆθ) + ¯k = F∗(ˆη) + ¯k c 2014 Frank Nielsen 19/39
  • 20. The k-MLE method: Heuristics based on k-means k-means is NP-hard (non-convex optimization) when d 1 and k 1 and solved exactly using dynamic programming [26] in O(n2k) and O(n) memory when d = 1. Heuristics: ◮ Kanungo et al. [18] swap: yields a (9 + ǫ)-approximation ◮ Global seeds: random seed (Forgy [12]), k-means++ [2], global k-means initialization [38], ◮ Local refinements: Lloyd batched update [19], MacQueen iterative update [20], Hartigan single-point swap [16], etc. ◮ etc. c 2014 Frank Nielsen 20/39
  • 21. Generalized k-MLE Weibull or generalized Gaussians are parametric families of exponential families [35]: F(γ). Fixing some parameters yields nested families of (sub)-exponential families [34]: obtain one free parameter with convex conjugate F∗ approximated by line search (Gamma distributions/generalized Gaussians). c 2014 Frank Nielsen 21/39
  • 22. Generalized k-MLE k-GMLE: 1. Initialize weight W ∈ k and family type (F1, ..., Fk ) for each cluster 2. Solve minΛ P i minj Dj (xi ) (center-based clustering for W fixed) with potential functions: Dj (xi ) = −log pFj (xi |θj ) − log wj 3. Solve family types maximizing the MLE in each cluster Cj by choosing the parametric family of distributions Fj = F(γj ) that yields the best likelihood: P minF1=F( 1),...,Fk=F( k )∈F( ) i minj Dwj ,j ,Fj (xi ). 4. Update W as the cluster point proportion 5. Test for convergence and go to step 2) otherwise. Dwj ,j ,Fj (x) = −log pFj (x; θj ) − log wj c 2014 Frank Nielsen 22/39
  • 23. Generalized k-MLE: Convergence ◮ Lloyd’s batched generalized k-MLE maximizes monotonically the complete likelihood ◮ Hartigan single-point relocation generalized k-MLE maximizes monotonically the complete likelihood [32], improves over Lloyd local maxima, and avoids the problem of the existence of MLE inside clusters by ensuring nj ≥ D in general position (T rank D + 1). ◮ Model selection: Learn k automatically using DP k-means [32] (Dirichlet Process) c 2014 Frank Nielsen 23/39
  • 24. k-MLE [23] versus EM for Exponential Families [1] k-MLE/Hard EM [23] (2012-) Soft EM [1] (1977) = Bregman hard clustering = Bregman soft clustering Memory lighter O(n) heavier O(nk) Assignment NNs with VP-trees [27], BB-trees [30] all k-NNs Conv. always finitely ∞, need stopping criterion Many (probabilistically) guaranteed initialization for k-MLE [18, 2, 28] c 2014 Frank Nielsen 24/39
  • 25. k-MLE: Solving for D = 1 exponential families ◮ Rayleigh, Poisson or (nested) univariate normal with constant σ are order 1 EFs (D = 1). ◮ Clustering problem: Dual 1D Bregman clustering [1] on 1D scalars yi = t(xi ). ◮ FML Voronoi diagrams have connected cells: Optimal clustering yields interval clustering. ◮ 1D k-means (with additive weights) can be solved exactly using dynamic programming in O(n2k) time [26]. Then update the weights W (cluster point proportion) and reiterate... c 2014 Frank Nielsen 25/39
  • 26. Dynamic programming for D = 1-order mixtures [26] Consider W fixed. k-MLE cost: Pk j=1 l (Cj ) where Cj are clusters. x1 xj−1 xj xn MLEk−1(X1,j−1) MLE1(Xj,n) ˆ k = ˆ j,n Dynamic programming optimality equation: MLEk (x1, ..., xn) = n max j=2 (MLEk−1(X1,j−1) +MLE1(Xj ,n)) Xl ,r : {xl , xl+1, ..., xr−1, xr }. ◮ Build dynamic programming table from l = 1 to l = k columns, m = 1 to m = n rows. ◮ Retrieve Cj from DP table by backtracking on the argmaxj . ◮ For D = 1 EFs, O(n2k) time [26]. c 2014 Frank Nielsen 26/39
  • 27. Experiments with: 1D Gaussian Mixture Models (GMMs) gmm1 score = −3.075 (Euclidean k-means, σ fixed) gmm2 score = −3.038 (Bregman k-means, σ fitted, better) c 2014 Frank Nielsen 27/39
  • 28. Summary: k-MLE methodology for learning mixtures Learn MMs from sequences of geometric hard clustering [11]. ◮ Hard k-MLE (≡ dual Bregman hard clustering for EFs) versus soft EM (≡ soft Bregman clustering [1] for EFs): ◮ k-MLE maximizes the complete likelihood lc . ◮ EM maximizes locally the incomplete likelihood li . ◮ The component parameters η geometric clustering (Step 2.) can be implemented using any Bregman k-means heuristic on conjugate F∗ ◮ Consider generalized k-MLE when F∗ not available in closed form: nested exponential families (eg., Gamma) ◮ Initialization can be performed using k-means initialization: k-MLE++, etc. ◮ Exact solution with dynamic programming for order 1 EFs (with prescribed weight proportion W). ◮ Avoid unbounded likelihood (eg., ∞ for location-scale member with σ → 0: Dirac) using Hartigan’s heuristic [32] c 2014 Frank Nielsen 28/39
  • 29. Discussion: Learning statistical models FAST! ◮ (EF) Mixture Models allow one to approximate universally smooth densities ◮ A single (multimodal) EF can approximate any smooth density too [7] but F not in closed-form ◮ Which criterion to maximize is best/realistic: incomplete or complete, or parameter distortions? Leverage many recent results on k-means clustering to learning mixture models. ◮ Alternative approach: Simplifying mixtures from kernel density estimators (KDEs) is one fine-to-coarse solution [33] ◮ Open problem: How to constrain the MMs to have a prescribed number of modes/antimodes? c 2014 Frank Nielsen 29/39
  • 30. Thank you. Experiments and performance evaluations on generalized k-MLE: ◮ k-GMLE for generalized Gaussians [35] ◮ k-GMLE for Gamma distributions [34] ◮ k-GMLE for singly-parametric distributions [26] (compared with Expectation-Maximization [8]) Frank Nielsen (5793b870). c 2014 Frank Nielsen 30/39
  • 31. Bibliography I Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clustering with Bregman divergences. Journal of Machine Learning Research, 6:1705–1749, 2005. Anup Bhattacharya, Ragesh Jaiswal, and Nir Ailon. A tight lower bound instance for k-means++ in constant dimension. In T.V. Gopal, Manindra Agrawal, Angsheng Li, and S.Barry Cooper, editors, Theory and Applications of Models of Computation, volume 8402 of Lecture Notes in Computer Science, pages 7–22. Springer International Publishing, 2014. Krzysztof Bogdan and Ma lgorzata Bogdan. On existence of maximum likelihood estimators in exponential families. Statistics, 34(2):137–149, 2000. Jean-Daniel Boissonnat, Frank Nielsen, and Richard Nock. Bregman Voronoi diagrams. Discrete Comput. Geom., 44(2):281–307, September 2010. Gilles Celeux and G´erard Govaert. A classification EM algorithm for clustering and two stochastic versions. Comput. Stat. Data Anal., 14(3):315–332, October 1992. Jiahua Chen. Optimal rate of convergence for finite mixture models. The Annals of Statistics, pages 221–233, 1995. Loren Cobb, Peter Koppstein, and Neng Hsin Chen. Estimation and moment recursion relations for multimodal distributions of the exponential family. Journal of the American Statistical Association, 78(381):124–130, 1983. c 2014 Frank Nielsen 31/39
  • 32. Bibliography II Arthur Pentland Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977. Herbert Edelsbrunner, Brittany Terese Fasy, and G¨unter Rote. Add isotropic Gaussian kernels at own risk: more and more resilient modes in higher dimensions. In Proceedings of the 2012 symposuim on Computational Geometry, SoCG ’12, pages 91–100, New York, NY, USA, 2012. ACM. Dan Feldman, Matthew Faulkner, and Andreas Krause. Scalable training of mixture models via coresets. In J. Shawe-Taylor, R.S. Zemel, P.L. Bartlett, F. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2142–2150. Curran Associates, Inc., 2011. Dan Feldman, Morteza Monemizadeh, and Christian Sohler. A PTAS for k-means clustering based on weak coresets. In Proceedings of the twenty-third annual symposium on Computational geometry, pages 11–18. ACM, 2007. Edward W. Forgy. Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics, 1965. Francis Galton. Hereditary genius. Macmillan and Company, 1869. Vincent Garcia and Frank Nielsen. Simplification and hierarchical representations of mixtures of exponential families. Signal Processing (Elsevier), 90(12):3197–3212, 2010. c 2014 Frank Nielsen 32/39
  • 33. Bibliography III Moritz Hardt and Eric Price. Sharp bounds for learning a mixture of two gaussians. CoRR, abs/1404.4997, 2014. John A. Hartigan. Clustering Algorithms. John Wiley Sons, Inc., New York, NY, USA, 99th edition, 1975. Adam Tauman Kalai, Ankur Moitra, and Gregory Valiant. Disentangling gaussians. Communications of the ACM, 55(2):113–120, 2012. Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. A local search approximation algorithm for k-means clustering. Computational Geometry: Theory Applications, 28(2-3):89–112, 2004. Stuart P. Lloyd. Least squares quantization in PCM. Technical report, Bell Laboratories, 1957. James B. MacQueen. Some methods of classification and analysis of multivariate observations. In L. M. Le Cam and J. Neyman, editors, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. University of California Press, Berkeley, CA, USA, 1967. Weiwen Miao and Marjorie Hahn. Existence of maximum likelihood estimates for multi-dimensional exponential families. Scandinavian Journal of Statistics, 24(3):371–386, 1997. c 2014 Frank Nielsen 33/39
  • 34. Bibliography IV Ankur Moitra and Gregory Valiant. Settling the polynomial learnability of mixtures of Gaussians. In 51st IEEE Annual Symposium on Foundations of Computer Science, pages 93–102. IEEE, 2010. Frank Nielsen. k-MLE: A fast algorithm for learning statistical mixture models. CoRR, abs/1203.5181, 2012. Frank Nielsen. Generalized Bhattacharyya and Chernoff upper bounds on Bayes error using quasi-arithmetic means. Pattern Recognition Letters, 42(0):25 – 34, 2014. Frank Nielsen and Vincent Garcia. Statistical exponential families: A digest with flash cards, 2009. arXiv.org:0911.4863. Frank Nielsen and Richard Nock. Optimal interval clustering: Application to bregman clustering and statistical mixture learning. Signal Processing Letters, IEEE, 21(10):1289–1292, Oct 2014. Frank Nielsen, Paolo Piro, and Michel Barlaud. Bregman vantage point trees for efficient nearest neighbor queries. In Proceedings of the 2009 IEEE International Conference on Multimedia and Expo (ICME), pages 878–881, 2009. Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman, and Chaitanya Swamy. The effectiveness of Lloyd-type methods for the k-means problem. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, pages 165–176, Washington, DC, USA, 2006. IEEE Computer Society. c 2014 Frank Nielsen 34/39
  • 35. Bibliography V Karl Pearson. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society A, 185:71–110, 1894. Paolo Piro, Frank Nielsen, and Michel Barlaud. Tailored Bregman ball trees for effective nearest neighbors. In European Workshop on Computational Geometry (EuroCG), LORIA, Nancy, France, March 2009. IEEE. Adolphe Quetelet. Lettres sur la th´eorie des probabilit´es, appliqu´ee aux sciences morales et politiques. Hayez, 1846. Christophe Saint-Jean and Frank Nielsen. Hartigan’s method for k-MLE: Mixture modeling with Wishart distributions and its application to motion retrieval. In Geometric Theory of Information, pages 301–330. Springer International Publishing, 2014. Olivier Schwander and Frank Nielsen. Model centroids for the simplification of kernel density estimators. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 737–740, 2012. Olivier Schwander and Frank Nielsen. Fast learning of Gamma mixture models with k-mle. In Similarity-Based Pattern Recognition (SIMBAD), pages 235–249, 2013. Olivier Schwander, Aur´elien J. Schutz, Frank Nielsen, and Yannick Berthoumieu. k-MLE for mixtures of generalized Gaussians. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR), pages 2825–2828, 2012. c 2014 Frank Nielsen 35/39
  • 36. Bibliography VI Jose Seabra, Francesco Ciompi, Oriol Pujol, Josepa Mauri, Petia Radeva, and Joao Sanchez. Rayleigh mixture model for plaque characterization in intravascular ultrasound. IEEE Transaction on Biomedical Engineering, 58(5):1314–1324, 2011. Marc Teboulle. A unified continuous optimization framework for center-based clustering methods. Journal of Machine Learning Research, 8:65–102, 2007. Juanying Xie, Shuai Jiang, Weixin Xie, and Xinbo Gao. An efficient global k-means clustering algorithm. Journal of computers, 6(2), 2011. c 2014 Frank Nielsen 36/39