CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures


Published on

Published in: Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures

  1. 1. A vne da cdIfr t nT e r inomai h oyn oC P “ aN t e” VRi n us l hl CP VR T ti u rl oa J n 1 -82 1 u e 31 0 0 S nFa c c ,A a rn i oC sGaussian Mixtures:Classification & PDF EstimationFrancisco Escolano & Anand Rangarajan
  2. 2. Gaussian MixturesBackground. Gaussian Mixtures are ubiquitous in CVPR. Forinstance, in CBIR, it is sometimes iteresting to model the image as apdf over the pixel colors and positions (see for instance [Goldberger etal.,03] where a KL-divergence computation method is presented).GMs often provide a model for the pdf associated to the image andthis is useful for segmentation. GMs, as we have seen in theprevious lesson, are also useful for modeling shapes.Therefore GMs estimation has been a recurrent topic in CVPR.Traditional methods, associated to the EM algorithm have evolvedto incorporate IT elements like the MDL principle for model-orderselection [Figueiredo et al.,02] in parallel with the development ofVariational Bayes (VB) [Constantinopoulos and Likas,07] 2/43
  3. 3. Uses of Gaussian MixturesFigure: Gaussian Mixtures for modeling images (top) and for color-basedsegmentation (bottom) 3/43
  4. 4. Review of Gaussian MixturesDefinitionA d-dimensional random variable Y follows a finite-mixturedistribution when its pdf p(Y |Θ) can be described by a weightedsum of known pdfs named kernels. When all of these kernels areGaussian, the mixture is named in the same way: K p(Y |Θ) = πi p(Y |Θi ), i=1where 0 ≤ πi ≤ 1, i = 1, . . . , K , K πi = 1, K is the number of i=1kernels, π1 , . . . , πK are the a priori probabilities of each kernel, andΘi are the parameters describing the kernel. In GMs, Θi = {µi , Σi },that is, the mean vector and covariance. 4/43
  5. 5. Review of Gaussian Mixtures (2)GMs and Maximum LikelihoodThe whole set of parameters of a given K-mixture is denoted byΘ ≡ {Θ1 , . . . , ΘK , π1 , . . . , πK }. Obtaining the optimal set ofparameters Θ∗ is usually posed in terms of maximizing thelog-likelihood of the pdf to be estimated, based on a set of N i.i.d.samples of the variable Y = {y1 , . . . , yN }: N L(Θ, Y ) = (Y |Θ) = log p(Y |Θ) = log p(yn |Θ) n=1 N K = log πk p(yn |Θk ). n=1 k=1 5/43
  6. 6. Review of Gaussian Mixtures (3)GMs and EMThe EM algorithm allows to find maximum-likelihood solutions toproblems where there are hidden variables. In the case of Gaussianmixtures, these variables are a set of N labels Z = {z 1 , . . . , z N }associated to the samples. Each label is a binary vector (n) (n) (n)z i = [z1 , . . . , zK ], being K the number of components, zm = 1 (n)and zp = 0, if p = m, denoting that yn has been generated by thekernel m. Then, given the complete set of data X = {Y , Z }, thelog-likelihood of this set is given by N K n log p(Y , Z |Θ) = zk log[πk p(yn |Θk )]. n=1 k=1 6/43
  7. 7. Review of Gaussian Mixtures (4)E-StepConsists in estimating the expected value of the hidden variablesgiven the visible data Y and the current estimation of theparameters Θ∗ (t): (n) (n) E [zk |Y , Θ∗ (t)] = P[zk = 1|yn , Θ∗ (t)]) πk (t)p(yn |Θ∗ (t)) ∗ k = . ΣK πj∗ (t)p(yn |Θ∗ (t)) j=1 kThus, the probability of generating yn with the kernel k is given by: πk p(yn |k) p(k|yn ) = . ΣK πj p(yn |j) j=1 7/43
  8. 8. Review of Gaussian Mixtures (5)M-StepGiven the expected Z , the new parameters Θ∗ (t + 1) are given by: N 1 πk = p(k|yn ), N n=1 N n=1 p(k|yn )yn µk = N , n=1 p(k|yn ) N n=1 p(k|yn )(yn − µk )(yn − µk )T Σk = N . n=1 p(k|yn ) 8/43
  9. 9. Model Order SelectionTwo Extreme Approaches How many kernels are needed for describe the distribution? [Figueiredo and Jain,02] it is proposed to perform EM for different values of K and take the one optimizing ML and a MLD-like criterion. Starting from a high K , kernel fusions are preformed if needed. Local optima arise. In EBEM [Pe˜alver et al., 09] we show that it is possible to apply n MDL more efficiently and robustly by starting from a unique kernel and splitting only if the underlying data is not Gaussian. The main challenge of this approach is how to estimate Gaussianity for multi-dimensional data. 9/43
  10. 10. Model Order Selection (2)MDLMinimum Description Length and related principles choose arepresentation of the data that allows us to express them with theshortest possible message from a postulated set of models.Rissanen’ MDL implies as minimizing N(K ) CMDL (Θ(K ) , K ) = −L(Θ(K ) , Y ) + log n, 2where: N(K ) is the number of parameters required to define aK -component mixture, and n is the number of samples. d(d + 1) N(K ) = (K − 1) + K d+ . 2 10/43
  11. 11. Gaussian DeficiencyMaximum Entropy of a MixtureAttending to the 2nd Gibbs Theorem, Gaussian variables have themaximum entropy among all the variables with equal variance. Thistheoretical maximum entropy for a d-dimensional variable Y onlydepends on the covariance Σ and is given by: 1 Hmax (Y ) = log[(2πe)d |Σ|]. 2Therefore, the maximum entropy of the mixture is given by K Hmax (Y ) = πk Hmax (k). k=1 11/43
  12. 12. Gaussian Deficiency (2)Gaussian DeficiencyInstead of using the MDL principle we may compare the estimatedentropy of the underlying data with the entropy of a Gaussian. Wedefine the Gaussianity Deficiency GD of the whole mixture as thenormalized weighted sum of the differences between maximum andreal entropy of each kernel: K K Hmax (k) − Hreal (k) Hreal (k)GD = πk = πk 1− , Hmax (k) Hmax (k) k=1 k=1where Hreal (k) is the real entropy of the data under the k−thkernel. We have: 0 ≤ GD ≤ 1 (0 iff Gaussian). If the GD is highenough we may stop the algorithm. 12/43
  13. 13. Gaussian Deficiency (3)Kernel SelectionIf the GD ratio is below a given threshold, we consider that allkernels are well fitted. Otherwise, we select the kernel with thehighest individual ratio and it is replaced by two other kernels thatare conveniently placed and initialized. Then, a new EM epoch withK + 1 kernels starts. The worst kernel is given by (Hmax (k) − Hreal (k)) k ∗ = arg max πk . k Hmax (k)Independently of using MDL or GD, in order to decide what kernelcan be split by two other kernels (if needed), we compute and laterexpression and we decide to split k ∗ accordingly to MDL or GD. 13/43
  14. 14. Split ProcessSplit ConstrainsThe k ∗ component must be decomposed into the kernels k1 and k2with parameters Θk1 = (µk1 , Σk1 ) and Θk2 = (µk2 , Σk2 ). Inmultivariate settings, the corresponding priors, the mean vectors andthe covariance matrices should satisfy the following split equations: π∗ = π1 + π2 , π∗ µ∗ = π1 µ1 + π2 µ2 , π∗ (Σ∗ + µ∗ µT ) = π1 (Σ1 + µ1 µT ) + π2 (Σ2 + µ2 µT ), ∗ 1 2Clearly, the split move is an ill-posed problem because the number ofequations is less than the number of unknowns. 14/43
  15. 15. Split Process (2)Split TFollowing [Dellaportas,06], let Σ∗ = V∗ Λ∗ V∗ . Let also be D a d × drotation matrix with orthonormal unit vectors as columns. Then: π1 = u1 π∗ , π2 = (1 − u1 )π∗ , µ1 = µ∗ − ( d u2 λi∗ V∗ ) π2 , i=1 i i π1 d i i π1 µ2 = µ∗ + ( i=1 u2 λi∗ V∗ ) π2 , Λ1 = diag(u3 )diag(ι − u2 )diag(ι + u2 )Λ∗ π∗ , π1 Λ2 = diag(ι − u3 )diag(ι − u2 )diag(ι + u2 )Λ∗ π∗ , π2 V1 = DV∗ , V2 = D T V∗ , 15/43
  16. 16. Split Process (3)Split (cont.)The latter spectral split method has a non-evident randomcomponent, because ι is a d x 1 vector of ones,u1 , u2 = (u2 , u2 , . . . , u2 )T and u3 = (u3 , u3 , . . . , u3 )T are 2d + 1 1 2 d 1 2 drandom variables needed to build priors, means and eigenvalues forthe new component in the mixture. They are calculated as: 1 u1 ∼ β(2, 2), u2 ∼ β(1, 2d), j 1 j u2 ∼ U(−1, 1), u3 ∼ β(1, d), u3 ∼ U(0, 1),with j = 2, . . . , d, U(., .) and β(., .) denotes Beta and Uniformdistributions respectively. 16/43
  17. 17. Split Process (4)Figure: Split of a 2D kernel into two ones. 17/43
  18. 18. EBEM AlgorithmAlg. 1: EBEM - Entropy Based EM AlgorithmInput: convergence th 1 N 1 N TK = 1, i = 0, π1 = 1, Θ1 = {µ1 , Σ1 } where µ1 = N i=1 yi , Σ1 = N−1 i=1 (yi − µ1 ) (yi − µ1 )Final = falserepeat i =i +1 repeat EM iteration Estimate log-likelihood in iteration i: (Y |Θ(i)) until | (Y |Θ(i)) − (Y |Θ(i − 1))| < convergence th ; Evaluate Hreal (Y ) and Hmax (Y ) (H (k)−Hreal (k)) Select k ∗ with the highest ratio: k ∗ = arg maxk πk maxH (k) max d(d+1) N(k) Estimate CMDL in iteration i: N(k) = (k − 1) + k d + 2 , CMDL (Θ(i)) = − (Y |Θ(i)) + 2 log n if (C(Θ(i)) ≥ C(Θ(i − 1))) then Final = true K = K − 1, Θ∗ = Θ(i − 1) end else Decompose k ∗ in k1 and k2 enduntil Final=true ;Output: Optimal mixture model: K , Θ∗ 18/43
  19. 19. EBEM Algorithm (2)Figure: Top: MML (Figueiredo & Jain), Bottom: EBEM 19/43
  20. 20. EBEM Algorithm (3)Figure: Color Segmentation: EBEM (2nd col.) vs VEM (3rd col.) 20/43
  21. 21. EBEM Algorithm (4) Table: EM, VEM and EBEM in Color Image Segmentation Algorithm “Forest” “Sunset” “Lighthouse” (K=5) (K=7) (K=8)Classic EM (PSNR) 5.35 14.76 12.08 (dB) ±0.39 ±2.07 ±2.49 VEM (PSNR) 10.96 18.64 15.88 (dB) ±0.59 ±0.40 ±1.08 EBEM (PSNR) 14.1848 18.91 19.4205 (dB) ±0.35 ±0.38 ±2.11 21/43
  22. 22. EBEM Algorithm (5)EBEM in Higher Dimensions We have also tested the algorithm with the well known Wine data set, that contains 3 classes of 178 (13-dimensional) instances. The number of samples, 178 is not enough to build the pdf using Parzen’s windows method in a 13-dimensional space. With the MST approach (see below) where no pdf estimation is needed, the algorithm has been applied to this data set. After EBEM ends with K = 3, a maximum a posteriori classifier was built. The classification performance was 96.1%. This result is either similar or even better than the experiments reported in the literature. 22/43
  23. 23. Entropic GraphsEGs and R´nyi Entropy eEntropic Spanning Graphs obtained from data to estimate R´nyi’s eα-entropy [Hero and Michel, 02] belong to the “non plug-in” methodsfor entropy estimation. R´nyi’s α-entropy of a probability density efunction p is defined as: 1 Hα (p) = ln p α (z)dz 1−α zfor α ∈ [0, 1[. The α-entropy converges to the Shannon onelimα→1 Hα (p) = H(p) ≡ − p(z) ln p(z)dz, so it is possible toobtain the Shannon entropy from the R´nyi’s one if the latter limit eis either solved or numerically approximated. 23/43
  24. 24. Entropic Graphs (2)EGs and R´nyi Entropy (cont.) eLet be a graph G consisting in a set of vertices Xn = {x1 , . . . , xn },with xn ∈ R d and edges {e} that connect vertices: eij = (xi , xj ). Ifwe denote by M(Xn ) the possible sets of edges in the class of acyclicgraphs spanning Xn (spanning trees), the total edge lengthfunctional of the Euclidean power weighted Minimal Spanning Treeis: LMST (Xn ) = min γ ||e||γ M(Xn ) e∈M(Xn )with γ∈ [0, d] and ||.|| the Euclidean distance. The MST has beenused in order to measure the randomness of a set of points. 24/43
  25. 25. Entropic Graphs (3)EGs and R´nyi Entropy (cont.) eIt is intuitive that the length of the MST for the uniformlydistributed points increases at a greater rate than does the MSTspanning the more concentrated nonuniform set of points. Ford ≥ 2: d Lγ (Xn ) Hα (Xn ) = ln − ln βLγ ,d γ nαis an asymptotically unbiased and almost surely consistent estimatorof the α-entropy of p where α = (d − γ)/d and βLγ ,d is a constantbias correction for which there are only known approximations andbounds: (i) Monte Carlo simulation of uniform random samples onunit cube [0, 1]d ; (ii) Large d approximation: (γ/2) ln(d/(2πe)). 25/43
  26. 26. Entropic Graphs (4)Figure: Uniform (left) vs Gaussian (right) distribution’s EGs. 26/43
  27. 27. Entropic Graphs (5) a+b×e cdFigure: Extrapolation to Shannon: α∗ = 1 − N 27/43
  28. 28. Variational BayesProblem DefinitionGiven N i.i.d. samples X = {x 1 , . . . , x N } of a d-dimensionalrandom variable X , their associated hidden variablesZ = {z 1 , . . . , z N } and the parameters Θ of the model, the Bayesianposterior is given by [Watanabe et al.,09] : N n n p(Θ) n=1 p(x , z |Θ) p(Z , Θ|X ) = N . n n p(Θ) n=1 p(x , z |Θ)dΘSince the integration w.r.t. Θ is analytically intractable, theposterior is approximated by a factorized distributionq(Z , Θ) = q(Z )q(Θ) and the optimal approximation is the one thatminimizes the variational free energy. 28/43
  29. 29. Variational Bayes (2)Problem Definition (cont.)The variational free energy is given by: N q(Z , Θ)L(q) = q(Z , Θ) log dΘ − log p(Θ) p(x n |θ)dΘ , p(Z , Θ|X ) n=1where the first term is the Kullback-Leibler divergence between theapproximation and the true posterior. As the second term isindependent of the approximation, the Variational Bayes (VB)approach is reduced to minimize the latter divergence. Suchminimization is addressed in a EM-like process alternating theupdating of q(Θ) and the updating of q(Z ). 29/43
  30. 30. Variational Bayes (3)Problem Definition (cont.)The EM-like process alternating the updating of q(Θ) and theupdating of q(Z ) is given by N q(Θ) ∝ p(Θ) exp log p(x n , z n |Θ) q(Z ) n=1 N q(Z ) ∝ exp log p(x n , z n |Θ) q(Θ) n=1 30/43
  31. 31. Variational Bayes (4)Problem Definition (cont.)In [Constantinopoulos and Likas,07] , the optimization of thevariational free energy yields (being N (.) and W(.) are respectivelythe Gaussian and Wishart densities): n n q(Z ) = N n=1 s k=1 rk n zk K k=s+1 ρk n zk K q(µ) = k=1 N (µk |mk , Σk K q(Σ) = k=1 W(Σk |νk , Vk ) Γ( K γk −1 ˜ s −K +s k=s+1 γk ) ˜q(β) = (1 − k=1 πk ) K · Kk=s+1 1− s πk πk , k=s+1 Γ(˜k ) γ k=1After the maximization of the free energy w.r.t. q(.), it proceeds toupdate the coefficients in α which denote the free components. 31/43
  32. 32. Model Selection in VBFixed and Free Components In the latter framework, it is assumed that a number of K − s components fit the data well in their region of influence (fixed components) and then model order selection is posed in terms of optimizing the parameters of the remaing s (free components). Let α = {πk }sk=1 the coefficients of the free components and K β = {πk }k=s+1 the coefficients of the fixed components.Under the i.i.d. sampling assumption, the prior distribution of Z given α and β can be modeled by a product of multinomials: N s n zk K n zk p(Z |α, β) = n=1 k=1 πk k=s+1 πk . 32/43
  33. 33. Model Selection in VB (2)Fixed and Free Components (cont.) Moreover, assuming conjugate Dirichlet priors over the set of mixing coefficients, we have that p(β|α) = −K +s Γ( K γk ) γk −1 s k=s+1 K πk (1 − k=1 πk ) K · k=s+1 1− s πk . k=s+1 Γ(γk ) k=1 Then, considering fixed coefficients Θ is redefined as Θ = {µ, Σ, β} and we have the following factorization: q(Z , Θ) = q(Z )q(µ)q(Σ)q(β) . 33/43
  34. 34. Model Selection in VB (3)Kernel Splits In [Constantinopoulos and Likas,07] , the VBgmm methodis used for training an initial K = 2 model. Then, in the so called VBgmmSplit, they proceed by sorting the obtained kernels and then trying to split them recursively. Each splitting consists of: Removing the original component. Replacing it by two kernels with the same covariance matrix as the original but with means placed in opposite directions along the maximum variabiability direction. 34/43
  35. 35. Model Selection in VB (4)Kernel Splits (cont) Independently of the split strategy, the critical point of VBgmmSplit is the amount of splits needed until convergence. At each iteration of the latter algorithm the K current exisiting kernels are splited. Consider the case of any split is detected as proper (non-zero π after running the VB update described in the previous section, where each new kernel is considered as free). Then, the number of components increases and then a new set of splitting tests starts in the next iteration. This means that if the algorithm stops (all splits failed) with K kernels, the number of splits has been 1 + 2 + . . . + K = K (K + 1)/2. 35/43
  36. 36. Model Selection in VB (5)EBVS Split We split only one kernel per iteration. In order to do so, we implement a selection criterion based on measuring the entropy of the kernels. If ones uses Leonenko’s estimator then there is no need of extrapolation as in EGs, and asymptotic consistence is ensured. Then, at each iteration of the algorithm we select the worse, in terms of low entropy, to be split. If the split is successful we will have K + 1 kernels to feed the VB optimization in the next iteration. Otherwise, there is no need to add a new kernel and the process converges to K kernels. The key question here is that the overall process is linear (one split per iteration). 36/43
  37. 37. EBVS: Fast BVFigure: EBVS Results 37/43
  38. 38. EBVS: Fast BV (2)Figure: EBVS Results (more) 38/43
  39. 39. EBVS: Fast BV (3)MD Experiments With this approach using Leonenko’s estimator, the classification performance we obtain on this data set is 86%. Altough experiments in higher dimensions can be performed, when the number of samples is not high enough, the risk of unbounded maxima of the likelihood function is higher, due to singular covariance matrices. The entropy estimation method, however, performs very well with thousands of dimensions. 39/43
  40. 40. ConclusionsSummarizing Ideas in GMs In the multi-dimensional case, efficient entropy estimators become critical. In VB where model-order selection is implicit, it is possible to reduce the complexity at least by one order of magnitude. Can use the same approach for shapes in 2D and 3D. Once we have the mixtures, new measures for compare them are waiting to be discovered and used. Let’s do it! 40/43
  41. 41. References[Goldberger et al.,03] Goldberger, J., Gordon, S., Greenspan, H(2003). An Efficient Image Similarity Measure Based onApproximations of KL-Divergence Between Two Gaussian Mixtures.ICCV’03[Figueiredo and Jain, 02] Figueiredo, M. and Jain, A. (2002).Unsupervised learning of nite mixture models. IEEE Trans. PatternAnal. Mach. Intell., vol. 24, no. 3, pp. 381399[Constantinopoulos and Likas,07] Constantinopoulos, C. and Likas, A.(2007). Unsupervised Learning of Gaussian Mixtures based onVariational Component Splitting. IEEE Trans. Neural Networks, vol.18., no. 3, 745–755. 41/43
  42. 42. References (2)[Pe˜alver et al., 09] Pe˜alver, A., Escolano, F., S´ez, J.M. Learning n n aGaussian Mixture Models with Entropy Based Criteria. IEEE Trans.on Neural Networks, 20(11) 1756–1771.[Dellaportas,06] Dellaportas, P. and Papageorgiou I. (2006).Multivariate mixtures of normals with unknown number ofcomponents. Statistics and Computing, vol. 16, no. 1, pp. 57–68[Hero and Michel,02] Hero, A. and Michel, o. (2002). Applications ofspanning entropic graphs. IEEE Signal Processing Magazine, vol. 19,no. 5, pp. 85–95[Watanabe et al.,09] Watanabe, K., Akaho, S., Omachi, S.:Variational bayesian mixture model on a subspace of exponentialfamily distributions. IEEE Transactions on Neural Networks 20(11)1783–1796 42/43
  43. 43. References (3)Escolano et al.,10] Escolano, F., Pe˜alver A. and Bonev, B. (2010). nEntropy-based Variational Scheme for Fast Bayes Learning ofGaussian Mixtures. SSPR’2010 (accepted)[Rajwadee et al.,09] Ajit Rajwade, Arunava Banerjee, AnandRangarajan(2009). Probability Density Estimation Using Isocontoursand Isosurfaces: Applications to Information-Theoretic ImageRegistration. IEEE Trans. Pattern Anal. Mach. Intell. 31(3):475–491[Chen et al.,10] Ting Chen, Baba C Vemuri, Anand Rangarajan,Stephan J Eisenschenk (2010). Group-wise Point-set registrationusing a novel CDF-based Havrda-Charv´t Divergence.Int J Comput aVis. 86 (1):111-124 43/43