Upcoming SlideShare
×

# Computational Information Geometry on Matrix Manifolds (ICTP 2013)

584 views

Published on

Computational Information Geometry
on Matrix Manifolds

http://cdsagenda5.ictp.trieste.it/full_display.php?ida=a12193

Published in: Technology
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
584
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
9
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Computational Information Geometry on Matrix Manifolds (ICTP 2013)

1. 1. Computational Information Geometry on Matrix Manifolds Frank Nielsen Frank.Nielsen@acm.org www.informationgeometry.org Sony Computer Science Laboratories, Inc. July 2013, ICTP, Trieste, IT c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/56
2. 2. Geometry of matrix manifolds... ◮ Euclidean geometry, Fr¨benius norm → distance: o M 2 F 2 mij = = i ,j Mi ∗ i 2 2 = M∗j 2 2 = tr(M ⊤ M) j ◮ Riemannian geometry of symmetric positive deﬁnite (SPD) matrices [9, 2] ◮ Riemannian geometry of rank-deﬁcient positive semi-deﬁnite (SPSD) matrices Stiefel/Grassman manifolds [3] ◮ Quantum geometry: SPD matrices with unit trace “One geometry cannot be more true than another; it can only be more convenient”, — Jules Henri Poincar´ (1902) e c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/56
3. 3. Forthcoming conference (GSI) 28th-30th August, Paris. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/56
4. 4. What is Computational Information Geometry? ◮ What is Information? = Essence of data (datum=“thing”) (make it tangible → e.g., parameters of generative models) ◮ Can we do Intrinsic computing? (unbiased by any particular “data representation” → same results after recoding data) ◮ Geometry −→ Science of invariance (mother of Science, compass & ruler, Descartes analytic=coordinate/Cartesian, imaginaries, ...). ...the open-ended poetic mathematics! ?! c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/56
5. 5. Rationale for Computational Information Geometry ◮ Information is ...never void! → lower bounds ◮ ◮ ◮ ◮ ◮ Geometry: ◮ ◮ ◮ Fisher information and Cram´r-Rao lower bound (estimation) e Bayes error and Chernoﬀ information (classiﬁcation) Coding and Shannon entropy (communication) Program and Kolmogorov complexity (compression). (Unfortunately not computable!) Language (point, line, ball, dimension, orthogonal, projection, geodesic, immersion, etc.) Power of characterization (eg., intersection of two pseudo-segments not admitting closed-form expression) Computing: Information computing. Seeking for mathematical convenience and mathematical tricks (RKHS in ML). How to manipulate “space of functions” ?!? c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/56
6. 6. Example I: Matrix manifold Pattern = Gaussian mixture models (universal class) Statistical (dis)similarity/distance: total Bregman divergence (tBD, tKL). Invariance: ..., xi ∼ N(µi , Σi ), y = A(x) = Lx + t, yi ∼ N(Lµi + t, LΣi L⊤ ), D(X1 : X2 ) = D(Y1 : Y2 ) (L: any invertible aﬃne transformation, t a translation) Shape Retrieval using Hierarchical Total Bregman Soft Clustering [7], IEEE PAMI, 2012. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/56
7. 7. Example II: Matrix manifolds DTI: diﬀusion ellipsoids, tensor interpolation. Pattern = zero-centered “Gaussians“ Statistical (dis)similarity/distance: total Bregman divergence (tBD, tKL). Invariance: ..., D(A⊤ PA : A⊤ QA) = D(P : Q), A ∈ SL(d): orthogonal matrix (volume/orientation preserving) total Bregman divergence (tBD). (3D rat corpus callosum) c Total Bregman Divergence and its Applications to DTI Analysis [20], IEEE TMI. 2011. Science Laboratories, Inc. 2013 Frank Nielsen, Sony Computer 7/56
8. 8. Example III: Gaussian manifolds Consider 5D Gaussian Mixture Models (GMMs) of color images (image=RGBxy point set) A Gaussian mixture model wi N(µi , Σi ) is interpreted as a weighted point set {θi = (µi , Σi )}. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/56
9. 9. Matrix center points & clustering Aggregation (matrix quantization for codebooks): Given a data-set of matrices M = {M1 , ..., Mn } ⊂ M, compute a center matrix C . Centering as a variational minimization problem: wi distancep (C , Mi ) (OPT ) : Cp = arg min C ∈M i Notion of centrality, robustness to outliers? For diagonal matrices, with “Euclidean” distance, usual geometric center points: ◮ ◮ ◮ median (p = 1): robust to outliers (Fermat-Weber point, no closed form), centroid (p = 2): breakdown point of 1 (→ tBD)), circumcenter (lim p → ∞): minimize farthest point (minimax [1]). c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/56
10. 10. Diﬀusion Tensor Magnetic Resonance Imaging DT-MRI: Measures anisotropic diﬀusion of water molecules in a 3 × 3 tensor assigned to each voxel position (1990˜). Used to analyze in-vivo connectivity patterns of brain tissues: gray matter, white matter (corpus callosum) and cerebrospinal ﬂuid (CSF) c Image courtesy Peter J. Basser (Magnetic resonance imaging of the brain and spine, Chapter 31) c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/56
11. 11. Gradiometry tensor: 3 × 3 SPSD matrices Beyond the “constant” g ≃ 9.81m/s 2 . Gravity ﬁeld measuring anisotropy. → Oil & gas industry. Courtesy of BellGeo. http://www.bellgeo.com/tech/technology_theory_of_FTG.html c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/56
12. 12. Structure tensors in computer vision → Pioneered in image processing: tensor descriptor of a region at a pixel. (Harris-Stephens [6]). Consider a kernel, and compute the tensor descriptor I ′2 (x) T (p = (x, y )) = K ∗ I ′ (y )I ′ (x) I ′ (x)I ′ (y ) I ′ (y )2 , w (u, v )∇I (u, v )(∇I (u, v ))T = u,v K : uniform, Gaussian kernel (eg., s × s window W centered at the pixel p) I ′ (x), I ′ (y ): gradient, derivatives of the image. Versatile method: corner detection, optical ﬂow estimation, segmentation, stereo matching, etc. → Tensor image processing c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/56
13. 13. Harris-Stephens structure tensor (1988) Deformation tensor ﬁeld Harris-Stephens combined corner-edge detector: R = det T − k(tr T )2 → Measures of tensor anisotropy. Structure tensor represents local orientation (eigenvectors/eigenvalues). Harris-Stephens’ combined corner/edge detector (note) c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/56
14. 14. Matrix with Fr¨benius metric distance o Matrix space M with vectorial structure dE (P, Q) = P −Q = F tr(P − Q)T (P − Q) Centroid of tensors: 1 CE = n (1) (2) n wi Ti i =1 → scalar average of each element of the tensor. Tensor Field Segmentation Using Region Based Active Contour Model [21], ECCV, 2004. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/56
15. 15. Matrix vectorization & computational geometry Computational geometry on w × h-dimensional matrix spaces wrt Fr¨benius distance amounts to computational geometry on o Euclidean vector space for D = w × h. → Voronoi diagrams, smallest enclosing ball, minimum spanning tree, etc. For symmetric matrices, we have D = d(d+1) degrees of freedom, 2 and vectorize as follows: d M F d 2 mij = i =1 j=1 d−1 d d 2 mij 2 mii + 2 = i =1 i =1 j=i +1 = m 2 √ √ √ with m = [m11 ...mdd 2m12 2m1d ... 2md−1,d ]T = M. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/56
16. 16. Matrix functions From the spectral decomposition: M = UDU ⊤ with D = λ(M) = diag(λ1 , ..., λd ) the diagonal matrix of eigenvalues, consider real-valued function x → f (x) to extend to matrices as f (M) = U diag(f (λ1 ), ..., f (λd )) U T 1 Examples: log x, exp x, |x|, x 2 , x 2 , etc. O(d 3 ) SVD factorization complexity. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/56
17. 17. Riemannian cone of SPD matrices Exponential maps from tangent planes (symmetric matrices Sym) to the manifold cone C: expP : TP C = Sym → C Logarithmic maps from manifold cone C to tangent planes: logP : C → TP C = Sym 1 1 1 1 logP (Q) = P 2 log(P − 2 QP − 2 )P 2 Map any point Q ∈ Sym++ to unique tangent vector at P such that γ0 = P and γ1 = Q. Geodesic equation: 1 1 1 γt (P, Q) = P 2 P − 2 QP − 2 t 1 P2 Geodesic (metric length) distance: 1 1 dR (P, Q) = log P − 2 QP − 2 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/56
18. 18. Riemannian Karcher centroid d dR (P, Q) = log2 λi tr log2 (P −1 Q) = i =1 = log P 1 −2 QP 1 −2 , where the λi ’s are the eigenvalues of P −1 Q. 1 1 (P −1 Q = Q 2 P −1 Q 2 ) Unique mean characterized by n=1 log(Ti−1 CR ) = 0 i Closed-form solution only for n = 2: 1 1 1 1 2 1 CR (P, Q) = P 2 P − 2 QP − 2 P 2 otherwise iterative approximation (CR = limt→∞ Ct ): Ct+1 = Ct exp c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1 n n log Ct−1 Ti . i =1 18/56
19. 19. Riemannian minimax SPD center (circumcenter [1]) Case of p = ∞, center that minimizes the maximum distance. GEO-ALG: Starts with c1 ∈ P and iteratively update the current 1 circumcenter as follows: ci +1 = Geodesic(ci , fi , i +1 ), where fi denotes the farthest point of P to ci , and Geodesic(p, q, t) denotes the intermediate point m on the geodesic passing through p and q such that ρ(p, m) = t × ρ(p, q). Geodesic: 1 1 1 γt (P, Q) = P 2 P − 2 QP − 2 t 1 P2 Find t such that d=1 log2 λt = t 2 d=1 log2 λi = r 2 i i i That is t = r . Prove core-set and guaranteed convergence. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. d 2 i =1 log λi . 19/56
20. 20. Matrices as parameters in probability distributions Exponential families: Gaussian, Wishart, etc.: p(x; λ) = pF (x; θ) = exp ( t(x), θ − F (θ) + k(x)) . Example: Poisson distribution p(x; λ) = λx exp(−λ), x! ◮ the suﬃcient statistic t(x) = x, ◮ θ = log λ, the natural parameter, ◮ F (θ) = exp θ, the log-normalizer → CONVEX, ◮ and k(x) = − log x! the carrier measure (with respect to the counting measure). c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 20/56
21. 21. Gaussians as an exponential family p(x; λ) = p(x; µ, Σ) = 1 (x − µ)T Σ−1 (x − µ)) √ exp − 2 2π det Σ 1 θ = (Σ−1 µ, 2 Σ−1 ) ∈ Θ = Rd × Kd×d , with Kd×d cone of positive deﬁnite matrices, ◮ F (θ) = 1 trθ −1 θ1 θ T − 1 log det θ2 + d log π → CONVEX 1 2 4 2 2 ◮ t(x) = (x, −x T x), ◮ k(x) = 0. Inner product : composite, sum of a dot product and a matrix trace : T ′ T ′ θ, θ ′ = θ1 θ1 + trθ2 θ2 . ◮ The coordinate transformation τ : Λ → Θ is given for λ = (µ, Σ) by τ (λ) = 1 λ−1 λ1 , λ−1 , 2 2 2 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. τ −1 (θ) = 1 −1 1 −1 θ θ1 , θ2 2 2 2 21/56
22. 22. Convex duality: Legendre transformation ◮ For a strictly convex and diﬀerentiable function F : X → R: F ∗ (y ) = sup { y , x − F (x)} x∈X lF (y ;x); ◮ Maximum obtained for y = ∇F (x): ∇x lF (y ; x) = y − ∇F (x) = 0 ⇒ y = ∇F (x) ◮ Maximum unique from convexity of F (∇2 F ≻ 0): ∇2 lF (y ; x) = −∇2 F (x) ≺ 0 x ◮ Convex conjugates: (F , X ) ⇔ (F ∗ , Y), c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. Y = {∇F (x) | x ∈ X } 22/56
23. 23. Legendre duality: Geometric interpretation Consider the epigraph of F as a convex object: ◮ convex hull (V -representation), versus ◮ half-space (H-representation). Legendre transform also called “slope” transform. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 23/56
24. 24. Legendre duality & Canonical divergence ◮ ◮ ◮ Convex conjugates have functional inverse gradients ∇F −1 = ∇F ∗ ∇F ∗ may require numerical approximation (not always available in analytical closed-form) Involution: (F ∗ )∗ = F with ∇F ∗ = (∇F )−1 . Convex conjugate F ∗ expressed using (∇F )−1 : F ∗ (y ) = = ◮ x, y − F (x), x = ∇y F ∗ (y ) (∇F )−1 (y ), y − F ((∇F )−1 (y )) Fenchel-Young inequality at the heart of canonical divergence: F (x) + F ∗ (y ) ≥ x, y AF (x : y ) = AF ∗ (y : x) = F (x) + F ∗ (y ) − x, y ≥ 0 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 24/56
25. 25. Dual Bregman divergences & canonical divergence [14] p(x) ≥0 q(x) = BF (θQ : θP ) = BF ∗ (ηP : ηQ ) KL(P : Q) = EP log = F (θQ ) + F ∗ (ηP ) − θQ , ηP = AF (θQ : ηP ) = AF ∗ (ηP : θQ ) with θQ (natural parameterization) and ηP = EP [t(X )] = ∇F (θP ) (moment parameterization). 1 1 dx − p(x) log dx KL(P : Q) = p(x) log q(x) p(x) H × (P:Q) H(p)=H × (P:P) Shannon cross-entropy and entropy of EF [14]: H × (P : Q) = F (θQ ) − θQ , ∇F (θP ) − EP [k(x)] H(P) = F (θP ) − θP , ∇F (θP ) − EP [k(x)] H(P) = −F ∗ (ηP ) − EP [k(x)] c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 25/56
26. 26. Bregman divergence: Geometric interpretation (I) Potential function F , graph plot F : (x, F (x)). DF (p : q) = F (p) − F (q) − p − q, ∇F (q) c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 26/56
27. 27. Bregman divergence: Geometric interpretation (II) Potential function f , graph plot F : (x, f (x)). Bf (p||q) = f (p) − f (q) − (p − q)f ′ (q) Bf (.||q): vertical distance between the hyperplane Hq tangent to F at lifted point q , and the translated hyperplane at p . ˆ ˆ c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 27/56
28. 28. Bregman divergence: Geometric interpretation (III) Bregman divergence and path integrals B(θ1 : θ2 ) = F (θ1 ) − F (θ2 ) − θ1 − θ2 , ∇F (θ2 ) , (3) θ1 = θ2 η2 = η1 ∗ ∇F (t) − ∇F (θ2 ), dt , (4) ∇F ∗ (t) − ∇F ∗ (η1 ), dt , (5) = B (η2 : η1 ) c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. (6) 28/56
29. 29. Matrix Bregman divergences [4, 16] Choose F a real-valued functional generator and extend F to matrices: F (X ) = tr(Ψ(X )) tF ,k N k Ψ(X ) = k≥0 (tF ,k from the Taylor expansion of real-valued F ) BF (P : Q) = F (P) − F (Q) − tr((P − Q)⊤ ∇F (Q)), ∇F (X ) = ′ tF ,k N k k≥0 ′ (tF ,k from the Taylor expansion of real-valued F ′ ) c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 29/56
30. 30. Matrix Bregman divergences [16] c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 30/56
31. 31. Particular case: Bregman Schatten p-divergences [5, 16] Schatten p-norm of real symmetric matrix X : (unitarily invariant matrix norms) X p = λ(X ) p Bregman generator: 1 X 2 p 2 Used in regularized convex optimization [5], matrix data mining [16]. F (X ) = c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 31/56
32. 32. Matrix Legendre transformation Extends classical Legendre-Fenchel transformation: F ∗ (η) = sup spec(θ)⊆dom(F ) tr(θη ⊤ ) − F (θ) DF (θP : θQ ) = DF ∗ (ηQ : ηP ) = F (θ) + F ∗ (η) − tr(θη ⊤ ) θ and η are dual matrix coordinate systems on the matrix manifold. Non-metric diﬀerential structure with dual coordinate systems. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 32/56
33. 33. Bregman matrix means BF (X , P) = F (X ) − F (P) − tr((X − P)T ∇F (P)), F (·): strictly convex and diﬀerentiable function on an open convex space. n C = ∇F −1 i =1 wi ∇F (Ti ) quasi-arithmetic mean for ∇F . Since BF (X , P) = BF (P, X ), deﬁne a right-sided centroid M ′ : Find the center of mass [13] (independent of generator F ) F (X ) = tr(X T X ): the quadratic matrix entropy, F (X ) = − log det X : the matrix Burg entropy, and F (X ) = tr(X log X − X ): the von Neumann entropy [19, 18, 15] (Umegaki quantum relative entropy). c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 33/56
34. 34. Total Bregman divergences (tBD) Instead of ”vertical” projection in Bregman divergence, consider perpendicular projection. (Analogy with least squares and total least squares regression.) tBF (P, Q) = BF (P, Q) 1 + ∇F (Q) 2 → proven statistically robust. Applications to robust DT-MRI segmentation [8]. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 34/56
35. 35. Matrix Jensen/Burbea-Rao divergences [10] Convexity gap deﬁnes a divergence BRF (P, Q) = ◮ ◮ ◮ ◮ F (P) + F (Q) −F 2 P +Q 2 ≥0 F (X ) = tr(X T X ): the quadratic matrix entropy, F (X ) = − log det X : the matrix Burg entropy, and F (X ) = tr(X log X − X ): the von Neumann entropy. etc. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 35/56
36. 36. Smooth family of convex generators [12, 17] 1-parameter family of generators: Fα (X ) = 1 tr(αX − X α + (1 − α)I ), α = {0, 1} α(1 − α) Bα (P : Q) = ∇Fα (X ) = 1 tr(Q α − P α + αQ α−1 (P − Q)) α(1 − α) 1 (I − X α−1 ) α−1 1 −1 ∇Fα (X ) = (I − (α − 1)X ) α−1 When α → 1, ∇Fα (X ) = ∇F1 (X ) = log X . When α → 0, ∇Fα (X ) = ∇F0 (X ) = X −1 − I . ◮ α = 2: Quadratic matrix information ◮ α → 1: von Neumann information ◮ α → 0: Burg log-det information c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 36/56
37. 37. Jensen (Burbea-Rao) divergences Based on Jensen’s inequality for a convex function F : BRF (X , P) = F (X ) + F (P) −F 2 X +P 2 def = ≥ 0. strictly convex function F (·). Includes the special case of Jensen-Shannon divergence: JS(p, q) = H p+q 2 − H(p) + H(q) 2 F (x) = −H(x), the negative Shannon entropy H(x) = −x log x. → generators are convex and entropies are concave (negative generators) c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 37/56
38. 38. Visualizing Burbea-Rao divergences include Squared Mahalanobis distance. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 38/56
39. 39. Burbea-Rao from Symmetrizing Bregman divergences [13] ◮ Jeﬀreys-Bregman divergences. SF (p; q) = = ◮ BF (p, q) + BF (q, p) 2 1 p − q, ∇F (p) − ∇F (q) , 2 Jensen-Bregman divergences (diversity index). JF (p; q) = = BF (p, p+q ) + BF (q, p+q ) 2 2 2 F (p) + F (q) p+q −F 2 2 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. = BRF (p, q) 39/56
40. 40. Skew Burbea-Rao divergences (α) BRF (α) : X × X → R+ BRF (p, q) = αF (p) + (1 − α)F (q) − F (αp + (1 − α)q) (α) BRF (p, q) = αF (p) + (1 − α)F (q) − F (αp + (1 − α)q) (1−α) = BRF (q, p) Skew symmetrization of Bregman divergences: def αBF (p, αp + (1 − α)q) + (1 − α)BF (q, αp + (1 − α)q) = (α) BRF (p, q) = skew Jensen-Bregman divergences. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 40/56
41. 41. Bregman divergences = asymptotic skewed Jensen divergences 1 (α) BRF (p, q) α→1 1 − α 1 (α) BF (q, p) = lim BRF (p, q) α→0 α BF (p, q) = lim c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 41/56
42. 42. Burbea-Rao/Jensen centroids (p = 1) n X (α ) wi BRF i (X , Ti ) = arg min L(x) OPT : CF = arg min x i =1 Wlog., equivalent to minimize n n E (c) = ( i =1 wi αi )F (C ) − i =1 wi F (αi C + (1 − αi )Ti ) Sum E = F + G of convex F + concave G function ⇒ Convex-ConCave Procedure (CCCP, NIPS*01) Start from arbitrary c0 , and iteratively update as: ∇F (Ct+1 ) = −∇G (Ct ) ⇒ guaranteed convergence to a (local) minimum. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 42/56
43. 43. ConCave Convex Procedure (CCCP) minx E (x) = F (x) + G (x) ∇F (ct+1 ) = −∇G (ct ) Decomposition may not be unique... c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 43/56
44. 44. Iterative algorithm for Burbea-Rao centroids Apply CCCP scheme ∇F (Ct+1 ) = Ct+1 = ∇F −1 1 n n i =1 wi αi i =1 1 wi αi ∇F (αi Ct + (1 − αi )Ti ) n n i =1 wi αi i =1 wi αi ∇F (αi Ct + (1 − αi )Ti ) Get arbitrarily ﬁne approximations of the (skew) Burbea-Rao matrix centroids and barycenters. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 44/56
45. 45. Special case: α-log det divergence [15, 11] Cone of Hermitian positive deﬁnite matrices (self-adjoint matrices ¯ M H = M T = M). F (X ) = − log detX , ∇F (X ) = ∇F −1 (X ) = −X −1 Burbea-Rao α-log det divergences:   tr(Q −1 P − I ) − log det(Q −1 P)) α = 1   det( 1−α P+ 1+α Q) (α) 4 2 2 α ∈ R{−1, 1} Dld (P, Q) = 1−α 2 log 1+α  1−α (det P) 2 (det Q) 2   tr(P −1 Q − I ) − log det(P −1 Q) α = −1 Start with C1 = 1 n n i =1 Ti , n Ct+1 = n i =1 1−α 1+α Ti + Ct 2 2 → unique global mean (obtained from CCCP). c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. −1 −1 45/56
46. 46. Bhattacharyya coeﬃcients/distances Bhattacharyya coeﬃcient and non-metric distance: C( p, q) = p(x)q(x)dx, 0 < C (p, q) ≤ 1, B(p, q) = − ln C (p, q). (coeﬃcient is always strictly positive). Hellinger metric H(p, q) = 1 2 ( p(x) − q(x))2 dx, such that 0 ≤ H(p, q) ≤ 1. H(p, q) = = 1 2 p(x)dx + q(x)dx − 2 p(x) q(x)dx 1 − C (p, q). c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 46/56
47. 47. Chernoﬀ coeﬃcients/α-divergences Skew Bhattacharrya divergences based on Chernoﬀ α-coeﬃcients. Bα (p, q) = − ln = − ln x p α(x)q 1−α (x)dx = − ln Cα (p, q) q(x) x p(x) q(x) α dx = − ln Eq [Lα (x)] Amari α-divergence:  1−α 1+α  4 2 1 − p(x) 2 q(x) 2 dx , α = ±1,  1−α  p(x) Dα (p||q) = α = −1, p(x) log q(x) dx = KL(p, q),    q(x) log q(x) dx = KL(q, p), α = 1, p(x) Dα (p||q) = D−α (q||p) Remapping α′ = 1−α 2 (α = 1 − 2α′ ) to get Chernoﬀ α′ -divergences c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 47/56
48. 48. Bhattacharyya/Chernoﬀ of exponential families [10] Equivalence with skew Burbea-Rao distances: (α) Bα (pF (x; θp ), pF (x; θq )) = BRF (θp , θq ), (7) = αF (θp ) + (1 − α)F (θq ) − F (αθp + (1 − α)θq ) Bhat. divergence on probability distributions amounts to compute a Jensen divergence on its parameters c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 48/56
49. 49. Closed-form Bhattacharyya distances for exp. fam. Generic formula that instantiates in those well-known formula in statistical pattern recognition. Exp. fam. F (θ) (up to a constant) Multinomial log(1 + Poisson exp θ Gaussian 1 π 1 − 4θ + 2 log(− θ ) 2 2 2 2 σ2 +σq 1 (µp −µq ) + 1 ln p 4 σ2 +σ2 2 2σp σq p q Gaussian 1 trΘ−1 θθ T − 1 log det Θ 4 2 1 (µ − µ )T p q 8 d −1 exp θi ) i =1 θ2 Bhattacharyya/Burbea-Rao BRF (λp , λq ) = BRF (τ (λp ), τ (λq )) √ − ln d=1 pi qi i 1 ( √µ − √µ ) 2 p q 2 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. Σp +Σq 2 −1 Σp +Σq det 1 2 (µp − µq ) + 2 ln det Σ det Σ p q 49/56
50. 50. Wrapping up ◮ Besides Euclidean, log-Euclidean and Riemannian metric-based means, proposed divergence-based matrix centroids, ◮ Total Bregman divergences and robustness (conformal geometry), ◮ Riemannian minimax center, ◮ skew Burbea-Rao/Jensen divergences extending Bregman divergences, ◮ Bhattacharrya means of densities = Burbea-Rao means on (matrix) parameters Which mean you do you mean or need? c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 50/56
51. 51. Non-metric matrix manifolds with dually aﬃne connections In a nutshell: ◮ asymmetric (Bregman) non-metric divergence, ◮ Legendre transform, convex conjugates & dual divergences ◮ Dual θ− or η- or mixed coordinate systems ◮ dual closed-form aﬃne geodesics (convenient computationally) ◮ Pythagorean theorem c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 51/56
52. 52. Thank you. www.informationgeometry.org “One geometry cannot be more true than another; it can only be more convenient”, — Jules Henri Poincar´ (1902) e c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 52/56
53. 53. Bibliographic references I Marc Arnaudon and Frank Nielsen. On approximating the Riemannian 1-center. Comput. Geom., 46(1):93–104, 2013. Rajendra Bhatia. The Riemannian mean of positive matrices. In Frank Nielsen and Rajendra Bhatia, editors, Matrix Information Geometry, pages 35–51, 2012. Silvere Bonnabel and Rodolphe Sepulchre. Riemannian metric and geometric mean for positive semideﬁnite matrices of ﬁxed rank. SIAM J. Matrix Analysis Applications, 31(3):1055–1070, 2009. Inderjit S. Dhillon and Joel A. Tropp. Matrix nearness problems with bregman divergences. SIAM J. Matrix Anal. Appl., 29(4):1120–1146, November 2007. John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari. Composite objective mirror descent. In Adam Tauman Kalai and Mehryar Mohri, editors, COLT, pages 14–26. Omnipress, 2010. C. Harris and M. Stephens. A Combined Corner and Edge Detection. In Proceedings of The Fourth Alvey Vision Conference, pages 147–151, 1988. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 53/56
54. 54. Bibliographic references II Meizhu Liu, Baba C. Vemuri, Shun-ichi Amari, and Frank Nielsen. Shape retrieval using hierarchical total Bregman soft clustering. Transactions on Pattern Analysis and Machine Intelligence, 34(12):2407–2419, 2012. Meizhu Liu, Baba C. Vemuri, Shun ichi Amari, and Frank Nielsen. Shape retrieval using hierarchical total bregman soft clustering. IEEE Trans. Pattern Anal. Mach. Intell., 34(12):2407–2419, 2012. Maher Moakher. A diﬀerential geometric approach to the geometric mean of symmetric positive-deﬁnite matrices. SIAM Journal on Matrix Analysis and Applications, 26(3):735–747, 2005. Frank Nielsen and Sylvain Boltz. The Burbea-Rao and Bhattacharyya centroids. IEEE Transactions on Information Theory, 57(8):5455–5466, 2011. Frank Nielsen, Meizhu Liu, Xiaojing Ye, and Baba C. Vemuri. Jensen divergence based SPD matrix means and applications. In International Conference on Pattern Recognition (ICPR), 2012. Frank Nielsen and Richard Nock. Quantum Voronoi diagrams and Holevo channel capacity for 1-qubit quantum states. In IEEE International Symposium on Information Theory (ISIT), pages 96–100, 2008. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 54/56
55. 55. Bibliographic references III Frank Nielsen and Richard Nock. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theor., 55(6):2882–2904, June 2009. Frank Nielsen and Richard Nock. Entropies and cross-entropies of exponential families. In International Conference on Image Processing (ICIP), pages 3621–3624, 2010. R. Nock, B. Magdalou, E. Briys, and F. Nielsen. On tracking portfolios with certainty equivalents on a generalization of Markowitz model: the fool, the wise and the adaptive. In Thorsten Joachims, editor, International Conference on Machine Learning (ICML). Omnipress, 2011. Richard Nock, Brice Magdalou, Eric Briys, and Frank Nielsen. Mining matrix data with Bregman matrix divergences for portfolio selection. In Frank Nielsen and Rajendra Bhatia, editors, Matrix Information Geometry, pages 373–402, 2012. Masanori Ohya and D´nes Petz. e Quantum Entropy and Its Use. 1st ed. 1993. Corr 2nd printing, 2004. Koji Tsuda, Gunnar R¨tsch, and Manfred K. Warmuth. a Matrix exponentiated gradient updates for on-line learning and Bregman projection. J. Mach. Learn. Res., 6:995–1018, December 2005. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 55/56
56. 56. Bibliographic references IV Hisaharu Umegaki. Conditional expectation in an operator algebra. IV. Entropy and information. KodaiMathSemRep, 14(2):59, 1962. Baba Vemuri, Meizhu Liu, Shun ichi Amari, and Frank Nielsen. Total Bregman divergence and its applications to DTI analysis. IEEE Transactions on Medical Imaging, 2011. Zhizhou Wang and Baba C. Vemuri. An aﬃne invariant tensor dissimilarity measure and its applications to tensor-valued image segmentation. In CVPR (1), pages 228–233, 2004. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 56/56