Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Fundamentals cig 4thdec
1. Fundamentals of Algorithms and Data-Stru
tures in
Information-Geometri
Spa
es
Frank NIELSEN
É
ole Polyte
hnique, Fran
e
Sony Computer S
ien
e Laboratories, In
MEXT-ISM Workshop on Information Geometry for Ma
hine Learning
Brain S
ien
e Institute, RIKEN
4th De
ember 2014
2014 Frank Nielsen 1/75
2. Brief histori
al review of Computational Geometry (CG)
◮ Three resear
h periods:
1. Geometri
algorithms:
Voronoi/Delaunay, minimum spanning trees, data-stru
tures for proximity
queries
2. Geometri
omputing:
robustness, algebrai
degree of predi
ates, programs that work/s
ale!
3. Computational topology:
simpli
ial
omplexes, ltrations, input=distan
e matrix
→ paradigm of Topologi
al Data Analysis (TDA)
◮ Show
asing libraries for CG software:
◮ CGAL http://www.
gal.org/
Geometry Fa
tory http://geometryfa
tory.
om/
◮ Gudhi https://proje
t.inria.fr/gudhi/
Ayasdi http://www.ayasdi.
om/
2014 Frank Nielsen 1.CG History 2/75
3. Outline
◮ Review of the basi
algorithmi
toolbox in
omputational geometry:
Voronoi diagrams and dual Delaunay, spanning balls
◮ Generalizations of those
on
epts and toolbox to information spa
es:
◮ Riemannian
omputational information geometry
◮ Dually ane
onne
tions
omputational information geometry
◮ Appli
ations to
lustering, learning mixtures, et
.
What is a good/friendly geometri
omputing spa
e?
2014 Frank Nielsen 1.CG History 3/75
4. Basi
s of Eu
lidean
Computational Geometry:
Voronoi diagrams and dual
Delaunay
omplexes
2014 Frank Nielsen 2.Ordinary CG 4/75
5. Eu
lidean (ordinary) Voronoi diagrams
P = {P1, ..., Pn}: n distin
t point generators in Eu
lidean spa
e Ed
V (Pi ) = {X : DE (Pi ,X) ≤ DE (Pj ,X), ∀j6= i}
Voronoi diagram =
ell
omplex V (Pi )'s with their fa
es
2014 Frank Nielsen 2.Ordinary CG 5/75
6. Voronoi diagrams from bise
tors and ∩ halfspa
es
Bise
tors
Bi(P,Q) = {X : DE (P,X) = DE (Q,X)}
→ are hyperplanes in Eu
lidean geometry
Voronoi
ells as halfspa
e interse
tions:
=1Bi+(Pi , Pj )
V (Pi ) = {X : DE (Pi ,X) ≤ DE (Pj ,X), ∀j6= i} = ∩ni
DE (P,Q) = kθ(P) − θ(Q)k2 =
qPd
i=1(θi (P) − θi (Q))2
θ(P) = p: Cartesian
oordinate system with θj (Pi ) = p(j)
i .
⇒ Many appli
ations of Voronoï diagrams:
rystal growth,
odebook/quantization, mole
ule interfa
es/do
king, motion planning, et
.
2014 Frank Nielsen 2.Ordinary CG 6/75
7. Voronoi diagrams and dual Delaunay simpli
ial
omplex
◮ Empty sphere property, max min angle triangulation, et
◮ Voronoi dual Delaunay triangulation
→ non-degenerate point set = no (d + 2) points
o-spheri
al
◮ Duality: Voronoi k-fa
e ⇔ Delaunay (d − k)-simplex
◮ Bise
tor Bi(P,Q) perpendi
ular ⊥ to segment [PQ]
2014 Frank Nielsen 2.Ordinary CG 7/75
8. Voronoi Delaunay : Complexity and algorithms
◮ Combinatorial
omplexity: (n⌈ d
2 ⌉) (→ quadrati
in 3D)
mat
hed for points on the moment
urve: t7→ (t, t2, .., td )
◮ Constru
tion: (n log n + n⌈ d
2 ⌉), optimal
◮ some output-sensitive algorithms but...
◮
(n log n + f ), not yet optimal output-sensitive algorithms.
2014 Frank Nielsen 2.Ordinary CG 8/75
9. Modeling population spa
es in information geometry
Population spa
e {P(x)} interpreted as a smooth manifold equipped with
the Fisher Information Matrix (FIM):
◮ Riemannian modeling: metri
length spa
e with the FIM as metri
tensor
(orthogonality), and the Levi-Civita metri
onne
tion for length
minimizing geodesi
s
◮ Dual ±1 ane
onne
tion modeling: dual geodesi
s that des
ribe
parallel transport, non-metri
dual divergen
es indu
ed by dual potential
Legendre
onvex fun
tions. Dual ±α
onne
tions.
→ Algorithmi
onsiderations of these two approa
hes
Population spa
e, parameter spa
e, obje
t-oriented geometry, et
.
2014 Frank Nielsen 3.Information geometry 9/75
11. Population spa
es: Hotelling (1930) [12℄ Rao (1945) [33℄
Birth of dierential-geometri
methods in statisti
s.
◮ Fisher information matrix (non-degenerate positive denite)
an be used
as a (smooth) Riemannian metri
tensor g.
◮ Distan
e between two populations indexed by θ1 and θ2: Riemannian
distan
e (metri
length)
First appli
ations in statisti
s:
◮ Fisher-Hotelling-Rao (FHR) geodesi
distan
e used in
lassi
ation:
Find the
losest population to a given set of populations
◮ Used in tests of signi
an
e (null versus alternative hypothesis), power
of a test: P(reje
t H0|H0 is false)
→ dene surfa
es in population spa
es
2014 Frank Nielsen 4.Riemannian CIG 11/75
12. Rao's distan
e (1945, introdu
ed by Hotelling 1930 [12℄)
◮ Innitesimal squared length element:
ds2 =
X
i ,j
gij (θ)dθidθj = dθTI (θ)dθ
◮ Geodesi
and distan
e are hard to expli
itly
al
ulate:
ρ(p(x; θ1), p(x; θ2)) = min
(s)
(0)=1
(1)=2
Z 1
0
s
dθ
ds
T
I (θ)
dθ
ds
ds
Rao's distan
e not known in
losed-form for multivariate normals
◮ Advantages: Metri
property of ρ + many tools of dierential
geometry [1℄: Riemannian Log/Exp tangent/manifold mapping
2014 Frank Nielsen 4.Riemannian CIG 12/75
13. Extrinsi
Computational Geometry on tangent planes
◮ Tensor g = Q(x) ≻ 0 denes smooth inner produ
t hp, qix = p⊤Q(x)q
that indu
es a normed distan
p
e:
dx (p, q) = kp − qkx =
(p − q)⊤Q(x)(p − q)
◮ Mahalanobis metri
distan
e on tangent planes :
(X1,X2) =
q
(μ1 − μ2)⊤−1(μ1 − μ2) =
p
μ⊤−1μ
◮ Cholesky de
omposition = LL⊤
(X1,X2) = DE (L−1μ1, L−1μ2)
◮ CG on tangent planes = ordinary CG on transformed points x′ ← L−1x.
Extrinsi
vs intrinsi
means [10℄
2014 Frank Nielsen 4.Riemannian CIG-1.Mahalanobis 13/75
14. Mahalanobis Voronoi diagrams on tangent planes (extrinsi
)
In statisti
s,
ovarian
e matrix a
ount for both
orrelation and dimension
(feature) s
aling
⇔
Dual stru
ture ≡ anisotropi
Delaunay triangulation
⇒ empty
ir
umellipse property (Cholesky de
omposition)
2014 Frank Nielsen 4.Riemannian CIG-1.Mahalanobis 14/75
16. Riemannian statisti
al Voronoi diagrams
... for statisti
al population spa
es:
◮ Lo
ation-s
ale 2D families have
onstant non-positive
urvature
(Hotelling, 1930): Riemannian statisti
al Voronoi diagrams amount
to hyperboli
Voronoi diagrams or Eu
lidean diagrams (lo
ation
families only like isotropi
Gaussians)
◮ Multinomial family has spheri
al geometry on the positive orthant:
Spheri
al Voronoi diagram
(
ompute as stereographi
proje
tion ∝ Eu
lidean Voronoi diagrams)
But for arbitrary families p(x|θ): Geodesi
s not in
losed forms → limited
omputational framework in pra
ti
e (ray shooting, et
.)
2014 Frank Nielsen 4.Riemannian CIG-1.Mahalanobis 16/75
17. Normal/Gaussian family and 2D lo
ation-s
ale families
◮ Fisher Information Matrix (FIM):
I (θ) =
Ii ,j (θ) = E
∂
∂θi
log p(x|θ)
∂
∂θj
log p(x|θ)
◮ FIM for univariate normal/multivariate spheri
al distributions:
I (μ, σ) =
1
2 0
0 2
2
=
1
σ2
1 0
0 2
, I (μ, σ) = diag
1
σ2 , ...,
1
σ2 ,
2
σ2
◮ → amount to Poin
aré metri
dx2+dy2
y2 , hyperboli
geometry in
upper half plane/spa
e.
2014 Frank Nielsen 4.Riemannian CIG-2.Hyperboli
geometry 17/75
19. Matrix SPD spa
es and hyperboli
geometry
Symmetri
Positive Denite matri
es M: ∀x6= 0, x⊤Mx 0.
◮ 2D SPD(2) matrix spa
e has dimension d = 3: A positive
one.
SPD(2)
(a, b, c) ∈ R3 : a 0, ab − c2 0
◮ Can be peeled into sheets of dimension 2, ea
h sheet
orresponding to a
onstant value of the determinant of the elements [8℄
SPD(2) = SSPD(2) × R+
where SSPD(2) = {a, b, c = √1 − ab) : a 0, ab − c2 = 1}
◮ Mapping M(a, b, c) → H2:
◮
x0 = a+b
2 ≥ 1, x1 = a−b
2 , x2 = c
in hyperboloid model [28℄
◮ z = a−b+2ic
2+a+b in Poin
aré disk [28℄.
2014 Frank Nielsen 4.Riemannian CIG-2.Hyperboli
geometry 19/75
20. Riemannian manifolds: Choi
e of equivalent models?
Many equivalent models of hyperboli
geometry:
◮ Conformal (good for visualization sin
e we
an measure angles) versus
non-
onformal (
omputationally-friendly for geodesi
s) models.
◮ Convert equivalently to other models of hyperboli
geometry: Poin
aré
disk, upper half spa
e, hyperboloid, Beltrami hemisphere, et
.
Two questions:
◮ Given a metri
tensor g and its indu
ed metri
distan
e ρg (p, q), what
are the equivalent metri
tensors g′ ∼ g su
h that ρg (p, q) = ρg′(p′, q′)?
Is one metri
tensor better for
omputing spa
e?
◮ Metri
s yielding straight geodesi
s are fully
hara
terized in 2D but in
higher dimensions?
2014 Frank Nielsen 4.Riemannian CIG-2.Hyperboli
geometry 20/75
21. Riemannian Poin
aré disk metri
tensor (
onformal)
→ often used in Human Computer Interfa
es, network routing (embedding
trees), et
.
2014 Frank Nielsen 4.Riemannian CIG-2.Hyperboli
geometry 21/75
22. Riemannian Klein disk metri
tensor (non-
onformal)
◮ re
ommended for
omputing spa
e sin
e geodesi
s are straight line
segments
◮ Klein is also
onformal at the origin (so we
an perform translation
from and ba
k to the origin)
◮ Geodesi
s passing through O in the Poin
aré disk are straight (so we
an
perform translation from and ba
k to the origin)
2014 Frank Nielsen 4.Riemannian CIG-2.Hyperboli
geometry 22/75
23. Hyperboli
Voronoi diagrams [25, 29℄
In arbitrary dimension, Hd
◮ In Klein disk, the hyperboli
Voronoi diagram amounts to a
lipped
ane Voronoi diagram, or a
lipped power diagram with e
ient
lipping algorithm [5℄.
◮ then
onvert to other models of hyperboli
geometry: Poin
aré disk,
upper half spa
e, hyperboloid, Beltrami hemisphere, et
.
◮ Conformal (good for visualization) versus non-
onformal (good for
omputing) models.
2014 Frank Nielsen 4.Riemannian CIG-2.Hyperboli
geometry 23/75
24. Hyperboli
Voronoi diagrams [25, 29℄
Hyperboli
Voronoi diagram in Klein disk =
lipped power diagram.
Power distan
e:
kx − pk2 − wp
→ additively weighted ordinary Voronoi = ordinary CG
2014 Frank Nielsen 4.Riemannian CIG-2.Hyperboli
geometry 24/75
25. Hyperboli
Voronoi diagrams [25, 29℄
5
ommon models of the abstra
t hyperboli
geometry
https://www.youtube.
om/wat
h?v=i9IUzNxeH4o (5 min. video)
ACM Symposium on Computational Geometry (SoCG'14)
2014 Frank Nielsen 4.Riemannian CIG-2.Hyperboli
geometry 25/75
26. Dually ane
onne
tion
omputational information
geometry
2014 Frank Nielsen 5.Dually at CIG 26/75
27. Dually at spa
e
onstru
tion from
onvex fun
tions F
◮ Convex and stri
tly dierentiable fun
tion F(θ) admits a
Legendre-Fen
hel
onvex
onjugate F∗(η):
F∗(η) = sup
(θ⊤η − F(θ)), ∇F(θ) = η = (∇F∗)−1(θ)
◮ Young's inequality gives rise to
anoni
al divergen
e [15℄:
F(θ) + F∗(η′) ≥ θ⊤η′ ⇒ AF,F∗(θ, η′) = F(θ) + F∗(η′) − θ⊤η′
◮ Writing using single
oordinate system, get dual Bregman
divergen
es:
BF (θp : θq) = F(θp) − F(θq) − (θp − θq)⊤∇F(θq)
= BF∗(ηq : ηp) = AF,F∗(θp, ηq) = AF∗,F (ηq : θp)
◮ dual ane
oordinate systems with geodesi
s straight:
η = ∇F(θ) ⇔ θ = ∇F∗(η). Tensor g(θ) = g∗(η)
2014 Frank Nielsen 5.Dually at CIG 27/75
28. Dual divergen
e/Bregman dual bise
tors [6, 24, 26℄
Bregman sided (referen
e) bise
tors related by
onvex duality:
BiF (θ1, θ2) = {θ ∈ |BF (θ : θ1) = BF (θ : θ1)}
BiF∗(η1, η2) = {η ∈ H |BF∗(η : η1) = BF∗(η : η1)}
Right-sided bise
tor: → θ-hyperplane, η-hypersurfa
e
HF (p, q) = {x ∈ X | BF (x : p ) = BF (x : q )}.
HF : h∇F(p) − ∇F(q), xi + (F(p) − F(q) + hq,∇F(q)i − hp,∇F(p)i) = 0
Left-sided bise
tor: → θ-hypersurfa
e, η-hyperplane
H′F
(p, q) = {x ∈ X | BF ( p : x) = BF ( q : x)}
H′F
: h∇F(x), q − pi + F(p) − F(q) = 0
hyperplane = autoparallel submanifold of dimension d − 1
2014 Frank Nielsen 5.Dually at CIG-1.bise
tor 28/75
29. Visualizing Bregman bise
tors
Primal
oordinates θ Dual
oordinates η
natural parameters expe
tation parameters
p
q
Source Space: Itakura-Saito
p(0.52977081,0.72041688) q(0.85824458,0.29083834)
D(p,q)=0.66969016 D(q,p)=0.44835617
Gradient Space: Itakura-Saito dual
p’(-1.88760873,-1.38808518) q’(-1.16516903,-3.43833618)
D*(p’,q’)=0.44835617 D*(q’,p’)=0.66969016
p’
q’
Bi(P,Q) and Bi∗(P,Q)
an be expressed in either θ/η
oordinate systems
2014 Frank Nielsen 5.Dually at CIG-1.bise
tor 29/75
30. Spa
es of spheres: 1-to-1
mapping between d-spheres
and (d + 1)-hyperplanes using
potential fun
tions
2014 Frank Nielsen 5.Dually at CIG-2.Spa
e of spheres 30/75
31. Spa
e of Bregman spheres and Bregman balls [6℄
Dual sided Bregman balls (bounding Bregman spheres):
Ballr
F (c, r ) = {x ∈ X | BF (x : c) ≤ r}
Balll
F (c, r ) = {x ∈ X | BF (c : x) ≤ r}
Legendre duality:
F (c, r ) = (∇F)−1(Ballr
F∗(∇F(c), r ))
Balll
Illustration for Itakura-Saito divergen
e, F(x) = −log x
2014 Frank Nielsen 5.Dually at CIG-2.Spa
e of spheres 31/75
32. Spa
e of Bregman spheres: Lifting map [6℄
F : x7→ ˆx = (x, F(x)), hypersurfa
e in Rd+1, potential fun
tion
Hp: Tangent hyperplane at ˆp, z = Hp(x) = hx − p,∇F(p)i + F(p)
◮ Bregman sphere σ −→ ˆσ with supporting hyperplane
H : z = hx − c,∇F(c)i + F(c) + r .
(// to Hc and shifted verti
ally by r )
ˆσ = F ∩ H.
◮ interse
tion of any hyperplane H with F proje
ts onto X as a Bregman
sphere:
H : z = hx, ai + b → σ : BallF (c = (∇F)−1(a), r = ha, ci − F(c) + b)
2014 Frank Nielsen 5.Dually at CIG-2.Spa
e of spheres 32/75
34. Spa
e of Bregman spheres: Algorithmi
appli
ations [6℄
◮ Union/interse
tion of Bregman d-spheres from representational
(d + 1)-polytope [6℄
◮ Radi
al axis of two Bregman balls is an hyperplane: Appli
ations to
Nearest Neighbor sear
h trees like Bregman ball trees or Bregman
vantage point trees [31℄.
2014 Frank Nielsen 5.Dually at CIG-2.Spa
e of spheres 34/75
35. Bregman proximity data stru
tures [31℄
Vantage point trees: partition spa
e a
ording to Bregman balls
Partitionning spa
e with interse
tion of Kullba
k-Leibler balls
→ e
ient nearest neighbour queries in information spa
es
2014 Frank Nielsen 5.Dually at CIG-2.Spa
e of spheres 35/75
36. Appli
ation: Minimum En
losing Ball [23, 32℄
To a hyperplane H = H(a, b) : z = ha, xi + b in Rd+1,
orresponds a ball
σ = Ball(c, r ) in Rd with
enter c = ∇F∗(a) and radius:
r = ha, ci − F(c) + b = ha,∇F∗(a)i − F(∇F∗(a)) + b = F∗(a) + b
sin
e F(∇F∗(a)) = h∇F∗(a), ai − F∗(a) (Young equality)
SEB: Find halfspa
e H(a, b)− : z ≤ ha, xi + b that
ontains all lifted points:
min
a,b
r = F∗(a) + b,
∀i ∈ {1, ..., n}, ha, xi i + b − F(xi ) ≥ 0
→ Convex Program (CP) with linear inequality
onstraints
F(θ) = F∗(η) = 12
x⊤x: CP → Quadrati
Programming (QP) [11℄ used in
SVM. Smallest en
losing ball used as a primitive in SVM [34℄
2014 Frank Nielsen 5.Dually at CIG-2.Spa
e of spheres 36/75
37. Smallest Bregman en
losing balls [32, 22℄
Algorithm 1: BBCA(P, l ).
c1 ←
hoose randomly a point in P;
for i = 2 to l − 1 do
// farthest point from ci wrt. BF
si ← argmax=1BF (ci : pj );
nj
// update the
enter: walk on the η-segment [ci , psi ]
ci+1 ← ∇F−1(∇F(ci )# 1
i+1∇F(psi )) ;
end
// Return the SEBB approximation
return Ball(cl , rl = BF (cl : X)) ;
θ-, η-geodesi
segments in dually at geometry.
2014 Frank Nielsen 5.Dually at CIG-2.Spa
e of spheres 37/75
38. Smallest en
losing balls: Core-sets [32℄
Core-set C ⊆ S: SOL(S) ≤ SOL(C) ≤ (1 + ǫ)SOL(S)
extended Kullba
k-Leibler Itakura-Saito
2014 Frank Nielsen 5.Dually at CIG-2.Spa
e of spheres 38/75
39. InSphere predi
ates wrt Bregman divergen
es [6℄
Impli
it representation of Bregman spheres/balls:
onsider d + 1 support
points on the boundary
◮ Is x inside the Bregman ball dened by d + 1 support points?
InSphere(x; p0, ..., pd ) =
51. ◮ sign of a (d + 2) × (d + 2) matrix determinant
◮ InSphere(x; p0, ..., pd ) is negative, null or positive depending on whether
x lies inside, on, or outside σ.
2014 Frank Nielsen 5.Dually at CIG-2.Spa
e of spheres 39/75
52. Smallest en
losing ball in Riemannian manifolds [2℄
c = a#Mt
b: point γ(t) on the geodesi
line segment [ab] wrt M su
h that
ρM(a, c) = t × ρM(a, b) (with ρM the metri
distan
e on manifold M)
Algorithm 2: GeoA
c1 ←
hoose randomly a point in P;
for i = 2 to l do
// farthest point from ci
si ← argmax=1ρ(ci , pj );
nj
// update the
enter: walk on the geodesi
line segment
[ci , psi ]
ci+1 ← ci#M
1
i+1
psi ;
end
// Return the SEB approximation
return Ball(cl , rl = ρ(cl ,P)) ;
2014 Frank Nielsen 5.Dually at CIG-2.Spa
e of spheres 40/75
53. Approximating the smallest en
losing ball in hyperboli
spa
e
Initialization First iteration
Se
ond iteration Third iteration
Fourth iteration after 104 iterations
http://www.sony
sl.
o.jp/person/nielsen/infogeo/RiemannMinimax/
2014 Frank Nielsen 5.Dually at CIG-2.Spa
e of spheres 41/75
54. Bregman dual regular/Delaunay triangulations
Embedded geodesi
Delaunay triangulations+empty Bregman balls
Delaunay Exponential Del. Hellinger-like Del.
◮ empty Bregman sphere property,
◮ geodesi
triangles: embedded Delaunay.
2014 Frank Nielsen 5.Dually at CIG-2.Spa
e of spheres 42/75
55. Dually orthogonal Bregman Voronoi Triangulations
Ordinary Voronoi diagram is perpendi
ular to Delaunay triangulation:
Voronoi k-fa
e ⊥ Delaunay d − k-fa
e
Bi(P,Q) ⊥ γ∗(P,Q)
γ(P,Q) ⊥ Bi∗(P,Q)
2014 Frank Nielsen 5.Dually at CIG-2.Spa
e of spheres 43/75
56. Syntheti
geometry: Exa
t
hara
terization of the
Bayesian error exponent but
no
losed-form known
2014 Frank Nielsen 6.Bayesian error exponent 44/75
57. Bayesian hypothesis testing, MAP rule and probability of
error Pe
◮ Mixture p(x) =
P
i wipi (x). Task = Classify x Whi
h
omponent?
◮ Prior probabilities: wi = P(X ∼ Pi ) 0 (with
Pn
i=1 wi = 1)
◮ Conditional probabilities: P(X = x|X ∼ Pi ).
P(X = x) =
Xn
i=1
P(X ∼ Pi )P(X = x|X ∼ Pi ) =
Xn
i=1
wiP(X|Pi )
◮ Best rule = Maximum a posteriori probability (MAP) rule:
map(x) = argmaxi∈{1,...,n} wipi (x)
where pi (x) = P(X = x|X ∼ Pi ) are the
onditional probabilities.
◮ For w1 = w2 = 12
, probability of error
Pe = 12
R
min(p1(x), p2(x))dx ≤ 12
R
p1(x)p2(x)1−dx, for α ∈ (0, 1).
Best exponent α∗
2014 Frank Nielsen 6.Bayesian error exponent 45/75
58. Error exponent for exponential families
◮ Exponential families have nite dimensional su
ient statisti
s: →
Redu
e n data to D statisti
s.
∀x ∈ X, P(x|θ) = exp(θ⊤t(x) − F(θ) + k(x))
F(·): log-normalizer/
umulant/partition fun
tion, k(x): auxiliary term
for
arrier measure.
◮ Maximum likelihood estimator (MLE): ∇F(θ) ˆ= 1
n
P
i t(Xi ) = ˆη
◮ Bije
tion between exponential families and Bregman divergen
es:
log p(x|θ) = −BF∗(t(x) : η) + F∗(t(x)) + k(x)
Exponential families are log-
on
ave
2014 Frank Nielsen 6.Bayesian error exponent 46/75
59. Geometry of the best error exponent
On the exponential family manifold, Cherno α-
oe
ient [7℄:
c(P1 : P2) =
Z
2 (x)dμ(x) = exp(−J()
p
1 (x)p1−
F (θ1 : θ2))
Skew Jensen divergen
e [20℄ on the natural parameters:
J()
F (θ1 : θ2) = αF(θ1) + (1 − α)F(θ2) − F(θ()
12 )
Cherno information = Bregman divergen
e for exponential families:
C(P1 : P2 ) = B(θ1 : θ(∗)
12 ) = B(θ2 : θ(∗)
12 )
Finding best error exponent α∗?
2014 Frank Nielsen 6.Bayesian error exponent 47/75
60. Geometry of the best error exponent: binary hypothesis [17℄
Cherno distribution P∗:
P∗ = P∗12 = Ge(P1, P2) ∩ Bim(P1, P2)
e-geodesi
:
Ge(P1, P2) =
n
E()
12 | θ(E()
o
,
12 ) = (1 − λ)θ1 + λθ2, λ ∈ [0, 1]
m-bise
tor:
Bim(P1, P2) :
n
P | F(θ1) − F(θ2) + η(P)⊤θ = 0
o
,
Optimal natural parameter of P∗:
θ∗ = θ(∗)
12 = argmin∈B(θ1 : θ) = argmin∈B(θ2 : θ).
→
losed-form for order-1 family, or e
ient bise
tion sear
h.
2014 Frank Nielsen 6.Bayesian error exponent 48/75
61. Geometry of the best error exponent: binary hypothesis
P∗ = P∗12 = Ge(P1, P2) ∩ Bim(P1, P2)
p1
p2
m-bisector
p
12
-coordinate system
e-geodesic Ge(P1 , P2 )
P
12
C(1 : 2) = B(1 :
12)
Bim(P1 , P2 )
Binary Hypothesis Testing: Pe bounded using Bregman divergen
e between
Cherno distribution and
lass-
onditional distributions.
2014 Frank Nielsen 6.Bayesian error exponent 49/75
62. Clustering and Learning
nite statisti
al mixtures
2014 Frank Nielsen 6.Bayesian error exponent 50/75
63. -divergen
es
For α ∈ R6= ±1, α-divergen
es [9℄ on positive arrays [36℄ :
◮ D(p : q)
.=
Xd
i=1
4
1 − α2
1 − α
2
pi +
1 + α
2
qi − (pi )
1−
2 (qi )
1+
2
with
D(p : q) = D−(q : p) and in the limit
ases D−1(p : q) = KL(p : q)
and D1(p : q) = KL(q : p), where KL is the extended Kullba
kLeibler
divergen
e KL(p : q) .=
Pd
i=1 pi log pi
qi + qi − pi
◮ α-divergen
es belong to the
lass of Csiszár f -divergen
es
If (p : q)
.=
Pd
i=1 qi f
pi
qi
with the following generator:
f (t) =
4
1−2
1 − t(1+)/2
, if α6= ±1,
t ln t, if α = 1,
−ln t, if α = −1
Information monotoni
ity
2014 Frank Nielsen 6.Bayesian error exponent 51/75
64. Mixed divergen
es [30℄
Dened on three parameters p, q and r :
M(p : q : r )
.=
λD(p : q) + (1 − λ)D(q : r )
for λ ∈ [0, 1].
Mixed divergen
es in
lude:
◮ the sided divergen
es for λ ∈ {0, 1},
◮ the symmetrized (arithmeti
mean) divergen
e for λ = 12
, or skew
symmetrized for λ6= 12
.
2014 Frank Nielsen 7.Mixed divergen
es 52/75
65. Symmetrizing -divergen
es
S(p, q) =
1
2
(D(p : q) + D(q : p)) = S−(p, q),
= M1
2
(p : q : p),
For α = ±1, we get half of Jereys divergen
e:
S±1(p, q) =
1
2
Xd
i=1
(pi − qi ) log
pi
qi
◮ Centroids for symmetrized α-divergen
e usually not in
losed form.
◮ How to perform
enter-based
lustering without
losed form
entroids?
2014 Frank Nielsen 7.Mixed divergen
es 53/75
66. Jereys positive
entroid [16℄
◮ Jereys divergen
e is symmetrized α = ±1 divergen
es.
◮ The Jereys positive
entroid c = (c1, ..., cd ) of a set {h1, ..., hn} of n
weighted positive histograms with d bins
an be
al
ulated
omponent-wise exa
tly using the Lambert W analyti
fun
tion:
ci =
ai
W
ai
gi e
where ai =
Pn
j=1 πjhi
j denotes the
oordinate-wise arithmeti
weighted
means and gi =
Qn
j )j the
j=1(hi
oordinate-wise geometri
weighted
means.
◮ The Lambert analyti
fun
tion W [4℄ (positive bran
h) is dened by
W(x)eW(x) = x for x ≥ 0.
◮ → Jereys k-means
lustering . But for α6= 1, how to
luster?
2014 Frank Nielsen 7.Mixed divergen
es 54/75
67. Mixed -divergen
es/-Jereys symmetrized divergen
e
◮ Mixed α-divergen
e between a histogram x to two histograms p and q:
M,(p : x : q) = λD(p : x) + (1 − λ)D(x : q),
= λD−(x : p) + (1 − λ)D−(q : x),
= M1−,−(q : x : p),
◮ α-Jereys symmetrized divergen
e is obtained for λ = 12
:
S(p, q) = M1
2 ,(q : p : q) = M1
2 ,(p : q : p)
◮ skew symmetrized α-divergen
e is dened by:
S,(p : q) = λD(p : q) + (1 − λ)D(q : p)
2014 Frank Nielsen 7.Mixed divergen
es 55/75
68. Mixed divergen
e-based k-means
lustering
k distin
t seeds from the dataset with li = ri .
Input: Weighted histogram set H, divergen
e D(·, ·), integer k 0, real
λ ∈ [0, 1];
Initialize left-sided/right-sided seeds C = {(li , ri )}ki
=1;
repeat
//Assignment
for i = 1, 2, ..., k do
Ci ← {h ∈ H : i = arg minj M(lj : h : rj )};
end
// Dual-sided
entroid relo
ation
for i = 1, 2, ..., k do
ri ← arg minx D(Ci : x) =
P
h∈Ci
wjD(h : x);
li ← arg minx D(x : Ci ) =
P
h∈Ci
wjD(x : h);
end
until
onvergen
e;
2014 Frank Nielsen 7.Mixed divergen
es 56/75
69. ki
Mixed -hard
lustering: MAhC(H, k, , )
Input: Weighted histogram set H, integer k 0, real λ ∈ [0, 1], real α ∈ R;
Let C = {(li , ri )}=1 ← MAS(H, k, λ, α);
repeat
//Assignment
for i = 1, 2, ..., k do
Ai ← {h ∈ H : i = arg minj M,(lj : h : rj )};
end
// Centroid relo
ation
for i = 1, 2, ..., k do
ri ←
P
h∈Ai
wih
1−
2
2
1− ;
li ←
P
h∈Ai
wih
1+
2
2
1+ ;
end
until
onvergen
e;
2014 Frank Nielsen 7.Mixed divergen
es 57/75
70. Coupled k-Means++ -Seeding
Algorithm 3: Mixed α-seeding; MAS(H, k, λ, α)
Input: Weighted histogram set H, integer k ≥ 1, real λ ∈ [0, 1], real α ∈ R;
Let C ← hj with uniform probability ;
for i = 2, 3, ..., k do
Pi
k at random histogram h ∈ H with probability:
πH(h)
.=
whM,(ch : h : ch) P
y∈H wyM,(cy : y : cy )
, (1)
//where (ch, ch)
.=arg min(z,z)∈C M,(z : h : z);
C ← C ∪ {(h, h)};
end
Output: Set of initial
luster
enters C;
→ Guaranteed probabilisti
bound. Just need to initialize! No
entroid
omputations
2014 Frank Nielsen 7.Mixed divergen
es 58/75
71. Learning MMs: A geometri
hard
lustering viewpoint
Learn the parameters of a mixture m(x) =
Pk
i=1 wip(x|θi )
Maximize the
omplete data likelihood=
lustering obje
tive fun
tion
max
W,
lc (W, ) =
Xn
i=1
Xk
j=1
zi ,j log(wjp(xi |θj ))
= max
Xn
i=1
k
max
j=1
log(wjp(xi |θj ))
≡ min
W,
Xn
i=1
k
min
j=1
Dj (xi ) ,
where cj = (wj , θj ) (
luster prototype) and Dj (xi ) = −log p(xi |θj ) − log wj
are potential distan
e-like fun
tions.
further atta
h to ea
h
luster a dierent family of probability distributions.
2014 Frank Nielsen 7.Mixed divergen
es 59/75
72. Generalized k-MLE for learning statisti
al mixtures
Model-based
lustering: Assignment of points to
lusters:
Dwj ,j ,Fj (x) = −log pFj (x; θj ) − log wj
k-GMLE:
1. Initialize weight W ∈ k and family type (F1, ..., Fk ) for ea
h
luster
2. Solve min
P
i minj Dj (xi ) (
enter-based
lustering for W xed) with
potential fun
tions: Dj (xi ) = −log pFj (xi |θj ) − log wj
3. Solve family types maximizing the MLE in ea
h
luster Cj by
hoosing
the parametri
family of distributions Fj P
= F(γj ) that yields the best
likelihood: minF1=F(
1),...,Fk=F(
)∈F(
)
mini j Dwj k ,j ,Fj (xi ).
j ( ˆ ηl = 1
∀l , γl = maxj F∗
nl
P
x∈Cl
tj (x)) + 1
nl
P
x∈Cl
k(x).
4. Update weight W as the
luster point proportion
5. Test for
onvergen
e and go to step 2) otherwise.
Drawba
k = biased, non-
onsistent estimator due to Voronoi support
trun
ation.
2014 Frank Nielsen 8.k-GMLE 60/75
73. Computing f -divergen
es for
generi
f : Beyond sto
hasti
numeri
al integration
2014 Frank Nielsen 9.Computing f -divergen
es 61/75
74. f -divergen
es
If (X1 : X2) =
Z
x1(x)f
x2(x)
x1(x)
dν(x) ≥ 0
Name of the f -divergen
e Formula If (P : Q) Generator f (u) with f (1) = 0
Total variation (metri
) 1
2 R |p(x) − q(x)|d(x) 1
2 |u − 1|
Squared Hellinger R (pp(x) − pq(x))2d(x) (√u − 1)2
Pearson
2
P R (q(x)−p(x))2
p(x)
d(x) (u − 1)2
2
N R (p(x)−q(x))2
Neyman
q(x)
d(x)
(1−u)2
u
k
P R (q(x)−p(x))k
Pearson-Vajda
pk−1(x)
d(x) (u − 1)k
P R |q(x)−p(x)|k
Pearson-Vajda ||k
pk−1(x)
d(x) |u − 1|k
Kullba
k-Leibler R p(x) log p(x)
q(x)
d(x) −log u
reverse Kullba
k-Leibler R q(x) log q(x)
p(x)
d(x) u log u
-divergen
e 4
1−2 (1 − R p
1−
2 (x)q1+(x)d(x)) 4
1−2 (1 − u
1+
2 )
2 R (p(x) log 2p(x)
Jensen-Shannon 1
p(x)+q(x) + q(x) log 2q(x)
p(x)+q(x) )d(x) −(u + 1) log 1+u
2 + u log u
2014 Frank Nielsen 9.Computing f -divergen
es 62/75
75. f -divergen
es and higher-order Vajda k divergen
es
If (X1 : X2) =
∞X
k=0
f (k)(1)
k!
χk
P(X1 : X2)
χk
P(X1 : X2) =
Z
(x2(x) − x1(x))k
x1(x)k−1 dν(x),
|χ|k
P(X1 : X2) =
Z
|x2(x) − x1(x)|k
x1(x)k−1 dν(x),
are f -divergen
es for the generators (u − 1)k and |u − 1|k .
◮ When k = 1, χ1
P(X1 : X2) =
R
(x1(x) − x2(x))dν(x) = 0 (never
dis
riminative), and |χ1
P|(X1,X2) is twi
e the total variation distan
e.
◮ χk
P is a signed distan
e
2014 Frank Nielsen 9.Computing f -divergen
es 63/75
76. Ane exponential families
Canoni
al de
omposition of the probability measure:
p(x) = exp(ht(x), θi − F(θ) + k(x)),
onsider natural parameter spa
e ane (like multinomials).
Poi(λ) : p(x|λ) =
λx e−
x!
, λ 0, x ∈ {0, 1, ...}
NorI (μ) : p(x|μ) = (2π)−d
2 e−12
(x−μ)⊤(x−μ), μ ∈ Rd , x ∈ Rd
Family θ F(θ) k(x) t(x) ν
Poisson log λ R e −log x! x νc
Iso.Gaussian μ Rd 12
θ⊤θ d
2 log 2π − 12
x⊤x x νL
2014 Frank Nielsen 9.Computing f -divergen
es 64/75
77. Higher-order Vajda k divergen
es
The (signed) χk
P distan
e between members X1 ∼ EF (θ1) and X2 ∼ EF (θ2) of
the same ane exponential family is (k ∈ N) always bounded and equal to:
χk
P(X1 : X2) =
Xk
j=0
(−1)k−j
k
j
eF((1−j)1+j2)
e(1−j)F(1)+jF(2)
For Poisson/Normal distributions, we get
losed-form formula:
χk
P(λ1 : λ2) =
Xk
j=0
(−1)k−j
k
j
e1−j
1 j
2−((1−j)1+j2),
χk
P(μ1 : μ2) =
Xk
j=0
(−1)k−j
k
j
e
12
j(j−1)(μ1−μ2)⊤(μ1−μ2).
2014 Frank Nielsen 9.Computing f -divergen
es 65/75
78. f -divergen
es: Analyti
formula [14℄
◮ λ = 1 ∈ int(dom(f (i ))), f -divergen
e (Theorem 1 of [3℄):
90. Geometri
ally designed divergen
es
Plot of the
onvex generator F.
q p p+q
2
B(p : q)
J(p, q)
tB(p : q)
F : (x, F(x))
(p, F(p))
(q, F(q))
2014 Frank Nielsen 10.Conformal divergen
es 68/75
92. Total Bregman divergen
es [13℄
Conformal divergen
e,
onformal fa
tor ρ:
D′(p : q) = ρ(p, q)D(p : q)
plays the r le of regularizer [35℄
Invarian
e by rotation of the axes of the design spa
e
tB(p : q) =
B(p : q) p
1 + h∇F(q),∇F(q)i
= ρB(q)B(p : q),
ρB(q) =
1 p
1 + h∇F(q),∇F(q)i
.
For example, total squared Eu
lidean divergen
e:
tE(p, q) =
1
2
hpp− q, p − qi
1 + hq, qi
.
2014 Frank Nielsen 10.Conformal divergen
es 70/75
93. Total skew Jensen divergen
es [27℄
tB(p : q) = ρB(q)B(p : q), ρB(q) =
s
1
1 + h∇F(q),∇F(q)i
tJ(p : q) = ρJ (p, q)J(p : q), ρJ (p, q) =
vuut
1
1 + (F(p)−F(q))2
hp−q,p−qi
Jensen-Shannon divergen
e, square root is a metri
:
JS(p, q) =
1
2
Xd
i=1
pi log
2pi
pi + qi
+
1
2
Xd
i=1
qi log
2qi
pi + qi
But the square root of the total Jensen-Shannon divergen
e is not a metri
.
2014 Frank Nielsen 10.Conformal divergen
es 71/75
94. Summary: Geometri
Computing in Information Spa
es
◮ Lo
ation-s
ale families, spheri
al normal, symmetri
positive denite
matri
es → hyperboli
geometry.
◮ Hyperboli
geometry: CG ane
onstru
tions in Klein disk
◮ Spa
e of spheres in dually ane
onne
tion geometry
◮ Syntheti
geometry for
hara
terizing the best error exponent in Bayes
error
◮ Conformal divergen
es: total Bregman/total Jensen divergen
es
◮ Clustering using pair of
entroids for
lusters using mixed divergen
es for
symmetrized alpha divergen
es
◮ Learning statisi
al mixtures maximizing the
omplete likelihood as a
sequen
e of geometri
lustering problems: k-GLME
◮ In sear
h of
losed-form solutions: Jereys
entroid using Lambert W
fun
tion, f -divergen
e approximation for ane exponential families.
2014 Frank Nielsen 10.Conformal divergen
es 72/75
95. Computational Information Geometry (Edited books)
[19℄ [18℄
http://www.springer.
om/engineering/signals/book/978-3-642-30231-2
http://www.sony
sl.
o.jp/person/nielsen/infogeo/MIG/MIGBOOKWEB/
http://www.springer.
om/engineering/signals/book/978-3-319-05316-5
http://www.sony
sl.
o.jp/person/nielsen/infogeo/GTI/Geometri
TheoryOfInformation.html
2014 Frank Nielsen 11.Referen
es 73/75
96. Geometri
S
ien
es of Information (GSI) 2015
O
tober 28-30th 2015. Deadline 1st Mar
h 2015
http://www.gsi2015.org/
2014 Frank Nielsen 11.Referen
es 74/75
97. Thank you!
2014 Frank Nielsen 11.Referen
es 75/75
98. Mar
Arnaudon and Frank Nielsen.
On approximating the Riemannian 1-
enter.
Comput. Geom. Theory Appl., 46(1):93104, January 2013.
Mar
Arnaudon and Frank Nielsen.
On approximating the Riemannian 1-
enter.
Computational Geometry, 46(1):93 104, 2013.
N.S. Barnett, P. Cerone, S.S. Dragomir, and A. Sofo.
Approximating Csiszár f -divergen
e by the use of Taylor's formula with integral remainder.
Mathemati
al Inequalities Appli
ations, 5(3):417434, 2002.
D. A. Barry, P. J. Culligan-Hensley, and S. J. Barry.
Real values of the W-fun
tion.
ACM Trans. Math. Softw., 21(2):161171, June 1995.
Jean-Daniel Boissonnat and Christophe Delage.
Convex hull and Voronoi diagram of additively weighted points.
In Gerth St ¸lting Brodal and Stefano Leonardi, editors, ESA, volume 3669 of Le
ture Notes in Computer
S
ien
e, pages 367378. Springer, 2005.
Jean-Daniel Boissonnat, Frank Nielsen, and Ri
hard No
k.
Bregman Voronoi diagrams.
Dis
rete and Computational Geometry, 44(2):281307, April 2010.
Herman Cherno.
A measure of asymptoti
e
ien
y for tests of a hypothesis based on the sum of observations.
Annals of Mathemati
al Statisti
s, 23:493507, 1952.
Pas
al Chossat and Olivier P. Faugeras.
Hyperboli
planforms in relation to visual edges and textures per
eption.
PLoS Computational Biology, 5(12), 2009.
Andrzej Ci
ho
ki, Sergio Cru
es, and Shun-i
hi Amari.
Generalized alpha-beta divergen
es and their appli
ation to robust nonnegative matrix fa
torization.
2014 Frank Nielsen 11.Referen
es 75/75
99. Entropy, 13(1):134170, 2011.
P. Thomas Flet
her, Conglin Lu, Stephen M. Pizer, and Sarang C. Joshi.
Prin
ipal geodesi
analysis for the study of nonlinear statisti
s of shape.
IEEE Trans. Med. Imaging, 23(8):9951005, 2004.
Bernd Gärtner and Sven S
hönherr.
An e
ient, exa
t, and generi
quadrati
programming solver for geometri
optimization.
In Pro
eedings of the sixteenth annual symposium on Computational geometry, pages 110118. ACM, 2000.
Harold Hotelling.
Meizhu Liu, Baba C. Vemuri, Shun-i
hi Amari, and Frank Nielsen.
Shape retrieval using hierar
hi
al total Bregman soft
lustering.
Transa
tions on Pattern Analysis and Ma
hine Intelligen
e, 34(12):24072419, 2012.
F. Nielsen and R. No
k.
On the
hi square and higher-order
hi distan
es for approximating f -divergen
es.
Signal Pro
essing Letters, IEEE, 21(1):1013, 2014.
Frank Nielsen.
Legendre transformation and information geometry.
Te
hni
al Report CIG-MEMO2, September 2010.
Frank Nielsen.
Jereys
entroids: A
losed-form expression for positive histograms and a guaranteed tight approximation for
frequen
y histograms.
Signal Pro
essing Letters, IEEE, PP(99):11, 2013.
Frank Nielsen.
Generalized bhatta
haryya and
herno upper bounds on bayes error using quasi-arithmeti
means.
Pattern Re
ognition Letters, 42:2534, 2014.
Frank Nielsen.
Geometri
Theory of Information.
2014 Frank Nielsen 11.Referen
es 75/75
100. Springer, 2014.
Frank Nielsen and Rajendra Bhatia, editors.
Matrix Information Geometry (Revised Invited Papers). Springer, 2012.
Frank Nielsen and Sylvain Boltz.
The Burbea-Rao and Bhatta
haryya
entroids.
IEEE Transa
tions on Information Theory, 57(8):54555466, 2011.
Frank Nielsen and Sylvain Boltz.
The Burbea-Rao and Bhatta
haryya
entroids.
IEEE Transa
tions on Information Theory, 57(8):54555466, August 2011.
Frank Nielsen and Ri
hard No
k.
On approximating the smallest en
losing Bregman balls.
In Pro
eedings of the Twenty-se
ond Annual Symposium on Computational Geometry, SCG '06, pages 485486,
New York, NY, USA, 2006. ACM.
Frank Nielsen and Ri
hard No
k.
On the smallest en
losing information disk.
Information Pro
essing Letters (IPL), 105(3):9397, 2008.
Frank Nielsen and Ri
hard No
k.
The dual Voronoi diagrams with respe
t to representational Bregman divergen
es.
In International Symposium on Voronoi Diagrams (ISVD), pages 7178, 2009.
Frank Nielsen and Ri
hard No
k.
Hyperboli
Voronoi diagrams made easy.
In 2013 13th International Conferen
e on Computational S
ien
e and Its Appli
ations, pages 7480. IEEE, 2010.
Frank Nielsen and Ri
hard No
k.
Hyperboli
Voronoi diagrams made easy.
In International Conferen
e on Computational S
ien
e and its Appli
ations (ICCSA), volume 1, pages 7480, Los
Alamitos, CA, USA, mar
h 2010. IEEE Computer So
iety.
Frank Nielsen and Ri
hard No
k.
2014 Frank Nielsen 11.Referen
es 75/75
101. Total jensen divergen
es: Denition, properties and k-means++
lustering.
CoRR, abs/1309.7109, 2013.
Frank Nielsen and Ri
hard No
k.
Visualizing hyperboli
Voronoi diagrams.
In Pro
eedings of the Thirtieth Annual Symposium on Computational Geometry, SOCG'14, pages 90:9090:91,
New York, NY, USA, 2014. ACM.
Frank Nielsen and Ri
hard No
k.
Visualizing hyperboli
Voronoi diagrams.
In Symposium on Computational Geometry, page 90, 2014.
Frank Nielsen, Ri
hard No
k, and Shun-i
hi Amari.
On
lustering histograms with k-means by using mixed -divergen
es.
Entropy, 16(6):32733301, 2014.
Frank Nielsen, Paolo Piro, and Mi
hel Barlaud.
Bregman vantage point trees for e
ient nearest neighbor queries.
In Pro
eedings of the 2009 IEEE International Conferen
e on Multimedia and Expo (ICME), pages 878881, 2009.
Ri
hard No
k and Frank Nielsen.
Fitting the smallest en
losing Bregman ball.
In Ma
hine Learning, volume 3720 of Le
ture Notes in Computer S
ien
e, pages 649656. Springer Berlin
Heidelberg, 2005.
Calyampudi Radhakrishna Rao.
Information and the a
ura
y attainable in the estimation of statisti
al parameters.
Bulletin of the Cal
utta Mathemati
al So
iety, 37:8189, 1945.
Ivor W. Tsang, Andras Ko
sor, and James T. Kwok.
Simpler
ore ve
tor ma
hines with en
losing balls.
In Pro
eedings of the 24th International Conferen
e on Ma
hine Learning (ICML), pages 911918, New York,
NY, USA, 2007. ACM.
Baba Vemuri, Meizhu Liu, Shun-i
hi Amari, and Frank Nielsen.
Total Bregman divergen
e and its appli
ations to DTI analysis.
2014 Frank Nielsen 11.Referen
es 75/75
102. IEEE Transa
tions on Medi
al Imaging, pages 475483, 2011.
Huaiyu Zhu and Ri
hard Rohwer.
Measurements of generalisation based on information geometry.
In StephenW. Ella
ott, JohnC. Mason, and IainJ. Anderson, editors, Mathemati
s of Neural Networks, volume 8
of Operations Resear
h/Computer S
ien
e Interfa
es Series, pages 394398. Springer US, 1997.
2014 Frank Nielsen 11.Referen
es 75/75