Fundamentals cig 4thdec

Fundamentals of Algorithms and Data-Stru
tures in
Information-Geometri
Spa
es
Frank NIELSEN
É
ole Polyte
hnique, Fran
e
Sony Computer S
ien
e Laboratories, In
MEXT-ISM Workshop on Information Geometry for Ma
hine Learning
Brain S
ien
e Institute, RIKEN
4th De
ember 2014

2014 Frank Nielsen 1/75

Brief histori
al review of Computational Geometry (CG)
◮ Three resear
h periods:
1. Geometri
algorithms:
Voronoi/Delaunay, minimum spanning trees, data-stru
tures for proximity
queries
2. Geometri
omputing:
robustness, algebrai
degree of predi
ates, programs that work/s
ale!
3. Computational topology:
simpli
ial
omplexes, ltrations, input=distan
e matrix
→ paradigm of Topologi
al Data Analysis (TDA)
◮ Show
asing libraries for CG software:
◮ CGAL http://www.
gal.org/
Geometry Fa
tory http://geometryfa
tory.
om/
◮ Gudhi https://proje
t.inria.fr/gudhi/
Ayasdi http://www.ayasdi.
om/

2014 Frank Nielsen 1.CG History 2/75

Outline
◮ Review of the basi
algorithmi
toolbox in
omputational geometry:
Voronoi diagrams and dual Delaunay, spanning balls
◮ Generalizations of those
on
epts and toolbox to information spa
es:
◮ Riemannian
omputational information geometry
◮ Dually ane
onne
tions
omputational information geometry
◮ Appli
ations to
lustering, learning mixtures, et
.
What is a good/friendly geometri
omputing spa
e?

2014 Frank Nielsen 1.CG History 3/75

Basi
s of Eu
lidean
Computational Geometry:
Voronoi diagrams and dual
Delaunay
omplexes

2014 Frank Nielsen 2.Ordinary CG 4/75

Eu
lidean (ordinary) Voronoi diagrams
P = {P1, ..., Pn}: n distin
t point generators in Eu
lidean spa
e Ed
V (Pi ) = {X : DE (Pi ,X) ≤ DE (Pj ,X), ∀j6= i}
Voronoi diagram =
ell
omplex V (Pi )'s with their fa
es


Voronoi diagrams from bise
tors and ∩ halfspa
es
Bise
tors
Bi(P,Q) = {X : DE (P,X) = DE (Q,X)}
→ are hyperplanes in Eu
lidean geometry
Voronoi
ells as halfspa
e interse
tions:
=1Bi+(Pi , Pj )
V (Pi ) = {X : DE (Pi ,X) ≤ DE (Pj ,X), ∀j6= i} = ∩ni
DE (P,Q) = kθ(P) − θ(Q)k2 =
qPd
i=1(θi (P) − θi (Q))2
θ(P) = p: Cartesian
oordinate system with θj (Pi ) = p(j)
i .
⇒ Many appli
ations of Voronoï diagrams:
rystal growth,
odebook/quantization, mole
ule interfa
es/do
king, motion planning, et
.


Voronoi diagrams and dual Delaunay simpli
ial
omplex
◮ Empty sphere property, max min angle triangulation, et
◮ Voronoi dual Delaunay triangulation
→ non-degenerate point set = no (d + 2) points
o-spheri
al
◮ Duality: Voronoi k-fa
e ⇔ Delaunay (d − k)-simplex
◮ Bise
tor Bi(P,Q) perpendi
ular ⊥ to segment [PQ]


Voronoi Delaunay : Complexity and algorithms
◮ Combinatorial
omplexity: (n⌈ d
2 ⌉) (→ quadrati
in 3D)
mat
hed for points on the moment
urve: t7→ (t, t2, .., td )
◮ Constru
tion: (n log n + n⌈ d
2 ⌉), optimal
◮ some output-sensitive algorithms but...
◮
(n log n + f ), not yet optimal output-sensitive algorithms.


Modeling population spa
es in information geometry
Population spa
e {P(x)} interpreted as a smooth manifold equipped with
the Fisher Information Matrix (FIM):
◮ Riemannian modeling: metri
length spa
e with the FIM as metri
tensor
(orthogonality), and the Levi-Civita metri
onne
tion for length
minimizing geodesi
s
◮ Dual ±1 ane
onne
tion modeling: dual geodesi
s that des
ribe
parallel transport, non-metri
dual divergen
es indu
ed by dual potential
Legendre
onvex fun
tions. Dual ±α
onne
tions.
→ Algorithmi
onsiderations of these two approa
hes
Population spa
e, parameter spa
e, obje
t-oriented geometry, et
.

2014 Frank Nielsen 3.Information geometry 9/75

Riemannian
omputational
information geometry from
the viewpoint of
omputing

2014 Frank Nielsen 4.Riemannian CIG 10/75

Population spa
es: Hotelling (1930) [12℄ Rao (1945) [33℄
Birth of dierential-geometri
methods in statisti
s.
◮ Fisher information matrix (non-degenerate positive denite)
an be used
as a (smooth) Riemannian metri
tensor g.
◮ Distan
e between two populations indexed by θ1 and θ2: Riemannian
distan
e (metri
length)
First appli
ations in statisti
s:
◮ Fisher-Hotelling-Rao (FHR) geodesi
distan
e used in
lassi
ation:
Find the
losest population to a given set of populations
◮ Used in tests of signi
an
e (null versus alternative hypothesis), power
of a test: P(reje
t H0|H0 is false)
→ dene surfa
es in population spa
es


Rao's distan
e (1945, introdu
ed by Hotelling 1930 [12℄)
◮ Innitesimal squared length element:
ds2 =
X
i ,j
gij (θ)dθidθj = dθTI (θ)dθ
◮ Geodesi
and distan
e are hard to expli
itly
al
ulate:
ρ(p(x; θ1), p(x; θ2)) = min
(s)
(0)=1
(1)=2
Z 1
0
s
dθ
ds
T
I (θ)
dθ
ds
ds
Rao's distan
e not known in
losed-form for multivariate normals
◮ Advantages: Metri
property of ρ + many tools of dierential
geometry [1℄: Riemannian Log/Exp tangent/manifold mapping


Extrinsi
Computational Geometry on tangent planes
◮ Tensor g = Q(x) ≻ 0 denes smooth inner produ
t hp, qix = p⊤Q(x)q
that indu
es a normed distan
p
e:
dx (p, q) = kp − qkx =
(p − q)⊤Q(x)(p − q)
◮ Mahalanobis metri
distan
e on tangent planes :
(X1,X2) =
q
(μ1 − μ2)⊤−1(μ1 − μ2) =
p
μ⊤−1μ
◮ Cholesky de
omposition = LL⊤
(X1,X2) = DE (L−1μ1, L−1μ2)
◮ CG on tangent planes = ordinary CG on transformed points x′ ← L−1x.
Extrinsi
vs intrinsi
means [10℄

2014 Frank Nielsen 4.Riemannian CIG-1.Mahalanobis 13/75

Mahalanobis Voronoi diagrams on tangent planes (extrinsi
)
In statisti
s,
ovarian
e matrix a
ount for both
orrelation and dimension
(feature) s
aling
⇔
Dual stru
ture ≡ anisotropi
Delaunay triangulation
⇒ empty
ir
umellipse property (Cholesky de
omposition)


Riemannian Mahalanobis metri
tensor (−1, PSD)
ρ(p1, p2) =
q
(p1 − p2)⊤−1(p1 − p2), g(p) = −1 =

1 −1
−1 2

non-
onformal geometry: g(p)6= f (p)I


Riemannian statisti
al Voronoi diagrams
... for statisti
al population spa
es:
◮ Lo
ation-s
ale 2D families have
onstant non-positive
urvature
(Hotelling, 1930): Riemannian statisti
al Voronoi diagrams amount
to hyperboli
Voronoi diagrams or Eu
lidean diagrams (lo
ation
families only like isotropi
Gaussians)
◮ Multinomial family has spheri
al geometry on the positive orthant:
Spheri
al Voronoi diagram
(
ompute as stereographi
proje
tion ∝ Eu
lidean Voronoi diagrams)
But for arbitrary families p(x|θ): Geodesi
s not in
losed forms → limited
omputational framework in pra
ti
e (ray shooting, et
.)


Normal/Gaussian family and 2D lo
ation-s
ale families
◮ Fisher Information Matrix (FIM):
I (θ) =

Ii ,j (θ) = E

∂
∂θi
log p(x|θ)
∂
∂θj

log p(x|θ)
◮ FIM for univariate normal/multivariate spheri
al distributions:
I (μ, σ) =
1
2 0
0 2
2

=
1
σ2

1 0
0 2

, I (μ, σ) = diag

1
σ2 , ...,
1
σ2 ,
2
σ2

◮ → amount to Poin
aré metri
dx2+dy2
y2 , hyperboli
geometry in
upper half plane/spa
e.

2014 Frank Nielsen 4.Riemannian CIG-2.Hyperboli
geometry 17/75

Riemannian Poin
aré upper plane metri
tensor (
onformal)
osh ρ(p1, p2) = 1 + kp1 − p2k2
2y1y2
, g(p) =

1
y2 0
0 1
y2
#
=
1
y2 I
onformal: g(p) = 1
y2 I

geometry 18/75

Matrix SPD spa
es and hyperboli
geometry
Symmetri
Positive Denite matri
es M: ∀x6= 0, x⊤Mx 0.
◮ 2D SPD(2) matrix spa
e has dimension d = 3: A positive
one.
SPD(2)

(a, b, c) ∈ R3 : a 0, ab − c2 0

◮ Can be peeled into sheets of dimension 2, ea
h sheet
orresponding to a
onstant value of the determinant of the elements [8℄
SPD(2) = SSPD(2) × R+
where SSPD(2) = {a, b, c = √1 − ab) : a 0, ab − c2 = 1}
◮ Mapping M(a, b, c) → H2:
◮

x0 = a+b
2 ≥ 1, x1 = a−b
2 , x2 = c

in hyperboloid model [28℄
◮ z = a−b+2ic
2+a+b in Poin
aré disk [28℄.

geometry 19/75

Riemannian manifolds: Choi
e of equivalent models?
Many equivalent models of hyperboli
geometry:
◮ Conformal (good for visualization sin
e we
an measure angles) versus
non-
onformal (
omputationally-friendly for geodesi
s) models.
◮ Convert equivalently to other models of hyperboli
geometry: Poin
aré
disk, upper half spa
e, hyperboloid, Beltrami hemisphere, et
.
Two questions:
◮ Given a metri
tensor g and its indu
ed metri
distan
e ρg (p, q), what
are the equivalent metri
tensors g′ ∼ g su
h that ρg (p, q) = ρg′(p′, q′)?
Is one metri
tensor better for
omputing spa
e?
◮ Metri
s yielding straight geodesi
s are fully
hara
terized in 2D but in
higher dimensions?

geometry 20/75

Riemannian Poin
aré disk metri
tensor (
onformal)
→ often used in Human Computer Interfa
es, network routing (embedding
trees), et
.

geometry 21/75

Riemannian Klein disk metri
tensor (non-
onformal)
◮ re
ommended for
omputing spa
e sin
e geodesi
s are straight line
segments
◮ Klein is also
onformal at the origin (so we
an perform translation
from and ba
k to the origin)
◮ Geodesi
s passing through O in the Poin
aré disk are straight (so we
an
perform translation from and ba
k to the origin)

geometry 22/75

Hyperboli
Voronoi diagrams [25, 29℄
In arbitrary dimension, Hd
◮ In Klein disk, the hyperboli
Voronoi diagram amounts to a
lipped
ane Voronoi diagram, or a
lipped power diagram with e
ient
lipping algorithm [5℄.
◮ then
onvert to other models of hyperboli
geometry: Poin
aré disk,
upper half spa
e, hyperboloid, Beltrami hemisphere, et
.
◮ Conformal (good for visualization) versus non-
onformal (good for
omputing) models.

geometry 23/75

Hyperboli
Hyperboli
Voronoi diagram in Klein disk =
lipped power diagram.
Power distan
e:
kx − pk2 − wp
→ additively weighted ordinary Voronoi = ordinary CG

geometry 24/75

Hyperboli
5
ommon models of the abstra
t hyperboli
geometry
https://www.youtube.
om/wat
h?v=i9IUzNxeH4o (5 min. video)
ACM Symposium on Computational Geometry (SoCG'14)

geometry 25/75

Dually ane
onne
tion
omputational information
geometry

2014 Frank Nielsen 5.Dually at CIG 26/75

Dually at spa
e
onstru
tion from
onvex fun
tions F
◮ Convex and stri
tly dierentiable fun
tion F(θ) admits a
Legendre-Fen
hel
onvex
onjugate F∗(η):
F∗(η) = sup

(θ⊤η − F(θ)), ∇F(θ) = η = (∇F∗)−1(θ)
◮ Young's inequality gives rise to
anoni
al divergen
e [15℄:
F(θ) + F∗(η′) ≥ θ⊤η′ ⇒ AF,F∗(θ, η′) = F(θ) + F∗(η′) − θ⊤η′
◮ Writing using single
oordinate system, get dual Bregman
divergen
es:
BF (θp : θq) = F(θp) − F(θq) − (θp − θq)⊤∇F(θq)
= BF∗(ηq : ηp) = AF,F∗(θp, ηq) = AF∗,F (ηq : θp)
◮ dual ane
oordinate systems with geodesi
s straight:
η = ∇F(θ) ⇔ θ = ∇F∗(η). Tensor g(θ) = g∗(η)

2014 Frank Nielsen 5.Dually at CIG 27/75

Dual divergen
e/Bregman dual bise
tors [6, 24, 26℄
Bregman sided (referen
e) bise
tors related by
onvex duality:
BiF (θ1, θ2) = {θ ∈ |BF (θ : θ1) = BF (θ : θ1)}
BiF∗(η1, η2) = {η ∈ H |BF∗(η : η1) = BF∗(η : η1)}
Right-sided bise
tor: → θ-hyperplane, η-hypersurfa
e
HF (p, q) = {x ∈ X | BF (x : p ) = BF (x : q )}.
HF : h∇F(p) − ∇F(q), xi + (F(p) − F(q) + hq,∇F(q)i − hp,∇F(p)i) = 0
Left-sided bise
tor: → θ-hypersurfa
e, η-hyperplane
H′F
(p, q) = {x ∈ X | BF ( p : x) = BF ( q : x)}
H′F
: h∇F(x), q − pi + F(p) − F(q) = 0
hyperplane = autoparallel submanifold of dimension d − 1

2014 Frank Nielsen 5.Dually at CIG-1.bise
tor 28/75

Visualizing Bregman bise
tors
Primal
oordinates θ Dual
oordinates η
natural parameters expe
tation parameters
p
q
Source Space: Itakura-Saito
p(0.52977081,0.72041688) q(0.85824458,0.29083834)
D(p,q)=0.66969016 D(q,p)=0.44835617
Gradient Space: Itakura-Saito dual
p’(-1.88760873,-1.38808518) q’(-1.16516903,-3.43833618)
D*(p’,q’)=0.44835617 D*(q’,p’)=0.66969016
p’
q’
Bi(P,Q) and Bi∗(P,Q)
an be expressed in either θ/η
oordinate systems

2014 Frank Nielsen 5.Dually at CIG-1.bise
tor 29/75

Spa
es of spheres: 1-to-1
mapping between d-spheres
and (d + 1)-hyperplanes using
potential fun
tions

2014 Frank Nielsen 5.Dually at CIG-2.Spa
e of spheres 30/75

Spa
e of Bregman spheres and Bregman balls [6℄
Dual sided Bregman balls (bounding Bregman spheres):
Ballr
F (c, r ) = {x ∈ X | BF (x : c) ≤ r}
Balll
F (c, r ) = {x ∈ X | BF (c : x) ≤ r}
Legendre duality:
F (c, r ) = (∇F)−1(Ballr
F∗(∇F(c), r ))
Balll
Illustration for Itakura-Saito divergen
e, F(x) = −log x

e of spheres 31/75

Spa
e of Bregman spheres: Lifting map [6℄
F : x7→ ˆx = (x, F(x)), hypersurfa
e in Rd+1, potential fun
tion
Hp: Tangent hyperplane at ˆp, z = Hp(x) = hx − p,∇F(p)i + F(p)
◮ Bregman sphere σ −→ ˆσ with supporting hyperplane
H : z = hx − c,∇F(c)i + F(c) + r .
(// to Hc and shifted verti
ally by r )
ˆσ = F ∩ H.
◮ interse
tion of any hyperplane H with F proje
ts onto X as a Bregman
sphere:
H : z = hx, ai + b → σ : BallF (c = (∇F)−1(a), r = ha, ci − F(c) + b)

e of spheres 32/75

Lifting/Polarity: Potential fun
tion graph F

e of spheres 33/75

Spa
e of Bregman spheres: Algorithmi
appli
ations [6℄
◮ Union/interse
tion of Bregman d-spheres from representational
(d + 1)-polytope [6℄
◮ Radi
al axis of two Bregman balls is an hyperplane: Appli
ations to
Nearest Neighbor sear
h trees like Bregman ball trees or Bregman
vantage point trees [31℄.

e of spheres 34/75

Bregman proximity data stru
tures [31℄
Vantage point trees: partition spa
e a
ording to Bregman balls
Partitionning spa
e with interse
tion of Kullba
k-Leibler balls
→ e
ient nearest neighbour queries in information spa
es

e of spheres 35/75

Appli
ation: Minimum En
losing Ball [23, 32℄
To a hyperplane H = H(a, b) : z = ha, xi + b in Rd+1,
orresponds a ball
σ = Ball(c, r ) in Rd with
enter c = ∇F∗(a) and radius:
r = ha, ci − F(c) + b = ha,∇F∗(a)i − F(∇F∗(a)) + b = F∗(a) + b
sin
e F(∇F∗(a)) = h∇F∗(a), ai − F∗(a) (Young equality)
SEB: Find halfspa
e H(a, b)− : z ≤ ha, xi + b that
ontains all lifted points:
min
a,b
r = F∗(a) + b,
∀i ∈ {1, ..., n}, ha, xi i + b − F(xi ) ≥ 0
→ Convex Program (CP) with linear inequality
onstraints
F(θ) = F∗(η) = 12
x⊤x: CP → Quadrati
Programming (QP) [11℄ used in
SVM. Smallest en
losing ball used as a primitive in SVM [34℄

e of spheres 36/75

Smallest Bregman en
losing balls [32, 22℄
Algorithm 1: BBCA(P, l ).
c1 ←
hoose randomly a point in P;
for i = 2 to l − 1 do
// farthest point from ci wrt. BF
si ← argmax=1BF (ci : pj );
nj
// update the
enter: walk on the η-segment [ci , psi ]
ci+1 ← ∇F−1(∇F(ci )# 1
i+1∇F(psi )) ;
end
// Return the SEBB approximation
return Ball(cl , rl = BF (cl : X)) ;
θ-, η-geodesi
segments in dually at geometry.

e of spheres 37/75

Smallest en
losing balls: Core-sets [32℄
Core-set C ⊆ S: SOL(S) ≤ SOL(C) ≤ (1 + ǫ)SOL(S)
extended Kullba
k-Leibler Itakura-Saito

e of spheres 38/75

InSphere predi
ates wrt Bregman divergen
es [6℄
Impli
it representation of Bregman spheres/balls:
onsider d + 1 support
points on the boundary
◮ Is x inside the Bregman ball dened by d + 1 support points?
InSphere(x; p0, ..., pd ) =

1 ... 1 1
p0 ... pd x
F(p0) ... F(pd ) F(x)

◮ sign of a (d + 2) × (d + 2) matrix determinant
◮ InSphere(x; p0, ..., pd ) is negative, null or positive depending on whether
x lies inside, on, or outside σ.

e of spheres 39/75

Smallest en
losing ball in Riemannian manifolds [2℄
c = a#Mt
b: point γ(t) on the geodesi
line segment [ab] wrt M su
h that
ρM(a, c) = t × ρM(a, b) (with ρM the metri
distan
e on manifold M)
Algorithm 2: GeoA
c1 ←
hoose randomly a point in P;
for i = 2 to l do
// farthest point from ci
si ← argmax=1ρ(ci , pj );
nj
// update the
enter: walk on the geodesi
line segment
[ci , psi ]
ci+1 ← ci#M
1
i+1
psi ;
end
// Return the SEB approximation
return Ball(cl , rl = ρ(cl ,P)) ;

e of spheres 40/75

Approximating the smallest en
losing ball in hyperboli
spa
e
Initialization First iteration
Se
ond iteration Third iteration
Fourth iteration after 104 iterations
http://www.sony
sl.
o.jp/person/nielsen/infogeo/RiemannMinimax/

e of spheres 41/75

Bregman dual regular/Delaunay triangulations
Embedded geodesi
Delaunay triangulations+empty Bregman balls
Delaunay Exponential Del. Hellinger-like Del.
◮ empty Bregman sphere property,
◮ geodesi
triangles: embedded Delaunay.

e of spheres 42/75

Dually orthogonal Bregman Voronoi Triangulations
Ordinary Voronoi diagram is perpendi
ular to Delaunay triangulation:
Voronoi k-fa
e ⊥ Delaunay d − k-fa
e
Bi(P,Q) ⊥ γ∗(P,Q)
γ(P,Q) ⊥ Bi∗(P,Q)

e of spheres 43/75

Syntheti
geometry: Exa
t
hara
terization of the
Bayesian error exponent but
no
losed-form known

2014 Frank Nielsen 6.Bayesian error exponent 44/75

Bayesian hypothesis testing, MAP rule and probability of
error Pe
◮ Mixture p(x) =
P
i wipi (x). Task = Classify x Whi
h
omponent?
◮ Prior probabilities: wi = P(X ∼ Pi ) 0 (with
Pn
i=1 wi = 1)
◮ Conditional probabilities: P(X = x|X ∼ Pi ).
P(X = x) =
Xn
i=1
P(X ∼ Pi )P(X = x|X ∼ Pi ) =
Xn
i=1
wiP(X|Pi )
◮ Best rule = Maximum a posteriori probability (MAP) rule:
map(x) = argmaxi∈{1,...,n} wipi (x)
where pi (x) = P(X = x|X ∼ Pi ) are the
onditional probabilities.
◮ For w1 = w2 = 12
, probability of error
Pe = 12
R
min(p1(x), p2(x))dx ≤ 12
R
p1(x)p2(x)1−dx, for α ∈ (0, 1).
Best exponent α∗


Error exponent for exponential families
◮ Exponential families have nite dimensional su
ient statisti
s: →
Redu
e n data to D statisti
s.
∀x ∈ X, P(x|θ) = exp(θ⊤t(x) − F(θ) + k(x))
F(·): log-normalizer/
umulant/partition fun
tion, k(x): auxiliary term
for
arrier measure.
◮ Maximum likelihood estimator (MLE): ∇F(θ) ˆ= 1
n
P
i t(Xi ) = ˆη
◮ Bije
tion between exponential families and Bregman divergen
es:
log p(x|θ) = −BF∗(t(x) : η) + F∗(t(x)) + k(x)
Exponential families are log-
on
ave


Geometry of the best error exponent
On the exponential family manifold, Cherno α-
oe
ient [7℄:
c(P1 : P2) =
Z
2 (x)dμ(x) = exp(−J()
p
1 (x)p1−
F (θ1 : θ2))
Skew Jensen divergen
e [20℄ on the natural parameters:
J()
F (θ1 : θ2) = αF(θ1) + (1 − α)F(θ2) − F(θ()
12 )
Cherno information = Bregman divergen
e for exponential families:
C(P1 : P2 ) = B(θ1 : θ(∗)
12 ) = B(θ2 : θ(∗)
12 )
Finding best error exponent α∗?


Geometry of the best error exponent: binary hypothesis [17℄
Cherno distribution P∗:
P∗ = P∗12 = Ge(P1, P2) ∩ Bim(P1, P2)
e-geodesi
:
Ge(P1, P2) =
n
E()
12 | θ(E()
o
,
12 ) = (1 − λ)θ1 + λθ2, λ ∈ [0, 1]
m-bise
tor:
Bim(P1, P2) :
n
P | F(θ1) − F(θ2) + η(P)⊤θ = 0
o
,
Optimal natural parameter of P∗:
θ∗ = θ(∗)
12 = argmin∈B(θ1 : θ) = argmin∈B(θ2 : θ).
→
losed-form for order-1 family, or e
ient bise
tion sear
h.


Geometry of the best error exponent: binary hypothesis
P∗ = P∗12 = Ge(P1, P2) ∩ Bim(P1, P2)
p1
p2
m-bisector
p
12
-coordinate system
e-geodesic Ge(P1 , P2 )
P
12
C(1 : 2) = B(1 :
12)
Bim(P1 , P2 )
Binary Hypothesis Testing: Pe bounded using Bregman divergen
e between
Cherno distribution and
lass-
onditional distributions.


Clustering and Learning
nite statisti
al mixtures


-divergen
es
For α ∈ R6= ±1, α-divergen
es [9℄ on positive arrays [36℄ :
◮ D(p : q)
.=
Xd
i=1
4
1 − α2

1 − α
2
pi +
1 + α
2
qi − (pi )
1−
2 (qi )
1+
2

with
D(p : q) = D−(q : p) and in the limit
ases D−1(p : q) = KL(p : q)
and D1(p : q) = KL(q : p), where KL is the extended Kullba
kLeibler
divergen
e KL(p : q) .=
Pd
i=1 pi log pi
qi + qi − pi
◮ α-divergen
es belong to the
lass of Csiszár f -divergen
es
If (p : q)
.=
Pd
i=1 qi f

pi
qi

with the following generator:
f (t) =


4
1−2

1 − t(1+)/2
, if α6= ±1,
t ln t, if α = 1,
−ln t, if α = −1
Information monotoni
ity


Mixed divergen
es [30℄
Dened on three parameters p, q and r :
M(p : q : r )
.=
λD(p : q) + (1 − λ)D(q : r )
for λ ∈ [0, 1].
Mixed divergen
es in
lude:
◮ the sided divergen
es for λ ∈ {0, 1},
◮ the symmetrized (arithmeti
mean) divergen
e for λ = 12
, or skew
symmetrized for λ6= 12
.

2014 Frank Nielsen 7.Mixed divergen
es 52/75

Symmetrizing -divergen
es
S(p, q) =
1
2
(D(p : q) + D(q : p)) = S−(p, q),
= M1
2
(p : q : p),
For α = ±1, we get half of Jereys divergen
e:
S±1(p, q) =
1
2
Xd
i=1
(pi − qi ) log
pi
qi
◮ Centroids for symmetrized α-divergen
e usually not in
losed form.
◮ How to perform
enter-based
lustering without
losed form
entroids?

es 53/75

Jereys positive
entroid [16℄
◮ Jereys divergen
e is symmetrized α = ±1 divergen
es.
◮ The Jereys positive
entroid c = (c1, ..., cd ) of a set {h1, ..., hn} of n
weighted positive histograms with d bins
an be
al
ulated
omponent-wise exa
tly using the Lambert W analyti
fun
tion:
ci =
ai
W

ai
gi e

where ai =
Pn
j=1 πjhi
j denotes the
oordinate-wise arithmeti
weighted
means and gi =
Qn
j )j the
j=1(hi
oordinate-wise geometri
weighted
means.
◮ The Lambert analyti
fun
tion W [4℄ (positive bran
h) is dened by
W(x)eW(x) = x for x ≥ 0.
◮ → Jereys k-means
lustering . But for α6= 1, how to
luster?

es 54/75

Mixed -divergen
es/-Jereys symmetrized divergen
e
◮ Mixed α-divergen
e between a histogram x to two histograms p and q:
M,(p : x : q) = λD(p : x) + (1 − λ)D(x : q),
= λD−(x : p) + (1 − λ)D−(q : x),
= M1−,−(q : x : p),
◮ α-Jereys symmetrized divergen
e is obtained for λ = 12
:
S(p, q) = M1
2 ,(q : p : q) = M1
2 ,(p : q : p)
◮ skew symmetrized α-divergen
e is dened by:
S,(p : q) = λD(p : q) + (1 − λ)D(q : p)

es 55/75

Mixed divergen
e-based k-means
lustering
k distin
t seeds from the dataset with li = ri .
Input: Weighted histogram set H, divergen
e D(·, ·), integer k 0, real
λ ∈ [0, 1];
Initialize left-sided/right-sided seeds C = {(li , ri )}ki
=1;
repeat
//Assignment
for i = 1, 2, ..., k do
Ci ← {h ∈ H : i = arg minj M(lj : h : rj )};
end
// Dual-sided
entroid relo
ation
for i = 1, 2, ..., k do
ri ← arg minx D(Ci : x) =
P
h∈Ci
wjD(h : x);
li ← arg minx D(x : Ci ) =
P
h∈Ci
wjD(x : h);
end
until
onvergen
e;

es 56/75

ki
Mixed -hard
lustering: MAhC(H, k, , )
Input: Weighted histogram set H, integer k 0, real λ ∈ [0, 1], real α ∈ R;
Let C = {(li , ri )}=1 ← MAS(H, k, λ, α);
repeat
//Assignment
for i = 1, 2, ..., k do
Ai ← {h ∈ H : i = arg minj M,(lj : h : rj )};
end
// Centroid relo
ation
for i = 1, 2, ..., k do
ri ←
P
h∈Ai
wih
1−
2
2
1− ;
li ←
P
h∈Ai
wih
1+
2
2
1+ ;
end
until
onvergen
e;

es 57/75

Coupled k-Means++ -Seeding
Algorithm 3: Mixed α-seeding; MAS(H, k, λ, α)
Input: Weighted histogram set H, integer k ≥ 1, real λ ∈ [0, 1], real α ∈ R;
Let C ← hj with uniform probability ;
for i = 2, 3, ..., k do
Pi
k at random histogram h ∈ H with probability:
πH(h)
.=
whM,(ch : h : ch) P
y∈H wyM,(cy : y : cy )
, (1)
//where (ch, ch)
.=arg min(z,z)∈C M,(z : h : z);
C ← C ∪ {(h, h)};
end
Output: Set of initial
luster
enters C;
→ Guaranteed probabilisti
bound. Just need to initialize! No
entroid
omputations

es 58/75

Learning MMs: A geometri
hard
lustering viewpoint
Learn the parameters of a mixture m(x) =
Pk
i=1 wip(x|θi )
Maximize the
omplete data likelihood=
lustering obje
tive fun
tion
max
W,
lc (W, ) =
Xn
i=1
Xk
j=1
zi ,j log(wjp(xi |θj ))
= max

Xn
i=1
k
max
j=1
log(wjp(xi |θj ))
≡ min
W,
Xn
i=1
k
min
j=1
Dj (xi ) ,
where cj = (wj , θj ) (
luster prototype) and Dj (xi ) = −log p(xi |θj ) − log wj
are potential distan
e-like fun
tions.
further atta
h to ea
h
luster a dierent family of probability distributions.

es 59/75

Generalized k-MLE for learning statisti
al mixtures
Model-based
lustering: Assignment of points to
lusters:
Dwj ,j ,Fj (x) = −log pFj (x; θj ) − log wj
k-GMLE:
1. Initialize weight W ∈ k and family type (F1, ..., Fk ) for ea
h
luster
2. Solve min
P
i minj Dj (xi ) (
enter-based
lustering for W xed) with
potential fun
tions: Dj (xi ) = −log pFj (xi |θj ) − log wj
3. Solve family types maximizing the MLE in ea
h
luster Cj by
hoosing
the parametri
family of distributions Fj P
= F(γj ) that yields the best
likelihood: minF1=F(
1),...,Fk=F(
)∈F(
)
mini j Dwj k ,j ,Fj (xi ).
j ( ˆ ηl = 1
∀l , γl = maxj F∗
nl
P
x∈Cl
tj (x)) + 1
nl
P
x∈Cl
k(x).
4. Update weight W as the
luster point proportion
5. Test for
onvergen
e and go to step 2) otherwise.
Drawba
k = biased, non-
onsistent estimator due to Voronoi support
trun
ation.

2014 Frank Nielsen 8.k-GMLE 60/75

Computing f -divergen
es for
generi
f : Beyond sto
hasti
numeri
al integration

2014 Frank Nielsen 9.Computing f -divergen
es 61/75

f -divergen
es
If (X1 : X2) =
Z
x1(x)f

x2(x)
x1(x)

dν(x) ≥ 0
Name of the f -divergen
e Formula If (P : Q) Generator f (u) with f (1) = 0
Total variation (metri
) 1
2 R |p(x) − q(x)|d(x) 1
2 |u − 1|
Squared Hellinger R (pp(x) − pq(x))2d(x) (√u − 1)2
Pearson
2
P R (q(x)−p(x))2
p(x)
d(x) (u − 1)2
2
N R (p(x)−q(x))2
Neyman
q(x)
d(x)
(1−u)2
u
k
P R (q(x)−p(x))k
Pearson-Vajda
pk−1(x)
d(x) (u − 1)k
P R |q(x)−p(x)|k
Pearson-Vajda ||k
pk−1(x)
d(x) |u − 1|k
Kullba
k-Leibler R p(x) log p(x)
q(x)
d(x) −log u
reverse Kullba
k-Leibler R q(x) log q(x)
p(x)
d(x) u log u
-divergen
e 4
1−2 (1 − R p
1−
2 (x)q1+(x)d(x)) 4
1−2 (1 − u
1+
2 )
2 R (p(x) log 2p(x)
Jensen-Shannon 1
p(x)+q(x) + q(x) log 2q(x)
p(x)+q(x) )d(x) −(u + 1) log 1+u
2 + u log u

es 62/75

f -divergen
es and higher-order Vajda k divergen
es
If (X1 : X2) =
∞X
k=0
f (k)(1)
k!
χk
P(X1 : X2)
χk
P(X1 : X2) =
Z
(x2(x) − x1(x))k
x1(x)k−1 dν(x),
|χ|k
P(X1 : X2) =
Z
|x2(x) − x1(x)|k
x1(x)k−1 dν(x),
are f -divergen
es for the generators (u − 1)k and |u − 1|k .
◮ When k = 1, χ1
P(X1 : X2) =
R
(x1(x) − x2(x))dν(x) = 0 (never
dis
riminative), and |χ1
P|(X1,X2) is twi
e the total variation distan
e.
◮ χk
P is a signed distan
e

es 63/75

Ane exponential families
Canoni
al de
omposition of the probability measure:
p(x) = exp(ht(x), θi − F(θ) + k(x)),
onsider natural parameter spa
e ane (like multinomials).
Poi(λ) : p(x|λ) =
λx e−
x!
, λ 0, x ∈ {0, 1, ...}
NorI (μ) : p(x|μ) = (2π)−d
2 e−12
(x−μ)⊤(x−μ), μ ∈ Rd , x ∈ Rd
Family θ F(θ) k(x) t(x) ν
Poisson log λ R e −log x! x νc
Iso.Gaussian μ Rd 12
θ⊤θ d
2 log 2π − 12
x⊤x x νL

es 64/75

Higher-order Vajda k divergen
es
The (signed) χk
P distan
e between members X1 ∼ EF (θ1) and X2 ∼ EF (θ2) of
the same ane exponential family is (k ∈ N) always bounded and equal to:
χk
P(X1 : X2) =
Xk
j=0
(−1)k−j

k
j

eF((1−j)1+j2)
e(1−j)F(1)+jF(2)
For Poisson/Normal distributions, we get
losed-form formula:
χk
P(λ1 : λ2) =
Xk
j=0
(−1)k−j

k
j

e1−j
1 j
2−((1−j)1+j2),
χk
P(μ1 : μ2) =
Xk
j=0
(−1)k−j

k
j

e
12
j(j−1)(μ1−μ2)⊤(μ1−μ2).

es 65/75

f -divergen
es: Analyti
formula [14℄
◮ λ = 1 ∈ int(dom(f (i ))), f -divergen
e (Theorem 1 of [3℄):

Fundamentals cig 4thdec

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Fundamentals cig 4thdec

Similar to Fundamentals cig 4thdec (20)

Recently uploaded

Recently uploaded (20)

Fundamentals cig 4thdec