Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Slides: Total Jensen divergences: Definition, Properties and k-Means++ Clustering
1. Total Jensen divergences: Definition, Properties
and k-Means++ Clustering
Frank Nielsen1 Richard Nock2
www.informationgeometry.org
1 Sony
Computer Science Laboratories, Inc.
2 UAG-CEREGMIA
September 2013
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
1/19
2. Divergences: Distortion measures
F a smooth convex function, the generator.
◮ Skew Jensen divergences:
′
Jα (p : q) = αF (p) + (1 − α)F (q) − F (αp + (1 − α)q),
= (F (p)F (q))α − F ((pq)α ),
◮
where (pq)γ = γp + (1 − γ)q = q + γ(p − q) and
(F (p)F (q))γ = γF (p)+(1−γ)F (q) = F (q)+γ(F (p)−F (q)).
Bregman divergences:
B(p : q) = F (p) − F (q) − p − q, ∇F (q) ,
lim Jα (p : q) = B(p : q),
α→0
lim Jα (p : q) = B(q : p).
α→1
◮
Statistical Bhattacharrya divergence:
Bhat(p1 : p2 ) = − log
′
p1 (x)α p2 (x)1−α dν(x) = Jα (θ1 : θ2 )
for exponential families [5].
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
2/19
3. Geometrically designed divergences
Plot of the convex generator F .
F : (x, F (x))
(q, F (q))
(p, F (p))
J(p, q)
tB(p : q)
B(p : q)
q
p+q
2
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
p
3/19
4. Total Bregman divergences
Conformal divergence, conformal factor ρ:
D ′ (p : q) = ρ(p, q)D(p : q)
plays the rˆle of “regularizer” [8]
o
Invariance by rotation of the axes of the design space
tB(p : q) =
ρB (q) =
B(p : q)
= ρB (q)B(p : q),
1 + ∇F (q), ∇F (q)
1
.
1 + ∇F (q), ∇F (q)
Total squared Euclidean divergence:
tE (p, q) =
1 p − q, p − q
.
2
1 + q, q
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
4/19
5. Total Jensen divergences
tB(p : q) = ρB (q)B(p : q),
ρB (q) =
tJα (p : q) = ρJ (p, q)Jα (p : q),
1
1 + ∇F (q), ∇F (q)
ρJ (p, q) =
1
1+
(F (p)−F (q))2
p−q,p−q
Jensen-Shannon divergence, square root is a metric [2]:
JS(p, q) =
1
2
d
pi log
i =1
2pi
1
+
pi + qi
2
d
qi log
i =1
2qi
pi + qi
Lemma
The square root of the total Jensen-Shannon divergence is not a
metric.
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
5/19
6. Total Jensen divergence: Illustration
(F (p)F (q))α
F (p)
(F (p)F (q))β
F (p′ )
(F (p′ )F (q ′ ))α
(F (p′ )F (q ′ ))β
′
Jα (p : q)
tJ′ (p : q)
α
F (q)
p
′
:q)
tJ′ (p′ : q ′ )
α
F ((pq)α )
O
′
Jα (p′
F (q ′ )
F ((p′ q ′ )α )
(pq)α
q′
q
O
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
p′
(p′ q ′ )α
6/19
7. Total Jensen divergence: Illustration
α on graph plot, β on interpolated segment
Two kinds of total Jensen divergences (but one always yields
closed-form)
β ∈ [0, 1]
β>1
β<0
β ∈ [0, 1]
β>1
β<0
(F (p)F (q))β
F ((pq)α )
(F (p)F (q))β
F ((pq)α )
p
q
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
p
q
7/19
8. Total Jensen divergences/Total Bregman divergences
Total Jensen is not a generalization of total Bregman.
limit cases α ∈ {0, 1}, we have:
lim tJα (p : q) = ρJ (p, q)B(p : q) = ρB (q)B(p : q),
α→0
lim tJα (p : q) = ρJ (p, q)B(q : p) = ρB (p)B(q : p),
α→1
since ρJ (p, q) = ρB (q).
Squared chord slope index in ρJ :
s2 =
∆2
∆⊤ ∇F (ǫ)∆⊤ ∇F (ǫ)
F
= ∇F (ǫ), ∇F (ǫ) = ∇F (ǫ) 2 .
=
∆ 2
∆⊤ ∆
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
8/19
9. Conformal factor from mean value theorem
When p ≃ q, ρJ (p, q) ≃ ρB (q), and the total Jensen divergence
tends to the total Bregman divergence for any value of α.
ρJ (p, q) =
1
1 + ∇F (ǫ), ∇F (ǫ)
= ρB (ǫ),
for ǫ ∈ [p, q].
For univariate generators, explicitly the value of ǫ:
ǫ = ∇F −1
∆F
∆
= ∇F ∗
∆F
∆
,
where F ∗ is the Legendre convex conjugate [5].
Stolarsky mean [7]:
tJα (p : q) = ρB (ǫ)J(p : q)
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
9/19
10. Centroids and statistical robustness
Centroids (barycenters) are minimizers of average (weighted)
divergences:
n
L(x; w ) =
wi × tJα (pi : x),
i =1
cα = arg min L(x; w ),
x∈X
◮
Is it unique?
◮
Is it robust to outliers [3]?
Iterative convex-concave procedure (CCCP) [5]
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
10/19
11. Robustness of Jensen centroids (univariate generator)
Theorem
The Jensen centroid is robust for a strictly convex and smooth
generator f if |f ′ ( p+y )| is bounded on the domain X for any
2
prescribed p.
◮
◮
Jensen-Shannon: X = R+ , f (x) = x log x − x ,f ′ (x) = log(x),
f ′′ (x) = 1/x.
|f ′ ( p+y )| = | log p+y | is unbounded when y → +∞.
2
2
JS centroid is not robust
Jensen-Burg: X = R+ , f (x) = − log x, f ′ (x) = −1/x,
f ′′ (x) = x12
2
|f ′ ( p+y )| = | p+y | is always bounded for y ∈ (0, +∞).
2
z(y ) = 2p 2
1
2
−
p p+y
When y → ∞, we have |z(y )| → 2p < ∞.
JB centroid is robust.
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
11/19
12. Clustering: No closed-form centroid, no cry!
k-means++ [1] picks up randomly seeds, no centroid calculation.
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
12/19
13. Divergence-based k-means++
Theorem
Suppose there exist some U and V such that, ∀x, y , z:
tJα (x : z) ≤ U(tJα (x : y ) + tJα (y : z)) , (triangular inequality)
tJα (x : z) ≤ V tJα (z : x) , (symmetric inequality)
Then the average potential of total Jensen seeding with k clusters
satisfies
E [tJα ] ≤ 2U 2 (1 + V )(2 + log k)tJopt,α ,
where tJopt,α is the minimal total Jensen potential achieved by a
clustering in k clusters.
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
13/19
14. Divergence-based k-means++: Two assumptions H
H:
◮
First, the maximal condition number of the Hessian of F , that
is, the ratio between the maximal and minimal eigenvalue
(> 0) of the Hessian of F , is upperbounded by K1 .
◮
Second, we assume the Lipschitz condition on F that
∆2 / ∆, ∆ ≤ K2 , for some K2 > 0.
F
Lemma
Assume 0 < α < 1. Then, under assumption H, for any
p, q, r ∈ S, there exists ǫ > 0 such that:
tJα (p : r ) ≤
2
2(1 + K2 )K1
ǫ
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
1
1
tJα (p : q) + tJα (q : r )
1−α
α
.
14/19
15. Divergence-based k-means++
Corollary
The total skew Jensen divergence satisfies the following triangular
inequality:
tJα (p : r ) ≤
2
2(1 + K2 )K1
(tJα (p : q) + tJα (q : r )) .
ǫα(1 − α)
U=
2
2(1 + K2 )K1
ǫ
Lemma
2
Symmetric inequality condition holds for V = K1 (1 + K2 )/ǫ, for
some 0 < ǫ < 1.
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
15/19
16. Total Jensen divergences: Recap
Total Jensen divergence = conformal divergence with
non-separable double-sided conformal factor.
◮
Invariant to axis rotation of “design space“
◮
Equivalent to total Bregman divergences [8, 4] only when
p≃q
◮
Square root of total Jensen-Shannon divergence is not a
metric (square root of total JS is a metric).
◮
Jensen centroids are not always robust (e.g., Jensen-Shannon
centroid)
◮
Total Jensen k-means++ do not require centroid
computations and guaranteed approximation
Interest of conformal divergences in SVM [9] (double-sided
separable), in information geometry [6] (flattening).
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
16/19
17. Thank you.
@article{totalJensen-arXiv1309.7109 ,
author="Frank Nielsen and Richard Nock",
title="Total {J}ensen divergences: {D}efinition, Properties and $k$-Means++ Clustering",
year="2013",
eprint="arXiv/1309.7109"
}
www.informationgeometry.org
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
17/19
18. Bibliographic references I
David Arthur and Sergei Vassilvitskii.
k-means++: the advantages of careful seeding.
In Proceedings of the eighteenth annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages
1027–1035. Society for Industrial and Applied Mathematics, 2007.
Bent Fuglede and Flemming Topsoe.
Jensen-Shannon divergence and Hilbert space embedding.
In IEEE International Symposium on Information Theory, pages 31–31, 2004.
F. R. Hampel, P. J. Rousseeuw, E. Ronchetti, and W. A. Stahel.
Robust Statistics: The Approach Based on Influence Functions.
Wiley Series in Probability and Mathematical Statistics, 1986.
Meizhu Liu, Baba C. Vemuri, Shun-ichi Amari, and Frank Nielsen.
Shape retrieval using hierarchical total Bregman soft clustering.
Transactions on Pattern Analysis and Machine Intelligence, 34(12):2407–2419, 2012.
Frank Nielsen and Sylvain Boltz.
The Burbea-Rao and Bhattacharyya centroids.
IEEE Transactions on Information Theory, 57(8):5455–5466, August 2011.
Atsumi Ohara, Hiroshi Matsuzoe, and Shun-ichi Amari.
A dually flat structure on the space of escort distributions.
Journal of Physics: Conference Series, 201(1):012012, 2010.
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
18/19
19. Bibliographic references II
Kenneth B Stolarsky.
Generalizations of the logarithmic mean.
Mathematics Magazine, 48(2):87–92, 1975.
Baba Vemuri, Meizhu Liu, Shun-ichi Amari, and Frank Nielsen.
Total Bregman divergence and its applications to DTI analysis.
IEEE Transactions on Medical Imaging, pages 475–483, 2011.
Si Wu and Shun-ichi Amari.
Conformal transformation of kernel functions a data dependent way to improve support vector machine
classifiers.
Neural Processing Letters, 15(1):59–67, 2002.
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
19/19