This document summarizes Frank Nielsen's talk on divergence-based center clustering and their applications. Some key points:
- Center-based clustering aims to minimize an objective function that assigns data points to their closest cluster centers. This is an NP-hard problem when the number of dimensions and data points are greater than 1.
- Mixed divergences use dual centroids per cluster to define cluster assignments. Total Jensen divergences are proposed as a way to make divergences more robust by incorporating a conformal factor.
- For clustering when centroids do not have closed-form solutions, initialization methods like k-means++ can be used which randomly select initial seeds without computing centroids. Total Jensen k-means++
Climate extremes likely to drive land mammal extinction during next supercont...
Divergence clustering
1. Divergence-based center clustering and their
applications
Frank Nielsen
joint work with Richard Nock et al.
´Ecole Polytechnique
Sony Computer Science Laboratories, Inc
Talk at Geneva University (Dec. 2015)
Viper group, CS Dept.
c 2015 Frank Nielsen 1
2. Background: Center-based clustering [16]
Countless applications of clustering: quantization (coding), finding
categories (unsupervised-clustering), technique for speeding-up
computations (e.g., distances), and so on.
Minimize objective/energy/loss function:
E(X = {x1, ..., xk}; C = {c1, ..., ck}) = min
C
n
i=1
min
j∈[k]
D(xi : cj )
NP-hard as soon as d, n > 1.
Dissimilarity D(· : ·): Metric distance vs. C2 divergence
(non-metric, squared Euclidean distance)
c 2015 Frank Nielsen 2
3. Background: Center-based clustering [16]
Initialize k cluster centers (seeds) as known as global
k-means heuristics: random (Forgy), global k-means
(discrete k-means), randomized k-means++ (expected
guarantee ˜O(log k))
Celebrated local k-means heuristics: Lloyd’s batched
allocation (assignment/center relocation), Hartigan’s single
point reassignment.
⇒ Aim at guaranteeing monotone convergence
Continuous vs. discrete k-means, k-center, etc.
Variational k-means: When centroids
arg min n
i=1 wi D(xi : c) not in closed form, center relocation
just need to be better (not best) to still guarantee monotone
convergence
c 2015 Frank Nielsen 3
4. The trick of mixed
divergences [18, 16]:
Dual centroids per cluster
c 2015 Frank Nielsen 4
5. Mixed divergences [16]
Defined on three parameters p, q and r:
Mλ(p : q : r)
eq
= λD(p : q) + (1 − λ)D(q : r)
for λ ∈ [0, 1].
Mixed divergences include:
the sided divergences for λ ∈ {0, 1},
the symmetrized (arithmetic mean) divergence for λ = 1
2, or
skew symmetrized for λ ∈ (0, 1), λ = 1
2.
c 2015 Frank Nielsen 5
6. Symmetrizing α-divergences
Sα(p, q) =
1
2
(Dα(p : q) + Dα(q : p)) = S−α(p, q),
= M1
2
(p : q : p),
For α = ±1, we get half of Jeffreys divergence:
S±1(p, q) =
1
2
d
i=1
(pi
− qi
) log
pi
qi
Same formula for probability and positive measures.
Centroids for symmetrized α-divergence usually not in closed
form.
How to perform finite center-based clustering without closed
form centroids (beyond variational) ?
c 2015 Frank Nielsen 6
7. Closed-form formula for Jeffreys positive centroid [8]
Jeffreys divergence (SKL) = symmetrized α = ±1
divergences.
The Jeffreys positive centroid c = (c1, ..., cd ) of a set
{h1, ..., hn} of n weighted positive histograms with d bins can
be calculated component-wise exactly using the Lambert W
analytic function:
ci
=
ai
W ai
gi e
where ai = n
j=1 πj hi
j denotes the coordinate-wise arithmetic
weighted means and gi = n
j=1(hi
j )πj the coordinate-wise
geometric weighted means.
The Lambert analytic function W (positive branch) is defined
by W (x)eW (x) = x for x ≥ 0.
Finite Jeffreys k-means clustering . But for α = 1,
how to cluster in finite number of iterations?
c 2015 Frank Nielsen 7
8. Mixed α-divergences/α-Jeffreys symmetrized divergence
Mixed α-divergence between a histogram x to two
histograms p and q:
Mλ,α(p : x : q) = λDα(p : x) + (1 − λ)Dα(x : q),
= λD−α(x : p) + (1 − λ)D−α(q : x),
= M1−λ,−α(q : x : p),
α-Jeffreys symmetrized divergence is obtained for λ = 1
2:
Sα(p, q) = M1
2
,α(q : p : q) = M1
2
,α(p : q : p)
skew symmetrized α-divergence is defined by:
Sλ,α(p : q) = λDα(p : q) + (1 − λ)Dα(q : p)
c 2015 Frank Nielsen 8
9. Mixed divergence-based k-means clustering
Initially, k distinct seeds from the dataset with li = ri .
Input: Weighted histogram set H, divergence D(·, ·), integer
k > 0, real λ ∈ [0, 1];
Initialize left-sided/right-sided seeds C = {(li , ri )}k
i=1;
repeat
// Assignment (as usual)
for i = 1, 2, ..., k do
Ci ← {h ∈ H : i = arg minj Mλ(lj : h : rj )};
end
// Dual-sided centroid relocation (the trick!)
for i = 1, 2, ..., k do
ri ← arg minx D(Ci : x) = h∈Ci
wj D(h : x);
li ← arg minx D(x : Ci ) = h∈Ci
wj D(x : h);
end
until convergence;
c 2015 Frank Nielsen 9
10. Example: Mixed α-hard clustering: MAhC(H, k, λ, α)
Input: Weighted histogram set H, integer k > 0, real λ ∈ [0, 1],
real α ∈ R;
Let C = {(li , ri )}k
i=1 ← MAS(H, k, λ, α);
repeat
// Assignment
for i = 1, 2, ..., k do
Ai ← {h ∈ H : i = arg minj Mλ,α(lj : h : rj )};
end
// Centroid relocation
for i = 1, 2, ..., k do
ri ← h∈Ai
wi h
1−α
2
2
1−α
;
li ← h∈Ai
wi h
1+α
2
2
1+α
;
end
until convergence;
c 2015 Frank Nielsen 10
11. Coupled k-Means++ α-Seeding (extending k-means++)
Algorithm 1: Mixed α-seeding; MAS(H, k, λ, α)
Input: Weighted histogram set H, integer k ≥ 1, real λ ∈ [0, 1],
real α ∈ R;
Let C ← hj with uniform probability ;
for i = 2, 3, ..., k do
Pick at random histogram h ∈ H with probability:
πH(h)
eq
=
whMλ,α(ch : h : ch)
y∈H wy Mλ,α(cy : y : cy )
, (1)
// where (ch, ch)
eq
= arg min(z,z)∈C Mλ,α(z : h : z);
C ← C ∪ {(h, h)};
end
Output: Set of initial cluster centers C;
→ Guaranteed probabilistic bound. Just need to initialize! No
centroid computations as iterations not theoretically required
c 2015 Frank Nielsen 11
12. Learning statistical
mixtures with hard EM
k-GMLE [7]: fast,
guaranteed finite
convergence, low memory
footprint... but not
consistent (biased)
c 2015 Frank Nielsen 12
13. Learning MMs: A geometric hard clustering viewpoint
Learn the parameters of a mixture m(x) = k
i=1 wi p(x|θi )
Maximize the complete data likelihood=clustering objective
function
max
W ,Λ
lc(W , Λ) =
n
i=1
k
j=1
zi,j log(wj p(xi |θj ))
= max
Λ
n
i=1
max
j∈[k]
log(wj p(xi |θj ))
≡ min
W ,Λ
n
i=1
min
j∈[k]
Dj (xi ) ,
where cj = (wj , θj ) (cluster prototype) and
Dj (xi ) = − log p(xi |θj ) − log wj are potential distance-like
functions.
⇒ further attach to each cluster (mixture component) a different
family of probability distributions.
c 2015 Frank Nielsen 13
14. Generalized k-MLE: learning statistical EF
mixtures [7, 23, 22, 21, 9]
Model-based clustering: Assignment of points to clusters:
Dwj ,θj ,Fj
(x) = − log pFj
(x; θj ) − log wj
1. Initialize weight W ∈ ∆k and family type (F1, ..., Fk) for each
cluster
2. Solve minΛ i minj Dj (xi ) (center-based clustering for W
fixed) with potential functions:
Dj (xi ) = − log pFj
(xi |θj ) − log wj
3. Solve family types maximizing the MLE in each cluster Cj by
choosing the parametric family of distributions Fj = F(γj )
that yields the best likelihood:
minF1=F(γ1),...,Fk =F(γk )∈F(γ) i minj Dwj ,θj ,Fj
(xi ).
∀l, γl = maxj F∗
j (ˆηl = 1
nl x∈Cl
tj (x)) + 1
nl x∈Cl
k(x).
4. Update weight W as the cluster point proportion
5. Test for convergence and go to step 2) otherwise.
Drawback = biased, non-consistent estimator due to Voronoi
support truncation.c 2015 Frank Nielsen 14
18. Total Bregman divergences
Conformal divergence, conformal factor ρ:
D (p : q) = ρ(p, q)D(p : q)
plays the rˆole of “regularizer” [24] and ensures robustness
Invariance by rotation of the axes of the design space
tB(p : q) =
B(p : q)
1 + F(q), F(q)
= ρB(q)B(p : q),
ρB(q) =
1
1 + F(q), F(q)
.
Total squared Euclidean divergence:
tE(p, q) =
1
2
p − q, p − q
1 + q, q
.
c 2015 Frank Nielsen 18
19. Total Jensen divergence: Illustration of the principle
p q(pq)α
F(p)
F(q)
(F(p)F(q))α
(F(p)F(q))β
Jα(p : q)
F((pq)α)
tJα(p : q)
F(p )
F(q )
(F(p )F(q ))α
(F(p )F(q ))β
Jα(p : q )
F((p q )α)
tJα(p : q )
p (p q )α
qO
O
c 2015 Frank Nielsen 19
20. Total Jensen divergences [15]
tB(p : q) = ρB(q)B(p : q), ρB(q) =
1
1 + F(q), F(q)
tJα(p : q) = ρJ(p, q)Jα(p : q), ρJ(p, q) =
1
1 + (F(p)−F(q))2
p−q,p−q
Jensen-Shannon divergence, square root is a metric [3]:
JS(p, q) =
1
2
d
i=1
pi log
2pi
pi + qi
+
1
2
d
i=1
qi log
2qi
pi + qi
Lemma
The square root of the total Jensen-Shannon divergence is not a
metric.
c 2015 Frank Nielsen 20
21. Total Jensen divergences/Total Bregman divergences
Total Jensen is not a generalization of total Bregman.
limit cases α ∈ {0, 1}, we have:
lim
α→0
tJα(p : q) = ρJ(p, q)B(p : q) = ρB(q)B(p : q),
lim
α→1
tJα(p : q) = ρJ(p, q)B(q : p) = ρB(p)B(q : p),
since conformal factors ρJ(p, q) = ρB(q).
ρB(q) =
1
1 + F(q), F(q)
, tJα(p : q) = ρJ(p, q)Jα(p : q), ρJ(p
c 2015 Frank Nielsen 21
22. Conformal factor from mean value theorem
When p q, ρJ(p, q) ρB(q), and the total Jensen divergence
tends to the total Bregman divergence for any value of α.
ρJ(p, q) =
1
1 + F( ), F( )
= ρB( ),
for ∈ [p, q].
For univariate generators, explicitly the value of :
= F−1 ∆F
∆
= F∗ ∆F
∆
,
where F∗ is the Legendre convex conjugate [10].
c 2015 Frank Nielsen 22
23. Centroids and statistical robustness
Centroids (barycenters) are minimizers of average (weighted)
divergences:
L(x; w) =
n
i=1
wi × tJα(pi : x),
cα = arg min
x∈X
L(x; w),
Is it unique?
Is it robust to outliers [4]?
Iterative convex-concave procedure (CCCP, gradient method
without learning rate) [10]
c 2015 Frank Nielsen 23
24. Clustering: No closed-form centroid, no cry!
k-means++ [1] picks up randomly seeds, no centroid calculation.
Algorithm 2: Total Jensen k-means++ seeding
Input: Number of clusters k ≥ 1;
Let C ← {hj } with uniform probability ;
for i = 2, 3, ..., k do
Pick at random h ∈ H with probability:
πH(h) =
tJα(ch : h)
y∈H tJα(cy : y)
where ch = arg minz∈C tJα(z : h);
C ← C ∪ {h};
end
Output: Set of initial cluster centers C;
c 2015 Frank Nielsen 24
25. Total Jensen divergences: Recap
Total Jensen divergence = conformal divergence with
non-separable double-sided conformal factor.
Invariant to axis rotation of “design space“
Equivalent to total Bregman divergences [24, 5] only when
p q
Square root of total Jensen-Shannon divergence is not a
metric but square root of total JS is a metric.
Total Jensen k-means++ do not require centroid
computations and guaranteed approximation
Interest of conformal divergences in SVM [25] (double-sided
separable), in information geometry [20] (flattening).
c 2015 Frank Nielsen 25
26. Novel heuristics for
NP-hard center-based
clustering: merge-and-split
and (k, l)-means [14]
c 2015 Frank Nielsen 26
27. The k-means merge-and-split heuristic
Generalize Hartigan’s single-point relocation heuristic...
Consider pairs of clusters (Ci , Cj ) with centers ci and cj ,
merge them and split them again in two clusters using new
centers ci and cj . Accept when the sum of these two cluster
variance decreases:
∆(Ci , Cj ) = V (Ci , ci ) + V (Cj , cj ) − (V (Ci , ci ) + V (Cj , cj ))
How to split again two merged clusters (best splitting is
NP-hard)?
a discrete 2-means: We choose among the ni,j = ni + nj points
of Ci,j the two best centers (naively implemented in O(n3
)).
This yields a 2-approximation of 2-means.
a 2-means++ heuristic: We pick ci at random, then pick cj
randomly according to the normalized distribution of the
squared distances of the points in Ci,j to ci , see k-means++.
We repeat a given number α of rounds this initialization (say,
α = 1 + 0.01 ni,j
2 ) and keeps the best one.
c 2015 Frank Nielsen 27
28. The k-means merge-and-split heuristic
ops=number of pivot operations
Data set Hartigan Discrete Hartigan Merge&Split
cost #ops cost #ops cost #ops
Iris(d=4,n=150,k=3) 112.35 35.11 101.69 33.54 83.95 31.36
Wine(d=13,n=178,k=3) 607303 97.88 593319 100.02 570283 100.47
Yeast(d=8,n=1484,k=10) 47.10 1364.0 57.34 807.83 50.20 190.58
Data set Hartigan++ Discrete Hartigan++ Merge&Split++
cost #ops cost #ops cost #ops
Iris(d=4,n=150,k=3) 101.49 19.40 90.48 18.93 88.56 8.84
Wine(d=13,n=178,k=3) 3152616 18.76 2525803 24.61 2498107 9.67
Yeast(d=8,n=1484,k=10) 47.41 1192.38 54.96 640.89 51.82 66.30
c 2015 Frank Nielsen 28
29. The (k, l)-means heuristic: navigating on the local minima!
Associate to each pi to its l nearest cluster centers
NNl (pi ; K) (with iNNl = cluster center indexes), and
minimize the (k, l)-means objective function (with 1 ≤ l ≤ k):
e(P, K; l) =
n
i=1 a∈iNNl (pi ;K)
pi − ca
2
.
Assignment/relocation guarantees monotonous decrease.
Higher l means = local optima in optimization landscape
conversion to k-means
(k, l) ↓-means: convert a (k, l)-means by assigning to each
point pi its closest neighbor (among the l assigned at the end
of the (k, l)-means), and then compute the centroids and
launch a regular Lloyd’s k-means to finalize.
(k, l)-means: cascading conversion of (k, l)-means to
k-means: After convergence of (k, l)-means, initialize a
(k, l − 1) means by dropping for each point pi its farthest
cluster and perform a Lloyd’s (k, l − 1)-means, etc until we get
a (k, 1)-means=k-means. .
c 2015 Frank Nielsen 29
30. The (k, l)-means heuristic: 10000 trials
Data-set: Iris
(k, l) ↓-means: convert a (k, l)-means by assigning to each point pi its closest neighbor (among the l
assigned at the end of the (k, l)-means), and then compute the centroids and launch a regular Lloyd’s
k-means to finalize.
(k, l)-means: cascading conversion of (k, l)-means to k-means: After convergence of (k, l)-means,
initialize a (k, l − 1) means by dropping for each point pi its farthest cluster and perform a Lloyd’s
(k, l − 1)-means, etc until we get a (k, 1)-means=k-means. .
k win k-means (k, 2) ↓-means
min avg min avg
3 20.8 78.94 92.39 78.94 78.94
4 24.29 57.31 63.15 57.31 70.33
5 57.76 46.53 52.88 49.74 51.10
6 80.55 38.93 45.60 38.93 41.63
7 76.67 34.18 40.00 34.29 36.85
8 80.36 29.87 36.05 29.87 32.52
9 78.85 27.76 32.91 27.91 30.15
10 79.88 25.81 30.24 25.97 28.02
k l win k-means (k, l)-means
min avg min avg
5 2 58.3 46.53 52.72 49.74 51.24
5 4 62.4 46.53 52.55 49.74 49.74
8 2 80.8 29.87 36.40 29.87 32.54
8 3 61.1 29.87 36.19 32.76 34.04
8 6 55.5 29.88 36.189 32.75 35.26
10 2 78.8 25.81 30.61 25.97 28.23
10 3 82.5 25.95 30.23 26.47 27.76
10 5 64.7 25.90 30.32 26.99 28.61
On average better cost, but better local minima found by normal
k-means...
c 2015 Frank Nielsen 30
34. Space of Bregman spheres: Lifting map [2]
F : x → ˆx = (x, F(x)), hypersurface in Rd+1, potential function
Hp: Tangent hyperplane at ˆp
z = Hp(x) = x − p, F(p) + F(p)
Bregman sphere σ −→ ˆσ with supporting hyperplane
Hσ : z = x − c, F(c) + F(c) + r.
(// to Hc and shifted vertically by r)
ˆσ = F ∩ Hσ.
intersection of any hyperplane H with F projects onto X as a
Bregman sphere:
H : z = x, a +b → σ : BallF (c = ( F)−1
(a), r = a, c −F(c)+b)
c 2015 Frank Nielsen 34
35. Space of Bregman spheres: Algorithmic applications
Vapnik-Chervonenkis dimension [2] (VC-dim) is d + 1 for the
class of Bregman balls (for Machine Learning).
Union/intersection of Bregman d-spheres from
representational (d + 1)-polytope [2]
Radical axis of two Bregman balls is an hyperplane:
Applications to Nearest Neighbor search trees like Bregman
ball trees or Bregman vantage point trees [17].
c 2015 Frank Nielsen 35
36. Bregman proximity data structures, k-NN queries
Vantage point trees [17]: partition space according to Bregman
balls
Partitionning space with intersection of Kullback-Leibler balls
→ efficient nearest neighbour queries in information spaces
c 2015 Frank Nielsen 36
37. Application: Minimum Enclosing Ball [12, 19]
To a hyperplane Hσ = H(a, b) : z = a, x + b in Rd+1,
corresponds a ball σ = Ball(c, r) in Rd with center c = F∗(a)
and radius:
r = a, c −F(c)+b = a, F∗
(a) −F( F∗
(a))+b = F∗
(a) + b
since F( F∗(a)) = F∗(a), a − F∗(a) (Young equality)
SEB: Find halfspace H(a, b)− : z ≤ a, x + b that contains all
lifted points:
min
a,b
r = F∗
(a) + b,
∀i ∈ {1, ..., n}, a, xi + b − F(xi ) ≥ 0
→ Convex Program (CP) with linear inequality constraints
F(θ) = F∗(η) = 1
2x x: CP → Quadratic Programming (QP)
used in SVM. Smallest enclosing ball used as a primitive in
SVM
c 2015 Frank Nielsen 37
38. Approximating the smallest Bregman enclosing
balls [19, 11]
Algorithm 3: BBCA(P, l).
c1 ← choose randomly a point in P;
for i = 2 to l − 1 do
// farthest point from ci wrt. BF
si ← argmaxn
j=1BF (ci : pj );
// update the center: walk on the η-segment
[ci , psi ]η
ci+1 ← F−1
( F(ci )# 1
i+1
F(psi )) ;
end
// Return the SEBB approximation
return Ball(cl , rl = BF (cl : X)) ;
θ-, η-geodesic segments in dually flat geometry.
c 2015 Frank Nielsen 38
39. Smallest enclosing balls: Core-sets [19]
Core-set C ⊆ S: SOL(S) ≤ SOL(C) ≤ (1 + )SOL(S)
extended Kullback-Leibler Itakura-Saito
c 2015 Frank Nielsen 39
40. Programming InSphere predicates
Implicit representation of Bregman spheres/balls [2]: consider
d + 1 support points on the boundary
Is x inside the Bregman ball defined by d + 1 support points?
InSphere(x; p0, ..., pd ) =
1 ... 1 1
p0 ... pd x
F(p0) ... F(pd ) F(x)
sign of a (d + 2) × (d + 2) matrix determinant
InSphere(x; p0, ..., pd ) is negative, null or positive depending
on whether x lies inside, on, or outside σ.
c 2015 Frank Nielsen 40
41. Computing f -divergences
for generic f :
Beyond stochastic
Monte-Carlo numerical
integration
c 2015 Frank Nielsen 41
42. Ali-Silvey-Csisz´ar f -divergences [6]
If (X1 : X2) = x1(x)f
x2(x)
x1(x)
dν(x) ≥ 0 (potentially +∞)
Name of the f -divergence Formula If (P : Q) Generator f (u) with f (1) = 0
Total variation (metric) 1
2
|p(x) − q(x)|dν(x) 1
2
|u − 1|
Squared Hellinger ( p(x) − q(x))2
dν(x) (
√
u − 1)2
Pearson χ2
P
(q(x)−p(x))2
p(x)
dν(x) (u − 1)2
Neyman χ2
N
(p(x)−q(x))2
q(x)
dν(x)
(1−u)2
u
Pearson-Vajda χk
P
(q(x)−λp(x))k
pk−1(x)
dν(x) (u − 1)k
Pearson-Vajda |χ|k
P
|q(x)−λp(x)|k
pk−1(x)
dν(x) |u − 1|k
Kullback-Leibler p(x) log
p(x)
q(x)
dν(x) − log u
reverse Kullback-Leibler q(x) log
q(x)
p(x)
dν(x) u log u
α-divergence 4
1−α2 (1 − p
1−α
2 (x)q1+α
(x)dν(x)) 4
1−α2 (1 − u
1+α
2 )
Jensen-Shannon 1
2
(p(x) log
2p(x)
p(x)+q(x)
+ q(x) log
2q(x)
p(x)+q(x)
)dν(x) −(u + 1) log 1+u
2
+ u log u
If (p : q) =
1
n
i
f (x2(si )/x1(si )), s1, ..., sn ∼iid X1(never +∞!)
c 2015 Frank Nielsen 42
43. Information monotonicity of f -divergences [6]
(Proof in Ali-Silvey paper)
Do coarse binning: from d bins to k < d bins:
X = k
i=1Ai
Let pA = (pi )A with pi = j∈Ai
pj .
Information monotonicity:
D(p : q) ≥ D(pA
: qA
)
We should distinguish less downgraded histograms...
⇒ f -divergences are the only divergences preserving the
information monotonicity.
c 2015 Frank Nielsen 43
44. f -divergences and higher-order Vajda χk
divergences [6]
If (X1 : X2) =
∞
k=0
f (k)(1)
k!
χk
P(X1 : X2)
χk
P(X1 : X2) =
(x2(x) − x1(x))k
x1(x)k−1
dν(x),
|χ|k
P(X1 : X2) =
|x2(x) − x1(x)|k
x1(x)k−1
dν(x),
are f -divergences for the generators (u − 1)k and |u − 1|k.
When k = 1, χ1
P(X1 : X2) = (x1(x) − x2(x))dν(x) = 0
(never discriminative), and |χ1
P|(X1, X2) is twice the total
variation distance.
χk
P is a signed distance
c 2015 Frank Nielsen 44
45. Affine exponential families [6]
Canonical decomposition of the probability measure:
pθ(x) = exp( t(x), θ − F(θ) + k(x)),
consider natural parameter space Θ affine (like multinomials).
Poi(λ) : p(x|λ) =
λx e−λ
x!
, λ > 0, x ∈ {0, 1, ...}
NorI (µ) : p(x|µ) = (2π)−d
2 e−1
2
(x−µ) (x−µ)
, µ ∈ Rd
, x ∈ Rd
Family θ Θ F(θ) k(x) t(x) ν
Poisson log λ R eθ − log x! x νc
Iso.Gaussian µ Rd 1
2θ θ d
2 log 2π − 1
2x x x νL
c 2015 Frank Nielsen 45
46. Higher-order Vajda χk
divergences [6]
The (signed) χk
P distance between members X1 ∼ EF (θ1) and
X2 ∼ EF (θ2) of the same affine exponential family is (k ∈ N)
always bounded and equal to:
χk
P(X1 : X2) =
k
j=0
(−1)k−j k
j
eF((1−j)θ1+jθ2)
e(1−j)F(θ1)+jF(θ2)
For Poisson/Normal distributions, we get closed-form formula:
χk
P(λ1 : λ2) =
k
j=0
(−1)k−j k
j
eλ1−j
1 λj
2−((1−j)λ1+jλ2)
,
χk
P(µ1 : µ2) =
k
j=0
(−1)k−j k
j
e
1
2
j(j−1)(µ1−µ2) (µ1−µ2)
.
c 2015 Frank Nielsen 46
48. Bibliography I
David Arthur and Sergei Vassilvitskii.
k-means++: the advantages of careful seeding.
In Proceedings of the eighteenth annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages
1027–1035. Society for Industrial and Applied Mathematics, 2007.
Jean-Daniel Boissonnat, Frank Nielsen, and Richard Nock.
Bregman Voronoi diagrams.
Discrete and Computational Geometry, 44(2):281–307, April 2010.
Bent Fuglede and Flemming Topsoe.
Jensen-Shannon divergence and Hilbert space embedding.
In IEEE International Symposium on Information Theory, pages 31–31, 2004.
F. R. Hampel, P. J. Rousseeuw, E. Ronchetti, and W. A. Stahel.
Robust Statistics: The Approach Based on Influence Functions.
Wiley Series in Probability and Mathematical Statistics, 1986.
Meizhu Liu, Baba C. Vemuri, Shun-ichi Amari, and Frank Nielsen.
Shape retrieval using hierarchical total Bregman soft clustering.
Transactions on Pattern Analysis and Machine Intelligence, 34(12):2407–2419, 2012.
F. Nielsen and R. Nock.
On the chi square and higher-order chi distances for approximating f -divergences.
Signal Processing Letters, IEEE, 21(1):10–13, 2014.
Frank Nielsen.
k-MLE: A fast algorithm for learning statistical mixture models.
CoRR, 1203.5181, 2012.
c 2015 Frank Nielsen 48
49. Bibliography II
Frank Nielsen.
Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximation
for frequency histograms.
Signal Processing Letters, IEEE, 20(7):657–660, 2013.
Frank Nielsen.
On learning statistical mixtures maximizing the complete likelihood.
Bayesian Inference and Maximum Entropy Methods in Science and Engineering (MaxEnt 2014),
1641:238–245, 2014.
Frank Nielsen and Sylvain Boltz.
The Burbea-Rao and Bhattacharyya centroids.
IEEE Transactions on Information Theory, 57(8):5455–5466, August 2011.
Frank Nielsen and Richard Nock.
On approximating the smallest enclosing Bregman balls.
In Proceedings of the Twenty-second Annual Symposium on Computational Geometry, SCG ’06, pages
485–486, New York, NY, USA, 2006. ACM.
Frank Nielsen and Richard Nock.
On the smallest enclosing information disk.
Information Processing Letters (IPL), 105(3):93–97, 2008.
Frank Nielsen and Richard Nock.
Sided and symmetrized Bregman centroids.
Information Theory, IEEE Transactions on, 55(6):2882–2904, 2009.
Frank Nielsen and Richard Nock.
Further heuristics for k-means: The merge-and-split heuristic and the (k, l)-means.
arXiv preprint arXiv:1406.6314, 2014.
c 2015 Frank Nielsen 49
50. Bibliography III
Frank Nielsen and Richard Nock.
Total Jensen divergences: Definition, properties and clustering.
In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2016–2020,
2015.
Frank Nielsen, Richard Nock, and Shun-ichi Amari.
On clustering histograms with k-means by using mixed α-divergences.
Entropy, 16(6):3273–3301, 2014.
Frank Nielsen, Paolo Piro, and Michel Barlaud.
Bregman vantage point trees for efficient nearest neighbor queries.
In Proceedings of the 2009 IEEE International Conference on Multimedia and Expo (ICME), pages 878–881,
2009.
Richard Nock, Panu Luosto, and Jyrki Kivinen.
Mixed Bregman clustering with approximation guarantees.
In Machine Learning and Knowledge Discovery in Databases, pages 154–169. Springer, 2008.
Richard Nock and Frank Nielsen.
Fitting the smallest enclosing bregman balls.
In 16th European Conference on Machine Learning (ECML), pages 649–656, October 2005.
Atsumi Ohara, Hiroshi Matsuzoe, and Shun-ichi Amari.
A dually flat structure on the space of escort distributions.
Journal of Physics: Conference Series, 201(1):012012, 2010.
Christophe Saint-Jean and Frank Nielsen.
Hartigans method for k-MLE: Mixture modeling with Wishart distributions and its application to motion
retrieval.
In Frank Nielsen, editor, Geometric Theory of Information, Signals and Communication Technology, pages
301–330. Springer International Publishing, 2014.
c 2015 Frank Nielsen 50
51. Bibliography IV
Olivier Schwander and Frank Nielsen.
Fast learning of gamma mixture models with k-mle.
In Similarity-Based Pattern Recognition, pages 235–249. Springer, 2013.
Olivier Schwander, Aurelien J Schutz, Frank Nielsen, and Yannick Berthoumieu.
k-mle for mixtures of generalized Gaussians.
In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 2825–2828. IEEE, 2012.
Baba Vemuri, Meizhu Liu, Shun-ichi Amari, and Frank Nielsen.
Total Bregman divergence and its applications to DTI analysis.
IEEE Transactions on Medical Imaging, pages 475–483, 2011.
Si Wu and Shun-ichi Amari.
Conformal transformation of kernel functions a data dependent way to improve support vector machine
classifiers.
Neural Processing Letters, 15(1):59–67, 2002.
c 2015 Frank Nielsen 51