Divergence clustering

Divergence-based center clustering and their
applications
Frank Nielsen
joint work with Richard Nock et al.
´Ecole Polytechnique
Sony Computer Science Laboratories, Inc
Talk at Geneva University (Dec. 2015)
Viper group, CS Dept.
c 2015 Frank Nielsen 1

Background: Center-based clustering [16]
Countless applications of clustering: quantization (coding), ﬁnding
categories (unsupervised-clustering), technique for speeding-up
computations (e.g., distances), and so on.
Minimize objective/energy/loss function:
E(X = {x1, ..., xk}; C = {c1, ..., ck}) = min
C
n
i=1
min
j∈[k]
D(xi : cj )
NP-hard as soon as d, n > 1.
Dissimilarity D(· : ·): Metric distance vs. C2 divergence
(non-metric, squared Euclidean distance)

Background: Center-based clustering [16]
Initialize k cluster centers (seeds) as known as global
k-means heuristics: random (Forgy), global k-means
(discrete k-means), randomized k-means++ (expected
guarantee ˜O(log k))
Celebrated local k-means heuristics: Lloyd’s batched
allocation (assignment/center relocation), Hartigan’s single
point reassignment.
⇒ Aim at guaranteeing monotone convergence
Continuous vs. discrete k-means, k-center, etc.
Variational k-means: When centroids
arg min n
i=1 wi D(xi : c) not in closed form, center relocation
just need to be better (not best) to still guarantee monotone
convergence

The trick of mixed
divergences [18, 16]:
Dual centroids per cluster

Mixed divergences [16]
Deﬁned on three parameters p, q and r:
Mλ(p : q : r)
eq
= λD(p : q) + (1 − λ)D(q : r)
for λ ∈ [0, 1].
Mixed divergences include:
the sided divergences for λ ∈ {0, 1},
the symmetrized (arithmetic mean) divergence for λ = 1
2, or
skew symmetrized for λ ∈ (0, 1), λ = 1
2.

Symmetrizing α-divergences
Sα(p, q) =
1
2
(Dα(p : q) + Dα(q : p)) = S−α(p, q),
= M1
2
(p : q : p),
For α = ±1, we get half of Jeﬀreys divergence:
S±1(p, q) =
1
2
d
i=1
(pi
− qi
) log
pi
qi
Same formula for probability and positive measures.
Centroids for symmetrized α-divergence usually not in closed
form.
How to perform ﬁnite center-based clustering without closed
form centroids (beyond variational) ?

Closed-form formula for Jeffreys positive centroid [8]
Jeffreys divergence (SKL) = symmetrized α = ±1
divergences.
The Jeffreys positive centroid c = (c1, ..., cd ) of a set
{h1, ..., hn} of n weighted positive histograms with d bins can
be calculated component-wise exactly using the Lambert W
analytic function:
ci
=
ai
W ai
gi e
where ai = n
j=1 πj hi
j denotes the coordinate-wise arithmetic
weighted means and gi = n
j=1(hi
j )πj the coordinate-wise
geometric weighted means.
The Lambert analytic function W (positive branch) is defined
by W (x)eW (x) = x for x ≥ 0.
Finite Jeffreys k-means clustering . But for α = 1,
how to cluster in finite number of iterations?

Mixed α-divergences/α-Jeffreys symmetrized divergence
Mixed α-divergence between a histogram x to two
histograms p and q:
Mλ,α(p : x : q) = λDα(p : x) + (1 − λ)Dα(x : q),
= λD−α(x : p) + (1 − λ)D−α(q : x),
= M1−λ,−α(q : x : p),
α-Jeffreys symmetrized divergence is obtained for λ = 1
2:
Sα(p, q) = M1
2
,α(q : p : q) = M1
2
,α(p : q : p)
skew symmetrized α-divergence is defined by:
Sλ,α(p : q) = λDα(p : q) + (1 − λ)Dα(q : p)

Mixed divergence-based k-means clustering
Initially, k distinct seeds from the dataset with li = ri .
Input: Weighted histogram set H, divergence D(·, ·), integer
k > 0, real λ ∈ [0, 1];
Initialize left-sided/right-sided seeds C = {(li , ri )}k
i=1;
repeat
// Assignment (as usual)
for i = 1, 2, ..., k do
Ci ← {h ∈ H : i = arg minj Mλ(lj : h : rj )};
end
// Dual-sided centroid relocation (the trick!)
for i = 1, 2, ..., k do
ri ← arg minx D(Ci : x) = h∈Ci
wj D(h : x);
li ← arg minx D(x : Ci ) = h∈Ci
wj D(x : h);
end
until convergence;

Example: Mixed α-hard clustering: MAhC(H, k, λ, α)
Input: Weighted histogram set H, integer k > 0, real λ ∈ [0, 1],
real α ∈ R;
Let C = {(li , ri )}k
i=1 ← MAS(H, k, λ, α);
repeat
// Assignment
for i = 1, 2, ..., k do
Ai ← {h ∈ H : i = arg minj Mλ,α(lj : h : rj )};
end
// Centroid relocation
for i = 1, 2, ..., k do
ri ← h∈Ai
wi h
1−α
2
2
1−α
;
li ← h∈Ai
wi h
1+α
2
2
1+α
;
end
until convergence;

Coupled k-Means++ α-Seeding (extending k-means++)
Algorithm 1: Mixed α-seeding; MAS(H, k, λ, α)
Input: Weighted histogram set H, integer k ≥ 1, real λ ∈ [0, 1],
real α ∈ R;
Let C ← hj with uniform probability ;
for i = 2, 3, ..., k do
Pick at random histogram h ∈ H with probability:
πH(h)
eq
=
whMλ,α(ch : h : ch)
y∈H wy Mλ,α(cy : y : cy )
, (1)
// where (ch, ch)
eq
= arg min(z,z)∈C Mλ,α(z : h : z);
C ← C ∪ {(h, h)};
end
Output: Set of initial cluster centers C;
→ Guaranteed probabilistic bound. Just need to initialize! No
centroid computations as iterations not theoretically required

Learning statistical
mixtures with hard EM
k-GMLE [7]: fast,
guaranteed ﬁnite
convergence, low memory
footprint... but not
consistent (biased)

Learning MMs: A geometric hard clustering viewpoint
Learn the parameters of a mixture m(x) = k
i=1 wi p(x|θi )
Maximize the complete data likelihood=clustering objective
function
max
W ,Λ
lc(W , Λ) =
n
i=1
k
j=1
zi,j log(wj p(xi |θj ))
= max
Λ
n
i=1
max
j∈[k]
log(wj p(xi |θj ))
≡ min
W ,Λ
n
i=1
min
j∈[k]
Dj (xi ) ,
where cj = (wj , θj ) (cluster prototype) and
Dj (xi ) = − log p(xi |θj ) − log wj are potential distance-like
functions.
⇒ further attach to each cluster (mixture component) a diﬀerent
family of probability distributions.

Generalized k-MLE: learning statistical EF
mixtures [7, 23, 22, 21, 9]
Model-based clustering: Assignment of points to clusters:
Dwj ,θj ,Fj
(x) = − log pFj
(x; θj ) − log wj
1. Initialize weight W ∈ ∆k and family type (F1, ..., Fk) for each
cluster
2. Solve minΛ i minj Dj (xi ) (center-based clustering for W
ﬁxed) with potential functions:
Dj (xi ) = − log pFj
(xi |θj ) − log wj
3. Solve family types maximizing the MLE in each cluster Cj by
choosing the parametric family of distributions Fj = F(γj )
that yields the best likelihood:
minF1=F(γ1),...,Fk =F(γk )∈F(γ) i minj Dwj ,θj ,Fj
(xi ).
∀l, γl = maxj F∗
j (ˆηl = 1
nl x∈Cl
tj (x)) + 1
nl x∈Cl
k(x).
4. Update weight W as the cluster point proportion
5. Test for convergence and go to step 2) otherwise.
Drawback = biased, non-consistent estimator due to Voronoi
support truncation.c 2015 Frank Nielsen 14

Conformal divergences and
clustering. (by analogy to
Riemannian tensor metric)

Geometrically designed divergences
Plot of the convex generator F: Bregman [13], Jensen
(Burbea-Rao [10]), total Bregman [5].
q p
p+q
2
B(p : q)
J(p, q)
tB(p : q)
F : (x, F(x))
(p, F(p))
(q, F(q))

Divergences: Distortion measures
F a smooth convex function, the generator.
Skew Jensen divergences:
Jα(p : q) = αF(p) + (1 − α)F(q) − F(αp + (1 − α)q),
= (F(p)F(q))α − F((pq)α),
where (pq)γ = γp + (1 − γ)q = q + γ(p − q) and
(F(p)F(q))γ = γF(p)+(1−γ)F(q) = F(q)+γ(F(p)−F(q)).
Bregman divergences = limit cases of skew Jensen
B(p : q) = F(p) − F(q) − p − q, F(q) ,
lim
α→0
Jα(p : q) = B(p : q),
lim
α→1
Jα(p : q) = B(q : p).
Statistical Bhattacharrya divergence = Jensen for exponential
families [10]
Bhat(p1 : p2) = − log p1(x)α
p2(x)1−α
dν(x) = Jα(θ1 : θ2)

Total Bregman divergences
Conformal divergence, conformal factor ρ:
D (p : q) = ρ(p, q)D(p : q)
plays the rˆole of “regularizer” [24] and ensures robustness
Invariance by rotation of the axes of the design space
tB(p : q) =
B(p : q)
1 + F(q), F(q)
= ρB(q)B(p : q),
ρB(q) =
1
1 + F(q), F(q)
.
Total squared Euclidean divergence:
tE(p, q) =
1
2
p − q, p − q
1 + q, q
.

Total Jensen divergence: Illustration of the principle
p q(pq)α
F(p)
F(q)
(F(p)F(q))α
(F(p)F(q))β
Jα(p : q)
F((pq)α)
tJα(p : q)
F(p )
F(q )
(F(p )F(q ))α
(F(p )F(q ))β
Jα(p : q )
F((p q )α)
tJα(p : q )
p (p q )α
qO
O

Total Jensen divergences [15]
tB(p : q) = ρB(q)B(p : q), ρB(q) =
1
1 + F(q), F(q)
tJα(p : q) = ρJ(p, q)Jα(p : q), ρJ(p, q) =
1
1 + (F(p)−F(q))2
p−q,p−q
Jensen-Shannon divergence, square root is a metric [3]:
JS(p, q) =
1
2
d
i=1
pi log
2pi
pi + qi
+
1
2
d
i=1
qi log
2qi
pi + qi
Lemma
The square root of the total Jensen-Shannon divergence is not a
metric.

Total Jensen divergences/Total Bregman divergences
Total Jensen is not a generalization of total Bregman.
limit cases α ∈ {0, 1}, we have:
lim
α→0
tJα(p : q) = ρJ(p, q)B(p : q) = ρB(q)B(p : q),
lim
α→1
tJα(p : q) = ρJ(p, q)B(q : p) = ρB(p)B(q : p),
since conformal factors ρJ(p, q) = ρB(q).
ρB(q) =
1
1 + F(q), F(q)
, tJα(p : q) = ρJ(p, q)Jα(p : q), ρJ(p

Conformal factor from mean value theorem
When p q, ρJ(p, q) ρB(q), and the total Jensen divergence
tends to the total Bregman divergence for any value of α.
ρJ(p, q) =
1
1 + F( ), F( )
= ρB( ),
for ∈ [p, q].
For univariate generators, explicitly the value of :
= F−1 ∆F
∆
= F∗ ∆F
∆
,
where F∗ is the Legendre convex conjugate [10].

Centroids and statistical robustness
Centroids (barycenters) are minimizers of average (weighted)
divergences:
L(x; w) =
n
i=1
wi × tJα(pi : x),
cα = arg min
x∈X
L(x; w),
Is it unique?
Is it robust to outliers [4]?
Iterative convex-concave procedure (CCCP, gradient method
without learning rate) [10]

Clustering: No closed-form centroid, no cry!
k-means++ [1] picks up randomly seeds, no centroid calculation.
Algorithm 2: Total Jensen k-means++ seeding
Input: Number of clusters k ≥ 1;
Let C ← {hj } with uniform probability ;
for i = 2, 3, ..., k do
Pick at random h ∈ H with probability:
πH(h) =
tJα(ch : h)
y∈H tJα(cy : y)
where ch = arg minz∈C tJα(z : h);
C ← C ∪ {h};
end
Output: Set of initial cluster centers C;

Total Jensen divergences: Recap
Total Jensen divergence = conformal divergence with
non-separable double-sided conformal factor.
Invariant to axis rotation of “design space“
Equivalent to total Bregman divergences [24, 5] only when
p q
Square root of total Jensen-Shannon divergence is not a
metric but square root of total JS is a metric.
Total Jensen k-means++ do not require centroid
computations and guaranteed approximation
Interest of conformal divergences in SVM [25] (double-sided
separable), in information geometry [20] (ﬂattening).

Novel heuristics for
NP-hard center-based
clustering: merge-and-split
and (k, l)-means [14]

The k-means merge-and-split heuristic
Generalize Hartigan’s single-point relocation heuristic...
Consider pairs of clusters (Ci , Cj ) with centers ci and cj ,
merge them and split them again in two clusters using new
centers ci and cj . Accept when the sum of these two cluster
variance decreases:
∆(Ci , Cj ) = V (Ci , ci ) + V (Cj , cj ) − (V (Ci , ci ) + V (Cj , cj ))
How to split again two merged clusters (best splitting is
NP-hard)?
a discrete 2-means: We choose among the ni,j = ni + nj points
of Ci,j the two best centers (naively implemented in O(n3
)).
This yields a 2-approximation of 2-means.
a 2-means++ heuristic: We pick ci at random, then pick cj
randomly according to the normalized distribution of the
squared distances of the points in Ci,j to ci , see k-means++.
We repeat a given number α of rounds this initialization (say,
α = 1 + 0.01 ni,j
2 ) and keeps the best one.

The k-means merge-and-split heuristic
ops=number of pivot operations
Data set Hartigan Discrete Hartigan Merge&Split
cost #ops cost #ops cost #ops
Iris(d=4,n=150,k=3) 112.35 35.11 101.69 33.54 83.95 31.36
Wine(d=13,n=178,k=3) 607303 97.88 593319 100.02 570283 100.47
Yeast(d=8,n=1484,k=10) 47.10 1364.0 57.34 807.83 50.20 190.58
Data set Hartigan++ Discrete Hartigan++ Merge&Split++
cost #ops cost #ops cost #ops
Iris(d=4,n=150,k=3) 101.49 19.40 90.48 18.93 88.56 8.84
Wine(d=13,n=178,k=3) 3152616 18.76 2525803 24.61 2498107 9.67
Yeast(d=8,n=1484,k=10) 47.41 1192.38 54.96 640.89 51.82 66.30

The (k, l)-means heuristic: navigating on the local minima!
Associate to each pi to its l nearest cluster centers
NNl (pi ; K) (with iNNl = cluster center indexes), and
minimize the (k, l)-means objective function (with 1 ≤ l ≤ k):
e(P, K; l) =
n
i=1 a∈iNNl (pi ;K)
pi − ca
2
.
Assignment/relocation guarantees monotonous decrease.
Higher l means = local optima in optimization landscape
conversion to k-means
(k, l) ↓-means: convert a (k, l)-means by assigning to each
point pi its closest neighbor (among the l assigned at the end
of the (k, l)-means), and then compute the centroids and
launch a regular Lloyd’s k-means to ﬁnalize.
(k, l)-means: cascading conversion of (k, l)-means to
k-means: After convergence of (k, l)-means, initialize a
(k, l − 1) means by dropping for each point pi its farthest
cluster and perform a Lloyd’s (k, l − 1)-means, etc until we get
a (k, 1)-means=k-means. .

The (k, l)-means heuristic: 10000 trials
Data-set: Iris
(k, l) ↓-means: convert a (k, l)-means by assigning to each point pi its closest neighbor (among the l
assigned at the end of the (k, l)-means), and then compute the centroids and launch a regular Lloyd’s
k-means to ﬁnalize.
(k, l)-means: cascading conversion of (k, l)-means to k-means: After convergence of (k, l)-means,
initialize a (k, l − 1) means by dropping for each point pi its farthest cluster and perform a Lloyd’s
(k, l − 1)-means, etc until we get a (k, 1)-means=k-means. .
k win k-means (k, 2) ↓-means
min avg min avg
3 20.8 78.94 92.39 78.94 78.94
4 24.29 57.31 63.15 57.31 70.33
5 57.76 46.53 52.88 49.74 51.10
6 80.55 38.93 45.60 38.93 41.63
7 76.67 34.18 40.00 34.29 36.85
8 80.36 29.87 36.05 29.87 32.52
9 78.85 27.76 32.91 27.91 30.15
10 79.88 25.81 30.24 25.97 28.02
k l win k-means (k, l)-means
min avg min avg
5 2 58.3 46.53 52.72 49.74 51.24
5 4 62.4 46.53 52.55 49.74 49.74
8 2 80.8 29.87 36.40 29.87 32.54
8 3 61.1 29.87 36.19 32.76 34.04
8 6 55.5 29.88 36.189 32.75 35.26
10 2 78.8 25.81 30.61 25.97 28.23
10 3 82.5 25.95 30.23 26.47 27.76
10 5 64.7 25.90 30.32 26.99 28.61
On average better cost, but better local minima found by normal
k-means...

Geometry: Space of
Bregman spheres,
potential function and
polarity

Space of Bregman spheres and Bregman balls [2]
Dual sided Bregman balls (bounding Bregman spheres):
Ballr
F (c, r) = {x ∈ X | BF (x : c) ≤ r}
Balll
F (c, r) = {x ∈ X | BF (c : x) ≤ r}
Legendre duality:
Balll
F (c, r) = ( F)−1
(Ballr
F∗ ( F(c), r))
Illustration for Itakura-Saito divergence, F(x) = − log x

Lifting/Polarity: Potential function graph F

Space of Bregman spheres: Lifting map [2]
F : x → ˆx = (x, F(x)), hypersurface in Rd+1, potential function
Hp: Tangent hyperplane at ˆp
z = Hp(x) = x − p, F(p) + F(p)
Bregman sphere σ −→ ˆσ with supporting hyperplane
Hσ : z = x − c, F(c) + F(c) + r.
(// to Hc and shifted vertically by r)
ˆσ = F ∩ Hσ.
intersection of any hyperplane H with F projects onto X as a
Bregman sphere:
H : z = x, a +b → σ : BallF (c = ( F)−1
(a), r = a, c −F(c)+b)

Space of Bregman spheres: Algorithmic applications
Vapnik-Chervonenkis dimension [2] (VC-dim) is d + 1 for the
class of Bregman balls (for Machine Learning).
Union/intersection of Bregman d-spheres from
representational (d + 1)-polytope [2]
Radical axis of two Bregman balls is an hyperplane:
Applications to Nearest Neighbor search trees like Bregman
ball trees or Bregman vantage point trees [17].

Bregman proximity data structures, k-NN queries
Vantage point trees [17]: partition space according to Bregman
balls
Partitionning space with intersection of Kullback-Leibler balls
→ eﬃcient nearest neighbour queries in information spaces

Application: Minimum Enclosing Ball [12, 19]
To a hyperplane Hσ = H(a, b) : z = a, x + b in Rd+1,
corresponds a ball σ = Ball(c, r) in Rd with center c = F∗(a)
and radius:
r = a, c −F(c)+b = a, F∗
(a) −F( F∗
(a))+b = F∗
(a) + b
since F( F∗(a)) = F∗(a), a − F∗(a) (Young equality)
SEB: Find halfspace H(a, b)− : z ≤ a, x + b that contains all
lifted points:
min
a,b
r = F∗
(a) + b,
∀i ∈ {1, ..., n}, a, xi + b − F(xi ) ≥ 0
→ Convex Program (CP) with linear inequality constraints
F(θ) = F∗(η) = 1
2x x: CP → Quadratic Programming (QP)
used in SVM. Smallest enclosing ball used as a primitive in
SVM

Approximating the smallest Bregman enclosing
balls [19, 11]
Algorithm 3: BBCA(P, l).
c1 ← choose randomly a point in P;
for i = 2 to l − 1 do
// farthest point from ci wrt. BF
si ← argmaxn
j=1BF (ci : pj );
// update the center: walk on the η-segment
[ci , psi ]η
ci+1 ← F−1
( F(ci )# 1
i+1
F(psi )) ;
end
// Return the SEBB approximation
return Ball(cl , rl = BF (cl : X)) ;
θ-, η-geodesic segments in dually ﬂat geometry.

Smallest enclosing balls: Core-sets [19]
Core-set C ⊆ S: SOL(S) ≤ SOL(C) ≤ (1 + )SOL(S)
extended Kullback-Leibler Itakura-Saito

Programming InSphere predicates
Implicit representation of Bregman spheres/balls [2]: consider
d + 1 support points on the boundary
Is x inside the Bregman ball deﬁned by d + 1 support points?
InSphere(x; p0, ..., pd ) =
1 ... 1 1
p0 ... pd x
F(p0) ... F(pd ) F(x)
sign of a (d + 2) × (d + 2) matrix determinant
InSphere(x; p0, ..., pd ) is negative, null or positive depending
on whether x lies inside, on, or outside σ.

Computing f -divergences
for generic f :
Beyond stochastic
Monte-Carlo numerical
integration

Ali-Silvey-Csisz´ar f -divergences [6]
If (X1 : X2) = x1(x)f
x2(x)
x1(x)
dν(x) ≥ 0 (potentially +∞)
Name of the f -divergence Formula If (P : Q) Generator f (u) with f (1) = 0
Total variation (metric) 1
2
|p(x) − q(x)|dν(x) 1
2
|u − 1|
Squared Hellinger ( p(x) − q(x))2
dν(x) (
√
u − 1)2
Pearson χ2
P
(q(x)−p(x))2
p(x)
dν(x) (u − 1)2
Neyman χ2
N
(p(x)−q(x))2
q(x)
dν(x)
(1−u)2
u
Pearson-Vajda χk
P
(q(x)−λp(x))k
pk−1(x)
dν(x) (u − 1)k
Pearson-Vajda |χ|k
P
|q(x)−λp(x)|k
pk−1(x)
dν(x) |u − 1|k
Kullback-Leibler p(x) log
p(x)
q(x)
dν(x) − log u
reverse Kullback-Leibler q(x) log
q(x)
p(x)
dν(x) u log u
α-divergence 4
1−α2 (1 − p
1−α
2 (x)q1+α
(x)dν(x)) 4
1−α2 (1 − u
1+α
2 )
Jensen-Shannon 1
2
(p(x) log
2p(x)
p(x)+q(x)
+ q(x) log
2q(x)
p(x)+q(x)
)dν(x) −(u + 1) log 1+u
2
+ u log u
If (p : q) =
1
n
i
f (x2(si )/x1(si )), s1, ..., sn ∼iid X1(never +∞!)

Information monotonicity of f -divergences [6]
(Proof in Ali-Silvey paper)
Do coarse binning: from d bins to k < d bins:
X = k
i=1Ai
Let pA = (pi )A with pi = j∈Ai
pj .
Information monotonicity:
D(p : q) ≥ D(pA
: qA
)
We should distinguish less downgraded histograms...
⇒ f -divergences are the only divergences preserving the
information monotonicity.

f -divergences and higher-order Vajda χk
divergences [6]
If (X1 : X2) =
∞
k=0
f (k)(1)
k!
χk
P(X1 : X2)
χk
P(X1 : X2) =
(x2(x) − x1(x))k
x1(x)k−1
dν(x),
|χ|k
P(X1 : X2) =
|x2(x) − x1(x)|k
x1(x)k−1
dν(x),
are f -divergences for the generators (u − 1)k and |u − 1|k.
When k = 1, χ1
P(X1 : X2) = (x1(x) − x2(x))dν(x) = 0
(never discriminative), and |χ1
P|(X1, X2) is twice the total
variation distance.
χk
P is a signed distance

Aﬃne exponential families [6]
Canonical decomposition of the probability measure:
pθ(x) = exp( t(x), θ − F(θ) + k(x)),
consider natural parameter space Θ aﬃne (like multinomials).
Poi(λ) : p(x|λ) =
λx e−λ
x!
, λ > 0, x ∈ {0, 1, ...}
NorI (µ) : p(x|µ) = (2π)−d
2 e−1
2
(x−µ) (x−µ)
, µ ∈ Rd
, x ∈ Rd
Family θ Θ F(θ) k(x) t(x) ν
Poisson log λ R eθ − log x! x νc
Iso.Gaussian µ Rd 1
2θ θ d
2 log 2π − 1
2x x x νL

Higher-order Vajda χk
divergences [6]
The (signed) χk
P distance between members X1 ∼ EF (θ1) and
X2 ∼ EF (θ2) of the same aﬃne exponential family is (k ∈ N)
always bounded and equal to:
χk
P(X1 : X2) =
k
j=0
(−1)k−j k
j
eF((1−j)θ1+jθ2)
e(1−j)F(θ1)+jF(θ2)
For Poisson/Normal distributions, we get closed-form formula:
χk
P(λ1 : λ2) =
k
j=0
(−1)k−j k
j
eλ1−j
1 λj
2−((1−j)λ1+jλ2)
,
χk
P(µ1 : µ2) =
k
j=0
(−1)k−j k
j
e
1
2
j(j−1)(µ1−µ2) (µ1−µ2)
.

Thank you!

Bibliography I
David Arthur and Sergei Vassilvitskii.
k-means++: the advantages of careful seeding.
In Proceedings of the eighteenth annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages
1027–1035. Society for Industrial and Applied Mathematics, 2007.
Jean-Daniel Boissonnat, Frank Nielsen, and Richard Nock.
Bregman Voronoi diagrams.
Discrete and Computational Geometry, 44(2):281–307, April 2010.
Bent Fuglede and Flemming Topsoe.
Jensen-Shannon divergence and Hilbert space embedding.
In IEEE International Symposium on Information Theory, pages 31–31, 2004.
F. R. Hampel, P. J. Rousseeuw, E. Ronchetti, and W. A. Stahel.
Robust Statistics: The Approach Based on Inﬂuence Functions.
Wiley Series in Probability and Mathematical Statistics, 1986.
Meizhu Liu, Baba C. Vemuri, Shun-ichi Amari, and Frank Nielsen.
Shape retrieval using hierarchical total Bregman soft clustering.
Transactions on Pattern Analysis and Machine Intelligence, 34(12):2407–2419, 2012.
F. Nielsen and R. Nock.
On the chi square and higher-order chi distances for approximating f -divergences.
Signal Processing Letters, IEEE, 21(1):10–13, 2014.
Frank Nielsen.
k-MLE: A fast algorithm for learning statistical mixture models.
CoRR, 1203.5181, 2012.

Bibliography II
Frank Nielsen.
Jeﬀreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximation
for frequency histograms.
Signal Processing Letters, IEEE, 20(7):657–660, 2013.
Frank Nielsen.
On learning statistical mixtures maximizing the complete likelihood.
Bayesian Inference and Maximum Entropy Methods in Science and Engineering (MaxEnt 2014),
1641:238–245, 2014.
Frank Nielsen and Sylvain Boltz.
The Burbea-Rao and Bhattacharyya centroids.
IEEE Transactions on Information Theory, 57(8):5455–5466, August 2011.
Frank Nielsen and Richard Nock.
On approximating the smallest enclosing Bregman balls.
In Proceedings of the Twenty-second Annual Symposium on Computational Geometry, SCG ’06, pages
485–486, New York, NY, USA, 2006. ACM.
On the smallest enclosing information disk.
Information Processing Letters (IPL), 105(3):93–97, 2008.
Sided and symmetrized Bregman centroids.
Information Theory, IEEE Transactions on, 55(6):2882–2904, 2009.
Further heuristics for k-means: The merge-and-split heuristic and the (k, l)-means.
arXiv preprint arXiv:1406.6314, 2014.

Bibliography III
Total Jensen divergences: Definition, properties and clustering.
In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2016–2020,
2015.
Frank Nielsen, Richard Nock, and Shun-ichi Amari.
On clustering histograms with k-means by using mixed α-divergences.
Entropy, 16(6):3273–3301, 2014.
Frank Nielsen, Paolo Piro, and Michel Barlaud.
Bregman vantage point trees for efficient nearest neighbor queries.
In Proceedings of the 2009 IEEE International Conference on Multimedia and Expo (ICME), pages 878–881,
2009.
Richard Nock, Panu Luosto, and Jyrki Kivinen.
Mixed Bregman clustering with approximation guarantees.
In Machine Learning and Knowledge Discovery in Databases, pages 154–169. Springer, 2008.
Richard Nock and Frank Nielsen.
Fitting the smallest enclosing bregman balls.
In 16th European Conference on Machine Learning (ECML), pages 649–656, October 2005.
Atsumi Ohara, Hiroshi Matsuzoe, and Shun-ichi Amari.
A dually flat structure on the space of escort distributions.
Journal of Physics: Conference Series, 201(1):012012, 2010.
Christophe Saint-Jean and Frank Nielsen.
Hartigans method for k-MLE: Mixture modeling with Wishart distributions and its application to motion
retrieval.
In Frank Nielsen, editor, Geometric Theory of Information, Signals and Communication Technology, pages
301–330. Springer International Publishing, 2014.

Bibliography IV
Olivier Schwander and Frank Nielsen.
Fast learning of gamma mixture models with k-mle.
In Similarity-Based Pattern Recognition, pages 235–249. Springer, 2013.
Olivier Schwander, Aurelien J Schutz, Frank Nielsen, and Yannick Berthoumieu.
k-mle for mixtures of generalized Gaussians.
In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 2825–2828. IEEE, 2012.
Baba Vemuri, Meizhu Liu, Shun-ichi Amari, and Frank Nielsen.
Total Bregman divergence and its applications to DTI analysis.
IEEE Transactions on Medical Imaging, pages 475–483, 2011.
Si Wu and Shun-ichi Amari.
Conformal transformation of kernel functions a data dependent way to improve support vector machine
classiﬁers.
Neural Processing Letters, 15(1):59–67, 2002.

Divergence clustering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Divergence clustering

Similar to Divergence clustering (20)

Recently uploaded

Recently uploaded (20)

Divergence clustering