SlideShare a Scribd company logo
Divergence-based center clustering and their
applications
Frank Nielsen
joint work with Richard Nock et al.
´Ecole Polytechnique
Sony Computer Science Laboratories, Inc
Talk at Geneva University (Dec. 2015)
Viper group, CS Dept.
c 2015 Frank Nielsen 1
Background: Center-based clustering [16]
Countless applications of clustering: quantization (coding), finding
categories (unsupervised-clustering), technique for speeding-up
computations (e.g., distances), and so on.
Minimize objective/energy/loss function:
E(X = {x1, ..., xk}; C = {c1, ..., ck}) = min
C
n
i=1
min
j∈[k]
D(xi : cj )
NP-hard as soon as d, n > 1.
Dissimilarity D(· : ·): Metric distance vs. C2 divergence
(non-metric, squared Euclidean distance)
c 2015 Frank Nielsen 2
Background: Center-based clustering [16]
Initialize k cluster centers (seeds) as known as global
k-means heuristics: random (Forgy), global k-means
(discrete k-means), randomized k-means++ (expected
guarantee ˜O(log k))
Celebrated local k-means heuristics: Lloyd’s batched
allocation (assignment/center relocation), Hartigan’s single
point reassignment.
⇒ Aim at guaranteeing monotone convergence
Continuous vs. discrete k-means, k-center, etc.
Variational k-means: When centroids
arg min n
i=1 wi D(xi : c) not in closed form, center relocation
just need to be better (not best) to still guarantee monotone
convergence
c 2015 Frank Nielsen 3
The trick of mixed
divergences [18, 16]:
Dual centroids per cluster
c 2015 Frank Nielsen 4
Mixed divergences [16]
Defined on three parameters p, q and r:
Mλ(p : q : r)
eq
= λD(p : q) + (1 − λ)D(q : r)
for λ ∈ [0, 1].
Mixed divergences include:
the sided divergences for λ ∈ {0, 1},
the symmetrized (arithmetic mean) divergence for λ = 1
2, or
skew symmetrized for λ ∈ (0, 1), λ = 1
2.
c 2015 Frank Nielsen 5
Symmetrizing α-divergences
Sα(p, q) =
1
2
(Dα(p : q) + Dα(q : p)) = S−α(p, q),
= M1
2
(p : q : p),
For α = ±1, we get half of Jeffreys divergence:
S±1(p, q) =
1
2
d
i=1
(pi
− qi
) log
pi
qi
Same formula for probability and positive measures.
Centroids for symmetrized α-divergence usually not in closed
form.
How to perform finite center-based clustering without closed
form centroids (beyond variational) ?
c 2015 Frank Nielsen 6
Closed-form formula for Jeffreys positive centroid [8]
Jeffreys divergence (SKL) = symmetrized α = ±1
divergences.
The Jeffreys positive centroid c = (c1, ..., cd ) of a set
{h1, ..., hn} of n weighted positive histograms with d bins can
be calculated component-wise exactly using the Lambert W
analytic function:
ci
=
ai
W ai
gi e
where ai = n
j=1 πj hi
j denotes the coordinate-wise arithmetic
weighted means and gi = n
j=1(hi
j )πj the coordinate-wise
geometric weighted means.
The Lambert analytic function W (positive branch) is defined
by W (x)eW (x) = x for x ≥ 0.
Finite Jeffreys k-means clustering . But for α = 1,
how to cluster in finite number of iterations?
c 2015 Frank Nielsen 7
Mixed α-divergences/α-Jeffreys symmetrized divergence
Mixed α-divergence between a histogram x to two
histograms p and q:
Mλ,α(p : x : q) = λDα(p : x) + (1 − λ)Dα(x : q),
= λD−α(x : p) + (1 − λ)D−α(q : x),
= M1−λ,−α(q : x : p),
α-Jeffreys symmetrized divergence is obtained for λ = 1
2:
Sα(p, q) = M1
2
,α(q : p : q) = M1
2
,α(p : q : p)
skew symmetrized α-divergence is defined by:
Sλ,α(p : q) = λDα(p : q) + (1 − λ)Dα(q : p)
c 2015 Frank Nielsen 8
Mixed divergence-based k-means clustering
Initially, k distinct seeds from the dataset with li = ri .
Input: Weighted histogram set H, divergence D(·, ·), integer
k > 0, real λ ∈ [0, 1];
Initialize left-sided/right-sided seeds C = {(li , ri )}k
i=1;
repeat
// Assignment (as usual)
for i = 1, 2, ..., k do
Ci ← {h ∈ H : i = arg minj Mλ(lj : h : rj )};
end
// Dual-sided centroid relocation (the trick!)
for i = 1, 2, ..., k do
ri ← arg minx D(Ci : x) = h∈Ci
wj D(h : x);
li ← arg minx D(x : Ci ) = h∈Ci
wj D(x : h);
end
until convergence;
c 2015 Frank Nielsen 9
Example: Mixed α-hard clustering: MAhC(H, k, λ, α)
Input: Weighted histogram set H, integer k > 0, real λ ∈ [0, 1],
real α ∈ R;
Let C = {(li , ri )}k
i=1 ← MAS(H, k, λ, α);
repeat
// Assignment
for i = 1, 2, ..., k do
Ai ← {h ∈ H : i = arg minj Mλ,α(lj : h : rj )};
end
// Centroid relocation
for i = 1, 2, ..., k do
ri ← h∈Ai
wi h
1−α
2
2
1−α
;
li ← h∈Ai
wi h
1+α
2
2
1+α
;
end
until convergence;
c 2015 Frank Nielsen 10
Coupled k-Means++ α-Seeding (extending k-means++)
Algorithm 1: Mixed α-seeding; MAS(H, k, λ, α)
Input: Weighted histogram set H, integer k ≥ 1, real λ ∈ [0, 1],
real α ∈ R;
Let C ← hj with uniform probability ;
for i = 2, 3, ..., k do
Pick at random histogram h ∈ H with probability:
πH(h)
eq
=
whMλ,α(ch : h : ch)
y∈H wy Mλ,α(cy : y : cy )
, (1)
// where (ch, ch)
eq
= arg min(z,z)∈C Mλ,α(z : h : z);
C ← C ∪ {(h, h)};
end
Output: Set of initial cluster centers C;
→ Guaranteed probabilistic bound. Just need to initialize! No
centroid computations as iterations not theoretically required
c 2015 Frank Nielsen 11
Learning statistical
mixtures with hard EM
k-GMLE [7]: fast,
guaranteed finite
convergence, low memory
footprint... but not
consistent (biased)
c 2015 Frank Nielsen 12
Learning MMs: A geometric hard clustering viewpoint
Learn the parameters of a mixture m(x) = k
i=1 wi p(x|θi )
Maximize the complete data likelihood=clustering objective
function
max
W ,Λ
lc(W , Λ) =
n
i=1
k
j=1
zi,j log(wj p(xi |θj ))
= max
Λ
n
i=1
max
j∈[k]
log(wj p(xi |θj ))
≡ min
W ,Λ
n
i=1
min
j∈[k]
Dj (xi ) ,
where cj = (wj , θj ) (cluster prototype) and
Dj (xi ) = − log p(xi |θj ) − log wj are potential distance-like
functions.
⇒ further attach to each cluster (mixture component) a different
family of probability distributions.
c 2015 Frank Nielsen 13
Generalized k-MLE: learning statistical EF
mixtures [7, 23, 22, 21, 9]
Model-based clustering: Assignment of points to clusters:
Dwj ,θj ,Fj
(x) = − log pFj
(x; θj ) − log wj
1. Initialize weight W ∈ ∆k and family type (F1, ..., Fk) for each
cluster
2. Solve minΛ i minj Dj (xi ) (center-based clustering for W
fixed) with potential functions:
Dj (xi ) = − log pFj
(xi |θj ) − log wj
3. Solve family types maximizing the MLE in each cluster Cj by
choosing the parametric family of distributions Fj = F(γj )
that yields the best likelihood:
minF1=F(γ1),...,Fk =F(γk )∈F(γ) i minj Dwj ,θj ,Fj
(xi ).
∀l, γl = maxj F∗
j (ˆηl = 1
nl x∈Cl
tj (x)) + 1
nl x∈Cl
k(x).
4. Update weight W as the cluster point proportion
5. Test for convergence and go to step 2) otherwise.
Drawback = biased, non-consistent estimator due to Voronoi
support truncation.c 2015 Frank Nielsen 14
Conformal divergences and
clustering. (by analogy to
Riemannian tensor metric)
c 2015 Frank Nielsen 15
Geometrically designed divergences
Plot of the convex generator F: Bregman [13], Jensen
(Burbea-Rao [10]), total Bregman [5].
q p
p+q
2
B(p : q)
J(p, q)
tB(p : q)
F : (x, F(x))
(p, F(p))
(q, F(q))
c 2015 Frank Nielsen 16
Divergences: Distortion measures
F a smooth convex function, the generator.
Skew Jensen divergences:
Jα(p : q) = αF(p) + (1 − α)F(q) − F(αp + (1 − α)q),
= (F(p)F(q))α − F((pq)α),
where (pq)γ = γp + (1 − γ)q = q + γ(p − q) and
(F(p)F(q))γ = γF(p)+(1−γ)F(q) = F(q)+γ(F(p)−F(q)).
Bregman divergences = limit cases of skew Jensen
B(p : q) = F(p) − F(q) − p − q, F(q) ,
lim
α→0
Jα(p : q) = B(p : q),
lim
α→1
Jα(p : q) = B(q : p).
Statistical Bhattacharrya divergence = Jensen for exponential
families [10]
Bhat(p1 : p2) = − log p1(x)α
p2(x)1−α
dν(x) = Jα(θ1 : θ2)
c 2015 Frank Nielsen 17
Total Bregman divergences
Conformal divergence, conformal factor ρ:
D (p : q) = ρ(p, q)D(p : q)
plays the rˆole of “regularizer” [24] and ensures robustness
Invariance by rotation of the axes of the design space
tB(p : q) =
B(p : q)
1 + F(q), F(q)
= ρB(q)B(p : q),
ρB(q) =
1
1 + F(q), F(q)
.
Total squared Euclidean divergence:
tE(p, q) =
1
2
p − q, p − q
1 + q, q
.
c 2015 Frank Nielsen 18
Total Jensen divergence: Illustration of the principle
p q(pq)α
F(p)
F(q)
(F(p)F(q))α
(F(p)F(q))β
Jα(p : q)
F((pq)α)
tJα(p : q)
F(p )
F(q )
(F(p )F(q ))α
(F(p )F(q ))β
Jα(p : q )
F((p q )α)
tJα(p : q )
p (p q )α
qO
O
c 2015 Frank Nielsen 19
Total Jensen divergences [15]
tB(p : q) = ρB(q)B(p : q), ρB(q) =
1
1 + F(q), F(q)
tJα(p : q) = ρJ(p, q)Jα(p : q), ρJ(p, q) =
1
1 + (F(p)−F(q))2
p−q,p−q
Jensen-Shannon divergence, square root is a metric [3]:
JS(p, q) =
1
2
d
i=1
pi log
2pi
pi + qi
+
1
2
d
i=1
qi log
2qi
pi + qi
Lemma
The square root of the total Jensen-Shannon divergence is not a
metric.
c 2015 Frank Nielsen 20
Total Jensen divergences/Total Bregman divergences
Total Jensen is not a generalization of total Bregman.
limit cases α ∈ {0, 1}, we have:
lim
α→0
tJα(p : q) = ρJ(p, q)B(p : q) = ρB(q)B(p : q),
lim
α→1
tJα(p : q) = ρJ(p, q)B(q : p) = ρB(p)B(q : p),
since conformal factors ρJ(p, q) = ρB(q).
ρB(q) =
1
1 + F(q), F(q)
, tJα(p : q) = ρJ(p, q)Jα(p : q), ρJ(p
c 2015 Frank Nielsen 21
Conformal factor from mean value theorem
When p q, ρJ(p, q) ρB(q), and the total Jensen divergence
tends to the total Bregman divergence for any value of α.
ρJ(p, q) =
1
1 + F( ), F( )
= ρB( ),
for ∈ [p, q].
For univariate generators, explicitly the value of :
= F−1 ∆F
∆
= F∗ ∆F
∆
,
where F∗ is the Legendre convex conjugate [10].
c 2015 Frank Nielsen 22
Centroids and statistical robustness
Centroids (barycenters) are minimizers of average (weighted)
divergences:
L(x; w) =
n
i=1
wi × tJα(pi : x),
cα = arg min
x∈X
L(x; w),
Is it unique?
Is it robust to outliers [4]?
Iterative convex-concave procedure (CCCP, gradient method
without learning rate) [10]
c 2015 Frank Nielsen 23
Clustering: No closed-form centroid, no cry!
k-means++ [1] picks up randomly seeds, no centroid calculation.
Algorithm 2: Total Jensen k-means++ seeding
Input: Number of clusters k ≥ 1;
Let C ← {hj } with uniform probability ;
for i = 2, 3, ..., k do
Pick at random h ∈ H with probability:
πH(h) =
tJα(ch : h)
y∈H tJα(cy : y)
where ch = arg minz∈C tJα(z : h);
C ← C ∪ {h};
end
Output: Set of initial cluster centers C;
c 2015 Frank Nielsen 24
Total Jensen divergences: Recap
Total Jensen divergence = conformal divergence with
non-separable double-sided conformal factor.
Invariant to axis rotation of “design space“
Equivalent to total Bregman divergences [24, 5] only when
p q
Square root of total Jensen-Shannon divergence is not a
metric but square root of total JS is a metric.
Total Jensen k-means++ do not require centroid
computations and guaranteed approximation
Interest of conformal divergences in SVM [25] (double-sided
separable), in information geometry [20] (flattening).
c 2015 Frank Nielsen 25
Novel heuristics for
NP-hard center-based
clustering: merge-and-split
and (k, l)-means [14]
c 2015 Frank Nielsen 26
The k-means merge-and-split heuristic
Generalize Hartigan’s single-point relocation heuristic...
Consider pairs of clusters (Ci , Cj ) with centers ci and cj ,
merge them and split them again in two clusters using new
centers ci and cj . Accept when the sum of these two cluster
variance decreases:
∆(Ci , Cj ) = V (Ci , ci ) + V (Cj , cj ) − (V (Ci , ci ) + V (Cj , cj ))
How to split again two merged clusters (best splitting is
NP-hard)?
a discrete 2-means: We choose among the ni,j = ni + nj points
of Ci,j the two best centers (naively implemented in O(n3
)).
This yields a 2-approximation of 2-means.
a 2-means++ heuristic: We pick ci at random, then pick cj
randomly according to the normalized distribution of the
squared distances of the points in Ci,j to ci , see k-means++.
We repeat a given number α of rounds this initialization (say,
α = 1 + 0.01 ni,j
2 ) and keeps the best one.
c 2015 Frank Nielsen 27
The k-means merge-and-split heuristic
ops=number of pivot operations
Data set Hartigan Discrete Hartigan Merge&Split
cost #ops cost #ops cost #ops
Iris(d=4,n=150,k=3) 112.35 35.11 101.69 33.54 83.95 31.36
Wine(d=13,n=178,k=3) 607303 97.88 593319 100.02 570283 100.47
Yeast(d=8,n=1484,k=10) 47.10 1364.0 57.34 807.83 50.20 190.58
Data set Hartigan++ Discrete Hartigan++ Merge&Split++
cost #ops cost #ops cost #ops
Iris(d=4,n=150,k=3) 101.49 19.40 90.48 18.93 88.56 8.84
Wine(d=13,n=178,k=3) 3152616 18.76 2525803 24.61 2498107 9.67
Yeast(d=8,n=1484,k=10) 47.41 1192.38 54.96 640.89 51.82 66.30
c 2015 Frank Nielsen 28
The (k, l)-means heuristic: navigating on the local minima!
Associate to each pi to its l nearest cluster centers
NNl (pi ; K) (with iNNl = cluster center indexes), and
minimize the (k, l)-means objective function (with 1 ≤ l ≤ k):
e(P, K; l) =
n
i=1 a∈iNNl (pi ;K)
pi − ca
2
.
Assignment/relocation guarantees monotonous decrease.
Higher l means = local optima in optimization landscape
conversion to k-means
(k, l) ↓-means: convert a (k, l)-means by assigning to each
point pi its closest neighbor (among the l assigned at the end
of the (k, l)-means), and then compute the centroids and
launch a regular Lloyd’s k-means to finalize.
(k, l)-means: cascading conversion of (k, l)-means to
k-means: After convergence of (k, l)-means, initialize a
(k, l − 1) means by dropping for each point pi its farthest
cluster and perform a Lloyd’s (k, l − 1)-means, etc until we get
a (k, 1)-means=k-means. .
c 2015 Frank Nielsen 29
The (k, l)-means heuristic: 10000 trials
Data-set: Iris
(k, l) ↓-means: convert a (k, l)-means by assigning to each point pi its closest neighbor (among the l
assigned at the end of the (k, l)-means), and then compute the centroids and launch a regular Lloyd’s
k-means to finalize.
(k, l)-means: cascading conversion of (k, l)-means to k-means: After convergence of (k, l)-means,
initialize a (k, l − 1) means by dropping for each point pi its farthest cluster and perform a Lloyd’s
(k, l − 1)-means, etc until we get a (k, 1)-means=k-means. .
k win k-means (k, 2) ↓-means
min avg min avg
3 20.8 78.94 92.39 78.94 78.94
4 24.29 57.31 63.15 57.31 70.33
5 57.76 46.53 52.88 49.74 51.10
6 80.55 38.93 45.60 38.93 41.63
7 76.67 34.18 40.00 34.29 36.85
8 80.36 29.87 36.05 29.87 32.52
9 78.85 27.76 32.91 27.91 30.15
10 79.88 25.81 30.24 25.97 28.02
k l win k-means (k, l)-means
min avg min avg
5 2 58.3 46.53 52.72 49.74 51.24
5 4 62.4 46.53 52.55 49.74 49.74
8 2 80.8 29.87 36.40 29.87 32.54
8 3 61.1 29.87 36.19 32.76 34.04
8 6 55.5 29.88 36.189 32.75 35.26
10 2 78.8 25.81 30.61 25.97 28.23
10 3 82.5 25.95 30.23 26.47 27.76
10 5 64.7 25.90 30.32 26.99 28.61
On average better cost, but better local minima found by normal
k-means...
c 2015 Frank Nielsen 30
Geometry: Space of
Bregman spheres,
potential function and
polarity
c 2015 Frank Nielsen 31
Space of Bregman spheres and Bregman balls [2]
Dual sided Bregman balls (bounding Bregman spheres):
Ballr
F (c, r) = {x ∈ X | BF (x : c) ≤ r}
Balll
F (c, r) = {x ∈ X | BF (c : x) ≤ r}
Legendre duality:
Balll
F (c, r) = ( F)−1
(Ballr
F∗ ( F(c), r))
Illustration for Itakura-Saito divergence, F(x) = − log x
c 2015 Frank Nielsen 32
Lifting/Polarity: Potential function graph F
c 2015 Frank Nielsen 33
Space of Bregman spheres: Lifting map [2]
F : x → ˆx = (x, F(x)), hypersurface in Rd+1, potential function
Hp: Tangent hyperplane at ˆp
z = Hp(x) = x − p, F(p) + F(p)
Bregman sphere σ −→ ˆσ with supporting hyperplane
Hσ : z = x − c, F(c) + F(c) + r.
(// to Hc and shifted vertically by r)
ˆσ = F ∩ Hσ.
intersection of any hyperplane H with F projects onto X as a
Bregman sphere:
H : z = x, a +b → σ : BallF (c = ( F)−1
(a), r = a, c −F(c)+b)
c 2015 Frank Nielsen 34
Space of Bregman spheres: Algorithmic applications
Vapnik-Chervonenkis dimension [2] (VC-dim) is d + 1 for the
class of Bregman balls (for Machine Learning).
Union/intersection of Bregman d-spheres from
representational (d + 1)-polytope [2]
Radical axis of two Bregman balls is an hyperplane:
Applications to Nearest Neighbor search trees like Bregman
ball trees or Bregman vantage point trees [17].
c 2015 Frank Nielsen 35
Bregman proximity data structures, k-NN queries
Vantage point trees [17]: partition space according to Bregman
balls
Partitionning space with intersection of Kullback-Leibler balls
→ efficient nearest neighbour queries in information spaces
c 2015 Frank Nielsen 36
Application: Minimum Enclosing Ball [12, 19]
To a hyperplane Hσ = H(a, b) : z = a, x + b in Rd+1,
corresponds a ball σ = Ball(c, r) in Rd with center c = F∗(a)
and radius:
r = a, c −F(c)+b = a, F∗
(a) −F( F∗
(a))+b = F∗
(a) + b
since F( F∗(a)) = F∗(a), a − F∗(a) (Young equality)
SEB: Find halfspace H(a, b)− : z ≤ a, x + b that contains all
lifted points:
min
a,b
r = F∗
(a) + b,
∀i ∈ {1, ..., n}, a, xi + b − F(xi ) ≥ 0
→ Convex Program (CP) with linear inequality constraints
F(θ) = F∗(η) = 1
2x x: CP → Quadratic Programming (QP)
used in SVM. Smallest enclosing ball used as a primitive in
SVM
c 2015 Frank Nielsen 37
Approximating the smallest Bregman enclosing
balls [19, 11]
Algorithm 3: BBCA(P, l).
c1 ← choose randomly a point in P;
for i = 2 to l − 1 do
// farthest point from ci wrt. BF
si ← argmaxn
j=1BF (ci : pj );
// update the center: walk on the η-segment
[ci , psi ]η
ci+1 ← F−1
( F(ci )# 1
i+1
F(psi )) ;
end
// Return the SEBB approximation
return Ball(cl , rl = BF (cl : X)) ;
θ-, η-geodesic segments in dually flat geometry.
c 2015 Frank Nielsen 38
Smallest enclosing balls: Core-sets [19]
Core-set C ⊆ S: SOL(S) ≤ SOL(C) ≤ (1 + )SOL(S)
extended Kullback-Leibler Itakura-Saito
c 2015 Frank Nielsen 39
Programming InSphere predicates
Implicit representation of Bregman spheres/balls [2]: consider
d + 1 support points on the boundary
Is x inside the Bregman ball defined by d + 1 support points?
InSphere(x; p0, ..., pd ) =
1 ... 1 1
p0 ... pd x
F(p0) ... F(pd ) F(x)
sign of a (d + 2) × (d + 2) matrix determinant
InSphere(x; p0, ..., pd ) is negative, null or positive depending
on whether x lies inside, on, or outside σ.
c 2015 Frank Nielsen 40
Computing f -divergences
for generic f :
Beyond stochastic
Monte-Carlo numerical
integration
c 2015 Frank Nielsen 41
Ali-Silvey-Csisz´ar f -divergences [6]
If (X1 : X2) = x1(x)f
x2(x)
x1(x)
dν(x) ≥ 0 (potentially +∞)
Name of the f -divergence Formula If (P : Q) Generator f (u) with f (1) = 0
Total variation (metric) 1
2
|p(x) − q(x)|dν(x) 1
2
|u − 1|
Squared Hellinger ( p(x) − q(x))2
dν(x) (
√
u − 1)2
Pearson χ2
P
(q(x)−p(x))2
p(x)
dν(x) (u − 1)2
Neyman χ2
N
(p(x)−q(x))2
q(x)
dν(x)
(1−u)2
u
Pearson-Vajda χk
P
(q(x)−λp(x))k
pk−1(x)
dν(x) (u − 1)k
Pearson-Vajda |χ|k
P
|q(x)−λp(x)|k
pk−1(x)
dν(x) |u − 1|k
Kullback-Leibler p(x) log
p(x)
q(x)
dν(x) − log u
reverse Kullback-Leibler q(x) log
q(x)
p(x)
dν(x) u log u
α-divergence 4
1−α2 (1 − p
1−α
2 (x)q1+α
(x)dν(x)) 4
1−α2 (1 − u
1+α
2 )
Jensen-Shannon 1
2
(p(x) log
2p(x)
p(x)+q(x)
+ q(x) log
2q(x)
p(x)+q(x)
)dν(x) −(u + 1) log 1+u
2
+ u log u
If (p : q) =
1
n
i
f (x2(si )/x1(si )), s1, ..., sn ∼iid X1(never +∞!)
c 2015 Frank Nielsen 42
Information monotonicity of f -divergences [6]
(Proof in Ali-Silvey paper)
Do coarse binning: from d bins to k < d bins:
X = k
i=1Ai
Let pA = (pi )A with pi = j∈Ai
pj .
Information monotonicity:
D(p : q) ≥ D(pA
: qA
)
We should distinguish less downgraded histograms...
⇒ f -divergences are the only divergences preserving the
information monotonicity.
c 2015 Frank Nielsen 43
f -divergences and higher-order Vajda χk
divergences [6]
If (X1 : X2) =
∞
k=0
f (k)(1)
k!
χk
P(X1 : X2)
χk
P(X1 : X2) =
(x2(x) − x1(x))k
x1(x)k−1
dν(x),
|χ|k
P(X1 : X2) =
|x2(x) − x1(x)|k
x1(x)k−1
dν(x),
are f -divergences for the generators (u − 1)k and |u − 1|k.
When k = 1, χ1
P(X1 : X2) = (x1(x) − x2(x))dν(x) = 0
(never discriminative), and |χ1
P|(X1, X2) is twice the total
variation distance.
χk
P is a signed distance
c 2015 Frank Nielsen 44
Affine exponential families [6]
Canonical decomposition of the probability measure:
pθ(x) = exp( t(x), θ − F(θ) + k(x)),
consider natural parameter space Θ affine (like multinomials).
Poi(λ) : p(x|λ) =
λx e−λ
x!
, λ > 0, x ∈ {0, 1, ...}
NorI (µ) : p(x|µ) = (2π)−d
2 e−1
2
(x−µ) (x−µ)
, µ ∈ Rd
, x ∈ Rd
Family θ Θ F(θ) k(x) t(x) ν
Poisson log λ R eθ − log x! x νc
Iso.Gaussian µ Rd 1
2θ θ d
2 log 2π − 1
2x x x νL
c 2015 Frank Nielsen 45
Higher-order Vajda χk
divergences [6]
The (signed) χk
P distance between members X1 ∼ EF (θ1) and
X2 ∼ EF (θ2) of the same affine exponential family is (k ∈ N)
always bounded and equal to:
χk
P(X1 : X2) =
k
j=0
(−1)k−j k
j
eF((1−j)θ1+jθ2)
e(1−j)F(θ1)+jF(θ2)
For Poisson/Normal distributions, we get closed-form formula:
χk
P(λ1 : λ2) =
k
j=0
(−1)k−j k
j
eλ1−j
1 λj
2−((1−j)λ1+jλ2)
,
χk
P(µ1 : µ2) =
k
j=0
(−1)k−j k
j
e
1
2
j(j−1)(µ1−µ2) (µ1−µ2)
.
c 2015 Frank Nielsen 46
Thank you!
c 2015 Frank Nielsen 47
Bibliography I
David Arthur and Sergei Vassilvitskii.
k-means++: the advantages of careful seeding.
In Proceedings of the eighteenth annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages
1027–1035. Society for Industrial and Applied Mathematics, 2007.
Jean-Daniel Boissonnat, Frank Nielsen, and Richard Nock.
Bregman Voronoi diagrams.
Discrete and Computational Geometry, 44(2):281–307, April 2010.
Bent Fuglede and Flemming Topsoe.
Jensen-Shannon divergence and Hilbert space embedding.
In IEEE International Symposium on Information Theory, pages 31–31, 2004.
F. R. Hampel, P. J. Rousseeuw, E. Ronchetti, and W. A. Stahel.
Robust Statistics: The Approach Based on Influence Functions.
Wiley Series in Probability and Mathematical Statistics, 1986.
Meizhu Liu, Baba C. Vemuri, Shun-ichi Amari, and Frank Nielsen.
Shape retrieval using hierarchical total Bregman soft clustering.
Transactions on Pattern Analysis and Machine Intelligence, 34(12):2407–2419, 2012.
F. Nielsen and R. Nock.
On the chi square and higher-order chi distances for approximating f -divergences.
Signal Processing Letters, IEEE, 21(1):10–13, 2014.
Frank Nielsen.
k-MLE: A fast algorithm for learning statistical mixture models.
CoRR, 1203.5181, 2012.
c 2015 Frank Nielsen 48
Bibliography II
Frank Nielsen.
Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximation
for frequency histograms.
Signal Processing Letters, IEEE, 20(7):657–660, 2013.
Frank Nielsen.
On learning statistical mixtures maximizing the complete likelihood.
Bayesian Inference and Maximum Entropy Methods in Science and Engineering (MaxEnt 2014),
1641:238–245, 2014.
Frank Nielsen and Sylvain Boltz.
The Burbea-Rao and Bhattacharyya centroids.
IEEE Transactions on Information Theory, 57(8):5455–5466, August 2011.
Frank Nielsen and Richard Nock.
On approximating the smallest enclosing Bregman balls.
In Proceedings of the Twenty-second Annual Symposium on Computational Geometry, SCG ’06, pages
485–486, New York, NY, USA, 2006. ACM.
Frank Nielsen and Richard Nock.
On the smallest enclosing information disk.
Information Processing Letters (IPL), 105(3):93–97, 2008.
Frank Nielsen and Richard Nock.
Sided and symmetrized Bregman centroids.
Information Theory, IEEE Transactions on, 55(6):2882–2904, 2009.
Frank Nielsen and Richard Nock.
Further heuristics for k-means: The merge-and-split heuristic and the (k, l)-means.
arXiv preprint arXiv:1406.6314, 2014.
c 2015 Frank Nielsen 49
Bibliography III
Frank Nielsen and Richard Nock.
Total Jensen divergences: Definition, properties and clustering.
In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2016–2020,
2015.
Frank Nielsen, Richard Nock, and Shun-ichi Amari.
On clustering histograms with k-means by using mixed α-divergences.
Entropy, 16(6):3273–3301, 2014.
Frank Nielsen, Paolo Piro, and Michel Barlaud.
Bregman vantage point trees for efficient nearest neighbor queries.
In Proceedings of the 2009 IEEE International Conference on Multimedia and Expo (ICME), pages 878–881,
2009.
Richard Nock, Panu Luosto, and Jyrki Kivinen.
Mixed Bregman clustering with approximation guarantees.
In Machine Learning and Knowledge Discovery in Databases, pages 154–169. Springer, 2008.
Richard Nock and Frank Nielsen.
Fitting the smallest enclosing bregman balls.
In 16th European Conference on Machine Learning (ECML), pages 649–656, October 2005.
Atsumi Ohara, Hiroshi Matsuzoe, and Shun-ichi Amari.
A dually flat structure on the space of escort distributions.
Journal of Physics: Conference Series, 201(1):012012, 2010.
Christophe Saint-Jean and Frank Nielsen.
Hartigans method for k-MLE: Mixture modeling with Wishart distributions and its application to motion
retrieval.
In Frank Nielsen, editor, Geometric Theory of Information, Signals and Communication Technology, pages
301–330. Springer International Publishing, 2014.
c 2015 Frank Nielsen 50
Bibliography IV
Olivier Schwander and Frank Nielsen.
Fast learning of gamma mixture models with k-mle.
In Similarity-Based Pattern Recognition, pages 235–249. Springer, 2013.
Olivier Schwander, Aurelien J Schutz, Frank Nielsen, and Yannick Berthoumieu.
k-mle for mixtures of generalized Gaussians.
In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 2825–2828. IEEE, 2012.
Baba Vemuri, Meizhu Liu, Shun-ichi Amari, and Frank Nielsen.
Total Bregman divergence and its applications to DTI analysis.
IEEE Transactions on Medical Imaging, pages 475–483, 2011.
Si Wu and Shun-ichi Amari.
Conformal transformation of kernel functions a data dependent way to improve support vector machine
classifiers.
Neural Processing Letters, 15(1):59–67, 2002.
c 2015 Frank Nielsen 51

More Related Content

What's hot

Clustering in Hilbert simplex geometry
Clustering in Hilbert simplex geometryClustering in Hilbert simplex geometry
Clustering in Hilbert simplex geometryFrank Nielsen
 
Tailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest NeighborsTailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest NeighborsFrank Nielsen
 
Clustering in Hilbert geometry for machine learning
Clustering in Hilbert geometry for machine learningClustering in Hilbert geometry for machine learning
Clustering in Hilbert geometry for machine learningFrank Nielsen
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodFrank Nielsen
 
Density theorems for Euclidean point configurations
Density theorems for Euclidean point configurationsDensity theorems for Euclidean point configurations
Density theorems for Euclidean point configurationsVjekoslavKovac1
 
A T(1)-type theorem for entangled multilinear Calderon-Zygmund operators
A T(1)-type theorem for entangled multilinear Calderon-Zygmund operatorsA T(1)-type theorem for entangled multilinear Calderon-Zygmund operators
A T(1)-type theorem for entangled multilinear Calderon-Zygmund operatorsVjekoslavKovac1
 
Density theorems for anisotropic point configurations
Density theorems for anisotropic point configurationsDensity theorems for anisotropic point configurations
Density theorems for anisotropic point configurationsVjekoslavKovac1
 
ABC convergence under well- and mis-specified models
ABC convergence under well- and mis-specified modelsABC convergence under well- and mis-specified models
ABC convergence under well- and mis-specified modelsChristian Robert
 
Estimates for a class of non-standard bilinear multipliers
Estimates for a class of non-standard bilinear multipliersEstimates for a class of non-standard bilinear multipliers
Estimates for a class of non-standard bilinear multipliersVjekoslavKovac1
 
Mesh Processing Course : Active Contours
Mesh Processing Course : Active ContoursMesh Processing Course : Active Contours
Mesh Processing Course : Active ContoursGabriel Peyré
 

What's hot (20)

Clustering in Hilbert simplex geometry
Clustering in Hilbert simplex geometryClustering in Hilbert simplex geometry
Clustering in Hilbert simplex geometry
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Appli...
 Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Appli... Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Appli...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Appli...
 
Tailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest NeighborsTailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest Neighbors
 
Clustering in Hilbert geometry for machine learning
Clustering in Hilbert geometry for machine learningClustering in Hilbert geometry for machine learning
Clustering in Hilbert geometry for machine learning
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihood
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Density theorems for Euclidean point configurations
Density theorems for Euclidean point configurationsDensity theorems for Euclidean point configurations
Density theorems for Euclidean point configurations
 
A T(1)-type theorem for entangled multilinear Calderon-Zygmund operators
A T(1)-type theorem for entangled multilinear Calderon-Zygmund operatorsA T(1)-type theorem for entangled multilinear Calderon-Zygmund operators
A T(1)-type theorem for entangled multilinear Calderon-Zygmund operators
 
Density theorems for anisotropic point configurations
Density theorems for anisotropic point configurationsDensity theorems for anisotropic point configurations
Density theorems for anisotropic point configurations
 
CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...
CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...
CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
ABC convergence under well- and mis-specified models
ABC convergence under well- and mis-specified modelsABC convergence under well- and mis-specified models
ABC convergence under well- and mis-specified models
 
Estimates for a class of non-standard bilinear multipliers
Estimates for a class of non-standard bilinear multipliersEstimates for a class of non-standard bilinear multipliers
Estimates for a class of non-standard bilinear multipliers
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Mesh Processing Course : Active Contours
Mesh Processing Course : Active ContoursMesh Processing Course : Active Contours
Mesh Processing Course : Active Contours
 

Similar to Divergence clustering

On Clustering Histograms with k-Means by Using Mixed α-Divergences
 On Clustering Histograms with k-Means by Using Mixed α-Divergences On Clustering Histograms with k-Means by Using Mixed α-Divergences
On Clustering Histograms with k-Means by Using Mixed α-DivergencesFrank Nielsen
 
Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...Frank Nielsen
 
Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...
Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...
Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...Frank Nielsen
 
Slides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingSlides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingFrank Nielsen
 
Automatic Bayesian method for Numerical Integration
Automatic Bayesian method for Numerical Integration Automatic Bayesian method for Numerical Integration
Automatic Bayesian method for Numerical Integration Jagadeeswaran Rathinavel
 
Slides: Jeffreys centroids for a set of weighted histograms
Slides: Jeffreys centroids for a set of weighted histogramsSlides: Jeffreys centroids for a set of weighted histograms
Slides: Jeffreys centroids for a set of weighted histogramsFrank Nielsen
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking componentsChristian Robert
 
Convergence of ABC methods
Convergence of ABC methodsConvergence of ABC methods
Convergence of ABC methodsChristian Robert
 
Bayesian inference on mixtures
Bayesian inference on mixturesBayesian inference on mixtures
Bayesian inference on mixturesChristian Robert
 
k-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture modelsk-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture modelsFrank Nielsen
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applicationsFrank Nielsen
 
Zeros of orthogonal polynomials generated by a Geronimus perturbation of meas...
Zeros of orthogonal polynomials generated by a Geronimus perturbation of meas...Zeros of orthogonal polynomials generated by a Geronimus perturbation of meas...
Zeros of orthogonal polynomials generated by a Geronimus perturbation of meas...Edmundo José Huertas Cejudo
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Christian Robert
 
On approximating the Riemannian 1-center
On approximating the Riemannian 1-centerOn approximating the Riemannian 1-center
On approximating the Riemannian 1-centerFrank Nielsen
 

Similar to Divergence clustering (20)

On Clustering Histograms with k-Means by Using Mixed α-Divergences
 On Clustering Histograms with k-Means by Using Mixed α-Divergences On Clustering Histograms with k-Means by Using Mixed α-Divergences
On Clustering Histograms with k-Means by Using Mixed α-Divergences
 
Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...
 
Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...
Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...
Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...
 
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
 
Slides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingSlides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processing
 
Automatic Bayesian method for Numerical Integration
Automatic Bayesian method for Numerical Integration Automatic Bayesian method for Numerical Integration
Automatic Bayesian method for Numerical Integration
 
cswiercz-general-presentation
cswiercz-general-presentationcswiercz-general-presentation
cswiercz-general-presentation
 
Slides: Jeffreys centroids for a set of weighted histograms
Slides: Jeffreys centroids for a set of weighted histogramsSlides: Jeffreys centroids for a set of weighted histograms
Slides: Jeffreys centroids for a set of weighted histograms
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
Convergence of ABC methods
Convergence of ABC methodsConvergence of ABC methods
Convergence of ABC methods
 
Bayesian inference on mixtures
Bayesian inference on mixturesBayesian inference on mixtures
Bayesian inference on mixtures
 
Igv2008
Igv2008Igv2008
Igv2008
 
k-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture modelsk-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture models
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
 
Zeros of orthogonal polynomials generated by a Geronimus perturbation of meas...
Zeros of orthogonal polynomials generated by a Geronimus perturbation of meas...Zeros of orthogonal polynomials generated by a Geronimus perturbation of meas...
Zeros of orthogonal polynomials generated by a Geronimus perturbation of meas...
 
Vancouver18
Vancouver18Vancouver18
Vancouver18
 
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13
 
CDT 22 slides.pdf
CDT 22 slides.pdfCDT 22 slides.pdf
CDT 22 slides.pdf
 
On approximating the Riemannian 1-center
On approximating the Riemannian 1-centerOn approximating the Riemannian 1-center
On approximating the Riemannian 1-center
 

Recently uploaded

Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
 
Microbial Type Culture Collection (MTCC)
Microbial Type Culture Collection (MTCC)Microbial Type Culture Collection (MTCC)
Microbial Type Culture Collection (MTCC)abhishekdhamu51
 
THYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursingTHYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursingJocelyn Atis
 
National Biodiversity protection initiatives and Convention on Biological Di...
National Biodiversity protection initiatives and  Convention on Biological Di...National Biodiversity protection initiatives and  Convention on Biological Di...
National Biodiversity protection initiatives and Convention on Biological Di...PABOLU TEJASREE
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxmuralinath2
 
Shuaib Y-basedComprehensive mahmudj.pptx
Shuaib Y-basedComprehensive mahmudj.pptxShuaib Y-basedComprehensive mahmudj.pptx
Shuaib Y-basedComprehensive mahmudj.pptxMdAbuRayhan16
 
Seminar on Halal AGriculture and Fisheries.pptx
Seminar on Halal AGriculture and Fisheries.pptxSeminar on Halal AGriculture and Fisheries.pptx
Seminar on Halal AGriculture and Fisheries.pptxRUDYLUMAPINET2
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationanitaento25
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rockskumarmathi863
 
Transport in plants G1.pptx Cambridge IGCSE
Transport in plants G1.pptx Cambridge IGCSETransport in plants G1.pptx Cambridge IGCSE
Transport in plants G1.pptx Cambridge IGCSEjordanparish425
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsYOGESH DOGRA
 
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...Subhajit Sahu
 
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...Sérgio Sacani
 
The importance of continents, oceans and plate tectonics for the evolution of...
The importance of continents, oceans and plate tectonics for the evolution of...The importance of continents, oceans and plate tectonics for the evolution of...
The importance of continents, oceans and plate tectonics for the evolution of...Sérgio Sacani
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSELF-EXPLANATORY
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...Health Advances
 
Detectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureDetectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureSérgio Sacani
 
electrochemical gas sensors and their uses.pptx
electrochemical gas sensors and their uses.pptxelectrochemical gas sensors and their uses.pptx
electrochemical gas sensors and their uses.pptxHusna Zaheer
 
Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...
Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...
Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...Sérgio Sacani
 
Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Sérgio Sacani
 

Recently uploaded (20)

Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
 
Microbial Type Culture Collection (MTCC)
Microbial Type Culture Collection (MTCC)Microbial Type Culture Collection (MTCC)
Microbial Type Culture Collection (MTCC)
 
THYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursingTHYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursing
 
National Biodiversity protection initiatives and Convention on Biological Di...
National Biodiversity protection initiatives and  Convention on Biological Di...National Biodiversity protection initiatives and  Convention on Biological Di...
National Biodiversity protection initiatives and Convention on Biological Di...
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
 
Shuaib Y-basedComprehensive mahmudj.pptx
Shuaib Y-basedComprehensive mahmudj.pptxShuaib Y-basedComprehensive mahmudj.pptx
Shuaib Y-basedComprehensive mahmudj.pptx
 
Seminar on Halal AGriculture and Fisheries.pptx
Seminar on Halal AGriculture and Fisheries.pptxSeminar on Halal AGriculture and Fisheries.pptx
Seminar on Halal AGriculture and Fisheries.pptx
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
 
Transport in plants G1.pptx Cambridge IGCSE
Transport in plants G1.pptx Cambridge IGCSETransport in plants G1.pptx Cambridge IGCSE
Transport in plants G1.pptx Cambridge IGCSE
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
 
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
 
The importance of continents, oceans and plate tectonics for the evolution of...
The importance of continents, oceans and plate tectonics for the evolution of...The importance of continents, oceans and plate tectonics for the evolution of...
The importance of continents, oceans and plate tectonics for the evolution of...
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 
Detectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureDetectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a Technosignature
 
electrochemical gas sensors and their uses.pptx
electrochemical gas sensors and their uses.pptxelectrochemical gas sensors and their uses.pptx
electrochemical gas sensors and their uses.pptx
 
Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...
Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...
Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...
 
Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...
 

Divergence clustering

  • 1. Divergence-based center clustering and their applications Frank Nielsen joint work with Richard Nock et al. ´Ecole Polytechnique Sony Computer Science Laboratories, Inc Talk at Geneva University (Dec. 2015) Viper group, CS Dept. c 2015 Frank Nielsen 1
  • 2. Background: Center-based clustering [16] Countless applications of clustering: quantization (coding), finding categories (unsupervised-clustering), technique for speeding-up computations (e.g., distances), and so on. Minimize objective/energy/loss function: E(X = {x1, ..., xk}; C = {c1, ..., ck}) = min C n i=1 min j∈[k] D(xi : cj ) NP-hard as soon as d, n > 1. Dissimilarity D(· : ·): Metric distance vs. C2 divergence (non-metric, squared Euclidean distance) c 2015 Frank Nielsen 2
  • 3. Background: Center-based clustering [16] Initialize k cluster centers (seeds) as known as global k-means heuristics: random (Forgy), global k-means (discrete k-means), randomized k-means++ (expected guarantee ˜O(log k)) Celebrated local k-means heuristics: Lloyd’s batched allocation (assignment/center relocation), Hartigan’s single point reassignment. ⇒ Aim at guaranteeing monotone convergence Continuous vs. discrete k-means, k-center, etc. Variational k-means: When centroids arg min n i=1 wi D(xi : c) not in closed form, center relocation just need to be better (not best) to still guarantee monotone convergence c 2015 Frank Nielsen 3
  • 4. The trick of mixed divergences [18, 16]: Dual centroids per cluster c 2015 Frank Nielsen 4
  • 5. Mixed divergences [16] Defined on three parameters p, q and r: Mλ(p : q : r) eq = λD(p : q) + (1 − λ)D(q : r) for λ ∈ [0, 1]. Mixed divergences include: the sided divergences for λ ∈ {0, 1}, the symmetrized (arithmetic mean) divergence for λ = 1 2, or skew symmetrized for λ ∈ (0, 1), λ = 1 2. c 2015 Frank Nielsen 5
  • 6. Symmetrizing α-divergences Sα(p, q) = 1 2 (Dα(p : q) + Dα(q : p)) = S−α(p, q), = M1 2 (p : q : p), For α = ±1, we get half of Jeffreys divergence: S±1(p, q) = 1 2 d i=1 (pi − qi ) log pi qi Same formula for probability and positive measures. Centroids for symmetrized α-divergence usually not in closed form. How to perform finite center-based clustering without closed form centroids (beyond variational) ? c 2015 Frank Nielsen 6
  • 7. Closed-form formula for Jeffreys positive centroid [8] Jeffreys divergence (SKL) = symmetrized α = ±1 divergences. The Jeffreys positive centroid c = (c1, ..., cd ) of a set {h1, ..., hn} of n weighted positive histograms with d bins can be calculated component-wise exactly using the Lambert W analytic function: ci = ai W ai gi e where ai = n j=1 πj hi j denotes the coordinate-wise arithmetic weighted means and gi = n j=1(hi j )πj the coordinate-wise geometric weighted means. The Lambert analytic function W (positive branch) is defined by W (x)eW (x) = x for x ≥ 0. Finite Jeffreys k-means clustering . But for α = 1, how to cluster in finite number of iterations? c 2015 Frank Nielsen 7
  • 8. Mixed α-divergences/α-Jeffreys symmetrized divergence Mixed α-divergence between a histogram x to two histograms p and q: Mλ,α(p : x : q) = λDα(p : x) + (1 − λ)Dα(x : q), = λD−α(x : p) + (1 − λ)D−α(q : x), = M1−λ,−α(q : x : p), α-Jeffreys symmetrized divergence is obtained for λ = 1 2: Sα(p, q) = M1 2 ,α(q : p : q) = M1 2 ,α(p : q : p) skew symmetrized α-divergence is defined by: Sλ,α(p : q) = λDα(p : q) + (1 − λ)Dα(q : p) c 2015 Frank Nielsen 8
  • 9. Mixed divergence-based k-means clustering Initially, k distinct seeds from the dataset with li = ri . Input: Weighted histogram set H, divergence D(·, ·), integer k > 0, real λ ∈ [0, 1]; Initialize left-sided/right-sided seeds C = {(li , ri )}k i=1; repeat // Assignment (as usual) for i = 1, 2, ..., k do Ci ← {h ∈ H : i = arg minj Mλ(lj : h : rj )}; end // Dual-sided centroid relocation (the trick!) for i = 1, 2, ..., k do ri ← arg minx D(Ci : x) = h∈Ci wj D(h : x); li ← arg minx D(x : Ci ) = h∈Ci wj D(x : h); end until convergence; c 2015 Frank Nielsen 9
  • 10. Example: Mixed α-hard clustering: MAhC(H, k, λ, α) Input: Weighted histogram set H, integer k > 0, real λ ∈ [0, 1], real α ∈ R; Let C = {(li , ri )}k i=1 ← MAS(H, k, λ, α); repeat // Assignment for i = 1, 2, ..., k do Ai ← {h ∈ H : i = arg minj Mλ,α(lj : h : rj )}; end // Centroid relocation for i = 1, 2, ..., k do ri ← h∈Ai wi h 1−α 2 2 1−α ; li ← h∈Ai wi h 1+α 2 2 1+α ; end until convergence; c 2015 Frank Nielsen 10
  • 11. Coupled k-Means++ α-Seeding (extending k-means++) Algorithm 1: Mixed α-seeding; MAS(H, k, λ, α) Input: Weighted histogram set H, integer k ≥ 1, real λ ∈ [0, 1], real α ∈ R; Let C ← hj with uniform probability ; for i = 2, 3, ..., k do Pick at random histogram h ∈ H with probability: πH(h) eq = whMλ,α(ch : h : ch) y∈H wy Mλ,α(cy : y : cy ) , (1) // where (ch, ch) eq = arg min(z,z)∈C Mλ,α(z : h : z); C ← C ∪ {(h, h)}; end Output: Set of initial cluster centers C; → Guaranteed probabilistic bound. Just need to initialize! No centroid computations as iterations not theoretically required c 2015 Frank Nielsen 11
  • 12. Learning statistical mixtures with hard EM k-GMLE [7]: fast, guaranteed finite convergence, low memory footprint... but not consistent (biased) c 2015 Frank Nielsen 12
  • 13. Learning MMs: A geometric hard clustering viewpoint Learn the parameters of a mixture m(x) = k i=1 wi p(x|θi ) Maximize the complete data likelihood=clustering objective function max W ,Λ lc(W , Λ) = n i=1 k j=1 zi,j log(wj p(xi |θj )) = max Λ n i=1 max j∈[k] log(wj p(xi |θj )) ≡ min W ,Λ n i=1 min j∈[k] Dj (xi ) , where cj = (wj , θj ) (cluster prototype) and Dj (xi ) = − log p(xi |θj ) − log wj are potential distance-like functions. ⇒ further attach to each cluster (mixture component) a different family of probability distributions. c 2015 Frank Nielsen 13
  • 14. Generalized k-MLE: learning statistical EF mixtures [7, 23, 22, 21, 9] Model-based clustering: Assignment of points to clusters: Dwj ,θj ,Fj (x) = − log pFj (x; θj ) − log wj 1. Initialize weight W ∈ ∆k and family type (F1, ..., Fk) for each cluster 2. Solve minΛ i minj Dj (xi ) (center-based clustering for W fixed) with potential functions: Dj (xi ) = − log pFj (xi |θj ) − log wj 3. Solve family types maximizing the MLE in each cluster Cj by choosing the parametric family of distributions Fj = F(γj ) that yields the best likelihood: minF1=F(γ1),...,Fk =F(γk )∈F(γ) i minj Dwj ,θj ,Fj (xi ). ∀l, γl = maxj F∗ j (ˆηl = 1 nl x∈Cl tj (x)) + 1 nl x∈Cl k(x). 4. Update weight W as the cluster point proportion 5. Test for convergence and go to step 2) otherwise. Drawback = biased, non-consistent estimator due to Voronoi support truncation.c 2015 Frank Nielsen 14
  • 15. Conformal divergences and clustering. (by analogy to Riemannian tensor metric) c 2015 Frank Nielsen 15
  • 16. Geometrically designed divergences Plot of the convex generator F: Bregman [13], Jensen (Burbea-Rao [10]), total Bregman [5]. q p p+q 2 B(p : q) J(p, q) tB(p : q) F : (x, F(x)) (p, F(p)) (q, F(q)) c 2015 Frank Nielsen 16
  • 17. Divergences: Distortion measures F a smooth convex function, the generator. Skew Jensen divergences: Jα(p : q) = αF(p) + (1 − α)F(q) − F(αp + (1 − α)q), = (F(p)F(q))α − F((pq)α), where (pq)γ = γp + (1 − γ)q = q + γ(p − q) and (F(p)F(q))γ = γF(p)+(1−γ)F(q) = F(q)+γ(F(p)−F(q)). Bregman divergences = limit cases of skew Jensen B(p : q) = F(p) − F(q) − p − q, F(q) , lim α→0 Jα(p : q) = B(p : q), lim α→1 Jα(p : q) = B(q : p). Statistical Bhattacharrya divergence = Jensen for exponential families [10] Bhat(p1 : p2) = − log p1(x)α p2(x)1−α dν(x) = Jα(θ1 : θ2) c 2015 Frank Nielsen 17
  • 18. Total Bregman divergences Conformal divergence, conformal factor ρ: D (p : q) = ρ(p, q)D(p : q) plays the rˆole of “regularizer” [24] and ensures robustness Invariance by rotation of the axes of the design space tB(p : q) = B(p : q) 1 + F(q), F(q) = ρB(q)B(p : q), ρB(q) = 1 1 + F(q), F(q) . Total squared Euclidean divergence: tE(p, q) = 1 2 p − q, p − q 1 + q, q . c 2015 Frank Nielsen 18
  • 19. Total Jensen divergence: Illustration of the principle p q(pq)α F(p) F(q) (F(p)F(q))α (F(p)F(q))β Jα(p : q) F((pq)α) tJα(p : q) F(p ) F(q ) (F(p )F(q ))α (F(p )F(q ))β Jα(p : q ) F((p q )α) tJα(p : q ) p (p q )α qO O c 2015 Frank Nielsen 19
  • 20. Total Jensen divergences [15] tB(p : q) = ρB(q)B(p : q), ρB(q) = 1 1 + F(q), F(q) tJα(p : q) = ρJ(p, q)Jα(p : q), ρJ(p, q) = 1 1 + (F(p)−F(q))2 p−q,p−q Jensen-Shannon divergence, square root is a metric [3]: JS(p, q) = 1 2 d i=1 pi log 2pi pi + qi + 1 2 d i=1 qi log 2qi pi + qi Lemma The square root of the total Jensen-Shannon divergence is not a metric. c 2015 Frank Nielsen 20
  • 21. Total Jensen divergences/Total Bregman divergences Total Jensen is not a generalization of total Bregman. limit cases α ∈ {0, 1}, we have: lim α→0 tJα(p : q) = ρJ(p, q)B(p : q) = ρB(q)B(p : q), lim α→1 tJα(p : q) = ρJ(p, q)B(q : p) = ρB(p)B(q : p), since conformal factors ρJ(p, q) = ρB(q). ρB(q) = 1 1 + F(q), F(q) , tJα(p : q) = ρJ(p, q)Jα(p : q), ρJ(p c 2015 Frank Nielsen 21
  • 22. Conformal factor from mean value theorem When p q, ρJ(p, q) ρB(q), and the total Jensen divergence tends to the total Bregman divergence for any value of α. ρJ(p, q) = 1 1 + F( ), F( ) = ρB( ), for ∈ [p, q]. For univariate generators, explicitly the value of : = F−1 ∆F ∆ = F∗ ∆F ∆ , where F∗ is the Legendre convex conjugate [10]. c 2015 Frank Nielsen 22
  • 23. Centroids and statistical robustness Centroids (barycenters) are minimizers of average (weighted) divergences: L(x; w) = n i=1 wi × tJα(pi : x), cα = arg min x∈X L(x; w), Is it unique? Is it robust to outliers [4]? Iterative convex-concave procedure (CCCP, gradient method without learning rate) [10] c 2015 Frank Nielsen 23
  • 24. Clustering: No closed-form centroid, no cry! k-means++ [1] picks up randomly seeds, no centroid calculation. Algorithm 2: Total Jensen k-means++ seeding Input: Number of clusters k ≥ 1; Let C ← {hj } with uniform probability ; for i = 2, 3, ..., k do Pick at random h ∈ H with probability: πH(h) = tJα(ch : h) y∈H tJα(cy : y) where ch = arg minz∈C tJα(z : h); C ← C ∪ {h}; end Output: Set of initial cluster centers C; c 2015 Frank Nielsen 24
  • 25. Total Jensen divergences: Recap Total Jensen divergence = conformal divergence with non-separable double-sided conformal factor. Invariant to axis rotation of “design space“ Equivalent to total Bregman divergences [24, 5] only when p q Square root of total Jensen-Shannon divergence is not a metric but square root of total JS is a metric. Total Jensen k-means++ do not require centroid computations and guaranteed approximation Interest of conformal divergences in SVM [25] (double-sided separable), in information geometry [20] (flattening). c 2015 Frank Nielsen 25
  • 26. Novel heuristics for NP-hard center-based clustering: merge-and-split and (k, l)-means [14] c 2015 Frank Nielsen 26
  • 27. The k-means merge-and-split heuristic Generalize Hartigan’s single-point relocation heuristic... Consider pairs of clusters (Ci , Cj ) with centers ci and cj , merge them and split them again in two clusters using new centers ci and cj . Accept when the sum of these two cluster variance decreases: ∆(Ci , Cj ) = V (Ci , ci ) + V (Cj , cj ) − (V (Ci , ci ) + V (Cj , cj )) How to split again two merged clusters (best splitting is NP-hard)? a discrete 2-means: We choose among the ni,j = ni + nj points of Ci,j the two best centers (naively implemented in O(n3 )). This yields a 2-approximation of 2-means. a 2-means++ heuristic: We pick ci at random, then pick cj randomly according to the normalized distribution of the squared distances of the points in Ci,j to ci , see k-means++. We repeat a given number α of rounds this initialization (say, α = 1 + 0.01 ni,j 2 ) and keeps the best one. c 2015 Frank Nielsen 27
  • 28. The k-means merge-and-split heuristic ops=number of pivot operations Data set Hartigan Discrete Hartigan Merge&Split cost #ops cost #ops cost #ops Iris(d=4,n=150,k=3) 112.35 35.11 101.69 33.54 83.95 31.36 Wine(d=13,n=178,k=3) 607303 97.88 593319 100.02 570283 100.47 Yeast(d=8,n=1484,k=10) 47.10 1364.0 57.34 807.83 50.20 190.58 Data set Hartigan++ Discrete Hartigan++ Merge&Split++ cost #ops cost #ops cost #ops Iris(d=4,n=150,k=3) 101.49 19.40 90.48 18.93 88.56 8.84 Wine(d=13,n=178,k=3) 3152616 18.76 2525803 24.61 2498107 9.67 Yeast(d=8,n=1484,k=10) 47.41 1192.38 54.96 640.89 51.82 66.30 c 2015 Frank Nielsen 28
  • 29. The (k, l)-means heuristic: navigating on the local minima! Associate to each pi to its l nearest cluster centers NNl (pi ; K) (with iNNl = cluster center indexes), and minimize the (k, l)-means objective function (with 1 ≤ l ≤ k): e(P, K; l) = n i=1 a∈iNNl (pi ;K) pi − ca 2 . Assignment/relocation guarantees monotonous decrease. Higher l means = local optima in optimization landscape conversion to k-means (k, l) ↓-means: convert a (k, l)-means by assigning to each point pi its closest neighbor (among the l assigned at the end of the (k, l)-means), and then compute the centroids and launch a regular Lloyd’s k-means to finalize. (k, l)-means: cascading conversion of (k, l)-means to k-means: After convergence of (k, l)-means, initialize a (k, l − 1) means by dropping for each point pi its farthest cluster and perform a Lloyd’s (k, l − 1)-means, etc until we get a (k, 1)-means=k-means. . c 2015 Frank Nielsen 29
  • 30. The (k, l)-means heuristic: 10000 trials Data-set: Iris (k, l) ↓-means: convert a (k, l)-means by assigning to each point pi its closest neighbor (among the l assigned at the end of the (k, l)-means), and then compute the centroids and launch a regular Lloyd’s k-means to finalize. (k, l)-means: cascading conversion of (k, l)-means to k-means: After convergence of (k, l)-means, initialize a (k, l − 1) means by dropping for each point pi its farthest cluster and perform a Lloyd’s (k, l − 1)-means, etc until we get a (k, 1)-means=k-means. . k win k-means (k, 2) ↓-means min avg min avg 3 20.8 78.94 92.39 78.94 78.94 4 24.29 57.31 63.15 57.31 70.33 5 57.76 46.53 52.88 49.74 51.10 6 80.55 38.93 45.60 38.93 41.63 7 76.67 34.18 40.00 34.29 36.85 8 80.36 29.87 36.05 29.87 32.52 9 78.85 27.76 32.91 27.91 30.15 10 79.88 25.81 30.24 25.97 28.02 k l win k-means (k, l)-means min avg min avg 5 2 58.3 46.53 52.72 49.74 51.24 5 4 62.4 46.53 52.55 49.74 49.74 8 2 80.8 29.87 36.40 29.87 32.54 8 3 61.1 29.87 36.19 32.76 34.04 8 6 55.5 29.88 36.189 32.75 35.26 10 2 78.8 25.81 30.61 25.97 28.23 10 3 82.5 25.95 30.23 26.47 27.76 10 5 64.7 25.90 30.32 26.99 28.61 On average better cost, but better local minima found by normal k-means... c 2015 Frank Nielsen 30
  • 31. Geometry: Space of Bregman spheres, potential function and polarity c 2015 Frank Nielsen 31
  • 32. Space of Bregman spheres and Bregman balls [2] Dual sided Bregman balls (bounding Bregman spheres): Ballr F (c, r) = {x ∈ X | BF (x : c) ≤ r} Balll F (c, r) = {x ∈ X | BF (c : x) ≤ r} Legendre duality: Balll F (c, r) = ( F)−1 (Ballr F∗ ( F(c), r)) Illustration for Itakura-Saito divergence, F(x) = − log x c 2015 Frank Nielsen 32
  • 33. Lifting/Polarity: Potential function graph F c 2015 Frank Nielsen 33
  • 34. Space of Bregman spheres: Lifting map [2] F : x → ˆx = (x, F(x)), hypersurface in Rd+1, potential function Hp: Tangent hyperplane at ˆp z = Hp(x) = x − p, F(p) + F(p) Bregman sphere σ −→ ˆσ with supporting hyperplane Hσ : z = x − c, F(c) + F(c) + r. (// to Hc and shifted vertically by r) ˆσ = F ∩ Hσ. intersection of any hyperplane H with F projects onto X as a Bregman sphere: H : z = x, a +b → σ : BallF (c = ( F)−1 (a), r = a, c −F(c)+b) c 2015 Frank Nielsen 34
  • 35. Space of Bregman spheres: Algorithmic applications Vapnik-Chervonenkis dimension [2] (VC-dim) is d + 1 for the class of Bregman balls (for Machine Learning). Union/intersection of Bregman d-spheres from representational (d + 1)-polytope [2] Radical axis of two Bregman balls is an hyperplane: Applications to Nearest Neighbor search trees like Bregman ball trees or Bregman vantage point trees [17]. c 2015 Frank Nielsen 35
  • 36. Bregman proximity data structures, k-NN queries Vantage point trees [17]: partition space according to Bregman balls Partitionning space with intersection of Kullback-Leibler balls → efficient nearest neighbour queries in information spaces c 2015 Frank Nielsen 36
  • 37. Application: Minimum Enclosing Ball [12, 19] To a hyperplane Hσ = H(a, b) : z = a, x + b in Rd+1, corresponds a ball σ = Ball(c, r) in Rd with center c = F∗(a) and radius: r = a, c −F(c)+b = a, F∗ (a) −F( F∗ (a))+b = F∗ (a) + b since F( F∗(a)) = F∗(a), a − F∗(a) (Young equality) SEB: Find halfspace H(a, b)− : z ≤ a, x + b that contains all lifted points: min a,b r = F∗ (a) + b, ∀i ∈ {1, ..., n}, a, xi + b − F(xi ) ≥ 0 → Convex Program (CP) with linear inequality constraints F(θ) = F∗(η) = 1 2x x: CP → Quadratic Programming (QP) used in SVM. Smallest enclosing ball used as a primitive in SVM c 2015 Frank Nielsen 37
  • 38. Approximating the smallest Bregman enclosing balls [19, 11] Algorithm 3: BBCA(P, l). c1 ← choose randomly a point in P; for i = 2 to l − 1 do // farthest point from ci wrt. BF si ← argmaxn j=1BF (ci : pj ); // update the center: walk on the η-segment [ci , psi ]η ci+1 ← F−1 ( F(ci )# 1 i+1 F(psi )) ; end // Return the SEBB approximation return Ball(cl , rl = BF (cl : X)) ; θ-, η-geodesic segments in dually flat geometry. c 2015 Frank Nielsen 38
  • 39. Smallest enclosing balls: Core-sets [19] Core-set C ⊆ S: SOL(S) ≤ SOL(C) ≤ (1 + )SOL(S) extended Kullback-Leibler Itakura-Saito c 2015 Frank Nielsen 39
  • 40. Programming InSphere predicates Implicit representation of Bregman spheres/balls [2]: consider d + 1 support points on the boundary Is x inside the Bregman ball defined by d + 1 support points? InSphere(x; p0, ..., pd ) = 1 ... 1 1 p0 ... pd x F(p0) ... F(pd ) F(x) sign of a (d + 2) × (d + 2) matrix determinant InSphere(x; p0, ..., pd ) is negative, null or positive depending on whether x lies inside, on, or outside σ. c 2015 Frank Nielsen 40
  • 41. Computing f -divergences for generic f : Beyond stochastic Monte-Carlo numerical integration c 2015 Frank Nielsen 41
  • 42. Ali-Silvey-Csisz´ar f -divergences [6] If (X1 : X2) = x1(x)f x2(x) x1(x) dν(x) ≥ 0 (potentially +∞) Name of the f -divergence Formula If (P : Q) Generator f (u) with f (1) = 0 Total variation (metric) 1 2 |p(x) − q(x)|dν(x) 1 2 |u − 1| Squared Hellinger ( p(x) − q(x))2 dν(x) ( √ u − 1)2 Pearson χ2 P (q(x)−p(x))2 p(x) dν(x) (u − 1)2 Neyman χ2 N (p(x)−q(x))2 q(x) dν(x) (1−u)2 u Pearson-Vajda χk P (q(x)−λp(x))k pk−1(x) dν(x) (u − 1)k Pearson-Vajda |χ|k P |q(x)−λp(x)|k pk−1(x) dν(x) |u − 1|k Kullback-Leibler p(x) log p(x) q(x) dν(x) − log u reverse Kullback-Leibler q(x) log q(x) p(x) dν(x) u log u α-divergence 4 1−α2 (1 − p 1−α 2 (x)q1+α (x)dν(x)) 4 1−α2 (1 − u 1+α 2 ) Jensen-Shannon 1 2 (p(x) log 2p(x) p(x)+q(x) + q(x) log 2q(x) p(x)+q(x) )dν(x) −(u + 1) log 1+u 2 + u log u If (p : q) = 1 n i f (x2(si )/x1(si )), s1, ..., sn ∼iid X1(never +∞!) c 2015 Frank Nielsen 42
  • 43. Information monotonicity of f -divergences [6] (Proof in Ali-Silvey paper) Do coarse binning: from d bins to k < d bins: X = k i=1Ai Let pA = (pi )A with pi = j∈Ai pj . Information monotonicity: D(p : q) ≥ D(pA : qA ) We should distinguish less downgraded histograms... ⇒ f -divergences are the only divergences preserving the information monotonicity. c 2015 Frank Nielsen 43
  • 44. f -divergences and higher-order Vajda χk divergences [6] If (X1 : X2) = ∞ k=0 f (k)(1) k! χk P(X1 : X2) χk P(X1 : X2) = (x2(x) − x1(x))k x1(x)k−1 dν(x), |χ|k P(X1 : X2) = |x2(x) − x1(x)|k x1(x)k−1 dν(x), are f -divergences for the generators (u − 1)k and |u − 1|k. When k = 1, χ1 P(X1 : X2) = (x1(x) − x2(x))dν(x) = 0 (never discriminative), and |χ1 P|(X1, X2) is twice the total variation distance. χk P is a signed distance c 2015 Frank Nielsen 44
  • 45. Affine exponential families [6] Canonical decomposition of the probability measure: pθ(x) = exp( t(x), θ − F(θ) + k(x)), consider natural parameter space Θ affine (like multinomials). Poi(λ) : p(x|λ) = λx e−λ x! , λ > 0, x ∈ {0, 1, ...} NorI (µ) : p(x|µ) = (2π)−d 2 e−1 2 (x−µ) (x−µ) , µ ∈ Rd , x ∈ Rd Family θ Θ F(θ) k(x) t(x) ν Poisson log λ R eθ − log x! x νc Iso.Gaussian µ Rd 1 2θ θ d 2 log 2π − 1 2x x x νL c 2015 Frank Nielsen 45
  • 46. Higher-order Vajda χk divergences [6] The (signed) χk P distance between members X1 ∼ EF (θ1) and X2 ∼ EF (θ2) of the same affine exponential family is (k ∈ N) always bounded and equal to: χk P(X1 : X2) = k j=0 (−1)k−j k j eF((1−j)θ1+jθ2) e(1−j)F(θ1)+jF(θ2) For Poisson/Normal distributions, we get closed-form formula: χk P(λ1 : λ2) = k j=0 (−1)k−j k j eλ1−j 1 λj 2−((1−j)λ1+jλ2) , χk P(µ1 : µ2) = k j=0 (−1)k−j k j e 1 2 j(j−1)(µ1−µ2) (µ1−µ2) . c 2015 Frank Nielsen 46
  • 47. Thank you! c 2015 Frank Nielsen 47
  • 48. Bibliography I David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1027–1035. Society for Industrial and Applied Mathematics, 2007. Jean-Daniel Boissonnat, Frank Nielsen, and Richard Nock. Bregman Voronoi diagrams. Discrete and Computational Geometry, 44(2):281–307, April 2010. Bent Fuglede and Flemming Topsoe. Jensen-Shannon divergence and Hilbert space embedding. In IEEE International Symposium on Information Theory, pages 31–31, 2004. F. R. Hampel, P. J. Rousseeuw, E. Ronchetti, and W. A. Stahel. Robust Statistics: The Approach Based on Influence Functions. Wiley Series in Probability and Mathematical Statistics, 1986. Meizhu Liu, Baba C. Vemuri, Shun-ichi Amari, and Frank Nielsen. Shape retrieval using hierarchical total Bregman soft clustering. Transactions on Pattern Analysis and Machine Intelligence, 34(12):2407–2419, 2012. F. Nielsen and R. Nock. On the chi square and higher-order chi distances for approximating f -divergences. Signal Processing Letters, IEEE, 21(1):10–13, 2014. Frank Nielsen. k-MLE: A fast algorithm for learning statistical mixture models. CoRR, 1203.5181, 2012. c 2015 Frank Nielsen 48
  • 49. Bibliography II Frank Nielsen. Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximation for frequency histograms. Signal Processing Letters, IEEE, 20(7):657–660, 2013. Frank Nielsen. On learning statistical mixtures maximizing the complete likelihood. Bayesian Inference and Maximum Entropy Methods in Science and Engineering (MaxEnt 2014), 1641:238–245, 2014. Frank Nielsen and Sylvain Boltz. The Burbea-Rao and Bhattacharyya centroids. IEEE Transactions on Information Theory, 57(8):5455–5466, August 2011. Frank Nielsen and Richard Nock. On approximating the smallest enclosing Bregman balls. In Proceedings of the Twenty-second Annual Symposium on Computational Geometry, SCG ’06, pages 485–486, New York, NY, USA, 2006. ACM. Frank Nielsen and Richard Nock. On the smallest enclosing information disk. Information Processing Letters (IPL), 105(3):93–97, 2008. Frank Nielsen and Richard Nock. Sided and symmetrized Bregman centroids. Information Theory, IEEE Transactions on, 55(6):2882–2904, 2009. Frank Nielsen and Richard Nock. Further heuristics for k-means: The merge-and-split heuristic and the (k, l)-means. arXiv preprint arXiv:1406.6314, 2014. c 2015 Frank Nielsen 49
  • 50. Bibliography III Frank Nielsen and Richard Nock. Total Jensen divergences: Definition, properties and clustering. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2016–2020, 2015. Frank Nielsen, Richard Nock, and Shun-ichi Amari. On clustering histograms with k-means by using mixed α-divergences. Entropy, 16(6):3273–3301, 2014. Frank Nielsen, Paolo Piro, and Michel Barlaud. Bregman vantage point trees for efficient nearest neighbor queries. In Proceedings of the 2009 IEEE International Conference on Multimedia and Expo (ICME), pages 878–881, 2009. Richard Nock, Panu Luosto, and Jyrki Kivinen. Mixed Bregman clustering with approximation guarantees. In Machine Learning and Knowledge Discovery in Databases, pages 154–169. Springer, 2008. Richard Nock and Frank Nielsen. Fitting the smallest enclosing bregman balls. In 16th European Conference on Machine Learning (ECML), pages 649–656, October 2005. Atsumi Ohara, Hiroshi Matsuzoe, and Shun-ichi Amari. A dually flat structure on the space of escort distributions. Journal of Physics: Conference Series, 201(1):012012, 2010. Christophe Saint-Jean and Frank Nielsen. Hartigans method for k-MLE: Mixture modeling with Wishart distributions and its application to motion retrieval. In Frank Nielsen, editor, Geometric Theory of Information, Signals and Communication Technology, pages 301–330. Springer International Publishing, 2014. c 2015 Frank Nielsen 50
  • 51. Bibliography IV Olivier Schwander and Frank Nielsen. Fast learning of gamma mixture models with k-mle. In Similarity-Based Pattern Recognition, pages 235–249. Springer, 2013. Olivier Schwander, Aurelien J Schutz, Frank Nielsen, and Yannick Berthoumieu. k-mle for mixtures of generalized Gaussians. In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 2825–2828. IEEE, 2012. Baba Vemuri, Meizhu Liu, Shun-ichi Amari, and Frank Nielsen. Total Bregman divergence and its applications to DTI analysis. IEEE Transactions on Medical Imaging, pages 475–483, 2011. Si Wu and Shun-ichi Amari. Conformal transformation of kernel functions a data dependent way to improve support vector machine classifiers. Neural Processing Letters, 15(1):59–67, 2002. c 2015 Frank Nielsen 51