SlideShare a Scribd company logo
1 of 68
Download to read offline
Information-theoretic clustering with applications
Frank Nielsen
Sony Computer Science Laboratories Inc
´Ecole Polytechnique
Kyoto University Information Seminar
9th June 2014
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/68
Clustering: Data exploratory science
◮ Clustering: Find homogeneous groups in data
◮ Clustering: Separate data into groups
Iris data set from UCI: d = 4, k = 3, n = 150
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/68
Clustering: What to see?, when to see ?
◮ Distance or similarity measure between data?
◮ Data membership: Hard or soft clustering?
◮ How many groups?
◮ Outliers?
◮ “Shapes” of groups?
◮ Representation of data and cleaning,
◮ Missing data, truncated data, ...
◮ Etc.
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/68
Outline
Clustering n data in d dimensions into k clusters
Usual case: d << k << n, but all cases possible
◮ Quick tutorial: “The essentials”
◮ Ordinaryk-means,
◮ Gaussian mixtures models,
◮ Model selection,
◮ Other kinds of clustering, etc.
◮ Optimal 1D contiguous clustering
◮ Clustering histograms with Jeffreys divergence
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/68
Part I
Quick tutorial
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/68
k-means clustering
Partition the data set X = {x1, ..., xn} into k groups
P = {G1, ..., Gk },
each group Gj has a center cj .
Minimize the objective function (cost, energy, loss):
e(X, C) =
n
i=1
k
min
j=1
xi − cj
2
◮ NP-hard when d > 1 and k > 1
◮ For d = 1, center c1 is the centroid and
e1(X) = e(X, {C1}) = var(X), the “unnormalized” variance:
var(X) = i xi
2 − n c1
2
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/68
Rewriting the k-means cost to minimize
e(X, C) =
n
i=1
k
min
j=1
xi − cj
2
,
=
1
2
k
j=1 x,x′∈Gj
x − x′ 2
+ constant,
= −
1
2
k
j=1 x∈Gj x′∈Gk
x − x′ 2
+ constant,
=
k
j=1
nj var(Gj ), size cluster Gi : nj = |Gi |
= var(X) − var(Y), where Y = {(nj , cj )}k
j=1
Interpreted as: min sum cluster inter-distances, max sum cluster
inter-distances, min sum of cluster variances, min quantization loss.
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/68
Heuristics for k-means
◮ Local heuristics:
◮ Lloyd (batched): assign data to closest center, remap cluster
centers to centroids, reiterate until convergence
◮ McQueen (incremental, online): add a point at a time, assign
x to closest cluster, update that cluster centroid, reiterate until
convergence
◮ Hartigan (single-point): find a point x ∈ Gi and a cluster Gj so
that relocating x ∈ Gj decreases k-means cost, reiterate until
convergence
◮ Global heuristics:
◮ Forgy: random seeding. Best Forgy is 2-approximation via the
variance-bias decomposition:
D(x, X) =
n
i=1 x − xi
2
= var(X) + n x − c1
2
.
◮ Fastest fast traversal: choose c1 at random, then ci as the
point x farthest from {c1, ..., ci−1}, repeat.
◮ Randomized k-means++ (probabilistic ˜O(log k) bound)
◮ Global k-means
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/68
Probabilistic model-based clustering
Maximize likelihood for the mixture density:
m(x) =
k
i=1
wi p(x; µi , Σi )
Statistical mixture models: generative models
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/68
Statistical Gaussian Mixtures Models
Universal smooth density estimator.
Gaussian distribution:
p(x; µ, Σ) =
1
(2π)
d
2 |Σ|
e− 1
2
DΣ−1 (x−µ,x−µ)
Related to the squared Mahalanobis distance:
DQ(x, y) = (x − y)T
Q(x − y), x ∈ Rd
→ log-concave density
Expectation-maximization (EM) algorithm: maximize incomplete
likelihood.
EM tends to k-means when Σj = λI with λ → 0 (or any fixed λ,
see [34]).
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/68
Sampling from a Gaussian Mixture Model
To sample a variate x from a GMM:
◮ Choose a component l according to the weight distribution
w1, ..., wk ,
◮ Draw a variate x according to N(µl , Σl ).
→ Sampling is a doubly stochastic process:
◮ throw a biased dice with k faces to choose the component:
l ∼ Multinomial(w1, ..., wk )
(normalized histogram.)
◮ then draw at random a variate x from the l-th component
x ∼ Normal(µl , Σl )
x = µ + Cz with Cholesky: Σ = CC⊤ and z = [z1 ... zd]⊤
standard normal random variate:zi =
√
−2 log U1 cos(2πU2)
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/68
GMMs: Generative models, sampling [15]
A pixel (x,y,R,G,B) : A data point in 5D
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/68
Model selection: Choosing k
The more parameters the better the model but it over fits.
Solution: Penalized likelihood to maximize.
Bayesian Information Criterion (BIC):
max l(θ) −
#θ
2
log n
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/68
Clustering analysis
◮ Parametric clustering (k given, cost-based)
◮ center-based hard clustering (k-means, k-medians, min
diameter, single linkage, etc.)
◮ model-based soft clustering
◮ Non-parametric clustering (k adjusts with data)
◮ Kernel density estimators (mean shift)
◮ Dirichlet process mixture models
◮ Hierarchical clustering
◮ Graph-based clustering (normalized cut)
◮ Affinity propagation
◮ Subspace/manifold clustering
◮ Etc!
Cluster validation (distance between clustering, confusion matrix,
NMI, Rand index, etc.)
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/68
Part II
Optimal 1D contiguous clustering
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/68
Hard clustering: Partitioning the data set
◮ Partition X = {x1, ..., xn} ⊂ X into k pairwise disjoint clusters
C1 ⊂ X, ..., Ck ⊂ X:
X =
k
i=1
Ci
◮ Center-based hard clustering (center=prototype):
k-means [1], k-medians, k-center, ℓr -center [21], etc.
◮ Model-based hard clustering: statistical mixtures maximizing
the complete likelihood (prototype=model parameter).
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/68
k-means clustering
Minimize the sum of intra-cluster variances:
min
p1,...,pk
n
i=1
k
min
j=1
xi − pj
2
◮ k-means: NP-hard when d > 1 and k > 1 [24, 12]. k-medians
and k-centers also NP-hard [25] (1984)
◮ Global heuristics (for initialization) (Forgy [14], global
k-means [40], k-means++ [4], etc.) and local search
heuristics (incremental Voronoi MacQueen [23], batched
Voronoi Lloyd [22], single-point swap Hartigan [17], etc.)
◮ In 1D, k-means is polynomial [3, 39]: O(n2k).
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/68
Euclidean 1D k-means
◮ 1D k-means [13] has contiguous partition.
◮ Solved by enumerating all n−1
k−1 partitions in 1D (1958).
Better than Stirling numbers of the second kind S(n, k) that
count all partitions.
◮ Polynomial in time O(n2k) using Dynamic Programming
(DP) [3] (sketched in 1973 in two pages).
◮ R package Ckmeans.1d.dp [39] (2011).
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 18/68
(Sketch of the) DP solution [3] (Bellman, 1973)
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 19/68
Interval clustering: Structure
Sort X ∈ X with respect to total order < on X in O(n log n).
Output represented by:
◮ k intervals Ii = [xli
, xri
] such that Ci = Ii ∩ X.
◮ or better k − 1 delimiters li (i ∈ {2, ..., k}) since ri = li+1 − 1
(i < k and rk = n) and l1 = 1.
[x1...xl2−1]
C1
[xl2
...xl3−1]
C2
... [xlk
...xn]
Ck
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 20/68
Objective function for interval clustering
Scalars x1 < ... < xn are partitioned contiguously
into k clusters: C1 < ... < Ck.
Clustering objective function:
min ek(X) =
k
j=1
e1(Cj )
c1(·): intra-cluster cost/energy
⊕: inter-cluster cost/energy (commutative, associative)
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 21/68
Examples of objective functions
In arbitrary dimension X = Rd :
◮ ℓr -clustering (r ≥ 1): =
e1(Cj ) = min
p∈X


x∈Cj
d(x, p)r


(argmin=prototype pj is the same whether we take power of 1
r
of sum or not)
Euclidean ℓr -clustering: r = 1 median, r = 2 means.
◮ k-center (limr→∞): = max
e1(Ci ) = min
p∈X
max
x∈Cj
d(x, p)
◮ Discrete clustering: Search space in min is Cj instead of X.
Note that in 1D, ℓs-norm distance is always d(p, q) = |p − q|,
independent of s ≥ 1.
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 22/68
Optimal interval clustering by Dynamic Programming
Xj,i = {xj , ..., xi } (j ≤ i)
Xi = X1,i = {x1, ..., xi }
E = [ei,j]: n × k cost matrix, O(n × k) memory
ei,m = em(Xi )
Optimality equation:
ei,m = min
m≤j≤i
{ej−1,m−1 ⊕ e1(Xj,i )}
Associative/commutative operator ⊕ (+ or max).
Initialize with ci,1 = c1(Xi )
E: compute from left to right column, from bottom to top.
Best clustering solution cost is at en,k.
Time: n × k × O(n) × T1(n)=O(n2kT1(n)), O(nk) memory
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 23/68
Retrieving the solution: Backtracking
Use an auxiliary matrix S = [si.j] for storing the argmin.
Backtrack in O(k) time.
◮ Left index lk of Ck stored at sn,k: lk = sn,k.
◮ Iteratively retrieve the previous left interval indexes at entries
lj−1 = slj −1,i for j = k − 1, ..., j = 1.
Note that lj − 1 = n − k
l=j nl and lj − 1 = j−1
l=1 nl .
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 24/68
Optimizing time with a Look Up Table (LUT)
Save time when computing e1(Xj,i ) since we perform n × k × O(n)
such computations.
Look Up Table (LUT): Add extra n × n matrix E1 with
E1[j][i] = e1(Xj,i ).
Build in O(n2T1(n))...
Then DP in O(n2k)=O(n2T1(n))).
→ quadratic amount of memory (n > 10000...)
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 25/68
DP solver with cluster size constraints
n−
i and n+
i : lower/upper bound constraints on ni = |Ci |
k
l=1 = n−
i ≤ n and k
l=1 = n+
i ≥ n.
When no constraints: add dummy constraints n−
i = 1 and
n+
i = n − k − 1.
nm = |Cm| = i − j + 1 such that n−
m ≤ nm ≤ n+
m.
→ j ≤ i + 1 − n−
m and j ≥ i + 1 − n+
m.
ei,m = min
max{1+ m−1
l=1 n−
l ,i+1−n+
m}≤j
j≤i+1−n−
m
{ej−1,m−1 ⊕ e1(Xj,i )} ,
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 26/68
Model selection from the DP table
m(k) = ek (X)
e1(X) decreases with k and reaches minimum when k = n.
Model selection: trade-off choose best model among all the models
with k ∈ [1, n].
Regularized objective function: e′
k (X) = ek(X) + f (k), f (k)
related to model complexity.
Compute the DP table for k = n, ..., 1 and avoids redundant
computations.
Then compute the criterion for the last line (indexed by n) and
choose the argmin of e′
k.
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 27/68
A Voronoi cell condition for DP optimality
elements → interval clusters → prototypes
interval clusters ← prototypes
Partition X wrt. P = {p1, ..., pk }.
Voronoi cell:
V (pj) = {x ∈ X : dr
(x, pj ) ≤ dr
(x, pl ) ∀l ∈ {1, ..., k}}.
xr is a monotonically increasing function on R+, equivalent to
V ′
(pj ) = {x ∈ X : d(x : pj ) < d(x : pl )}
DP guarantees optimal clustering when ∀P, V ′(pj ) is an interval
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 28/68
Optimal 1D Bregman k-means
Bregman information [1] e1 (generalizes cluster variance):
e1(Cj ) = min
xl ∈Cj
wl BF (xl : pj ). (1)
Expressed as [35]:
e1(Cj ) = xl ∈Cj
wl (pj F′(pj ) − F(pj )) + xl ∈Cj
wl F(xl ) −
F′(pj ) x∈Cj
wl x
process using Summed Area Tables [10] (SATs)
S1(j) = j
l=1 wl , S2(j) = j
l=1 wl xl , and S3(j) = j
l=1 wl F(xl ) in
O(n) time at preprocessing stage.
Evaluate the Bregman information e1(Xj,i ) in constant time O(1).
For example, i
l=j wl F(xl ) = S3(i) − S3(j − 1) with S3(0) = 0.
Bregman Voronoi diagrams have connected cells [6] thus DP yields
optimal interval clustering.
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 29/68
Exponential families in statistics
Family of probability distributions:
F = {pF (x; θ) : θ ∈ Θ}
Exponential families [30]:
pF (x|θ) = exp (t(x)θ − F(θ) + k(x)) ,
For example:
univariate Rayleigh R(σ), t(x) = x2, k(x) = log x, θ = − 1
2σ2 ,
η = −1
θ , F(θ) = log − 1
2θ and F∗(η) = −1 + log 2
η .
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 30/68
Uniorder exponential families: MLE
Maximum Likelihood Estimator (MLE) [30]:
e1(Xj,i ) = ˆl(xj , ..., xi ) = F∗
(ˆηj,i ) +
1
i − j + 1
i
l=j
k(xl ).
with ˆηj,i = 1
i−j+1
i
l=j t(xl ).
By making a change of variable yl = t(xl ), and not accounting the
k(xl ) terms that are constant for any clustering, we get
e1(Xj,i ) ≡ F∗

 1
i − j + 1
i
l=j
yl


c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 31/68
Hard clustering for learning statistical mixtures
Expectation-Maximization learns monotonically from an
initialization by maximizing the incomplete log-likelihood.
Mixture maximizing the complete log-likelihood:
lc(X; L, Ω) =
n
i=1
log(αli
p(xi ; θli
)),
L = {li }i : hidden labels.
max lc ≡ min
θ1,...,θk
n
i=1
k
min
j=1
(− log p(xi ; θj ) − log αj ).
Given fixed α and − log pF (x; θ) amounts to a dual Bregman
divergence[1].
Run Bregman k-means and DP yields optimal partition since
additively-weighted Bregman Voronoi diagrams are interval [6].
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 32/68
Hard clustering for learning statistical mixtures
Location families:
F = {f (x; µ) =
1
σ
f0(
x − µ
σ
), µ ∈ R}
f0 standard density, σ > 0 fixed. Cauchy or Laplacian families have
density graphs intersecting in exactly one point.
→ singly-connected Maximum Likelihood Voronoi cells.
Model selection: Akaike Information Criterion [7] (AIC):
AIC(x1, ..., xn) = −2l(x1, ..., xn) + 2k +
2k(k + 1)
n − k − 1
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 33/68
Experiments with: Gaussian Mixture Models (GMMs)
gmm1 score = −3.0754314021966658 (Euclidean k-means) gmm2
score = −3.038795325884112 (Bregman k-means, better)
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 34/68
1-mean (centroid): O(n) time
min
p
n
i=1
(xi − p)2
D(x, p) = (x − p)2
, D′
(x, p) = 2(x − p), D′′
(x, p) = 2
Convex optimization (existence and unique solution)
n
i=1
D′
(x, p) = 0 ⇒
n
i=1
xi − np = 0
Center of mass p = 1
n
n
i=1 xi − (barycenter)
Extends to Bregman divergence:
DF (x, p) = F(x) − F(p) − (x − p)F′
(p)
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 35/68
2-means: O(n log n) time
Find xl2
(n − 1 potential locations for xl : from x2 to xn):
min
xl2
{e1(C1) + e1(C2)}
Browse from left to right l2 = x2, ...., xn.
Update cost in constant time E2(l + 1) from E2(l) (SATs also
O(1)):
E2(l) = e2(x1...xl−1|xl ...xn)
µ1(l + 1) =
(l − 1)µ1(l) + xl
l
, µ2(l + 1) =
(n − l + 1)µ2(l) − xl
n − l
v1(l + 1) =
l
i=1
(xi − µ1(l + 1))2
=
l
i=1
x2
i − lµ2
1(l + 1)
∆E2(l) =
l − 1
l
µ1(l) − xl
2
+
n − l + 1
n − l
µ2(l) − xl
2
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 36/68
2-means: Experiments
Intel Win7 i7-4800
n Brute force SAT Incremental
300000 155.022 0.010 0.0091
1000000 1814.44 0.018 0.015
Do we need sorting and Ω(n log n) time? (k = 1 is linear time)
Note that MaxGap does not yield the separator (because
centroid is sum of squared distance minimizer)
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 37/68
Conclusion
◮ Generic DP for solving interval clustering:
◮ O(n2
kT1(n))-time using O(nk) memory
◮ O(n2
T1(n)) time using O(n2
) memory
◮ Refine DP by adding minimum/maximum cluster size
constraints
◮ Model selection from DP table
◮ Two applications:
◮ 1D Bregman ℓr -clustering. 1D Bregman k-means in O(n2
k)
time using O(nk) memory using Summed Area Tables (SATs)
◮ Mixture learning maximizing the complete likelihood:
◮ For uni-order exponential families amount to a dual Bregman
k-means on Y = {yi = t(xi )}i
◮ For location families with density graph intersecting pairwise
in one point (Cauchy, Laplacian: ∈ exponential families)
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 38/68
Part III
Clustering histograms with Jeffreys divergence
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 39/68
Why histogram clustering?
Task: Classify documents into categories:
Bag-of-Word (BoW) modeling paradigm [5, 11].
◮ Define a word dictionary, and
◮ Represent each document by a word count histogram.
Centroid-based k-means clustering [1]:
◮ Cluster document histograms to learn categories,
◮ Build visual vocabularies by quantizing image features:
Compressed Histogram of Gradient descriptors [8].
→ histogram centroids
wh = d
i=1 hi : cumulative sum of bin values
˜: normalization operator
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 40/68
Why Jeffreys divergence?
Distance between two frequency histograms ˜p and ˜q:
Kullback-Leibler divergence or relative entropy.
KL(˜p : ˜q) = H×
(˜p : ˜q) − H(˜p),
H×
(˜p : ˜q) =
d
i=1
˜pi
log
1
˜qi
, cross − entropy
H(˜p) = H×
(˜p : ˜p) =
d
i=1
˜pi
log
1
˜pi
, Shannon entropy.
→ expected extra number of bits per datum that must be
transmitted when using the “wrong” distribution ˜q instead of the
true distribution ˜p.
˜p is hidden by nature (and hypothesized), ˜q is estimated.
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 41/68
Why Jeffreys divergence?
When clustering histograms, all histograms play the same role →
Jeffreys [18] divergence:
J(p, q) = KL(p : q) + KL(q : p),
J(p, q) =
d
i=1
(pi
− qi
) log
pi
qi
= J(q, p).
→ symmetrizes the KL divergence.
(aka. J-divergence or symmetrical Kullback-Leibler divergence,
etc.)
KL(p : q) =
i
pi log
pi
qi
+ qi − pi
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 42/68
Jeffreys centroids: frequency and positive centroids
A set H = {h1, ..., hn} of weighted histograms.
c = arg min
x
n
j=1
πj J(hj , x),
πj > 0’s histogram positive weights: n
j=1 πj = 1.
◮ Jeffreys positive centroid c:
c = arg min
x∈Rd
+
n
j=1
πj J(hj , x),
◮ Jeffreys frequency centroid ˜c:
˜c = arg min
x∈∆d
n
j=1
πjJ(˜hj , x),
∆d : Probability (d − 1)-dimensional simplex.
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 43/68
Prior work
◮ Histogram clustering wrt. χ2 distance [20]
◮ Histogram clustering wrt. Bhattacharyya distance [26, 33]
◮ Histogram clustering wrt. Kullback-Leibler distance as
Bregman k-means clustering [1]
◮ Jeffreys frequency centroid [38] (Newton numerical
optimization)
◮ Jeffreys frequency centroid as equivalent symmetrized
Bregmen centroid [36]
◮ Mixed Bregman clustering [37]
◮ Smooth family of KL symmetrized centroids including
Jensen-Shannon centroids and Jeffreys centroids in limit
case [29]
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 44/68
Jeffreys positive centroid
c = arg min
x∈Rd
+
J(H, x) = arg min
x∈Rd
+
n
j=1
πj J(hj , x).
Theorem (Theorem 1)
The Jeffreys positive centroid c = (c1, ..., cd ) of a set {h1, ..., hn}
of n weighted positive histograms with d bins can be calculated
component-wise exactly using the Lambert W analytic function:
ci
=
ai
W ( ai
gi e)
,
where ai = n
j=1 πj hi
j denotes the coordinate-wise arithmetic
weighted means and gi = n
j=1(hi
j )πj the coordinate-wise
geometric weighted means.
Lambert analytic function [2] W (x)eW (x) = x for x ≥ 0.
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 45/68
Jeffreys positive centroid (proof)
min
x
n
j=1
πj J(hj , x)
min
x
n
j=1
πj
d
i=1
(hi
j − xi
)(log hi
j − log xi
)
≡ min
x
d
i=1
n
j=1
πj (xi
log xi
− xi
log hi
j − hi
j log xi
)
d
i=1
xi
log xi
− xi
log
n
j=1
(hi
j )πj
g
−
n
j=1
πjhi
j a log xi
min
x
d
i=1
xi
log
xi
g
− a log xi
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 46/68
Jeffreys positive centroid (proof)
Coordinate-wise minimize:
min
x
x log
x
g
− a log x
Setting the derivative to zero, we solve:
log
x
g
+ 1 −
a
x
= 0
and get
x =
a
W ( a
g e)
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 47/68
Jeffreys frequency centroid: A guaranteed approximation
˜c = arg min
x∈∆d
n
j=1
πj J(˜hj , x),
Relaxing x from probability simplex ∆d to Rd
+, we get
˜c′
=
c
wc
, ci
=
ai
W ( ai
gi e)
, wc =
i
ci
Lemma (Lemma 1)
The cumulative sum wc of the bin values of the Jeffreys positive
centroid c of a set of frequency histograms is less or equal to one:
0 < wc ≤ 1.
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 48/68
Proof of Lemma 1
From Theorem 1:
wc =
d
i=1
ci
=
d
i=1
ai
W ( ai
gi e)
.
Arithmetic-geometric mean inequality: ai ≥ gi
Therefore W ( ai
gi e) ≥ 1 and ci ≤ ai . Thus
wc =
d
i=1
ci
≤
d
i=1
ai
= 1
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 49/68
Lemma 2
Lemma (Lemma 2)
For any histogram x and frequency histogram ˜h, we have
J(x, ˜h) = J(˜x, ˜h) + (wx − 1)(KL(˜x : ˜h) + log wx ), where wx
denotes the normalization factor (wx = d
i=1 xi ).
J(x, ˜H) = J(˜x, ˜H) + (wx − 1)(KL(˜x : ˜H) + log wx ),
where J(x, ˜H) = n
j=1 πj J(x, ˜hj ) and
KL(˜x : ˜H) = n
j=1 πj KL(˜x, ˜hj ) (with n
j=1 πj = 1).
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 50/68
Proof of Lemma 2
xi
= wx ˜xi
J(x, ˜h) =
d
i=1
(wx ˜xi
− ˜hi
) log
wx ˜xi
˜hi
J(x, ˜h) =
d
i=1
(wx ˜xi
log
˜xi
˜hi
+ wx ˜xi
log wx + ˜hi
log
˜hi
˜xi
− ˜hi
log wx)
= (wx − 1) log wx + J(˜x, ˜h) + (wx − 1)
d
i=1
˜xi
log
˜xi
˜hi
= J(˜x, ˜h) + (wx − 1)(KL(˜x : ˜h) + log wx )
since d
i=1
˜hi = d
i=1 ˜xi = 1.
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 51/68
Guaranteed approximation of ˜c
Theorem (Theorem 2)
Let ˜c denote the Jeffreys frequency centroid and ˜c′ = c
wc
the
normalized Jeffreys positive centroid. Then the approximation
factor α˜c′ = J(˜c′, ˜H)
J(˜c, ˜H)
is such that 1 ≤ α˜c′ ≤ 1
wc
(with wc ≤ 1).
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 52/68
Proof of Theorem 2
J(c, ˜H) ≤ J(˜c, ˜H) ≤ J(˜c′
, ˜H)
From Lemma 2, since
J(˜c′, ˜H) = J(c, ˜H) + (1 − wc)(KL(˜c′, ˜H) + log wc)) and
J(c, ˜H) ≤ J(˜c, ˜H)
1 ≤ α˜c′ ≤ 1 +
(1 − wc)(KL(˜c′, ˜H) + log wc)
J(˜c, ˜H)
KL(˜c′
: ˜H) =
1
wc
KL(c, ˜H) − log wc
α˜c′ ≤ 1 +
(1 − wc)KL(c, ˜H)
wcJ(˜c, ˜H)
Since J(˜c, ˜H) ≥ J(c, ˜H) and KL(c, ˜H) ≤ J(c, ˜H), we get
α˜c′ ≤ 1
wc
.
When wc = 1 the bound is tight.
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 53/68
In practice...
c in closed-form → compute wc, KL(c, ˜H), J(c, ˜H).
Bound the approximation factor α˜c′ as:
α˜c′ ≤ 1 +
1
wc
− 1
KL(c, ˜H)
J(c, ˜H)
≤
1
wc
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 54/68
Fine approximation
From [38, 36], minimization of Jeffreys frequency centroid
equivalent to:
˜c = arg min
˜x∈∆d
KL(˜a : ˜x) + KL(˜x : ˜g)
Lagrangian function enforcing i ci = 1:
log
˜ci
˜gi
+ 1 −
˜ai
˜ci
+ λ = 0
˜ci
=
˜ai
W ˜ai eλ+1
˜gi
λ = −KL(˜c : ˜g) ≤ 0
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 55/68
Fine approximation: Bisection search
ci
≤ 1 ⇒ ci
=
˜ai
W ˜ai eλ+1
˜gi
≤ 1
λ ≥ log(e˜ai
˜gi
) − 1∀i, λ ∈ [max
i
log(e˜ai
˜gi
) − 1, 0]
s(λ) =
i
ci
(λ) =
d
i=1
˜ai
W ˜ai eλ+1
˜gi
Function s: monotonously decreasing with s(0) ≤ 1.
→ Bisection search for s(λ∗) ≃ 1 for arbitrary precision.
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 56/68
Experiments: Caltech-256
Caltech-256 [16]: 30607 images labeled into 256 categories (256
Jeffreys centroids).
Arbitrary floating-point precision: http://www.apfloat.org/
˜c′′
=
˜a + ˜g
2
αc (optimal positive) α
˜c
′ (n′lized approx.) wc ≤ 1(n′lizing coeff.t) α
˜c
′′ (Veldhuis’ approx.)
avg 0.9648680345638155 1.0002205080964255 0.9338228644308926 1.065590178484613
min 0.906414219584823 1.0000005079528809 0.8342819488534723 1.0027707382095195
max 0.9956399220678585 1.0000031489541772 0.9931975105809021 1.3582296675397754
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 57/68
Experiments: Synthetic data-sets
Random binary histograms
α =
J(˜c′)
J(˜c)
≥ 1
Performance:
¯α ∼ 1.0000009, αmax ∼ 1.00181506, αmin = 1.000000.
Express better worst-case upper bound performance?
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 58/68
Summary and conclusion
◮ Jeffreys positive centroid c in closed-form
◮ normalized Jeffreys positive centroid ˜c′ within approximation
factor 1
wc
◮ Bisection search for arbitrary fine approximation of ˜c.
→ Variational Jeffreys k-means clustering
Other Kullback-Leibler symmetrizations:
◮ Jensen-Shannon divergence [19]
◮ Chernoff divergence [9]
◮ Family of symmetrized centroids including Jensen-Shannon
and Jeffreys centroids [29]
◮ Infinitely many symmetrizations using quasi-arithmetic means
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 59/68
Clustering: A never-ending story...
An old problem with many new recent results!
21st century is the revolution of data science
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 60/68
Computational Information Geometry
[32] [31]
http://www.springer.com/engineering/signals/book/978-3-642-30231-2
http://www.sonycsl.co.jp/person/nielsen/infogeo/MIG/MIGBOOKWEB/
http://www.springer.com/engineering/signals/book/978-3-319-05316-5
http://www.sonycsl.co.jp/person/nielsen/infogeo/GTI/GeometricTheoryOfInformation.html
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 61/68
Textbooks: Visual computing
[27] [28]
http://www.sonycsl.co.jp/person/nielsen/visualcomputing/
http://www.lix.polytechnique.fr/~nielsen/JavaProgramming/
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 62/68
Bibliography I
Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh.
Clustering with Bregman divergences.
Journal of Machine Learning Research, 6:1705–1749, 2005.
D. A. Barry, P. J. Culligan-Hensley, and S. J. Barry.
Real values of the W -function.
ACM Trans. Math. Softw., 21(2):161–171, June 1995.
Richard Bellman.
A note on cluster analysis and dynamic programming.
Mathematical Biosciences, 18(3-4):311 – 312, 1973.
Anup Bhattacharya, Ragesh Jaiswal, and Nir Ailon.
A tight lower bound instance for k-means++ in constant dimension.
CoRR, abs/1401.2912, 2014.
Brigitte Bigi.
Using Kullback-Leibler distance for text categorization.
In Proceedings of the 25th European conference on IR research (ECIR), ECIR’03, pages 305–319, Berlin,
Heidelberg, 2003. Springer-Verlag.
Jean-Daniel Boissonnat, Frank Nielsen, and Richard Nock.
Bregman Voronoi diagrams.
Discrete Computational Geometry, 44(2):281–307, September 2010.
J. Cavanaugh.
Unifying the derivations for the Akaike and corrected Akaike information criteria.
Statistics & Probability Letters, 33(2):201–208, April 1997.
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 63/68
Bibliography II
Vijay Chandrasekhar, Gabriel Takacs, David M. Chen, Sam S. Tsai, Yuriy A. Reznik, Radek Grzeszczuk, and
Bernd Girod.
Compressed histogram of gradients: A low-bitrate descriptor.
International Journal of Computer Vision, 96(3):384–399, 2012.
Herman Chernoff.
A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations.
Annals of Mathematical Statistics, 23:493–507, 1952.
Franklin C. Crow.
Summed-area tables for texture mapping.
In Proceedings of the 11th Annual Conference on Computer Graphics and Interactive Techniques,
SIGGRAPH ’84, pages 207–212, New York, NY, USA, 1984. ACM.
G. Csurka, C. Bray, C. Dance, and L. Fan.
Visual categorization with bags of keypoints.
Workshop on Statistical Learning in Computer Vision (ECCV), pages 1–22, 2004.
Sanjoy Dasgupta.
The hardness of k-means clustering.
Technical Report CS2008-0916.
Walter D Fisher.
On grouping for maximum homogeneity.
Journal of the American Statistical Association, 53(284):789–798, 1958.
Edward W. Forgy.
Cluster analysis of multivariate data: efficiency vs interpretability of classifications.
Biometrics, 1965.
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 64/68
Bibliography III
Vincent Garcia and Frank Nielsen.
Simplification and hierarchical representations of mixtures of exponential families.
Signal Processing, 90(12):3197–3212, 2010.
G. Griffin, A. Holub, and P. Perona.
Caltech-256 object category dataset.
Technical Report 7694, California Institute of Technology, 2007.
John A. Hartigan.
Clustering Algorithms.
John Wiley & Sons, Inc., New York, NY, USA, 99th edition, 1975.
Harold Jeffreys.
An invariant form for the prior probability in estimation problems.
Proceedings of the Royal Society of London, 186(1007):453–461, March 1946.
Jianhua Lin.
Divergence measures based on the Shannon entropy.
IEEE Transactions on Information Theory, 37:145–151, 1991.
Huan Liu and Rudy Setiono.
Chi2: Feature selection and discretization of numeric attributes.
In Proceedings of the Seventh International Conference on Tools with Artificial Intelligence (TAI), pages
88–, Washington, DC, USA, 1995. IEEE Computer Society.
Meizhu Liu, Baba C. Vemuri, Shun ichi Amari, and Frank Nielsen.
Shape retrieval using hierarchical total Bregman soft clustering.
IEEE Trans. Pattern Anal. Mach. Intell., 34(12):2407–2419, 2012.
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 65/68
Bibliography IV
Stuart P. Lloyd.
Least squares quantization in PCM.
Technical report, Bell Laboratories, 1957.
James B. MacQueen.
Some methods of classification and analysis of multivariate observations.
In L. M. Le Cam and J. Neyman, editors, Proceedings of the Fifth Berkeley Symposium on Mathematical
Statistics and Probability. University of California Press, Berkeley, CA, USA, 1967.
Meena Mahajan, Prajakta Nimbhorkar, and Kasturi R. Varadarajan.
The planar k-means problem is NP-hard.
Theoretical Computer Science, 442:13–21, 2012.
Nimrod Megiddo and Kenneth J Supowit.
On the complexity of some common geometric location problems.
SIAM journal on computing, 13(1):182–196, 1984.
Max Mignotte.
Segmentation by fusion of histogram-based k-means clusters in different color spaces.
IEEE Transactions on Image Processing (TIP), 17(5):780–787, 2008.
Frank Nielsen.
Visual Computing: Geometry, Graphics, and Vision.
Charles River Media / Thomson Delmar Learning, 2005.
Frank Nielsen.
A Concise and Practical Introduction to Programming Algorithms in Java.
Undergraduate Topics in Computer Science (UTiCS). Springer Verlag, 2009.
http://www.springer.com/computer/programming/book/978-1-84882-338-9.
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 66/68
Bibliography V
Frank Nielsen.
A family of statistical symmetric divergences based on Jensen’s inequality.
CoRR, abs/1009.4004, 2010.
Frank Nielsen.
k-mle: A fast algorithm for learning statistical mixture models.
CoRR, abs/1203.5181, 2012.
Frank Nielsen.
Geometric Theory of Information.
Springer, 2014.
Frank Nielsen and Rajendra Bhatia, editors.
Matrix Information Geometry (Revised Invited Papers). Springer, 2012.
Frank Nielsen and Sylvain Boltz.
The Burbea-Rao and Bhattacharyya centroids.
IEEE Transactions on Information Theory, 57(8):5455–5466, August 2011.
Frank Nielsen and Richard Nock.
Clustering multivariate normal distributions.
In Emerging Trends in Visual Computing, pages 164–174. Springer, 2009.
Frank Nielsen and Richard Nock.
Sided and symmetrized Bregman centroids.
IEEE Transactions on Information Theory, 55(6):2882–2904, 2009.
Frank Nielsen and Richard Nock.
Sided and symmetrized Bregman centroids.
IEEE Transactions on Information Theory, 55(6):2048–2059, June 2009.
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 67/68
Bibliography VI
Richard Nock, Panu Luosto, and Jyrki Kivinen.
Mixed Bregman clustering with approximation guarantees.
In Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases,
pages 154–169, Berlin, Heidelberg, 2008. Springer-Verlag.
Raymond N. J. Veldhuis.
The centroid of the symmetrical Kullback-Leibler distance.
IEEE signal processing letters, 9(3):96–99, March 2002.
Haizhou Wang and Mingzhou Song.
Ckmeans.1d.dp: Optimal k-means clustering in one dimension by dynamic programming.
R Journal, 3(2), 2011.
Juanying Xie, Shuai Jiang, Weixin Xie, and Xinbo Gao.
An efficient global k-means clustering algorithm.
Journal of computers, 6(2), 2011.
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 68/68

More Related Content

What's hot

MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3arogozhnikov
 
Tailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest NeighborsTailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest NeighborsFrank Nielsen
 
2012 mdsp pr08 nonparametric approach
2012 mdsp pr08 nonparametric approach2012 mdsp pr08 nonparametric approach
2012 mdsp pr08 nonparametric approachnozomuhamada
 
2012 mdsp pr04 monte carlo
2012 mdsp pr04 monte carlo2012 mdsp pr04 monte carlo
2012 mdsp pr04 monte carlonozomuhamada
 
Lecture 3 image sampling and quantization
Lecture 3 image sampling and quantizationLecture 3 image sampling and quantization
Lecture 3 image sampling and quantizationVARUN KUMAR
 
Efficient Analysis of high-dimensional data in tensor formats
Efficient Analysis of high-dimensional data in tensor formatsEfficient Analysis of high-dimensional data in tensor formats
Efficient Analysis of high-dimensional data in tensor formatsAlexander Litvinenko
 
MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1arogozhnikov
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsElvis DOHMATOB
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackarogozhnikov
 
MLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic trackMLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic trackarogozhnikov
 
Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines SimplyEmad Nabil
 
Efficient end-to-end learning for quantizable representations
Efficient end-to-end learning for quantizable representationsEfficient end-to-end learning for quantizable representations
Efficient end-to-end learning for quantizable representationsNAVER Engineering
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2arogozhnikov
 
DS-MLR: Scaling Multinomial Logistic Regression via Hybrid Parallelism
DS-MLR: Scaling Multinomial Logistic Regression via Hybrid ParallelismDS-MLR: Scaling Multinomial Logistic Regression via Hybrid Parallelism
DS-MLR: Scaling Multinomial Logistic Regression via Hybrid ParallelismParameswaran Raman
 

What's hot (20)

MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3
 
Tailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest NeighborsTailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest Neighbors
 
2012 mdsp pr08 nonparametric approach
2012 mdsp pr08 nonparametric approach2012 mdsp pr08 nonparametric approach
2012 mdsp pr08 nonparametric approach
 
2012 mdsp pr04 monte carlo
2012 mdsp pr04 monte carlo2012 mdsp pr04 monte carlo
2012 mdsp pr04 monte carlo
 
Lecture 3 image sampling and quantization
Lecture 3 image sampling and quantizationLecture 3 image sampling and quantization
Lecture 3 image sampling and quantization
 
CLIM Fall 2017 Course: Statistics for Climate Research, Estimating Curves and...
CLIM Fall 2017 Course: Statistics for Climate Research, Estimating Curves and...CLIM Fall 2017 Course: Statistics for Climate Research, Estimating Curves and...
CLIM Fall 2017 Course: Statistics for Climate Research, Estimating Curves and...
 
Efficient Analysis of high-dimensional data in tensor formats
Efficient Analysis of high-dimensional data in tensor formatsEfficient Analysis of high-dimensional data in tensor formats
Efficient Analysis of high-dimensional data in tensor formats
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic track
 
Pres metabief2020jmm
Pres metabief2020jmmPres metabief2020jmm
Pres metabief2020jmm
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
MLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic trackMLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic track
 
Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines Simply
 
Efficient end-to-end learning for quantizable representations
Efficient end-to-end learning for quantizable representationsEfficient end-to-end learning for quantizable representations
Efficient end-to-end learning for quantizable representations
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2
 
DS-MLR: Scaling Multinomial Logistic Regression via Hybrid Parallelism
DS-MLR: Scaling Multinomial Logistic Regression via Hybrid ParallelismDS-MLR: Scaling Multinomial Logistic Regression via Hybrid Parallelism
DS-MLR: Scaling Multinomial Logistic Regression via Hybrid Parallelism
 

Viewers also liked

Momentum Marketing Group
Momentum Marketing GroupMomentum Marketing Group
Momentum Marketing Groupamsmith1217
 
Audience feedback blog
Audience feedback blogAudience feedback blog
Audience feedback bloggryall97
 
City church new orleans covenant house new orleans-1867
City church new orleans covenant house new orleans-1867City church new orleans covenant house new orleans-1867
City church new orleans covenant house new orleans-1867SuperServiceChallenge2013
 
Islands elementary islands elementary pta-1079
Islands elementary islands elementary pta-1079Islands elementary islands elementary pta-1079
Islands elementary islands elementary pta-1079SuperServiceChallenge2013
 
Screen shots of magazine cover- Replay Mag
Screen shots of magazine cover- Replay MagScreen shots of magazine cover- Replay Mag
Screen shots of magazine cover- Replay Maggryall97
 
Andrew Cherwenka | Measurement
Andrew Cherwenka | MeasurementAndrew Cherwenka | Measurement
Andrew Cherwenka | MeasurementCM1TO
 
Point elementary school team- taylor, sophia and mckenna-basket of hope-3031
Point elementary school team- taylor, sophia and mckenna-basket of hope-3031Point elementary school team- taylor, sophia and mckenna-basket of hope-3031
Point elementary school team- taylor, sophia and mckenna-basket of hope-3031SuperServiceChallenge2013
 
Cwam service challenge athletes in action (nyc area) athletes in action (nyc ...
Cwam service challenge athletes in action (nyc area) athletes in action (nyc ...Cwam service challenge athletes in action (nyc area) athletes in action (nyc ...
Cwam service challenge athletes in action (nyc area) athletes in action (nyc ...SuperServiceChallenge2013
 
Jonathan Sy | When Shit Hits the Fan
Jonathan Sy | When Shit Hits the FanJonathan Sy | When Shit Hits the Fan
Jonathan Sy | When Shit Hits the FanCM1TO
 
(ISIA 2) Cours d'algorithmique (1995)
(ISIA 2) Cours d'algorithmique (1995)(ISIA 2) Cours d'algorithmique (1995)
(ISIA 2) Cours d'algorithmique (1995)Frank Nielsen
 
(ISIA 5) Cours d'algorithmique (1995)
(ISIA 5) Cours d'algorithmique (1995)(ISIA 5) Cours d'algorithmique (1995)
(ISIA 5) Cours d'algorithmique (1995)Frank Nielsen
 
Point elementary school team- quintein, milo and aiden-basket of hope-2970
Point elementary school team- quintein, milo and aiden-basket of hope-2970Point elementary school team- quintein, milo and aiden-basket of hope-2970
Point elementary school team- quintein, milo and aiden-basket of hope-2970SuperServiceChallenge2013
 
Ryan Ginsberg | Building Community Through Twitter
Ryan Ginsberg | Building Community Through TwitterRyan Ginsberg | Building Community Through Twitter
Ryan Ginsberg | Building Community Through TwitterCM1TO
 
(ISIA 4) Cours d'algorithmique (1995)
(ISIA 4) Cours d'algorithmique (1995)(ISIA 4) Cours d'algorithmique (1995)
(ISIA 4) Cours d'algorithmique (1995)Frank Nielsen
 
Lathrop & gage : leap operation breakthrough-1453
Lathrop & gage : leap operation breakthrough-1453Lathrop & gage : leap operation breakthrough-1453
Lathrop & gage : leap operation breakthrough-1453SuperServiceChallenge2013
 
Fundamentals cig 4thdec
Fundamentals cig 4thdecFundamentals cig 4thdec
Fundamentals cig 4thdecFrank Nielsen
 
Point elementary team- jake, jack and nat-basket of hope-2768
Point elementary   team- jake, jack and nat-basket of hope-2768Point elementary   team- jake, jack and nat-basket of hope-2768
Point elementary team- jake, jack and nat-basket of hope-2768SuperServiceChallenge2013
 
Erin Bury | It's Not Me, It's You: How to Turn Content From Promotional to Mu...
Erin Bury | It's Not Me, It's You: How to Turn Content From Promotional to Mu...Erin Bury | It's Not Me, It's You: How to Turn Content From Promotional to Mu...
Erin Bury | It's Not Me, It's You: How to Turn Content From Promotional to Mu...CM1TO
 

Viewers also liked (20)

Momentum Marketing Group
Momentum Marketing GroupMomentum Marketing Group
Momentum Marketing Group
 
Audience feedback blog
Audience feedback blogAudience feedback blog
Audience feedback blog
 
City church new orleans covenant house new orleans-1867
City church new orleans covenant house new orleans-1867City church new orleans covenant house new orleans-1867
City church new orleans covenant house new orleans-1867
 
Islands elementary islands elementary pta-1079
Islands elementary islands elementary pta-1079Islands elementary islands elementary pta-1079
Islands elementary islands elementary pta-1079
 
Screen shots of magazine cover- Replay Mag
Screen shots of magazine cover- Replay MagScreen shots of magazine cover- Replay Mag
Screen shots of magazine cover- Replay Mag
 
N:a big class-1916
N:a big class-1916N:a big class-1916
N:a big class-1916
 
Andrew Cherwenka | Measurement
Andrew Cherwenka | MeasurementAndrew Cherwenka | Measurement
Andrew Cherwenka | Measurement
 
Point elementary school team- taylor, sophia and mckenna-basket of hope-3031
Point elementary school team- taylor, sophia and mckenna-basket of hope-3031Point elementary school team- taylor, sophia and mckenna-basket of hope-3031
Point elementary school team- taylor, sophia and mckenna-basket of hope-3031
 
Cwam service challenge athletes in action (nyc area) athletes in action (nyc ...
Cwam service challenge athletes in action (nyc area) athletes in action (nyc ...Cwam service challenge athletes in action (nyc area) athletes in action (nyc ...
Cwam service challenge athletes in action (nyc area) athletes in action (nyc ...
 
Jonathan Sy | When Shit Hits the Fan
Jonathan Sy | When Shit Hits the FanJonathan Sy | When Shit Hits the Fan
Jonathan Sy | When Shit Hits the Fan
 
(ISIA 2) Cours d'algorithmique (1995)
(ISIA 2) Cours d'algorithmique (1995)(ISIA 2) Cours d'algorithmique (1995)
(ISIA 2) Cours d'algorithmique (1995)
 
(ISIA 5) Cours d'algorithmique (1995)
(ISIA 5) Cours d'algorithmique (1995)(ISIA 5) Cours d'algorithmique (1995)
(ISIA 5) Cours d'algorithmique (1995)
 
Point elementary school team- quintein, milo and aiden-basket of hope-2970
Point elementary school team- quintein, milo and aiden-basket of hope-2970Point elementary school team- quintein, milo and aiden-basket of hope-2970
Point elementary school team- quintein, milo and aiden-basket of hope-2970
 
Ryan Ginsberg | Building Community Through Twitter
Ryan Ginsberg | Building Community Through TwitterRyan Ginsberg | Building Community Through Twitter
Ryan Ginsberg | Building Community Through Twitter
 
Green 4 basket of hope-3407
Green 4 basket of hope-3407Green 4 basket of hope-3407
Green 4 basket of hope-3407
 
(ISIA 4) Cours d'algorithmique (1995)
(ISIA 4) Cours d'algorithmique (1995)(ISIA 4) Cours d'algorithmique (1995)
(ISIA 4) Cours d'algorithmique (1995)
 
Lathrop & gage : leap operation breakthrough-1453
Lathrop & gage : leap operation breakthrough-1453Lathrop & gage : leap operation breakthrough-1453
Lathrop & gage : leap operation breakthrough-1453
 
Fundamentals cig 4thdec
Fundamentals cig 4thdecFundamentals cig 4thdec
Fundamentals cig 4thdec
 
Point elementary team- jake, jack and nat-basket of hope-2768
Point elementary   team- jake, jack and nat-basket of hope-2768Point elementary   team- jake, jack and nat-basket of hope-2768
Point elementary team- jake, jack and nat-basket of hope-2768
 
Erin Bury | It's Not Me, It's You: How to Turn Content From Promotional to Mu...
Erin Bury | It's Not Me, It's You: How to Turn Content From Promotional to Mu...Erin Bury | It's Not Me, It's You: How to Turn Content From Promotional to Mu...
Erin Bury | It's Not Me, It's You: How to Turn Content From Promotional to Mu...
 

Similar to Information-theoretic clustering with applications

ENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-MeansENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-Meanstthonet
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodFrank Nielsen
 
13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdfEmanAsem4
 
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)Frank Nielsen
 
Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Frank Nielsen
 
Divergence clustering
Divergence clusteringDivergence clustering
Divergence clusteringFrank Nielsen
 
5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdf5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdfRahul926331
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variationsAndres Mendez-Vazquez
 
Divide_and_Contrast__Source_free_Domain_Adaptation_via_Adaptive_Contrastive_L...
Divide_and_Contrast__Source_free_Domain_Adaptation_via_Adaptive_Contrastive_L...Divide_and_Contrast__Source_free_Domain_Adaptation_via_Adaptive_Contrastive_L...
Divide_and_Contrast__Source_free_Domain_Adaptation_via_Adaptive_Contrastive_L...Huang Po Chun
 
11ClusAdvanced.ppt
11ClusAdvanced.ppt11ClusAdvanced.ppt
11ClusAdvanced.pptSueMiu
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data ScienceAlbert Bifet
 
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...Salah Amean
 
A new implementation of k-MLE for mixture modelling of Wishart distributions
A new implementation of k-MLE for mixture modelling of Wishart distributionsA new implementation of k-MLE for mixture modelling of Wishart distributions
A new implementation of k-MLE for mixture modelling of Wishart distributionsFrank Nielsen
 
Chapter 11. Cluster Analysis Advanced Methods.ppt
Chapter 11. Cluster Analysis Advanced Methods.pptChapter 11. Cluster Analysis Advanced Methods.ppt
Chapter 11. Cluster Analysis Advanced Methods.pptSubrata Kumer Paul
 
Litv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdfLitv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdfAlexander Litvinenko
 
On Optimization of Network-coded Scalable Multimedia Service Multicasting
On Optimization of Network-coded Scalable Multimedia Service MulticastingOn Optimization of Network-coded Scalable Multimedia Service Multicasting
On Optimization of Network-coded Scalable Multimedia Service MulticastingAndrea Tassi
 
Slides: Jeffreys centroids for a set of weighted histograms
Slides: Jeffreys centroids for a set of weighted histogramsSlides: Jeffreys centroids for a set of weighted histograms
Slides: Jeffreys centroids for a set of weighted histogramsFrank Nielsen
 
Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...Alexander Litvinenko
 

Similar to Information-theoretic clustering with applications (20)

ENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-MeansENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-Means
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihood
 
13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf
 
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
 
Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...
 
Divergence clustering
Divergence clusteringDivergence clustering
Divergence clustering
 
5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdf5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdf
 
11 clusadvanced
11 clusadvanced11 clusadvanced
11 clusadvanced
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
 
Divide_and_Contrast__Source_free_Domain_Adaptation_via_Adaptive_Contrastive_L...
Divide_and_Contrast__Source_free_Domain_Adaptation_via_Adaptive_Contrastive_L...Divide_and_Contrast__Source_free_Domain_Adaptation_via_Adaptive_Contrastive_L...
Divide_and_Contrast__Source_free_Domain_Adaptation_via_Adaptive_Contrastive_L...
 
11ClusAdvanced.ppt
11ClusAdvanced.ppt11ClusAdvanced.ppt
11ClusAdvanced.ppt
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
 
A new implementation of k-MLE for mixture modelling of Wishart distributions
A new implementation of k-MLE for mixture modelling of Wishart distributionsA new implementation of k-MLE for mixture modelling of Wishart distributions
A new implementation of k-MLE for mixture modelling of Wishart distributions
 
Chapter 11. Cluster Analysis Advanced Methods.ppt
Chapter 11. Cluster Analysis Advanced Methods.pptChapter 11. Cluster Analysis Advanced Methods.ppt
Chapter 11. Cluster Analysis Advanced Methods.ppt
 
TunUp final presentation
TunUp final presentationTunUp final presentation
TunUp final presentation
 
Litv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdfLitv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdf
 
On Optimization of Network-coded Scalable Multimedia Service Multicasting
On Optimization of Network-coded Scalable Multimedia Service MulticastingOn Optimization of Network-coded Scalable Multimedia Service Multicasting
On Optimization of Network-coded Scalable Multimedia Service Multicasting
 
Slides: Jeffreys centroids for a set of weighted histograms
Slides: Jeffreys centroids for a set of weighted histogramsSlides: Jeffreys centroids for a set of weighted histograms
Slides: Jeffreys centroids for a set of weighted histograms
 
Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...
 

Recently uploaded

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 

Recently uploaded (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 

Information-theoretic clustering with applications

  • 1. Information-theoretic clustering with applications Frank Nielsen Sony Computer Science Laboratories Inc ´Ecole Polytechnique Kyoto University Information Seminar 9th June 2014 c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/68
  • 2. Clustering: Data exploratory science ◮ Clustering: Find homogeneous groups in data ◮ Clustering: Separate data into groups Iris data set from UCI: d = 4, k = 3, n = 150 c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/68
  • 3. Clustering: What to see?, when to see ? ◮ Distance or similarity measure between data? ◮ Data membership: Hard or soft clustering? ◮ How many groups? ◮ Outliers? ◮ “Shapes” of groups? ◮ Representation of data and cleaning, ◮ Missing data, truncated data, ... ◮ Etc. c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/68
  • 4. Outline Clustering n data in d dimensions into k clusters Usual case: d << k << n, but all cases possible ◮ Quick tutorial: “The essentials” ◮ Ordinaryk-means, ◮ Gaussian mixtures models, ◮ Model selection, ◮ Other kinds of clustering, etc. ◮ Optimal 1D contiguous clustering ◮ Clustering histograms with Jeffreys divergence c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/68
  • 5. Part I Quick tutorial c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/68
  • 6. k-means clustering Partition the data set X = {x1, ..., xn} into k groups P = {G1, ..., Gk }, each group Gj has a center cj . Minimize the objective function (cost, energy, loss): e(X, C) = n i=1 k min j=1 xi − cj 2 ◮ NP-hard when d > 1 and k > 1 ◮ For d = 1, center c1 is the centroid and e1(X) = e(X, {C1}) = var(X), the “unnormalized” variance: var(X) = i xi 2 − n c1 2 c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/68
  • 7. Rewriting the k-means cost to minimize e(X, C) = n i=1 k min j=1 xi − cj 2 , = 1 2 k j=1 x,x′∈Gj x − x′ 2 + constant, = − 1 2 k j=1 x∈Gj x′∈Gk x − x′ 2 + constant, = k j=1 nj var(Gj ), size cluster Gi : nj = |Gi | = var(X) − var(Y), where Y = {(nj , cj )}k j=1 Interpreted as: min sum cluster inter-distances, max sum cluster inter-distances, min sum of cluster variances, min quantization loss. c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/68
  • 8. Heuristics for k-means ◮ Local heuristics: ◮ Lloyd (batched): assign data to closest center, remap cluster centers to centroids, reiterate until convergence ◮ McQueen (incremental, online): add a point at a time, assign x to closest cluster, update that cluster centroid, reiterate until convergence ◮ Hartigan (single-point): find a point x ∈ Gi and a cluster Gj so that relocating x ∈ Gj decreases k-means cost, reiterate until convergence ◮ Global heuristics: ◮ Forgy: random seeding. Best Forgy is 2-approximation via the variance-bias decomposition: D(x, X) = n i=1 x − xi 2 = var(X) + n x − c1 2 . ◮ Fastest fast traversal: choose c1 at random, then ci as the point x farthest from {c1, ..., ci−1}, repeat. ◮ Randomized k-means++ (probabilistic ˜O(log k) bound) ◮ Global k-means c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/68
  • 9. Probabilistic model-based clustering Maximize likelihood for the mixture density: m(x) = k i=1 wi p(x; µi , Σi ) Statistical mixture models: generative models c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/68
  • 10. Statistical Gaussian Mixtures Models Universal smooth density estimator. Gaussian distribution: p(x; µ, Σ) = 1 (2π) d 2 |Σ| e− 1 2 DΣ−1 (x−µ,x−µ) Related to the squared Mahalanobis distance: DQ(x, y) = (x − y)T Q(x − y), x ∈ Rd → log-concave density Expectation-maximization (EM) algorithm: maximize incomplete likelihood. EM tends to k-means when Σj = λI with λ → 0 (or any fixed λ, see [34]). c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/68
  • 11. Sampling from a Gaussian Mixture Model To sample a variate x from a GMM: ◮ Choose a component l according to the weight distribution w1, ..., wk , ◮ Draw a variate x according to N(µl , Σl ). → Sampling is a doubly stochastic process: ◮ throw a biased dice with k faces to choose the component: l ∼ Multinomial(w1, ..., wk ) (normalized histogram.) ◮ then draw at random a variate x from the l-th component x ∼ Normal(µl , Σl ) x = µ + Cz with Cholesky: Σ = CC⊤ and z = [z1 ... zd]⊤ standard normal random variate:zi = √ −2 log U1 cos(2πU2) c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/68
  • 12. GMMs: Generative models, sampling [15] A pixel (x,y,R,G,B) : A data point in 5D c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/68
  • 13. Model selection: Choosing k The more parameters the better the model but it over fits. Solution: Penalized likelihood to maximize. Bayesian Information Criterion (BIC): max l(θ) − #θ 2 log n c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/68
  • 14. Clustering analysis ◮ Parametric clustering (k given, cost-based) ◮ center-based hard clustering (k-means, k-medians, min diameter, single linkage, etc.) ◮ model-based soft clustering ◮ Non-parametric clustering (k adjusts with data) ◮ Kernel density estimators (mean shift) ◮ Dirichlet process mixture models ◮ Hierarchical clustering ◮ Graph-based clustering (normalized cut) ◮ Affinity propagation ◮ Subspace/manifold clustering ◮ Etc! Cluster validation (distance between clustering, confusion matrix, NMI, Rand index, etc.) c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/68
  • 15. Part II Optimal 1D contiguous clustering c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/68
  • 16. Hard clustering: Partitioning the data set ◮ Partition X = {x1, ..., xn} ⊂ X into k pairwise disjoint clusters C1 ⊂ X, ..., Ck ⊂ X: X = k i=1 Ci ◮ Center-based hard clustering (center=prototype): k-means [1], k-medians, k-center, ℓr -center [21], etc. ◮ Model-based hard clustering: statistical mixtures maximizing the complete likelihood (prototype=model parameter). c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/68
  • 17. k-means clustering Minimize the sum of intra-cluster variances: min p1,...,pk n i=1 k min j=1 xi − pj 2 ◮ k-means: NP-hard when d > 1 and k > 1 [24, 12]. k-medians and k-centers also NP-hard [25] (1984) ◮ Global heuristics (for initialization) (Forgy [14], global k-means [40], k-means++ [4], etc.) and local search heuristics (incremental Voronoi MacQueen [23], batched Voronoi Lloyd [22], single-point swap Hartigan [17], etc.) ◮ In 1D, k-means is polynomial [3, 39]: O(n2k). c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/68
  • 18. Euclidean 1D k-means ◮ 1D k-means [13] has contiguous partition. ◮ Solved by enumerating all n−1 k−1 partitions in 1D (1958). Better than Stirling numbers of the second kind S(n, k) that count all partitions. ◮ Polynomial in time O(n2k) using Dynamic Programming (DP) [3] (sketched in 1973 in two pages). ◮ R package Ckmeans.1d.dp [39] (2011). c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 18/68
  • 19. (Sketch of the) DP solution [3] (Bellman, 1973) c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 19/68
  • 20. Interval clustering: Structure Sort X ∈ X with respect to total order < on X in O(n log n). Output represented by: ◮ k intervals Ii = [xli , xri ] such that Ci = Ii ∩ X. ◮ or better k − 1 delimiters li (i ∈ {2, ..., k}) since ri = li+1 − 1 (i < k and rk = n) and l1 = 1. [x1...xl2−1] C1 [xl2 ...xl3−1] C2 ... [xlk ...xn] Ck c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 20/68
  • 21. Objective function for interval clustering Scalars x1 < ... < xn are partitioned contiguously into k clusters: C1 < ... < Ck. Clustering objective function: min ek(X) = k j=1 e1(Cj ) c1(·): intra-cluster cost/energy ⊕: inter-cluster cost/energy (commutative, associative) c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 21/68
  • 22. Examples of objective functions In arbitrary dimension X = Rd : ◮ ℓr -clustering (r ≥ 1): = e1(Cj ) = min p∈X   x∈Cj d(x, p)r   (argmin=prototype pj is the same whether we take power of 1 r of sum or not) Euclidean ℓr -clustering: r = 1 median, r = 2 means. ◮ k-center (limr→∞): = max e1(Ci ) = min p∈X max x∈Cj d(x, p) ◮ Discrete clustering: Search space in min is Cj instead of X. Note that in 1D, ℓs-norm distance is always d(p, q) = |p − q|, independent of s ≥ 1. c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 22/68
  • 23. Optimal interval clustering by Dynamic Programming Xj,i = {xj , ..., xi } (j ≤ i) Xi = X1,i = {x1, ..., xi } E = [ei,j]: n × k cost matrix, O(n × k) memory ei,m = em(Xi ) Optimality equation: ei,m = min m≤j≤i {ej−1,m−1 ⊕ e1(Xj,i )} Associative/commutative operator ⊕ (+ or max). Initialize with ci,1 = c1(Xi ) E: compute from left to right column, from bottom to top. Best clustering solution cost is at en,k. Time: n × k × O(n) × T1(n)=O(n2kT1(n)), O(nk) memory c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 23/68
  • 24. Retrieving the solution: Backtracking Use an auxiliary matrix S = [si.j] for storing the argmin. Backtrack in O(k) time. ◮ Left index lk of Ck stored at sn,k: lk = sn,k. ◮ Iteratively retrieve the previous left interval indexes at entries lj−1 = slj −1,i for j = k − 1, ..., j = 1. Note that lj − 1 = n − k l=j nl and lj − 1 = j−1 l=1 nl . c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 24/68
  • 25. Optimizing time with a Look Up Table (LUT) Save time when computing e1(Xj,i ) since we perform n × k × O(n) such computations. Look Up Table (LUT): Add extra n × n matrix E1 with E1[j][i] = e1(Xj,i ). Build in O(n2T1(n))... Then DP in O(n2k)=O(n2T1(n))). → quadratic amount of memory (n > 10000...) c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 25/68
  • 26. DP solver with cluster size constraints n− i and n+ i : lower/upper bound constraints on ni = |Ci | k l=1 = n− i ≤ n and k l=1 = n+ i ≥ n. When no constraints: add dummy constraints n− i = 1 and n+ i = n − k − 1. nm = |Cm| = i − j + 1 such that n− m ≤ nm ≤ n+ m. → j ≤ i + 1 − n− m and j ≥ i + 1 − n+ m. ei,m = min max{1+ m−1 l=1 n− l ,i+1−n+ m}≤j j≤i+1−n− m {ej−1,m−1 ⊕ e1(Xj,i )} , c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 26/68
  • 27. Model selection from the DP table m(k) = ek (X) e1(X) decreases with k and reaches minimum when k = n. Model selection: trade-off choose best model among all the models with k ∈ [1, n]. Regularized objective function: e′ k (X) = ek(X) + f (k), f (k) related to model complexity. Compute the DP table for k = n, ..., 1 and avoids redundant computations. Then compute the criterion for the last line (indexed by n) and choose the argmin of e′ k. c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 27/68
  • 28. A Voronoi cell condition for DP optimality elements → interval clusters → prototypes interval clusters ← prototypes Partition X wrt. P = {p1, ..., pk }. Voronoi cell: V (pj) = {x ∈ X : dr (x, pj ) ≤ dr (x, pl ) ∀l ∈ {1, ..., k}}. xr is a monotonically increasing function on R+, equivalent to V ′ (pj ) = {x ∈ X : d(x : pj ) < d(x : pl )} DP guarantees optimal clustering when ∀P, V ′(pj ) is an interval c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 28/68
  • 29. Optimal 1D Bregman k-means Bregman information [1] e1 (generalizes cluster variance): e1(Cj ) = min xl ∈Cj wl BF (xl : pj ). (1) Expressed as [35]: e1(Cj ) = xl ∈Cj wl (pj F′(pj ) − F(pj )) + xl ∈Cj wl F(xl ) − F′(pj ) x∈Cj wl x process using Summed Area Tables [10] (SATs) S1(j) = j l=1 wl , S2(j) = j l=1 wl xl , and S3(j) = j l=1 wl F(xl ) in O(n) time at preprocessing stage. Evaluate the Bregman information e1(Xj,i ) in constant time O(1). For example, i l=j wl F(xl ) = S3(i) − S3(j − 1) with S3(0) = 0. Bregman Voronoi diagrams have connected cells [6] thus DP yields optimal interval clustering. c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 29/68
  • 30. Exponential families in statistics Family of probability distributions: F = {pF (x; θ) : θ ∈ Θ} Exponential families [30]: pF (x|θ) = exp (t(x)θ − F(θ) + k(x)) , For example: univariate Rayleigh R(σ), t(x) = x2, k(x) = log x, θ = − 1 2σ2 , η = −1 θ , F(θ) = log − 1 2θ and F∗(η) = −1 + log 2 η . c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 30/68
  • 31. Uniorder exponential families: MLE Maximum Likelihood Estimator (MLE) [30]: e1(Xj,i ) = ˆl(xj , ..., xi ) = F∗ (ˆηj,i ) + 1 i − j + 1 i l=j k(xl ). with ˆηj,i = 1 i−j+1 i l=j t(xl ). By making a change of variable yl = t(xl ), and not accounting the k(xl ) terms that are constant for any clustering, we get e1(Xj,i ) ≡ F∗   1 i − j + 1 i l=j yl   c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 31/68
  • 32. Hard clustering for learning statistical mixtures Expectation-Maximization learns monotonically from an initialization by maximizing the incomplete log-likelihood. Mixture maximizing the complete log-likelihood: lc(X; L, Ω) = n i=1 log(αli p(xi ; θli )), L = {li }i : hidden labels. max lc ≡ min θ1,...,θk n i=1 k min j=1 (− log p(xi ; θj ) − log αj ). Given fixed α and − log pF (x; θ) amounts to a dual Bregman divergence[1]. Run Bregman k-means and DP yields optimal partition since additively-weighted Bregman Voronoi diagrams are interval [6]. c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 32/68
  • 33. Hard clustering for learning statistical mixtures Location families: F = {f (x; µ) = 1 σ f0( x − µ σ ), µ ∈ R} f0 standard density, σ > 0 fixed. Cauchy or Laplacian families have density graphs intersecting in exactly one point. → singly-connected Maximum Likelihood Voronoi cells. Model selection: Akaike Information Criterion [7] (AIC): AIC(x1, ..., xn) = −2l(x1, ..., xn) + 2k + 2k(k + 1) n − k − 1 c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 33/68
  • 34. Experiments with: Gaussian Mixture Models (GMMs) gmm1 score = −3.0754314021966658 (Euclidean k-means) gmm2 score = −3.038795325884112 (Bregman k-means, better) c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 34/68
  • 35. 1-mean (centroid): O(n) time min p n i=1 (xi − p)2 D(x, p) = (x − p)2 , D′ (x, p) = 2(x − p), D′′ (x, p) = 2 Convex optimization (existence and unique solution) n i=1 D′ (x, p) = 0 ⇒ n i=1 xi − np = 0 Center of mass p = 1 n n i=1 xi − (barycenter) Extends to Bregman divergence: DF (x, p) = F(x) − F(p) − (x − p)F′ (p) c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 35/68
  • 36. 2-means: O(n log n) time Find xl2 (n − 1 potential locations for xl : from x2 to xn): min xl2 {e1(C1) + e1(C2)} Browse from left to right l2 = x2, ...., xn. Update cost in constant time E2(l + 1) from E2(l) (SATs also O(1)): E2(l) = e2(x1...xl−1|xl ...xn) µ1(l + 1) = (l − 1)µ1(l) + xl l , µ2(l + 1) = (n − l + 1)µ2(l) − xl n − l v1(l + 1) = l i=1 (xi − µ1(l + 1))2 = l i=1 x2 i − lµ2 1(l + 1) ∆E2(l) = l − 1 l µ1(l) − xl 2 + n − l + 1 n − l µ2(l) − xl 2 c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 36/68
  • 37. 2-means: Experiments Intel Win7 i7-4800 n Brute force SAT Incremental 300000 155.022 0.010 0.0091 1000000 1814.44 0.018 0.015 Do we need sorting and Ω(n log n) time? (k = 1 is linear time) Note that MaxGap does not yield the separator (because centroid is sum of squared distance minimizer) c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 37/68
  • 38. Conclusion ◮ Generic DP for solving interval clustering: ◮ O(n2 kT1(n))-time using O(nk) memory ◮ O(n2 T1(n)) time using O(n2 ) memory ◮ Refine DP by adding minimum/maximum cluster size constraints ◮ Model selection from DP table ◮ Two applications: ◮ 1D Bregman ℓr -clustering. 1D Bregman k-means in O(n2 k) time using O(nk) memory using Summed Area Tables (SATs) ◮ Mixture learning maximizing the complete likelihood: ◮ For uni-order exponential families amount to a dual Bregman k-means on Y = {yi = t(xi )}i ◮ For location families with density graph intersecting pairwise in one point (Cauchy, Laplacian: ∈ exponential families) c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 38/68
  • 39. Part III Clustering histograms with Jeffreys divergence c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 39/68
  • 40. Why histogram clustering? Task: Classify documents into categories: Bag-of-Word (BoW) modeling paradigm [5, 11]. ◮ Define a word dictionary, and ◮ Represent each document by a word count histogram. Centroid-based k-means clustering [1]: ◮ Cluster document histograms to learn categories, ◮ Build visual vocabularies by quantizing image features: Compressed Histogram of Gradient descriptors [8]. → histogram centroids wh = d i=1 hi : cumulative sum of bin values ˜: normalization operator c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 40/68
  • 41. Why Jeffreys divergence? Distance between two frequency histograms ˜p and ˜q: Kullback-Leibler divergence or relative entropy. KL(˜p : ˜q) = H× (˜p : ˜q) − H(˜p), H× (˜p : ˜q) = d i=1 ˜pi log 1 ˜qi , cross − entropy H(˜p) = H× (˜p : ˜p) = d i=1 ˜pi log 1 ˜pi , Shannon entropy. → expected extra number of bits per datum that must be transmitted when using the “wrong” distribution ˜q instead of the true distribution ˜p. ˜p is hidden by nature (and hypothesized), ˜q is estimated. c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 41/68
  • 42. Why Jeffreys divergence? When clustering histograms, all histograms play the same role → Jeffreys [18] divergence: J(p, q) = KL(p : q) + KL(q : p), J(p, q) = d i=1 (pi − qi ) log pi qi = J(q, p). → symmetrizes the KL divergence. (aka. J-divergence or symmetrical Kullback-Leibler divergence, etc.) KL(p : q) = i pi log pi qi + qi − pi c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 42/68
  • 43. Jeffreys centroids: frequency and positive centroids A set H = {h1, ..., hn} of weighted histograms. c = arg min x n j=1 πj J(hj , x), πj > 0’s histogram positive weights: n j=1 πj = 1. ◮ Jeffreys positive centroid c: c = arg min x∈Rd + n j=1 πj J(hj , x), ◮ Jeffreys frequency centroid ˜c: ˜c = arg min x∈∆d n j=1 πjJ(˜hj , x), ∆d : Probability (d − 1)-dimensional simplex. c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 43/68
  • 44. Prior work ◮ Histogram clustering wrt. χ2 distance [20] ◮ Histogram clustering wrt. Bhattacharyya distance [26, 33] ◮ Histogram clustering wrt. Kullback-Leibler distance as Bregman k-means clustering [1] ◮ Jeffreys frequency centroid [38] (Newton numerical optimization) ◮ Jeffreys frequency centroid as equivalent symmetrized Bregmen centroid [36] ◮ Mixed Bregman clustering [37] ◮ Smooth family of KL symmetrized centroids including Jensen-Shannon centroids and Jeffreys centroids in limit case [29] c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 44/68
  • 45. Jeffreys positive centroid c = arg min x∈Rd + J(H, x) = arg min x∈Rd + n j=1 πj J(hj , x). Theorem (Theorem 1) The Jeffreys positive centroid c = (c1, ..., cd ) of a set {h1, ..., hn} of n weighted positive histograms with d bins can be calculated component-wise exactly using the Lambert W analytic function: ci = ai W ( ai gi e) , where ai = n j=1 πj hi j denotes the coordinate-wise arithmetic weighted means and gi = n j=1(hi j )πj the coordinate-wise geometric weighted means. Lambert analytic function [2] W (x)eW (x) = x for x ≥ 0. c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 45/68
  • 46. Jeffreys positive centroid (proof) min x n j=1 πj J(hj , x) min x n j=1 πj d i=1 (hi j − xi )(log hi j − log xi ) ≡ min x d i=1 n j=1 πj (xi log xi − xi log hi j − hi j log xi ) d i=1 xi log xi − xi log n j=1 (hi j )πj g − n j=1 πjhi j a log xi min x d i=1 xi log xi g − a log xi c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 46/68
  • 47. Jeffreys positive centroid (proof) Coordinate-wise minimize: min x x log x g − a log x Setting the derivative to zero, we solve: log x g + 1 − a x = 0 and get x = a W ( a g e) c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 47/68
  • 48. Jeffreys frequency centroid: A guaranteed approximation ˜c = arg min x∈∆d n j=1 πj J(˜hj , x), Relaxing x from probability simplex ∆d to Rd +, we get ˜c′ = c wc , ci = ai W ( ai gi e) , wc = i ci Lemma (Lemma 1) The cumulative sum wc of the bin values of the Jeffreys positive centroid c of a set of frequency histograms is less or equal to one: 0 < wc ≤ 1. c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 48/68
  • 49. Proof of Lemma 1 From Theorem 1: wc = d i=1 ci = d i=1 ai W ( ai gi e) . Arithmetic-geometric mean inequality: ai ≥ gi Therefore W ( ai gi e) ≥ 1 and ci ≤ ai . Thus wc = d i=1 ci ≤ d i=1 ai = 1 c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 49/68
  • 50. Lemma 2 Lemma (Lemma 2) For any histogram x and frequency histogram ˜h, we have J(x, ˜h) = J(˜x, ˜h) + (wx − 1)(KL(˜x : ˜h) + log wx ), where wx denotes the normalization factor (wx = d i=1 xi ). J(x, ˜H) = J(˜x, ˜H) + (wx − 1)(KL(˜x : ˜H) + log wx ), where J(x, ˜H) = n j=1 πj J(x, ˜hj ) and KL(˜x : ˜H) = n j=1 πj KL(˜x, ˜hj ) (with n j=1 πj = 1). c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 50/68
  • 51. Proof of Lemma 2 xi = wx ˜xi J(x, ˜h) = d i=1 (wx ˜xi − ˜hi ) log wx ˜xi ˜hi J(x, ˜h) = d i=1 (wx ˜xi log ˜xi ˜hi + wx ˜xi log wx + ˜hi log ˜hi ˜xi − ˜hi log wx) = (wx − 1) log wx + J(˜x, ˜h) + (wx − 1) d i=1 ˜xi log ˜xi ˜hi = J(˜x, ˜h) + (wx − 1)(KL(˜x : ˜h) + log wx ) since d i=1 ˜hi = d i=1 ˜xi = 1. c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 51/68
  • 52. Guaranteed approximation of ˜c Theorem (Theorem 2) Let ˜c denote the Jeffreys frequency centroid and ˜c′ = c wc the normalized Jeffreys positive centroid. Then the approximation factor α˜c′ = J(˜c′, ˜H) J(˜c, ˜H) is such that 1 ≤ α˜c′ ≤ 1 wc (with wc ≤ 1). c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 52/68
  • 53. Proof of Theorem 2 J(c, ˜H) ≤ J(˜c, ˜H) ≤ J(˜c′ , ˜H) From Lemma 2, since J(˜c′, ˜H) = J(c, ˜H) + (1 − wc)(KL(˜c′, ˜H) + log wc)) and J(c, ˜H) ≤ J(˜c, ˜H) 1 ≤ α˜c′ ≤ 1 + (1 − wc)(KL(˜c′, ˜H) + log wc) J(˜c, ˜H) KL(˜c′ : ˜H) = 1 wc KL(c, ˜H) − log wc α˜c′ ≤ 1 + (1 − wc)KL(c, ˜H) wcJ(˜c, ˜H) Since J(˜c, ˜H) ≥ J(c, ˜H) and KL(c, ˜H) ≤ J(c, ˜H), we get α˜c′ ≤ 1 wc . When wc = 1 the bound is tight. c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 53/68
  • 54. In practice... c in closed-form → compute wc, KL(c, ˜H), J(c, ˜H). Bound the approximation factor α˜c′ as: α˜c′ ≤ 1 + 1 wc − 1 KL(c, ˜H) J(c, ˜H) ≤ 1 wc c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 54/68
  • 55. Fine approximation From [38, 36], minimization of Jeffreys frequency centroid equivalent to: ˜c = arg min ˜x∈∆d KL(˜a : ˜x) + KL(˜x : ˜g) Lagrangian function enforcing i ci = 1: log ˜ci ˜gi + 1 − ˜ai ˜ci + λ = 0 ˜ci = ˜ai W ˜ai eλ+1 ˜gi λ = −KL(˜c : ˜g) ≤ 0 c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 55/68
  • 56. Fine approximation: Bisection search ci ≤ 1 ⇒ ci = ˜ai W ˜ai eλ+1 ˜gi ≤ 1 λ ≥ log(e˜ai ˜gi ) − 1∀i, λ ∈ [max i log(e˜ai ˜gi ) − 1, 0] s(λ) = i ci (λ) = d i=1 ˜ai W ˜ai eλ+1 ˜gi Function s: monotonously decreasing with s(0) ≤ 1. → Bisection search for s(λ∗) ≃ 1 for arbitrary precision. c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 56/68
  • 57. Experiments: Caltech-256 Caltech-256 [16]: 30607 images labeled into 256 categories (256 Jeffreys centroids). Arbitrary floating-point precision: http://www.apfloat.org/ ˜c′′ = ˜a + ˜g 2 αc (optimal positive) α ˜c ′ (n′lized approx.) wc ≤ 1(n′lizing coeff.t) α ˜c ′′ (Veldhuis’ approx.) avg 0.9648680345638155 1.0002205080964255 0.9338228644308926 1.065590178484613 min 0.906414219584823 1.0000005079528809 0.8342819488534723 1.0027707382095195 max 0.9956399220678585 1.0000031489541772 0.9931975105809021 1.3582296675397754 c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 57/68
  • 58. Experiments: Synthetic data-sets Random binary histograms α = J(˜c′) J(˜c) ≥ 1 Performance: ¯α ∼ 1.0000009, αmax ∼ 1.00181506, αmin = 1.000000. Express better worst-case upper bound performance? c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 58/68
  • 59. Summary and conclusion ◮ Jeffreys positive centroid c in closed-form ◮ normalized Jeffreys positive centroid ˜c′ within approximation factor 1 wc ◮ Bisection search for arbitrary fine approximation of ˜c. → Variational Jeffreys k-means clustering Other Kullback-Leibler symmetrizations: ◮ Jensen-Shannon divergence [19] ◮ Chernoff divergence [9] ◮ Family of symmetrized centroids including Jensen-Shannon and Jeffreys centroids [29] ◮ Infinitely many symmetrizations using quasi-arithmetic means c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 59/68
  • 60. Clustering: A never-ending story... An old problem with many new recent results! 21st century is the revolution of data science c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 60/68
  • 61. Computational Information Geometry [32] [31] http://www.springer.com/engineering/signals/book/978-3-642-30231-2 http://www.sonycsl.co.jp/person/nielsen/infogeo/MIG/MIGBOOKWEB/ http://www.springer.com/engineering/signals/book/978-3-319-05316-5 http://www.sonycsl.co.jp/person/nielsen/infogeo/GTI/GeometricTheoryOfInformation.html c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 61/68
  • 62. Textbooks: Visual computing [27] [28] http://www.sonycsl.co.jp/person/nielsen/visualcomputing/ http://www.lix.polytechnique.fr/~nielsen/JavaProgramming/ c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 62/68
  • 63. Bibliography I Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clustering with Bregman divergences. Journal of Machine Learning Research, 6:1705–1749, 2005. D. A. Barry, P. J. Culligan-Hensley, and S. J. Barry. Real values of the W -function. ACM Trans. Math. Softw., 21(2):161–171, June 1995. Richard Bellman. A note on cluster analysis and dynamic programming. Mathematical Biosciences, 18(3-4):311 – 312, 1973. Anup Bhattacharya, Ragesh Jaiswal, and Nir Ailon. A tight lower bound instance for k-means++ in constant dimension. CoRR, abs/1401.2912, 2014. Brigitte Bigi. Using Kullback-Leibler distance for text categorization. In Proceedings of the 25th European conference on IR research (ECIR), ECIR’03, pages 305–319, Berlin, Heidelberg, 2003. Springer-Verlag. Jean-Daniel Boissonnat, Frank Nielsen, and Richard Nock. Bregman Voronoi diagrams. Discrete Computational Geometry, 44(2):281–307, September 2010. J. Cavanaugh. Unifying the derivations for the Akaike and corrected Akaike information criteria. Statistics & Probability Letters, 33(2):201–208, April 1997. c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 63/68
  • 64. Bibliography II Vijay Chandrasekhar, Gabriel Takacs, David M. Chen, Sam S. Tsai, Yuriy A. Reznik, Radek Grzeszczuk, and Bernd Girod. Compressed histogram of gradients: A low-bitrate descriptor. International Journal of Computer Vision, 96(3):384–399, 2012. Herman Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Annals of Mathematical Statistics, 23:493–507, 1952. Franklin C. Crow. Summed-area tables for texture mapping. In Proceedings of the 11th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’84, pages 207–212, New York, NY, USA, 1984. ACM. G. Csurka, C. Bray, C. Dance, and L. Fan. Visual categorization with bags of keypoints. Workshop on Statistical Learning in Computer Vision (ECCV), pages 1–22, 2004. Sanjoy Dasgupta. The hardness of k-means clustering. Technical Report CS2008-0916. Walter D Fisher. On grouping for maximum homogeneity. Journal of the American Statistical Association, 53(284):789–798, 1958. Edward W. Forgy. Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics, 1965. c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 64/68
  • 65. Bibliography III Vincent Garcia and Frank Nielsen. Simplification and hierarchical representations of mixtures of exponential families. Signal Processing, 90(12):3197–3212, 2010. G. Griffin, A. Holub, and P. Perona. Caltech-256 object category dataset. Technical Report 7694, California Institute of Technology, 2007. John A. Hartigan. Clustering Algorithms. John Wiley & Sons, Inc., New York, NY, USA, 99th edition, 1975. Harold Jeffreys. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London, 186(1007):453–461, March 1946. Jianhua Lin. Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37:145–151, 1991. Huan Liu and Rudy Setiono. Chi2: Feature selection and discretization of numeric attributes. In Proceedings of the Seventh International Conference on Tools with Artificial Intelligence (TAI), pages 88–, Washington, DC, USA, 1995. IEEE Computer Society. Meizhu Liu, Baba C. Vemuri, Shun ichi Amari, and Frank Nielsen. Shape retrieval using hierarchical total Bregman soft clustering. IEEE Trans. Pattern Anal. Mach. Intell., 34(12):2407–2419, 2012. c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 65/68
  • 66. Bibliography IV Stuart P. Lloyd. Least squares quantization in PCM. Technical report, Bell Laboratories, 1957. James B. MacQueen. Some methods of classification and analysis of multivariate observations. In L. M. Le Cam and J. Neyman, editors, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. University of California Press, Berkeley, CA, USA, 1967. Meena Mahajan, Prajakta Nimbhorkar, and Kasturi R. Varadarajan. The planar k-means problem is NP-hard. Theoretical Computer Science, 442:13–21, 2012. Nimrod Megiddo and Kenneth J Supowit. On the complexity of some common geometric location problems. SIAM journal on computing, 13(1):182–196, 1984. Max Mignotte. Segmentation by fusion of histogram-based k-means clusters in different color spaces. IEEE Transactions on Image Processing (TIP), 17(5):780–787, 2008. Frank Nielsen. Visual Computing: Geometry, Graphics, and Vision. Charles River Media / Thomson Delmar Learning, 2005. Frank Nielsen. A Concise and Practical Introduction to Programming Algorithms in Java. Undergraduate Topics in Computer Science (UTiCS). Springer Verlag, 2009. http://www.springer.com/computer/programming/book/978-1-84882-338-9. c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 66/68
  • 67. Bibliography V Frank Nielsen. A family of statistical symmetric divergences based on Jensen’s inequality. CoRR, abs/1009.4004, 2010. Frank Nielsen. k-mle: A fast algorithm for learning statistical mixture models. CoRR, abs/1203.5181, 2012. Frank Nielsen. Geometric Theory of Information. Springer, 2014. Frank Nielsen and Rajendra Bhatia, editors. Matrix Information Geometry (Revised Invited Papers). Springer, 2012. Frank Nielsen and Sylvain Boltz. The Burbea-Rao and Bhattacharyya centroids. IEEE Transactions on Information Theory, 57(8):5455–5466, August 2011. Frank Nielsen and Richard Nock. Clustering multivariate normal distributions. In Emerging Trends in Visual Computing, pages 164–174. Springer, 2009. Frank Nielsen and Richard Nock. Sided and symmetrized Bregman centroids. IEEE Transactions on Information Theory, 55(6):2882–2904, 2009. Frank Nielsen and Richard Nock. Sided and symmetrized Bregman centroids. IEEE Transactions on Information Theory, 55(6):2048–2059, June 2009. c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 67/68
  • 68. Bibliography VI Richard Nock, Panu Luosto, and Jyrki Kivinen. Mixed Bregman clustering with approximation guarantees. In Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases, pages 154–169, Berlin, Heidelberg, 2008. Springer-Verlag. Raymond N. J. Veldhuis. The centroid of the symmetrical Kullback-Leibler distance. IEEE signal processing letters, 9(3):96–99, March 2002. Haizhou Wang and Mingzhou Song. Ckmeans.1d.dp: Optimal k-means clustering in one dimension by dynamic programming. R Journal, 3(2), 2011. Juanying Xie, Shuai Jiang, Weixin Xie, and Xinbo Gao. An efficient global k-means clustering algorithm. Journal of computers, 6(2), 2011. c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 68/68