Non-exhaustive,
overlapping K-means
clustering
David F. Gleich!
Purdue University!
Real-world graph and point data
have overlapping clusters. 
GeneRa
10 20 30 40 50 60 70
NM_003748NM_003862Contig32125_RCU82987AB037863NM_020974Contig55377_RCNM_003882NM_000849Contig48328_RCContig46223_RCNM_006117NM_003239NM_018401AF257175AF201951NM_001282Contig63102_RCNM_000286Contig34634_RCNM_000320AB033007AL355708NM_000017NM_006763AF148505Contig57595NM_001280AJ224741U45975Contig49670_RCContig753_RCContig25055_RCContig53646_RCContig42421_RCContig51749_RCAL137514NM_004911NM_000224NM_013262Contig41887_RCNM_004163AB020689NM_015416Contig43747_RCNM_012429AB033043AL133619NM_016569NM_004480NM_004798Contig37063_RCNM_000507AB037745Contig50802_RCNM_001007Contig53742_RCNM_018104Contig51963Contig53268_RCNM_012261NM_020244Contig55813_RCContig27312_RCContig44064_RCNM_002570NM_002900AL050090NM_015417Contig47405_RCNM_016337Contig55829_RCContig37598Contig45347_RCNM_020675NM_003234AL080110AL137295Contig17359_RCNM_013296NM_019013AF052159Contig55313_RCNM_002358NM_004358Contig50106_RCNM_005342NM_014754U58033Contig64688NM_001827Contig3902_RCContig41413_RCNM_015434NM_014078NM_018120NM_001124L27560Contig45816_RCAL050021NM_006115NM_001333NM_005496Contig51519_RCContig1778_RCNM_014363NM_001905NM_018454NM_002811NM_004603AB032973NM_006096D25328Contig46802_RCX94232NM_018004Contig8581_RCContig55188_RCContig50410Contig53226_RCNM_012214NM_006201NM_006372Contig13480_RCAL137502Contig40128_RCNM_003676NM_013437Contig2504_RCAL133603NM_012177R70506_RCNM_003662NM_018136NM_000158NM_018410Contig21812_RCNM_004052Contig4595Contig60864_RCNM_003878U96131NM_005563NM_018455Contig44799_RCNM_003258NM_004456NM_003158NM_014750Contig25343_RCNM_005196Contig57864_RCNM_014109NM_002808Contig58368_RCContig46653_RCNM_004504M21551NM_014875NM_001168NM_003376NM_018098AF161553NM_020166NM_017779NM_018265AF155117NM_004701NM_006281Contig44289_RCNM_004336Contig33814_RCNM_003600NM_006265NM_000291NM_000096NM_001673NM_001216NM_014968NM_018354NM_007036NM_004702Contig2399_RCNM_001809Contig20217_RCNM_003981NM_007203NM_006681AF055033NM_014889NM_020386NM_000599Contig56457_RCNM_005915Contig24252_RCContig55725_RCNM_002916NM_014321NM_006931AL080079Contig51464_RCNM_000788NM_016448X05610NM_014791Contig40831_RCAK000745NM_015984NM_016577Contig32185_RCAF052162AF073519NM_003607NM_006101NM_003875Contig25991Contig35251_RCNM_004994NM_000436NM_002073NM_002019NM_000127NM_020188AL137718Contig28552_RCContig38288_RCAA555029_RCNM_016359Contig46218_RCContig63649_RCAL080059
Social networks have overlapping
clusters because of social circles 
Genes have overlapping clusters
due to their role in multiple functions
SILO Seminar
David Gleich · Purdue
Overlapping research projects
are what got me here too!
PhD Thesis
on Google’s
PageRank
MSR Intern
and
Overlapping
Clusters for
Distributed
Computation
Accelerated
NCP plots
and locally
minimal
communities
Neighborhood
inflated seed
expansion for
overlapping
communities
Non-
exhaustive
overlapping "
K-means
SILO Seminar
David Gleich · Purdue 
1.  NISE Clustering - Whang, Gleich, Dhillon, CIKM 2013
2.  NEO-K-means - Whang, Gleich, Dhillon, SDM 2015
3.  NEO-K-means SDP "
Hou, Whang, Gleich, Dhillon, KDD 2015
4.  Multiplier Methods for Overlapping K-Means"
Hou, Whang, Gleich, Dhillon, Submitted
SILO Seminar
 David Gleich · Purdue
es around the seed sets
Overlapping communities via
seed set expansion works nicely. 
Filtering Phase
Seeding Phase
Seed Set Expansion Phase
Propagation Phase
Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (8/44
0 10 20 30 40 50 60 70 80 90 100
0
0.1
0.2
Coverage (percentage)
M
Student Version of MATLAB
(a) AstroPh
0 10 20 30 40 50 60 70 80 90 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Coverage (percentage)
MaximumConductance
egonet
graclus centers
spread hubs
random
bigclam
(d) Flickr
Figure 2: Conductance vs. graph cov
centers” outperforms other seeding str
We can cover 95% of network with
communities of cond. ~0.15.
Flickr social network
2M vertices, 22M edges
cond(S)
= cut(S)/“size”(S)
SILO Seminar
David Gleich · Purdue
We wanted a more principled
approach to achieve these results.
SILO Seminar
David Gleich · Purdue
The state of the art for clustering
SILO Seminar
David Gleich · Purdue 
K-Means
Problem 1
 Problem 2
 Problem 3
 Problem 4
😀
 😊
 😟
 😢
K-Means
The state of the art for clustering
SILO Seminar
David Gleich · Purdue 
K-Means
Problem 1
 Problem 2
 Problem 3
 Problem 4
😀
 😊
K-Means
 NEO-K-Means
 NEO K-Means
😊
 😊
m1
m2
|| xi – m1 ||
|| xi – m2 ||
K-means as
optimization.
SILO Seminar
David Gleich · Purdue 
minimize
P
ij Uij kxi mj k
2
subject to U is an assignment to clusters
mj = 1P
i Ui j Uij xi
minimize
P
ij Uij kxi mj k
2
subject to U is an multi-assignment to clusters
mj = 1P
i Ui j Uij xi
Input Points x1, ... , xn
Find an assignment
matrix U that gives
cluster assignments
to minimize
x1
x2
x3
x4
U =
2
6
6
4
1 0
1 0
0 1
0 1
3
7
7
5
c1 c2
K-means objective!
K-means’ objective with overlap?!
Overlap is not a natural addition
to optimization based clustering. 
SILO Seminar
David Gleich · Purdue
The NEO-K-means objective
balances overlap and outliers.
SILO Seminar
David Gleich · Purdue 
minimize
P
ij Uij kxi mj k
2
subject to Uij is binary
trace(UT
U) = (1 + ↵)n (↵n overlap)
eT
Ind[Ue] (1 )n (up to n outliers)
mj = 1P
i Ui j Uij xi
· If ↵, = 0, then we get back to K-means.
· Automatically choose ↵, based on K-means.
😊
1. Make (1 + ↵)n total assignments.
2. Allow up to n outliers.
−8
−6
−4
−2
0
2
4
6
8 Cluster 1
Cluster 2
Cluster 1 & 2
Not assigned
Lloyd’s algorithm for NEO-K-means
is just a wee-bit more complex.
SILO Seminar
David Gleich · Purdue 
Until done
1. Update centroids.
2. Assign (1 )n nodes to closest centroid
3. Make (↵ + )n assignments based on minimizing distance.
2
4
6
8 Cluster 1
Cluster 2
Cluster 1 & 2
Not assigned
This algorithm
correctly assigns our
example case and
even determines
overlap and outlier
parameters!
THEOREM Lloyds
algorithm decrease the
objective monotonically.
The non-exhaustiveness is
necessary for assignments.
SILO Seminar
David Gleich · Purdue 
−6 −4 −2 0 2 4 6 8
Cluster 1
Cluster 2
Cluster 1 & 2
Not assigned
b) First extension of k-means
−8 −6 −4 −2 0 2 4
−8
−6
−4
−2
0
2
4
6
8 Cluster 1
Cluster 2
Cluster 1 & 2
Not assigned
(c) NEO-K-Means
nerated (n=1,000, ↵=0.1, =0.005). Green points indicate o
−4 −2 0 2 4 6 8
& 2
gned
st extension of k-means
−8 −6 −4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
6
8 Cluster 1
Cluster 2
Cluster 1 & 2
Not assigned
(c) NEO-K-Means
d (n=1,000, ↵=0.1, =0.005). Green points indicate overlap4
6
8 Cluster 1
Cluster 2
Cluster 1 & 2
Not assigned
Output without assignment constraint.
(beta = 1)
NEO-K-means output (correct)
The Weighted, Kernel "
NEO-K-Means objective. 
•  Introduce weights for each data point.
•  Introduce feature maps for each data point too.
SILO Seminar
David Gleich · Purdue 
minimize
P
ij Uij wi k (xi ) mj k
2
subject to Uij is binary
trace(UT
U) = (1 + ↵)n (↵n overlap)
eT
Ind[Ue] (1 )n (up to n outliers)
mj = 1P
i Uij wi
wi Uij xi
X
ij
Uij wi k (xi ) mj k
2
=
X
ij
Uij wi Kii
uj WKWuj
uT
j Wuj
!
Theorem If K = D 1
+ D 1
AD 1
, then the NEO-K-Means objective
is equivalent to overlapping conductance.
NOTE
This means that NEO-K-Means was
the principled objective we were after!
SILO Seminar
David Gleich · Purdue
Conductance communities
Conductance is one of the most
important community scores [Schaeffer07]
The conductance of a set of vertices is
the ratio of edges leaving to total edges:


Equivalently, it’s the probability that a
random edge leaves the set.
Small conductance ó Good community
(S) =
cut(S)
min vol(S), vol( ¯S)
(edges leaving the set)
(total edges
in the set)
David Gleich · Purdue 
cut(S) = 7
vol(S) = 33
vol( ¯S) = 11
(S) = 7/11
SILO Seminar
Our theorem means that NEO-K-Means
can optimize the sum-conductance obj.
SILO Seminar
David Gleich · Purdue 
(S) 
cut(S)
vol(S)
+
cut( ¯S)
vol( ¯S)
X
S2C
cut(S)
vol(S)
=
X
S2C
(S) if vol(S)  vol( ¯S)
Conductance
 Normalized cut bi-partition
NEO-K-Means"
objective
When we use this
method to partition the
Karate club network, we
get reasonable solutions.
•  Inspired by Dhillon et al.’s
work on Graclus 
•  We have a multilevel
method to optimize the
graph case.
We get state of the art clustering
perf. on vector and graph datasets. 
SILO Seminar
David Gleich · Purdue 
F1 scores on vector datasets from the Mulan repository. 
moc fuzzy esp isp okm rokm NEO
synth1 0.833 0.959 0.977 0.985 0.989 0.969 0.996
synth2 0.836 0.957 0.952 0.973 0.967 0.975 0.996
synth3 0.547 0.919 0.968 0.952 0.970 0.928 0.996
yeast - 0.308 0.289 0.203 0.311 0.203 0.366
music 0.534 0.533 0.527 0.508 0.527 0.454 0.550
scene 0.467 0.431 0.572 0.586 0.571 0.593 0.626
n dim. ¯|C| outliers k
synth1 5,000 2 2,750 0 2
synth2 1,000 2 550 5 2
synth3 6,000 2 3,600 6 2
yeast 2,417 103 731.5 0 14
music 593 72 184.7 0 6
scene 2,407 294 430.8 0 6
The Mulan testset has
a number of
appropriate datasets
NEO-K-Means with Lloyds is fast and
usually accurate but inconsistent. 
SILO Seminar
David Gleich · Purdue 
−6 −4 −2 0 2 4 6
−2
0
2
4
6
8
10
Cluster 1
Cluster 2
Cluster 1 & 2
Cluster 3
Not assigned
−4 −2 0 2 4 6
ster 1
ster 2
ster 1 & 2
ster 3
assigned
−6 −4 −2
−2
0
2
4
6
8
10
Cluster 1
Cluster 2
Cluster 1 & 2
Cluster 3
Not assigned
A more complicated
overlapping test case
The output from NEO-K-
Means with Lloyd’s method
Can we get a more robust method?

Yes! 
SILO Seminar
David Gleich · Purdue
Towards better optimization
of the objective
1.  An SDP relaxation of the objective.
2.  A practical low-rank SDP heuristic.
3.  Faster optimization methods for the heuristic.
SILO Seminar
David Gleich · Purdue
From assignments to co-
occurrence matrices 
SILO Seminar
David Gleich · Purdue 
There are three key variables in our formulation
1. The co-occurrence matrix
Z =
X
j
Wuj uT
j W/uT
j Wuj
2. The overlap vector f
3. The assignment indicator g
U =
2
6
6
4
1 0
1 1
0 1
0 0
3
7
7
5
f =
2
6
6
4
1
2
1
0
3
7
7
5 g =
2
6
6
4
1
1
1
0
3
7
7
5
We can convert our objective into a
trace minimization problem. 
SILO Seminar
David Gleich · Purdue 
Kij = (xi )T
(xj )
di = wi Kii
X
ij
Uij wi k (xi ) mj k
2
=
X
ij
Uij wi Kii
uj WKWuj
uT
j Wuj
!
=
X
ij
Uij wi Kii
X
j
uj WKWuj
uT
j Wuj
= fT
d trace(KZ)
Z = normalized co-occurrence
f = overlap count
g = assignment indicator
The objective function
There is an SDP-like framework to
solve NEO-K-means.
SILO Seminar
David Gleich · Purdue 
maximize
Z,f,g
trace(KZ) fT
d
subject to trace(W 1
Z) = k, (a)
Zij 0, (b)
Z ⌫ 0, Z = ZT
(c)
Ze = Wf, (d)
eT
f = (1 + ↵)n, (e)
eT
g (1 )n, (f)
f g, (g)
rank(Z) = k, (h)
f 2 Zn
0, g 2 {0, 1}n
. (i)
Z must come from
an assignment matrix
Overlap and assignment
constraints
Combinatorial constraints
There is an SDP-relaxation to
approximate NEO-K-means.
SILO Seminar
David Gleich · Purdue 
Z must come from
an assignment matrix
Overlap and assignment
constraints
maximize
Z,f,g
trace(KZ) fT
d
subject to trace(W 1
Z) = k, (a)
Zij 0, (b)
Z ⌫ 0, Z = ZT
(c)
Ze = Wf, (d)
eT
f = (1 + ↵)n, (e)
eT
g (1 )n, (f)
f g, (g)
0  g  1 Relaxed constraints
This SDP can easily solve
simple problems.
SILO Seminar
David Gleich · Purdue 
NEO-K-Means
SDP
Solution Z from CVX is even rank 2!
But SDP methods have a number of
issues for large-scale problems. 
1.  The number of variables is quadratic in the number of
data points
2.  The best solvers can only solve problems with a few
hundred or thousand points.
So like many before us (e.g. Burer & Monteiro, Kulis
Surendran, and Platt 2007, and more) 
we optimize a low-rank factorization of the solution


 SILO Seminar
David Gleich · Purdue
Using the NEO-K-Means Low-Rank
SDP, we can find assignments directly. 
SILO Seminar
David Gleich · Purdue 
NEO-K-Means
Low-rank SDP
Y YT
kZ YYT
k = 2.3 ⇥ 10 4
maximize
Y,f,g,s,r
trace(YT
KY) fT
d
subject to k = trace(YT
W 1
Y)
0 = YYT
e Wf
0 = eT
f (1 + ↵)n
0 = f g s
0 = eT
g (1 )n r
Yij 0, s 0, r 0
0  f  ke, 0  g  1
The Low-Rank NEO-K-Means SDP 
We lose convexity but gain practicality.
We introduce slacks at this point. 
SILO Seminar
David Gleich · Purdue 
icky non-convex term
simple bound constraints
We use an augmented Lagrangian
method to optimize this problem
SILO Seminar
David Gleich · Purdue 
Journal on Optimization, 18(1):186–205, 2007.
[29] K. Trohidis, G. Tsoumakas, G. Kalliris, and I. P.
Vlahavas. Multi-label classification of music into
emotions. In International Conference on Music
Information Retrieval, pages 325–330, 2008.
[30] J. J. Whang, I. S. Dhillon, and D. F. Gleich.
Non-exhaustive, overlapping k-means. In Proceedings
of the SIAM International Conference on Data
Mining, pages 936–944, 2015.
[31] J. J. Whang, D. Gleich, and I. S. Dhillon. Overlapping
community detection using seed set expansion. In
ACM International Conference on Information and
Knowledge Management, pages 2099–2108, 2013.
[32] L. F. Wu, T. R. Hughes, A. P. Davierwala, M. D.
Robinson, R. Stoughton, and S. J. Altschuler.
Large-scale prediction of saccharomyces cerevisiae
gene function using overlapping transcriptional
clusters. Nature Genetics, 31(3):255–265, June 2002.
[33] E. P. Xing and M. I. Jordan. On semidefinite
relaxations for normalized k-cut and connections to
spectral clustering. Technical Report
UCB/USD-3-1265, University of California, Berkeley,
2003.
[34] J. Yang and J. Leskovec. Overlapping community
detection at scale: a nonnegative matrix factorization
approach. In ACM International Conference on Web
Search and Data Mining, pages 587–596, 2013.
[35] S. X. Yu and J. Shi. Multiclass spectral clustering. In
IEEE International Conference on Computer Vision -
Volume 2, 2003.
APPENDIX
A. AUGMENTED LAGRANGIANS
The augmented Lagrangian framework is a general strat-
egy to solve nonlinear optimization problems with equality
tion and the gradient vector.
B. GRADIENTS FOR NEO-LR
We now describe the analytic form of the gradients for the
augmented Lagrangian of the NEO-LR objective and a brief
validation that these are correct. Consider the augmented
Lagrangian (5). The gradient has five components for the
five sets of variables: Y , f, g, s and r:
rY LA(Y , f, g, s, r; , µ, , ) =
2KY eµT
Y µeT
Y
2( 1 (tr(Y T
W 1
Y ) k))W 1
Y
+ (Y Y T
eeT
Y + eeT
Y Y T
Y ) (W feT
Y + efT
W Y )
rf LA(Y , f, g, s, r; , µ, , ) =
d + W µ (W Y Y T
e W 2
f) 2e + (eT
f (1 + ↵)n)e
+ (f g s)
rgLA(Y , f, g, s, r; , µ, , ) =
(f g s) 3e + (eT
g (1 )n r)e
rsLA(Y , f, g, s, r; , µ, , ) = (f g s)
rrLA(Y , f, g, s, r; , µ, , ) = 3 (eT
g (1 )n r)
Using analytic gradients in a black-box solver such as L-
BFGS-B is problematic if the gradients are even slightly in-
correctly computed. To guarantee the analytic gradients we
derive are correct, we use forward finite di↵erence method
to get numerical approximation of the gradients based on
the objective function. We compare these with our analytic
gradient and expect to see small relative di↵erences on the
order of 10 5
or 10 6
. This is exactly what Figure 4 shows.
ous studies of low-rank sdp approximations [6].
Let = [ 1; 2; 3] be the Lagrange multipliers associated
th the three scalar constraints (s), (u), (w), and µ and
be the Lagrange multipliers associated with the vector
nstraints (t) and (v), respectively. Let 0 be a penalty
rameter. The augmented Lagrangian for (4) is:
LA(Y, f, g, s, r; , µ, , ) =
fT
d trace(Y T
KY )
| {z }
the objective
1(trace(Y T
W 1
Y ) k)
+
2
(trace(Y T
W 1
Y ) k)2
µT
(Y Y T
e W f)
+
2
(Y Y T
e W f)T
(Y Y T
e W f)
2(eT
f (1 + ↵)n) +
2
(eT
f (1 + ↵)n)2
T
(f g s) +
2
(f g s)T
(f g s)
3(eT
g (1 )n r)
+
2
(eT
g (1 )n r)2
(5)
t each step in the augmented Lagrangian solution frame-
ork, we solve the following subproblem:
minimize LA(Y , f, g, s, r; , µ, , )
We use an augmented Lagrangian
method to optimize this problem
•  Use L-BFGS-B to optimize each step.
•  Update the multiplier estimates in the standard way.
•  Pick parameters in a modestly standard way.
•  Some variability between problems to show best results, only
a little variation in time/performance.
•  Faster than the NEOS solvers


SILO Seminar
David Gleich · Purdue 
Low rank structure in NEO-K-Means solution Explore low rank structure in NEO-K-Means SDP
mparison with Solvers on NEOS Server
NEOS Server 1: State-of-the-Art Solvers for Numerical Optimization
Our solver with ALM approach is much faster than theirs (e.g.,
SNOPT which is suitable for large nonlinearly constrained problems
with a modest number of degrees of freedom).
Our Solver ALM (obj/time) SNOPT solver (obj/time)
MUSIC 79514.130/92s 79515.156/306s
SCENE 18534.030/3798s 18534.021/8910s
YEAST 8902.253/4331s Not solved
We win with our LRSDP solver vs. "
the CVX default solver
•  Dolphins (n=62) and Les Mis (n=77) are graph probs
•  LRSDP is much faster and just as accurate.
SILO Seminar
David Gleich · Purdue 
LRSDP is roughly an order of magnitude faster than cvx.
LRSDP generates solutions as good as the global optimal from cvx.
The objective value are di↵erent in light of the solution tolerances.
dolphins 1
: 62 nodes, 159 edges, les miserables 2
: 77 nodes, 254 edges
Objective value Run time
SDP LRSDP SDP LRSDP
dolphins
k=2, ↵=0.2, =0 -1.968893 -1.968329 107.03 secs 2.55 secs
k=2, ↵=0.2, =0.05 -1.969080 -1.968128 56.99 secs 2.96 secs
k=3, ↵=0.3, =0 -2.913601 -2.915384 160.57 secs 5.39 secs
k=3, ↵=0.3, =0.05 -2.921634 -2.922252 71.83 secs 8.39 secs
les miserables
k=2, ↵=0.2, =0 -1.937268 -1.935365 453.96 secs 7.10 secs
k=2, ↵=0.3, =0 -1.949212 -1.945632 447.20 secs 10.24 secs
k=3, ↵=0.2, =0.05 -2.845720 -2.845070 261.64 secs 13.53 secs
k=3, ↵=0.3, =0.05 -2.859959 -2.859565 267.07 secs 19.31 secs
1
D. Lusseau et al., Behavioral Ecology and Sociobiology, 2003.
2
D. E. Knuth. The Stanford GraphBase: A Platform for Combinatorial Computing. Addison-Wesley, 1993.
Yangyang Hou (Purdue CS) Low Rank Methods for Optimizing Clustering Nov 2, 2015 26 / 61
Dolphins from Lusseau et al. 2003; 
Les Mis from Knuth GraphBase
Rounding and Improvement
are both important.
SILO Seminar
David Gleich · Purdue 
Input ! Relaxed solution ! Rounded solution ! Improved solution
Rounding
f gives the number of clusters
g gives the set of assignments
Option 1
Use g and f to determine
the number of assignments and go greedy.
Option 2
Just greedily assign based on W 1
Y.
Improvement
Run NEO-K-Means on the output.
Initialization
Run NEO-K-Means on the intput.
The new method is more
robust, even in simple tests.
Consider clustering a cycle graph
SILO Seminar
David Gleich · Purdue
We use disconnected nodes to
measure the cluster quality.
SILO Seminar
David Gleich · Purdue 
disconnected nodes
0 0.5 1 1.5 2 2.5 3 3.5 4
0
10
20
30
40
50
60
70
80
90
100
Noise
No.ofdisconnectednodes
random+onelevel neo
multilevel neo
lrsdp As we increase the noise,
only the LRSDP method
can reliably find the true
clustering.
We get improved vector and
graph clustering results too. 
SILO Seminar
David Gleich · Purdue 
Low rank structure in NEO-K-Means solution Explore low rank structure in NEO-K-Means SDP
mental Results on Data Clustering
parison of NEO-K-Means objective function values
Real-world datasets from Mulan1
By using the LRSDP solution as the initialization of the iterative
algorithm, we can achieve better (smaller) objective function values.
worst best avg.
yeast
kmeans+neo 9611 9495 9549
lrsdp+neo 9440 9280 9364
slrsdp+neo 9471 9231 9367
music
kmeans+neo 87779 70158 77015
lrsdp+neo 82323 70157 75923
slrsdp+neo 82336 70159 75926
scene
kmeans+neo 18905 18745 18806
lrsdp+neo 18904 18759 18811
slrsdp+neo 18895 18760 18810
mulan.sourceforge.net/datasets.html
ou (Purdue CS) Low Rank Methods for Optimizing Clustering Nov 2, 2015 31 / 61
Low rank structure in NEO-K-Means solution Explore low rank structure in NEO-K-Means SDP
Experimental Results on Data Clustering
F1 scores on real-world vector datasets (the larger, the better)
NEO-K-Means-based methods outperform other methods.
Low-rank SDP method improves the clustering results.
moc esp isp okm kmeans+neo lrsdp+neo slrsdp+neo
yeast
worst - 0.274 0.232 0.311 0.356 0.390 0.369
best - 0.289 0.256 0.323 0.366 0.391 0.391
avg. - 0.284 0.248 0.317 0.360 0.391 0.382
music
worst 0.530 0.514 0.506 0.524 0.526 0.537 0.541
best 0.544 0.539 0.539 0.531 0.551 0.552 0.552
avg. 0.538 0.526 0.517 0.527 0.543 0.545 0.547
scene
worst 0.466 0.569 0.586 0.571 0.597 0.610 0.605
best 0.470 0.582 0.609 0.576 0.627 0.614 0.625
avg. 0.467 0.575 0.598 0.573 0.610 0.613 0.613
Yangyang Hou (Purdue CS) Low Rank Methods for Optimizing Clustering Nov 2, 2015 32 / 61
We have improved results – impressively so on the yeast dataset –
and only slightly worse on the scene data.
We get improved vector and
graph clustering results too. 
SILO Seminar
David Gleich · Purdue 
Facebook1 Facebook2 HepPh AstroPh
bigclam 0.830 0.640 0.625 0.645
demon 0.495 0.318 0.503 0.570
oslom 0.319 0.445 0.465 0.580
nise 0.297 0.293 0.102 0.153
m-neo 0.285 0.269 0.206 0.190
LRSDP 0.222 0.148 0.091 0.137
No. of vertices No. of edges
Facebook1 348 2,866
Facebook2 756 30,780
HepPh 11,204 117,619
AstroPh 17,903 196,972
For these graphs, we
dramatically improve
the conductance-vs-
coverage plots.
Lloyd’s iterative method takes O(1 second)
LRSDP method takes O(1 hour)

Now we want to improve the LRSDP time.
SILO Seminar
David Gleich · Purdue
We can improve the
optimization beyond ALM.
1.  Proximal augmented Lagrangian (PALM)"

Add a regularization term to the augmented Lagrangian"
"
"
"

Solve with L-BFGS-B
2.  ADMM method (5 blocks) 
SILO Seminar
David Gleich · Purdue 
x(k+1)
= argmin LA(x(k)
; (k)
...) +
1
2⌧
kx x(k)
k
Yk+1
= argmin
Y
LA(Y, fk
, gk
, sk
, rk
; k
, µk
, k
, )
fk+1
= argmin
f
LA(Yk+1
, f, gk
, sk
, rk k
, µk
, k
, )
gk+1
= argmin
g
LA(Yk+1
, fk+1
, g, sk
, rk k
, µk
, k
, )
sk+1
= argmin
s
LA(Yk+1
, fk+1
, gk+1
, s, rk k
, µk
, k
, )
rk+1
= argmin
r
LA(Yk+1
, fk+1
, gk+1
, sk+1
, r k
, µk
, k
, )
Convex J
Non-convex L
We had to get a new convergence result
for the proximal method 
Results for bound-constrained sub-problems?
Ours is a a small adaptation of a general result
due to Pennanen (2002). 
SILO Seminar
David Gleich · Purdue 
Low rank structure in NEO-K-Means solution Explore low rank structure in NEO-K-Means SDP
Convergence analysis of PALM 1
Theorem 1
Let (¯x, ¯) be a KKT pair satisfying the strongly second order su cient condition and
assume the gradients rc(¯x) are linearly independent. If the { k } are large enough with
k ! ¯  1 and if k(x0, 0) (¯x, ¯)k is small enough, then there exists a sequence
{(xk , k )} conforming to Algorithm 1 along with open neighborhoods Ck such that for
each k, xk+1 is the unique solution in Ck to (Pk
). Then also, the sequence {(xk , k )}
converges linearly and Fej´er monotonically to ¯x, ¯ with rate r(¯) < 1 that is decreasing
in ¯ and r(¯) ! 0 as ¯ ! 1.
On the yeast dataset, we see no
difference in objective, but faster solves
SILO Seminar
David Gleich · Purdue 
0
500
1000
1500
2000
2500
3000
3500
4000
4500
iterative ALM PALM ADMM
Runtimes on YEAST
8700
8800
8900
9000
9100
9200
ALM PALM ADMM
f(x) values on YEAST
On yeast, we see much better
discrete objectives and F1 scores. 
SILO Seminar
David Gleich · Purdue 
9000
9100
9200
9300
9400
9500
9600
9700
iterative ALM PALM ADMM
NEO−K−Means objectives on YEAST
0.34
0.345
0.35
0.355
0.36
0.365
0.37
0.375
0.38
0.385
0.39
iterative ALM PALM ADMM
F1 Scores on YEAST
Recap
For overlapping clustering of data and
overlapping community detection of graphs, we
have a new objective
•  Fast Lloyd-like iterative algorithm
•  SDP relaxation
•  Low-rank SDP relaxation
•  Proximal and ADMM acceleration techniques
SILO Seminar
David Gleich · Purdue 
1.  NEO-K-means - Whang, Gleich, Dhillon, SDM 2015
2.  NEO-K-means SDP + Aug. Lagrangian"
Hou, Whang, Gleich, Dhillon, KDD 2015
3.  Multiplier Methods for Overlapping K-Means"
Hou, Whang, Gleich, Dhillon, Submitted
SILO Seminar
David Gleich · Purdue 
plot(x)
0 2 4 6 8 10
x 10
5
0
0.02
0.04
0.06
0.08
0.1
10
0
10
2
10
4
10
6
10
−15
10
−10
10
−5
10
0
10
0
10
2
10
4
10
6
10
−15
10
−10
10
−5
10
0
nonzeros
Crawl of flickr from 2006 ~800k nodes, 6M edges, beta=1/2
(I P)x = (1 )s
nnz(x) ⇡ 800k
kD1
(xx⇤
)k1"
Localized solutions of diffusion equations in large graphs. 
Joint with Kyle Kloster. WAW2013, KDD2014, WAW2015; J. Internet Math. 
the answer [5]. Thus, just as in scientific
computing, marrying the method to
the model is key for the best scientific
computing on social networks.
Ultimately, none of these steps dif-
fer from the practice of physical sci-
entific computing. The challenges in
creating models, devising algorithms,
validating results, and comparing
models just take on different chal-
lenges when the problems come from
social data instead of physical mod-
els. Thus, let us return to our starting
question: What does the matrix have
to do with the social network? Just as
in scientific computing, many inter-
esting problems, models, and meth-
ods for social networks boil down to
matrix computations. Yet, as in the
expander example above, the types of
matrix questions change dramatical-
ly in order to fit social network mod-
els. Let’s see what’s been done that’s
enticingly and refreshingly different
from the types of matrix computa-
tions encountered in physical scien-
tific computing.
EXPANDER GRAPHS AND
PARALLEL COMPUTING
Recently, a coalition of folks from aca-
demia, national labs, and industry set
out to tackle the problems in parallel
computing and expander graphs. They
established the Graph 500 benchmark
(http://www.graph500.org) to measure
the performance of a parallel com-
puter on a standard graph computa-
tion with an expander graph. Over the
past three years, they’ve seen perfor-
mance grow by more than 1,000-times
Diffusion
in a plate
Movie
interest in
diffusion
The network, or mesh, from a typical problem in scientific computing
n a low dimensional space—think of two or three dimensions. These physical
ut limits on the size of the boundary or “surface area” of the space given its
No such limits exist in social networks and these two sets are usually about
size. A network with this property is called an expander network.
Size of set » Size of boundary
“Networks”
from PDEs
are usually
physical
Social networks
are expanders
SILO Seminar
David Gleich · Purdue 
Higher order organization of complex networks
Joint with Austin Benson and Jure Leskovec
9
10
8
7
2
0
4
3
11
6
5
1
CEPDR
CEPVR
IL2R
OLLR
RIAL
RIAR
RIVL
RIVR
RMDDR
RMDL
RMDR
RMDVL
RMFL
SMDDL
SMDDR
SMDVR
URBR
By using a new generalization of
spectral clustering methods, we are
able to find completely novel and
relevant structures in complex systems
such as the connectome and transport
networks.
SILO Seminar
David Gleich · Purdue 
SIAM Annual Meeting !
(AN16)!

July 11-15, 2016
The Westin Waterfront"
Boston, Massachusetts

David Gleich, Purdue
Mary Silber, Northwestern
Big Data, Data Science, and Privacy 
Education, Communication, and Policy 
Reproducibility and Ethics 
Efficiency and Optimization 

Integrating Models and Data (incl. "
computational social science, PDEs) 
Dynamic Networks (learning, evolution, "
adaptation, and cooperation) 
Applied Math, Statistics, and "
Machine Learning 
Earth systems; environmental/ecological
applications 
Epidemiology
Future work
Even faster solvers 

Understand why the
solution seems to be
rank-2. 
Better init for Lloyds.
SILO Seminar
David Gleich · Purdue 
Solution Z from CVX is even rank 2!
1.  NEO-K-means - Whang, Gleich, Dhillon, SDM 2015
2.  NEO-K-means SDP + Aug. Lagrangian"
Hou, Whang, Gleich, Dhillon, KDD 2015
3.  Multiplier Methods for Overlapping K-Means"
Hou, Whang, Gleich, Dhillon, Submitted

Non-exhaustive, Overlapping K-means

  • 1.
  • 2.
    Real-world graph andpoint data have overlapping clusters. GeneRa 10 20 30 40 50 60 70 NM_003748NM_003862Contig32125_RCU82987AB037863NM_020974Contig55377_RCNM_003882NM_000849Contig48328_RCContig46223_RCNM_006117NM_003239NM_018401AF257175AF201951NM_001282Contig63102_RCNM_000286Contig34634_RCNM_000320AB033007AL355708NM_000017NM_006763AF148505Contig57595NM_001280AJ224741U45975Contig49670_RCContig753_RCContig25055_RCContig53646_RCContig42421_RCContig51749_RCAL137514NM_004911NM_000224NM_013262Contig41887_RCNM_004163AB020689NM_015416Contig43747_RCNM_012429AB033043AL133619NM_016569NM_004480NM_004798Contig37063_RCNM_000507AB037745Contig50802_RCNM_001007Contig53742_RCNM_018104Contig51963Contig53268_RCNM_012261NM_020244Contig55813_RCContig27312_RCContig44064_RCNM_002570NM_002900AL050090NM_015417Contig47405_RCNM_016337Contig55829_RCContig37598Contig45347_RCNM_020675NM_003234AL080110AL137295Contig17359_RCNM_013296NM_019013AF052159Contig55313_RCNM_002358NM_004358Contig50106_RCNM_005342NM_014754U58033Contig64688NM_001827Contig3902_RCContig41413_RCNM_015434NM_014078NM_018120NM_001124L27560Contig45816_RCAL050021NM_006115NM_001333NM_005496Contig51519_RCContig1778_RCNM_014363NM_001905NM_018454NM_002811NM_004603AB032973NM_006096D25328Contig46802_RCX94232NM_018004Contig8581_RCContig55188_RCContig50410Contig53226_RCNM_012214NM_006201NM_006372Contig13480_RCAL137502Contig40128_RCNM_003676NM_013437Contig2504_RCAL133603NM_012177R70506_RCNM_003662NM_018136NM_000158NM_018410Contig21812_RCNM_004052Contig4595Contig60864_RCNM_003878U96131NM_005563NM_018455Contig44799_RCNM_003258NM_004456NM_003158NM_014750Contig25343_RCNM_005196Contig57864_RCNM_014109NM_002808Contig58368_RCContig46653_RCNM_004504M21551NM_014875NM_001168NM_003376NM_018098AF161553NM_020166NM_017779NM_018265AF155117NM_004701NM_006281Contig44289_RCNM_004336Contig33814_RCNM_003600NM_006265NM_000291NM_000096NM_001673NM_001216NM_014968NM_018354NM_007036NM_004702Contig2399_RCNM_001809Contig20217_RCNM_003981NM_007203NM_006681AF055033NM_014889NM_020386NM_000599Contig56457_RCNM_005915Contig24252_RCContig55725_RCNM_002916NM_014321NM_006931AL080079Contig51464_RCNM_000788NM_016448X05610NM_014791Contig40831_RCAK000745NM_015984NM_016577Contig32185_RCAF052162AF073519NM_003607NM_006101NM_003875Contig25991Contig35251_RCNM_004994NM_000436NM_002073NM_002019NM_000127NM_020188AL137718Contig28552_RCContig38288_RCAA555029_RCNM_016359Contig46218_RCContig63649_RCAL080059 Social networks have overlapping clusters because of social circles Genes have overlapping clusters due to their role in multiple functions SILO Seminar David Gleich · Purdue
  • 3.
    Overlapping research projects arewhat got me here too! PhD Thesis on Google’s PageRank MSR Intern and Overlapping Clusters for Distributed Computation Accelerated NCP plots and locally minimal communities Neighborhood inflated seed expansion for overlapping communities Non- exhaustive overlapping " K-means SILO Seminar David Gleich · Purdue 1.  NISE Clustering - Whang, Gleich, Dhillon, CIKM 2013 2.  NEO-K-means - Whang, Gleich, Dhillon, SDM 2015 3.  NEO-K-means SDP " Hou, Whang, Gleich, Dhillon, KDD 2015 4.  Multiplier Methods for Overlapping K-Means" Hou, Whang, Gleich, Dhillon, Submitted
  • 4.
    SILO Seminar DavidGleich · Purdue
  • 5.
    es around theseed sets Overlapping communities via seed set expansion works nicely. Filtering Phase Seeding Phase Seed Set Expansion Phase Propagation Phase Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (8/44 0 10 20 30 40 50 60 70 80 90 100 0 0.1 0.2 Coverage (percentage) M Student Version of MATLAB (a) AstroPh 0 10 20 30 40 50 60 70 80 90 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Coverage (percentage) MaximumConductance egonet graclus centers spread hubs random bigclam (d) Flickr Figure 2: Conductance vs. graph cov centers” outperforms other seeding str We can cover 95% of network with communities of cond. ~0.15. Flickr social network 2M vertices, 22M edges cond(S) = cut(S)/“size”(S) SILO Seminar David Gleich · Purdue
  • 6.
    We wanted amore principled approach to achieve these results. SILO Seminar David Gleich · Purdue
  • 7.
    The state ofthe art for clustering SILO Seminar David Gleich · Purdue K-Means Problem 1 Problem 2 Problem 3 Problem 4 😀 😊 😟 😢 K-Means
  • 8.
    The state ofthe art for clustering SILO Seminar David Gleich · Purdue K-Means Problem 1 Problem 2 Problem 3 Problem 4 😀 😊 K-Means NEO-K-Means NEO K-Means 😊 😊
  • 9.
    m1 m2 || xi –m1 || || xi – m2 || K-means as optimization. SILO Seminar David Gleich · Purdue minimize P ij Uij kxi mj k 2 subject to U is an assignment to clusters mj = 1P i Ui j Uij xi minimize P ij Uij kxi mj k 2 subject to U is an multi-assignment to clusters mj = 1P i Ui j Uij xi Input Points x1, ... , xn Find an assignment matrix U that gives cluster assignments to minimize x1 x2 x3 x4 U = 2 6 6 4 1 0 1 0 0 1 0 1 3 7 7 5 c1 c2 K-means objective! K-means’ objective with overlap?!
  • 10.
    Overlap is nota natural addition to optimization based clustering. SILO Seminar David Gleich · Purdue
  • 11.
    The NEO-K-means objective balancesoverlap and outliers. SILO Seminar David Gleich · Purdue minimize P ij Uij kxi mj k 2 subject to Uij is binary trace(UT U) = (1 + ↵)n (↵n overlap) eT Ind[Ue] (1 )n (up to n outliers) mj = 1P i Ui j Uij xi · If ↵, = 0, then we get back to K-means. · Automatically choose ↵, based on K-means. 😊 1. Make (1 + ↵)n total assignments. 2. Allow up to n outliers.
  • 12.
    −8 −6 −4 −2 0 2 4 6 8 Cluster 1 Cluster2 Cluster 1 & 2 Not assigned Lloyd’s algorithm for NEO-K-means is just a wee-bit more complex. SILO Seminar David Gleich · Purdue Until done 1. Update centroids. 2. Assign (1 )n nodes to closest centroid 3. Make (↵ + )n assignments based on minimizing distance. 2 4 6 8 Cluster 1 Cluster 2 Cluster 1 & 2 Not assigned This algorithm correctly assigns our example case and even determines overlap and outlier parameters! THEOREM Lloyds algorithm decrease the objective monotonically.
  • 13.
    The non-exhaustiveness is necessaryfor assignments. SILO Seminar David Gleich · Purdue −6 −4 −2 0 2 4 6 8 Cluster 1 Cluster 2 Cluster 1 & 2 Not assigned b) First extension of k-means −8 −6 −4 −2 0 2 4 −8 −6 −4 −2 0 2 4 6 8 Cluster 1 Cluster 2 Cluster 1 & 2 Not assigned (c) NEO-K-Means nerated (n=1,000, ↵=0.1, =0.005). Green points indicate o −4 −2 0 2 4 6 8 & 2 gned st extension of k-means −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8 Cluster 1 Cluster 2 Cluster 1 & 2 Not assigned (c) NEO-K-Means d (n=1,000, ↵=0.1, =0.005). Green points indicate overlap4 6 8 Cluster 1 Cluster 2 Cluster 1 & 2 Not assigned Output without assignment constraint. (beta = 1) NEO-K-means output (correct)
  • 14.
    The Weighted, Kernel" NEO-K-Means objective. •  Introduce weights for each data point. •  Introduce feature maps for each data point too. SILO Seminar David Gleich · Purdue minimize P ij Uij wi k (xi ) mj k 2 subject to Uij is binary trace(UT U) = (1 + ↵)n (↵n overlap) eT Ind[Ue] (1 )n (up to n outliers) mj = 1P i Uij wi wi Uij xi X ij Uij wi k (xi ) mj k 2 = X ij Uij wi Kii uj WKWuj uT j Wuj ! Theorem If K = D 1 + D 1 AD 1 , then the NEO-K-Means objective is equivalent to overlapping conductance. NOTE
  • 15.
    This means thatNEO-K-Means was the principled objective we were after! SILO Seminar David Gleich · Purdue
  • 16.
    Conductance communities Conductance isone of the most important community scores [Schaeffer07] The conductance of a set of vertices is the ratio of edges leaving to total edges: Equivalently, it’s the probability that a random edge leaves the set. Small conductance ó Good community (S) = cut(S) min vol(S), vol( ¯S) (edges leaving the set) (total edges in the set) David Gleich · Purdue cut(S) = 7 vol(S) = 33 vol( ¯S) = 11 (S) = 7/11 SILO Seminar
  • 17.
    Our theorem meansthat NEO-K-Means can optimize the sum-conductance obj. SILO Seminar David Gleich · Purdue (S)  cut(S) vol(S) + cut( ¯S) vol( ¯S) X S2C cut(S) vol(S) = X S2C (S) if vol(S)  vol( ¯S) Conductance Normalized cut bi-partition NEO-K-Means" objective When we use this method to partition the Karate club network, we get reasonable solutions. •  Inspired by Dhillon et al.’s work on Graclus •  We have a multilevel method to optimize the graph case.
  • 18.
    We get stateof the art clustering perf. on vector and graph datasets. SILO Seminar David Gleich · Purdue F1 scores on vector datasets from the Mulan repository. moc fuzzy esp isp okm rokm NEO synth1 0.833 0.959 0.977 0.985 0.989 0.969 0.996 synth2 0.836 0.957 0.952 0.973 0.967 0.975 0.996 synth3 0.547 0.919 0.968 0.952 0.970 0.928 0.996 yeast - 0.308 0.289 0.203 0.311 0.203 0.366 music 0.534 0.533 0.527 0.508 0.527 0.454 0.550 scene 0.467 0.431 0.572 0.586 0.571 0.593 0.626 n dim. ¯|C| outliers k synth1 5,000 2 2,750 0 2 synth2 1,000 2 550 5 2 synth3 6,000 2 3,600 6 2 yeast 2,417 103 731.5 0 14 music 593 72 184.7 0 6 scene 2,407 294 430.8 0 6 The Mulan testset has a number of appropriate datasets
  • 19.
    NEO-K-Means with Lloydsis fast and usually accurate but inconsistent. SILO Seminar David Gleich · Purdue −6 −4 −2 0 2 4 6 −2 0 2 4 6 8 10 Cluster 1 Cluster 2 Cluster 1 & 2 Cluster 3 Not assigned −4 −2 0 2 4 6 ster 1 ster 2 ster 1 & 2 ster 3 assigned −6 −4 −2 −2 0 2 4 6 8 10 Cluster 1 Cluster 2 Cluster 1 & 2 Cluster 3 Not assigned A more complicated overlapping test case The output from NEO-K- Means with Lloyd’s method
  • 20.
    Can we geta more robust method? Yes! SILO Seminar David Gleich · Purdue
  • 21.
    Towards better optimization ofthe objective 1.  An SDP relaxation of the objective. 2.  A practical low-rank SDP heuristic. 3.  Faster optimization methods for the heuristic. SILO Seminar David Gleich · Purdue
  • 22.
    From assignments toco- occurrence matrices SILO Seminar David Gleich · Purdue There are three key variables in our formulation 1. The co-occurrence matrix Z = X j Wuj uT j W/uT j Wuj 2. The overlap vector f 3. The assignment indicator g U = 2 6 6 4 1 0 1 1 0 1 0 0 3 7 7 5 f = 2 6 6 4 1 2 1 0 3 7 7 5 g = 2 6 6 4 1 1 1 0 3 7 7 5
  • 23.
    We can convertour objective into a trace minimization problem. SILO Seminar David Gleich · Purdue Kij = (xi )T (xj ) di = wi Kii X ij Uij wi k (xi ) mj k 2 = X ij Uij wi Kii uj WKWuj uT j Wuj ! = X ij Uij wi Kii X j uj WKWuj uT j Wuj = fT d trace(KZ) Z = normalized co-occurrence f = overlap count g = assignment indicator The objective function
  • 24.
    There is anSDP-like framework to solve NEO-K-means. SILO Seminar David Gleich · Purdue maximize Z,f,g trace(KZ) fT d subject to trace(W 1 Z) = k, (a) Zij 0, (b) Z ⌫ 0, Z = ZT (c) Ze = Wf, (d) eT f = (1 + ↵)n, (e) eT g (1 )n, (f) f g, (g) rank(Z) = k, (h) f 2 Zn 0, g 2 {0, 1}n . (i) Z must come from an assignment matrix Overlap and assignment constraints Combinatorial constraints
  • 25.
    There is anSDP-relaxation to approximate NEO-K-means. SILO Seminar David Gleich · Purdue Z must come from an assignment matrix Overlap and assignment constraints maximize Z,f,g trace(KZ) fT d subject to trace(W 1 Z) = k, (a) Zij 0, (b) Z ⌫ 0, Z = ZT (c) Ze = Wf, (d) eT f = (1 + ↵)n, (e) eT g (1 )n, (f) f g, (g) 0  g  1 Relaxed constraints
  • 26.
    This SDP caneasily solve simple problems. SILO Seminar David Gleich · Purdue NEO-K-Means SDP Solution Z from CVX is even rank 2!
  • 27.
    But SDP methodshave a number of issues for large-scale problems. 1.  The number of variables is quadratic in the number of data points 2.  The best solvers can only solve problems with a few hundred or thousand points. So like many before us (e.g. Burer & Monteiro, Kulis Surendran, and Platt 2007, and more) we optimize a low-rank factorization of the solution SILO Seminar David Gleich · Purdue
  • 28.
    Using the NEO-K-MeansLow-Rank SDP, we can find assignments directly. SILO Seminar David Gleich · Purdue NEO-K-Means Low-rank SDP Y YT kZ YYT k = 2.3 ⇥ 10 4
  • 29.
    maximize Y,f,g,s,r trace(YT KY) fT d subject tok = trace(YT W 1 Y) 0 = YYT e Wf 0 = eT f (1 + ↵)n 0 = f g s 0 = eT g (1 )n r Yij 0, s 0, r 0 0  f  ke, 0  g  1 The Low-Rank NEO-K-Means SDP We lose convexity but gain practicality. We introduce slacks at this point. SILO Seminar David Gleich · Purdue icky non-convex term simple bound constraints
  • 30.
    We use anaugmented Lagrangian method to optimize this problem SILO Seminar David Gleich · Purdue Journal on Optimization, 18(1):186–205, 2007. [29] K. Trohidis, G. Tsoumakas, G. Kalliris, and I. P. Vlahavas. Multi-label classification of music into emotions. In International Conference on Music Information Retrieval, pages 325–330, 2008. [30] J. J. Whang, I. S. Dhillon, and D. F. Gleich. Non-exhaustive, overlapping k-means. In Proceedings of the SIAM International Conference on Data Mining, pages 936–944, 2015. [31] J. J. Whang, D. Gleich, and I. S. Dhillon. Overlapping community detection using seed set expansion. In ACM International Conference on Information and Knowledge Management, pages 2099–2108, 2013. [32] L. F. Wu, T. R. Hughes, A. P. Davierwala, M. D. Robinson, R. Stoughton, and S. J. Altschuler. Large-scale prediction of saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nature Genetics, 31(3):255–265, June 2002. [33] E. P. Xing and M. I. Jordan. On semidefinite relaxations for normalized k-cut and connections to spectral clustering. Technical Report UCB/USD-3-1265, University of California, Berkeley, 2003. [34] J. Yang and J. Leskovec. Overlapping community detection at scale: a nonnegative matrix factorization approach. In ACM International Conference on Web Search and Data Mining, pages 587–596, 2013. [35] S. X. Yu and J. Shi. Multiclass spectral clustering. In IEEE International Conference on Computer Vision - Volume 2, 2003. APPENDIX A. AUGMENTED LAGRANGIANS The augmented Lagrangian framework is a general strat- egy to solve nonlinear optimization problems with equality tion and the gradient vector. B. GRADIENTS FOR NEO-LR We now describe the analytic form of the gradients for the augmented Lagrangian of the NEO-LR objective and a brief validation that these are correct. Consider the augmented Lagrangian (5). The gradient has five components for the five sets of variables: Y , f, g, s and r: rY LA(Y , f, g, s, r; , µ, , ) = 2KY eµT Y µeT Y 2( 1 (tr(Y T W 1 Y ) k))W 1 Y + (Y Y T eeT Y + eeT Y Y T Y ) (W feT Y + efT W Y ) rf LA(Y , f, g, s, r; , µ, , ) = d + W µ (W Y Y T e W 2 f) 2e + (eT f (1 + ↵)n)e + (f g s) rgLA(Y , f, g, s, r; , µ, , ) = (f g s) 3e + (eT g (1 )n r)e rsLA(Y , f, g, s, r; , µ, , ) = (f g s) rrLA(Y , f, g, s, r; , µ, , ) = 3 (eT g (1 )n r) Using analytic gradients in a black-box solver such as L- BFGS-B is problematic if the gradients are even slightly in- correctly computed. To guarantee the analytic gradients we derive are correct, we use forward finite di↵erence method to get numerical approximation of the gradients based on the objective function. We compare these with our analytic gradient and expect to see small relative di↵erences on the order of 10 5 or 10 6 . This is exactly what Figure 4 shows. ous studies of low-rank sdp approximations [6]. Let = [ 1; 2; 3] be the Lagrange multipliers associated th the three scalar constraints (s), (u), (w), and µ and be the Lagrange multipliers associated with the vector nstraints (t) and (v), respectively. Let 0 be a penalty rameter. The augmented Lagrangian for (4) is: LA(Y, f, g, s, r; , µ, , ) = fT d trace(Y T KY ) | {z } the objective 1(trace(Y T W 1 Y ) k) + 2 (trace(Y T W 1 Y ) k)2 µT (Y Y T e W f) + 2 (Y Y T e W f)T (Y Y T e W f) 2(eT f (1 + ↵)n) + 2 (eT f (1 + ↵)n)2 T (f g s) + 2 (f g s)T (f g s) 3(eT g (1 )n r) + 2 (eT g (1 )n r)2 (5) t each step in the augmented Lagrangian solution frame- ork, we solve the following subproblem: minimize LA(Y , f, g, s, r; , µ, , )
  • 31.
    We use anaugmented Lagrangian method to optimize this problem •  Use L-BFGS-B to optimize each step. •  Update the multiplier estimates in the standard way. •  Pick parameters in a modestly standard way. •  Some variability between problems to show best results, only a little variation in time/performance. •  Faster than the NEOS solvers SILO Seminar David Gleich · Purdue Low rank structure in NEO-K-Means solution Explore low rank structure in NEO-K-Means SDP mparison with Solvers on NEOS Server NEOS Server 1: State-of-the-Art Solvers for Numerical Optimization Our solver with ALM approach is much faster than theirs (e.g., SNOPT which is suitable for large nonlinearly constrained problems with a modest number of degrees of freedom). Our Solver ALM (obj/time) SNOPT solver (obj/time) MUSIC 79514.130/92s 79515.156/306s SCENE 18534.030/3798s 18534.021/8910s YEAST 8902.253/4331s Not solved
  • 32.
    We win withour LRSDP solver vs. " the CVX default solver •  Dolphins (n=62) and Les Mis (n=77) are graph probs •  LRSDP is much faster and just as accurate. SILO Seminar David Gleich · Purdue LRSDP is roughly an order of magnitude faster than cvx. LRSDP generates solutions as good as the global optimal from cvx. The objective value are di↵erent in light of the solution tolerances. dolphins 1 : 62 nodes, 159 edges, les miserables 2 : 77 nodes, 254 edges Objective value Run time SDP LRSDP SDP LRSDP dolphins k=2, ↵=0.2, =0 -1.968893 -1.968329 107.03 secs 2.55 secs k=2, ↵=0.2, =0.05 -1.969080 -1.968128 56.99 secs 2.96 secs k=3, ↵=0.3, =0 -2.913601 -2.915384 160.57 secs 5.39 secs k=3, ↵=0.3, =0.05 -2.921634 -2.922252 71.83 secs 8.39 secs les miserables k=2, ↵=0.2, =0 -1.937268 -1.935365 453.96 secs 7.10 secs k=2, ↵=0.3, =0 -1.949212 -1.945632 447.20 secs 10.24 secs k=3, ↵=0.2, =0.05 -2.845720 -2.845070 261.64 secs 13.53 secs k=3, ↵=0.3, =0.05 -2.859959 -2.859565 267.07 secs 19.31 secs 1 D. Lusseau et al., Behavioral Ecology and Sociobiology, 2003. 2 D. E. Knuth. The Stanford GraphBase: A Platform for Combinatorial Computing. Addison-Wesley, 1993. Yangyang Hou (Purdue CS) Low Rank Methods for Optimizing Clustering Nov 2, 2015 26 / 61 Dolphins from Lusseau et al. 2003; Les Mis from Knuth GraphBase
  • 33.
    Rounding and Improvement areboth important. SILO Seminar David Gleich · Purdue Input ! Relaxed solution ! Rounded solution ! Improved solution Rounding f gives the number of clusters g gives the set of assignments Option 1 Use g and f to determine the number of assignments and go greedy. Option 2 Just greedily assign based on W 1 Y. Improvement Run NEO-K-Means on the output. Initialization Run NEO-K-Means on the intput.
  • 34.
    The new methodis more robust, even in simple tests. Consider clustering a cycle graph SILO Seminar David Gleich · Purdue
  • 35.
    We use disconnectednodes to measure the cluster quality. SILO Seminar David Gleich · Purdue disconnected nodes 0 0.5 1 1.5 2 2.5 3 3.5 4 0 10 20 30 40 50 60 70 80 90 100 Noise No.ofdisconnectednodes random+onelevel neo multilevel neo lrsdp As we increase the noise, only the LRSDP method can reliably find the true clustering.
  • 36.
    We get improvedvector and graph clustering results too. SILO Seminar David Gleich · Purdue Low rank structure in NEO-K-Means solution Explore low rank structure in NEO-K-Means SDP mental Results on Data Clustering parison of NEO-K-Means objective function values Real-world datasets from Mulan1 By using the LRSDP solution as the initialization of the iterative algorithm, we can achieve better (smaller) objective function values. worst best avg. yeast kmeans+neo 9611 9495 9549 lrsdp+neo 9440 9280 9364 slrsdp+neo 9471 9231 9367 music kmeans+neo 87779 70158 77015 lrsdp+neo 82323 70157 75923 slrsdp+neo 82336 70159 75926 scene kmeans+neo 18905 18745 18806 lrsdp+neo 18904 18759 18811 slrsdp+neo 18895 18760 18810 mulan.sourceforge.net/datasets.html ou (Purdue CS) Low Rank Methods for Optimizing Clustering Nov 2, 2015 31 / 61 Low rank structure in NEO-K-Means solution Explore low rank structure in NEO-K-Means SDP Experimental Results on Data Clustering F1 scores on real-world vector datasets (the larger, the better) NEO-K-Means-based methods outperform other methods. Low-rank SDP method improves the clustering results. moc esp isp okm kmeans+neo lrsdp+neo slrsdp+neo yeast worst - 0.274 0.232 0.311 0.356 0.390 0.369 best - 0.289 0.256 0.323 0.366 0.391 0.391 avg. - 0.284 0.248 0.317 0.360 0.391 0.382 music worst 0.530 0.514 0.506 0.524 0.526 0.537 0.541 best 0.544 0.539 0.539 0.531 0.551 0.552 0.552 avg. 0.538 0.526 0.517 0.527 0.543 0.545 0.547 scene worst 0.466 0.569 0.586 0.571 0.597 0.610 0.605 best 0.470 0.582 0.609 0.576 0.627 0.614 0.625 avg. 0.467 0.575 0.598 0.573 0.610 0.613 0.613 Yangyang Hou (Purdue CS) Low Rank Methods for Optimizing Clustering Nov 2, 2015 32 / 61 We have improved results – impressively so on the yeast dataset – and only slightly worse on the scene data.
  • 37.
    We get improvedvector and graph clustering results too. SILO Seminar David Gleich · Purdue Facebook1 Facebook2 HepPh AstroPh bigclam 0.830 0.640 0.625 0.645 demon 0.495 0.318 0.503 0.570 oslom 0.319 0.445 0.465 0.580 nise 0.297 0.293 0.102 0.153 m-neo 0.285 0.269 0.206 0.190 LRSDP 0.222 0.148 0.091 0.137 No. of vertices No. of edges Facebook1 348 2,866 Facebook2 756 30,780 HepPh 11,204 117,619 AstroPh 17,903 196,972 For these graphs, we dramatically improve the conductance-vs- coverage plots.
  • 38.
    Lloyd’s iterative methodtakes O(1 second) LRSDP method takes O(1 hour) Now we want to improve the LRSDP time. SILO Seminar David Gleich · Purdue
  • 39.
    We can improvethe optimization beyond ALM. 1.  Proximal augmented Lagrangian (PALM)" Add a regularization term to the augmented Lagrangian" " " " Solve with L-BFGS-B 2.  ADMM method (5 blocks) SILO Seminar David Gleich · Purdue x(k+1) = argmin LA(x(k) ; (k) ...) + 1 2⌧ kx x(k) k Yk+1 = argmin Y LA(Y, fk , gk , sk , rk ; k , µk , k , ) fk+1 = argmin f LA(Yk+1 , f, gk , sk , rk k , µk , k , ) gk+1 = argmin g LA(Yk+1 , fk+1 , g, sk , rk k , µk , k , ) sk+1 = argmin s LA(Yk+1 , fk+1 , gk+1 , s, rk k , µk , k , ) rk+1 = argmin r LA(Yk+1 , fk+1 , gk+1 , sk+1 , r k , µk , k , ) Convex J Non-convex L
  • 40.
    We had toget a new convergence result for the proximal method Results for bound-constrained sub-problems? Ours is a a small adaptation of a general result due to Pennanen (2002). SILO Seminar David Gleich · Purdue Low rank structure in NEO-K-Means solution Explore low rank structure in NEO-K-Means SDP Convergence analysis of PALM 1 Theorem 1 Let (¯x, ¯) be a KKT pair satisfying the strongly second order su cient condition and assume the gradients rc(¯x) are linearly independent. If the { k } are large enough with k ! ¯  1 and if k(x0, 0) (¯x, ¯)k is small enough, then there exists a sequence {(xk , k )} conforming to Algorithm 1 along with open neighborhoods Ck such that for each k, xk+1 is the unique solution in Ck to (Pk ). Then also, the sequence {(xk , k )} converges linearly and Fej´er monotonically to ¯x, ¯ with rate r(¯) < 1 that is decreasing in ¯ and r(¯) ! 0 as ¯ ! 1.
  • 41.
    On the yeastdataset, we see no difference in objective, but faster solves SILO Seminar David Gleich · Purdue 0 500 1000 1500 2000 2500 3000 3500 4000 4500 iterative ALM PALM ADMM Runtimes on YEAST 8700 8800 8900 9000 9100 9200 ALM PALM ADMM f(x) values on YEAST
  • 42.
    On yeast, wesee much better discrete objectives and F1 scores. SILO Seminar David Gleich · Purdue 9000 9100 9200 9300 9400 9500 9600 9700 iterative ALM PALM ADMM NEO−K−Means objectives on YEAST 0.34 0.345 0.35 0.355 0.36 0.365 0.37 0.375 0.38 0.385 0.39 iterative ALM PALM ADMM F1 Scores on YEAST
  • 43.
    Recap For overlapping clusteringof data and overlapping community detection of graphs, we have a new objective •  Fast Lloyd-like iterative algorithm •  SDP relaxation •  Low-rank SDP relaxation •  Proximal and ADMM acceleration techniques SILO Seminar David Gleich · Purdue 1.  NEO-K-means - Whang, Gleich, Dhillon, SDM 2015 2.  NEO-K-means SDP + Aug. Lagrangian" Hou, Whang, Gleich, Dhillon, KDD 2015 3.  Multiplier Methods for Overlapping K-Means" Hou, Whang, Gleich, Dhillon, Submitted
  • 44.
    SILO Seminar David Gleich· Purdue plot(x) 0 2 4 6 8 10 x 10 5 0 0.02 0.04 0.06 0.08 0.1 10 0 10 2 10 4 10 6 10 −15 10 −10 10 −5 10 0 10 0 10 2 10 4 10 6 10 −15 10 −10 10 −5 10 0 nonzeros Crawl of flickr from 2006 ~800k nodes, 6M edges, beta=1/2 (I P)x = (1 )s nnz(x) ⇡ 800k kD1 (xx⇤ )k1" Localized solutions of diffusion equations in large graphs. Joint with Kyle Kloster. WAW2013, KDD2014, WAW2015; J. Internet Math. the answer [5]. Thus, just as in scientific computing, marrying the method to the model is key for the best scientific computing on social networks. Ultimately, none of these steps dif- fer from the practice of physical sci- entific computing. The challenges in creating models, devising algorithms, validating results, and comparing models just take on different chal- lenges when the problems come from social data instead of physical mod- els. Thus, let us return to our starting question: What does the matrix have to do with the social network? Just as in scientific computing, many inter- esting problems, models, and meth- ods for social networks boil down to matrix computations. Yet, as in the expander example above, the types of matrix questions change dramatical- ly in order to fit social network mod- els. Let’s see what’s been done that’s enticingly and refreshingly different from the types of matrix computa- tions encountered in physical scien- tific computing. EXPANDER GRAPHS AND PARALLEL COMPUTING Recently, a coalition of folks from aca- demia, national labs, and industry set out to tackle the problems in parallel computing and expander graphs. They established the Graph 500 benchmark (http://www.graph500.org) to measure the performance of a parallel com- puter on a standard graph computa- tion with an expander graph. Over the past three years, they’ve seen perfor- mance grow by more than 1,000-times Diffusion in a plate Movie interest in diffusion The network, or mesh, from a typical problem in scientific computing n a low dimensional space—think of two or three dimensions. These physical ut limits on the size of the boundary or “surface area” of the space given its No such limits exist in social networks and these two sets are usually about size. A network with this property is called an expander network. Size of set » Size of boundary “Networks” from PDEs are usually physical Social networks are expanders
  • 45.
    SILO Seminar David Gleich· Purdue Higher order organization of complex networks Joint with Austin Benson and Jure Leskovec 9 10 8 7 2 0 4 3 11 6 5 1 CEPDR CEPVR IL2R OLLR RIAL RIAR RIVL RIVR RMDDR RMDL RMDR RMDVL RMFL SMDDL SMDDR SMDVR URBR By using a new generalization of spectral clustering methods, we are able to find completely novel and relevant structures in complex systems such as the connectome and transport networks.
  • 46.
    SILO Seminar David Gleich· Purdue SIAM Annual Meeting ! (AN16)! July 11-15, 2016 The Westin Waterfront" Boston, Massachusetts David Gleich, Purdue Mary Silber, Northwestern Big Data, Data Science, and Privacy Education, Communication, and Policy Reproducibility and Ethics Efficiency and Optimization Integrating Models and Data (incl. " computational social science, PDEs) Dynamic Networks (learning, evolution, " adaptation, and cooperation) Applied Math, Statistics, and " Machine Learning Earth systems; environmental/ecological applications Epidemiology
  • 47.
    Future work Even fastersolvers Understand why the solution seems to be rank-2. Better init for Lloyds. SILO Seminar David Gleich · Purdue Solution Z from CVX is even rank 2! 1.  NEO-K-means - Whang, Gleich, Dhillon, SDM 2015 2.  NEO-K-means SDP + Aug. Lagrangian" Hou, Whang, Gleich, Dhillon, KDD 2015 3.  Multiplier Methods for Overlapping K-Means" Hou, Whang, Gleich, Dhillon, Submitted