Clustering of graphs and search of assemblages

Clustering of graphs and
search of assemblages
Kirill Rybachuk, DCA

What is assemglage?
• Assemblage is an intuitive concept and has no common deﬁnition
• Global approach: 'such aggregation cannot be for no reason'
• Modularity: comparison with random graph (almost Erdös-Renew)
• Local approach: 'more connections inside, then outside in the vicinity'
• In weak sense: internal degree > external degree
• In strong sense: internal degree > external degree for each node
• Completely local approach: aggregation of 'similar' nodes
• Jaccard measure:
j belong to the same cluster. Another way of looking at it is that an edge that is a
part of many triangles is probably in a dense region i.e. a cluster. We use the Jaccard
measure to quantify the overlap between adjacency lists. Let Adj(i) be the adjacency
list of i, and Adj(j) be the adjacency list of j. For simplicity, we will refer to the
similarity between Adj(i) and Adj(j) as the similarity between i and j itself.
Sim(i, j) =
|Adj(i) ∩ Adj(j)|
|Adj(i) ∪ Adj(j)|
(5.1)
Global Sparsiﬁcation

Why is this important?
1. DMP Segments
2. Recommendations on goods
3. Centricity within
assemblages: actual data ﬂows
4. Comparison to nominal
communities (dormitories, vk
groups)
5. Compression & Visualization

How does one find
assemblages?
• Graph partitioning: optimal partitioning into preset number of graphs of k.
how can one find that k well?
• Community finding: finding of particular aggregations
k not controlled directly
assemblages don't have to cover
completely
assemblages may overlay

How does one assess
success?
• Objective functions
• Optimization on graph completely: only approximate heuristics
• Comparison of results from various algorithms
• Selection of the most optimal k for graph partitioning
• If ground truth is available — use standard metrics for classifying purposes
• Often WTF is the best metric ever — see it with your own eyes

Cliques
• Just clique: complete sub-graph
• Everybody knows everybody
• Sometimes maximality required: no one can be added
• Brohn-Kerbosch algorithm (1973):  
n - number of nodes, d_max - maximum degree
O(n · 3dmax/3)

Disadvantages of cliques
• Too austere assumption
• Big cliques are almost absent
• Small cliques are present even in Erdös-Renew graph
• Disappearance of just one edge destroys the whole clique
• No center and outskirts of the assemblage
• Symmetry: no sense in centricity

Generalizations
• n-clique: maximum subgraph where distance between any two
nodes is no longer than n.
• at n=1 reduces to simple clique
• n-clique may be even non-connected inside!
• n-club: n-clique with diameter n. Always connected.
• k-core: maximum subgraph where every node has at least k
internal neighbors.
• p-clique: each node's internal neighbor share comprises at least p
(0 to 1)

Algorithm for ﬁnding k-cores
• Batagelj and Zaversnik algorithm (2003): O(m) where m stands for number of edges
• Input: graph G(V,E)
• Output: k value for each node

Metrics and objective
functions

Modularity
• Assume — degree of node i,
• m - number of edges, A - connectivity matrix
• Stir all edges retaining distribution of degrees
• Probability of i and j connected (roughly!):
• Modularity: measure of 'non-randomness' of an assemblage: 
 
 
ki
kikj
2m
Q =
1
2m
i,j
Aij −
kikj
2m
I ci = cj

Modularity properties
• m_s: number of edges in assemblage s, d_s: total degree of
nodes in s
• maximum value: at S disconnected cliques Q = 1-1/S
• Maximum for connected graph: at S equal subgraphs
connected by the same edge
• In that case, Q=1-1/S-S/m
Q ≡
s
ms
m
−
ds
2m
2

Modularity disadvantages
• Resolution limit!
• Assemblages with merge into one
• If cliques on the picture have dimension of n_s they
merge at
• Cluster dimension shifting towards ﬁtting
ms <
√
2m
ns(ns − 1) + 1 < S

WCC
• Assume S is an assemblage (its nodes)
• V - all nodes
• : number of triangles within S where node x is present
• number of nodes of S forming at least one triangle with node x
• Weighted community clustering for one node:
• Average for the whole assemblage: 
 
 
 
WCC(S) =
1
|S|
x∈S
WCC(x, S)

WCC
• Product of two components
• Triangle = 'clique'
• To the left: какая доля компашек с участием x находится внутри его «домашнего»
сообщества?
• To the right in the numerator: how many people overall are involved in cliques with x?
• To the right in the denominator: how many people would have benn involved in cliqes with
x, had S been a clique?

Newman-Girvan
• Hierarchic divisive algorithm
• Sequentially remove edges of greates betweenness
• Stop as the criterion is fulﬁlled (for instance, obtainment of k connected
components)
• O(n*m) for calculation of shortest routes
• O(n+m) for re-calculation of connected components
• The whole algorithm: at least O(n^3)
• Not used with graphs exceeding several thousands of nodes

k-medoids
• k-means: normalized space required
• Graph determines clearance between nodes only (for instance, 1 - Jaccard), therefore k-
means not suitable
• k-medoids: only available points act as centroids
• k shall be predetermined (graph partitioning)
• the most renowned variant is called PAM

k-medoids: PAM
1. Expressedly set k - number of clusters
2. Initialize: select k of random nodes as medoids
3. For each point ﬁnd the closest medoid thus forming initial clustering
4. minCost = initial conﬁguration losses function
5. For each medoid m:
5.1 For each node v!=m within cluster centered in m:
5.1.1 Shift the medoid to v
5.1.2 Re-distribute all nodes between new medoids
5.1.3 cost=function of losses for the whole graph
5.1.4 if cost<minCost:
5.1.4.1 Remember the medoids
5.1.4.2 minCost=cost
5.1.5 Put the medoid back (to m)
6. Perfor the best substitution of all those found (i.e., change one medoid within one cluster)
7. Repeat items 4-5 until medoids are stable

k-medoids: new heuristics
1. While true:
1.1 For each medoid m:
1.1.1 Randomly select s points within cluster centered in m
1.1.2 For each node v of s:
1.1.2.1 Shift the medoid to v
1.1.2.2 Re-distribute all nodes between new medoids
1.1.2.3 cost = function of losses for the whole graph
1.1.2.4 if cost < minCost:
1.1.2.4.1 Remember the medoids
1.1.2.4.2 minCost = cost
1.1.2.5 Put the medoid back (in m)
1.1.3 If the best substitute of s enhances the losses function:
1.1.3.1 Perform the substitution
1.1.3.2 StableSequence=0
1.1.4 Otherwise:
1.1.4.1 StableSequence +=1
1.1.4.2 If StableSequence>threshold:
1.1.4.2.1 Restore the current conﬁguration

k-medoids: clara
• Bagging for graph clustering
• Select and cluster random subsample
• Remaining nodes are just connected to the nearest medoids in the very end
• Run several times to select the best variant
• Acceleration only at complexity over O(n)
• Complexity of PAN: O(k*n^2*number of iterations)
• Complexity of new heuristics: O(k*n*s*number of iterations)

k-medoids for DMP
segments
• Plot graph for domains
• Data: selection of users, and set of visited domains for each of them
• Assume U_x is all users visited domain x
• Weight of edge between domains x and y: 
• Cutting of noises:
1. Threshold for nodes (domains): at least 15 visits
2. Threshold for edges: afﬁtinty of at least 20
affinity(x, y) =
|Ux ∩ Uy||U|
|Ux||Uy|
=
ˆp(x, y)
ˆp(x)ˆp(y)

How much data? How many
clusters?
• 30,000 users provide around 1200 domains and around 12
interpreted assemblages
• 500,000 users: 10,000 domains, around 30 assemblages

Interpreting the picture
• Node size: number of visits
• Node color: assemblage
• Edge color = assemblage color if internal,
otherwise grey
• Edge thickness: afﬁnity
• Too complicated for networkX!
• Disadvantages of networkX visualization:
1. Non-ﬂexible
2. Non-interactive
3. Unstable
4. Slow
• Use graph-tool if possible!

Research papers, cartoons, and cars

Pre-processing: Local
sparsiﬁcation
• Sparsify the graph retaining the assemblage structure
• Algorithms are faster, and pictures are prettier
• Option 1: sort all neighbors per Jaccasrd measure in descension, and remove the tail
• Minus: dense communities remain untouched, and those sparse get destroyed completely
• Option 2: sort neighbors for each node and retain edges
• d_i — degree of i, e from 0 to 1. At e=0,5 sparsiﬁcation is tenfold
• Power law retained, and connectedness almost retained!
min{1,de
i }

Local sparsiﬁcation:
demonstration

Stable cores:
• Randomized algorithms non-stable
• Different runs return different results
• Addition/removal of 2% of nodes
may completely change the clustering
picture
• Stable cores: launch the algorithm for
100 tiumes and count the share of how
many times each pair of nodes reached
the same cluster
• Resulting hierarchic clustering

Louvain
• Blondel et al, 2008
• The most audacious of all modularity-based algorithms
• Multi-level assemblages
• Very fast
• 1. Initializing: all nodes separately (n assemblages with 1 node)
• 2. Unify assemblage pairs in iterations providing the greatest modularity increment
• 3. As the incrementation ceases, represent each assemblage as 1 node in a new graph
• 4. Repeat clauses 2 to 3 until only 2 assemblages remain

Louvain: illustration
• Cell phone operator from Belgium
• 2.6 million customers
• 260 assemblages with over 100
customers, 36 with over 10,000
• 6 assemblage levels
• French and Dutch segments are
almost independent

MCL
• Markov Cluster Algorithm (van Dongen, 1997-2000)
• Normalize columns of connectedness matrix:
• 'Share of money for each friend' or 'Probability of random walk transition'
• Repeat 3 steps in iterations:
• Expand:
• Inflate: (number of clusters grows as r grows)
• Prune: zero out the least elements in each column
• Repeat until M converges
• Complexity: ~n*d^2 for the first iteration (the following will be faster)
of the graph G is also refered to as a flow matrix of G or simply
a flow matrix M, the ith
column contains the flows out of node
correspondingly the ith
row contains the in-flows. Note that whil
out-flows) sum to 1, the rows (or in-flows) are not required to d
The most common way of deriving a column-stochastic trans
graph is to simply normalize the columns of the adjacency matr
M(i, j) =
A(i, j)
n
k=1 A(k, j)
In matrix notation, M := AD−1
, where D is the diagonal degr
D(i, i) = n
j=1 A(j, i). We will refer to this particular transition
29
one can associate other stochastic matrices with the graph G.
Both MCL and our methods introduced in Section 3.2 can be thought of as sim-
ulating stochastic flows (or simulating random walks) on graphs according to certain
rules. For this reason, we refer to these processes as flow simulations.
3.1.2 Markov Clustering (MCL) Algorithm
We next describe the Markov Clustering (MCL) algorithm for clustering graphs,
proposed by Stijn van Dongen [41], in some detail as it is relevant to understanding
our own method.
The MCL algorithm is an iterative process of applying two operators - expansion
and inflation - on an initial stochastic matrix M, in alternation, until convergence.
Both expansion and inflation are operators that map the space of column-stochastic
matrices onto itself. Additionally, a prune step is performed at the end of each
inflation step in order to save memory. Each of these steps is defined below:
Expand: Input M , output Mexp.
Mexp = Expand(M)
def
= M ∗ M
The ith
column of Mexp can be interpreted as the final distribution of a random walk of
length 2 starting from vertex vi, with the transition probabilities of the random walk
given by M. One can take higher powers of M instead of a square (corresponding to
longer random walks), but this gets computationally prohibitive very quickly.
Inflate: Input M and inflation parameter r, output Minf .
Minf (i, j)
def
=
M(i, j)r
n
k=1 M(k, j)r
Minf corresponds to raising each entry in the matrix M to the power r and then
We next describe the Markov Clustering (MCL) algorithm for clustering graphs,
proposed by Stijn van Dongen [41], in some detail as it is relevant to understanding
our own method.
The MCL algorithm is an iterative process of applying two operators - expansion
and inflation - on an initial stochastic matrix M, in alternation, until convergence.
Both expansion and inflation are operators that map the space of column-stochastic
matrices onto itself. Additionally, a prune step is performed at the end of each
inflation step in order to save memory. Each of these steps is defined below:
Expand: Input M , output Mexp.
Mexp = Expand(M)
def
= M ∗ M
The ith
column of Mexp can be interpreted as the final distribution of a random walk of
length 2 starting from vertex vi, with the transition probabilities of the random walk
given by M. One can take higher powers of M instead of a square (corresponding to
longer random walks), but this gets computationally prohibitive very quickly.
Inflate: Input M and inflation parameter r, output Minf .
Minf (i, j)
def
=
M(i, j)r
n
k=1 M(k, j)r
Minf corresponds to raising each entry in the matrix M to the power r and then
normalizing the columns to sum to 1. By default r = 2. Because the entries in the
matrix are all guaranteed to be less than or equal to 1, this operator has the effect of
exaggerating the inhomogeneity in each column (as long as r > 1). In other words,
flow is strengthened where it is already strong and weakened where it is weak.
Prune: In each column, we remove those entries which have very small values (where
“small” is defined in relation to the rest of the entries in the column), and the retained
30

MCL: example
ging to one cluster.
y example
simple example of the MCL process in action for the graph in Fig-
initial stochastic matrix M0 obtained by adding self-loops to the graph
ng each column is given below
M0 =
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
0.33 0.33 0.25 0 0 0
0.33 0.33 0.25 0 0 0
0.33 0.33 0.25 0.25 0 0
0 0 0.25 0.25 0.33 0.33
0 0 0 0.25 0.33 0.33
0 0 0 0.25 0.33 0.33
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
32
t of applying one iteration of Expansion, Inflation and the Prune steps
w:
M1 =
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
0.33 0.33 0.2763 0 0 0
0.33 0.33 0.2763 0 0 0
0.33 0.33 0.4475 0 0 0
0 0 0 0.4475 0.33 0.33
0 0 0 0.2763 0.33 0.33
0 0 0 0.2763 0.33 0.33
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
e flow along the lone inter-cluster edge (M0(4, 3)) has evaporated to 0.
e more iteration results in convergence.
⎛
0 0 0 0 0 0
⎞
The result of applying one iteration of Expansion, Inflation and the P
is given below:
M1 =
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
0.33 0.33 0.2763 0 0 0
0.33 0.33 0.2763 0 0 0
0.33 0.33 0.4475 0 0 0
0 0 0 0.4475 0.33 0.33
0 0 0 0.2763 0.33 0.33
0 0 0 0.2763 0.33 0.33
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
Note that the flow along the lone inter-cluster edge (M0(4, 3)) has evapo
Applying one more iteration results in convergence.
M2 =
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
0 0 0 0 0 0
0 0 0 0 0 0
1 1 1 0 0 0
0 0 0 1 1 1
0 0 0 0 0 0
0 0 0 0 0 0
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
Hence, vertices 1, 2 and 3 flow completely to vertex 3, where as the vertic
6 flow completely to vertex 4. Hence, we group 1, 2 and 3 together with 3
“attractor” of the cluster, and similarly for 4, 5 and 6.

MCL: problems and
solutions
• Problems:
• Too many clusters
• Learning rate r reduces: less clusters, but works slower
• Non-balanced clusters: one giant and a pile of those having 2 to 3 nodes
• The thing is in re-learning!
• Try to make distribution of ﬂow from neighboring nodes similar
• Regularization of (R-MCL): Mexp = M*M0
Mexp = M ∗ M0

SCD: step 1
• Approximate maximization of WCC
• Count the triangles first
• Remove all edges that form no triangles
• Rough partitioning (algorithm 1):
• 1. Sort the nodes on the basis of local cluster
factor
• 2. First assemblage: first node + all neighbors
• 3. Second assemblage: first node of those
remaining (not visited yet) + all its neighbors
• 4. ...
• Complexity: O(n*d^2+n*log(n))

SCD: step 2
• Improve the algorithm 1 results in iterations until
WCC ceases improving (algorithm 2)
• * Find bestMovement for each node
(MapReduce)
• bestMovement: add / remove / transfer
• * Perform bestMovement simultaneously for all
nodes
• * Complexity: O(1) per one bestMovement, O(d
+1) for one node, O(m) for the whole graph
• * Whole algorithm: O(m*log(n))

Spinner
• Based on label propagation
• Implemented in Okapi (Mahout) by Telefonica
• Symmetrize initial graph D: weight w(u,v)=1 if the edge was located
in one direction, and 2 if in both directions
• * Regulate the balance: set maximum number of edges possible within
the assemblage (c may alter from 1 to 10-15):
• * Assemblage workload l: current number of edges in the community
l:
• * Relative workload:
• * Pre-set number of clusters k similar to k-medoids method; c=1,...k
• * What label shall be assigned to node v? The one that is most
common amongst neighbors of v:
• * Balanced state amendment:
p the current label if it is among them.
s convergence speed [6], and in our dis-
uces unnecessary network communica-
lgorithm halts when no vertex updates
ormulation of LPA assumes undirected
en graphs are directed (e.g. the Web).
stems like Pregel allow directed graphs,
re aware of graph directness, like PageR-
would need to convert a graph to undi-
would be to create an undirected edge
henever at least one directed edge exists
he directed graph.
agnostic to the communication patterns
on top. Consider the example graph in
tition to 3 parts. In the undirected graph
cut edges. At this point, according to the
agnostic of the directness of the original
ertex to another partition is as likely, and
dge less.
the directness of the edges in the orig-
ns are equally beneficial. In fact, either
n 1 or vertex 1 to partition 3 would in
edges in the directed graph. Once the
stem and messages are sent across the
decision results in less communication
straint, only encouraging a similar number of edges across
ferent partitions. As we will show, this decision allows a fu
centralized algorithm. While in this work we focus on the pr
tion and evaluation of the more system-related aspects of S
we plan to investigate theoretical justifications and guarant
hind our approach in future work.
Here, we consider the case of a homogeneous system,
each machine has equal resources. This setup is often pr
in synchronous graph processing systems like Pregel, to m
the time spent by faster machines waiting at the synchron
barrier for stragglers.
We define the capacity C of a partition as the maximum
of edges it can have so that partitions are balanced:
C = c·
|E|
k
Parameter c > 1 ensures additional capacity to each part
available for migrations. We define the load of a partition
actual number of edges in that partition:
B(l) = Â
v2G
deg(v)d(a(v),l)
A larger value of c increases the number of migrations
partition allowed at each iteration, possibly speeding up
gence, but it may increase unbalance, as more edges are allo
be assigned to each partition over the ideal value
|E|
.
entation reduces unnecessary network communica-
n 4). The algorithm halts when no vertex updates
e original formulation of LPA assumes undirected
er, very often graphs are directed (e.g. the Web).
models of systems like Pregel allow directed graphs,
ithms that are aware of graph directness, like PageR-
A as is, we would need to convert a graph to undi-
ve approach would be to create an undirected edge
s u and v whenever at least one directed edge exists
u and v in the directed graph.
h, though, is agnostic to the communication patterns
ons running on top. Consider the example graph in
e want to partition to 3 parts. In the undirected graph
e initially 3 cut edges. At this point, according to the
n, which is agnostic of the directness of the original
ation of a vertex to another partition is as likely, and
e one cut edge less.
we consider the directness of the edges in the orig-
all migrations are equally beneficial. In fact, either
2 to partition 1 or vertex 1 to partition 3 would in
e less cut edges in the directed graph. Once the
into the system and messages are sent across the
this latter decision results in less communication
k.
centralized algorithm. While in this work we focus o
tion and evaluation of the more system-related aspe
we plan to investigate theoretical justifications and
hind our approach in future work.
Here, we consider the case of a homogeneous
each machine has equal resources. This setup is o
in synchronous graph processing systems like Prege
the time spent by faster machines waiting at the s
barrier for stragglers.
We define the capacity C of a partition as the max
of edges it can have so that partitions are balanced:
C = c·
|E|
k
Parameter c > 1 ensures additional capacity to ea
available for migrations. We define the load of a p
actual number of edges in that partition:
B(l) = Â
v2G
deg(v)d(a(v),l)
A larger value of c increases the number of mig
partition allowed at each iteration, possibly speed
gence, but it may increase unbalance, as more edges
be assigned to each partition over the ideal value
|E|
k
3
We introduce a penalty function to discourage assigning vertices
to nearly full partitions. Given a partition indicated by label l, the
penalty function p(l) is defined as follows:
p(l) =
B(l)
C
(7)
To integrate the penalty function we normalize (4) first, and re-
formulate the score function as follows:
score00
(v,l) = Â
u2N(v)
w(u,v)d(a(u),l)
Âu2N(v) w(u,v)
p(l) (8)
3.3 Convergence and Halting
is to “push” the cu
it converged to, tow
sult, we restart the
look for a new loca
score, possibly dec
rithm continues as
concerned, we assi
The number of
state depends on th
Clearly, not every
times, no iteration
may not affect any
We introduce a penalty function to discourage assigning vertices
to nearly full partitions. Given a partition indicated by label l, the
penalty function p(l) is defined as follows:
p(l) =
B(l)
C
(7)
To integrate the penalty function we normalize (4) first, and re-
formulate the score function as follows:
score00
(v,l) = Â
u2N(v)
w(u,v)d(a(u),l)
Âu2N(v) w(u,v)
p(l) (8)
is
it
s
lo
s
r
c
s
C
ti
V is the set of vertices in the graph and E is the set of
that an edge e 2 E is a pair (u,v) with u,v 2 V. We
N(v) = {u: u 2 V,(u,v) 2 E} the neighborhood of a ve
by deg(v) = |N(v)| the degree of v. In a k-way partit
define L as a set of labels L = {l1,...,lk} that essentially c
to the k partitions. a is the labeling function a : V ! L
a(v) = lj if label lj is assigned to vertex v.
The end goal of Spinner is to assign partitions, or labe
vertex such that it maximizes edge locality and partitio
anced.
3.1 K-way Label Propagation
We first describe how to use basic LPA to maximize ed
and then extend the algorithm to achieve balanced part
tially, each vertex in the graph is assigned a label li at ran
0 < i  k. Subsequently, every vertex iteratively propag
bel to its neighbors. During this iterative process, a verte
the label that is more frequent among its neighbors. Eve
assigns a different score for a particular label l which is e
number of neighbors assigned to label l. A vertex shows
to labels with high score. More formally:
score(v,l) = Â
u2N(v)
d(a(u),l)
where d is the Kronecker delta. The vertex updates its l
label lv that maximizes its score according to the update

Spinner: Scalability
• Calculation within Pregel — perfect for label propagation
• Easy to add and remove clusters (1 cluster per 1 worker)
• Easy re-calculation in case of addition / removal of nodes
• Scalability of Spinner upon clustering of a random graph (Watts-Strogatz)
• Saving of resources upon addition of new edges or new assemblages (workers) 
 
 
 
 
 
 
 
 
(a) Partitioning of the Twitter graph. (b) Partitioning of the Yahoo! graph.
Figure 4: Partitioning of (a) the Twitter graph across 256 partitions and (b) the Yahoo! web graph across 115 partitions. The figure
shows the evolution of metrics f, r, and score(G) across iterations.
(a) Runtime vs. graph size (b) Runtime vs. cluster size (c) Runtime vs. k
Figure 5: Scalability of Spinner. (a) Runtime as a function of the number of vertices, (b) runtime as a function of the number of
workers, (c) runtime as a function of the number of partitions.
supersteps. This approach allows us to factor out the runtime of al-
gorithm as a function the number of vertices and edges.
Figure 5.2 presents the results of the experiments, executed on
a AWS Hadoop cluster consisting of 116 m2.4xlarge machines. In
the first experiment, presented in Figure 5(a), we focus on the scal-
ability of the algorithm as a function of the number of vertices and
edges in the graph. For this, we fix the number of outgoing edges
per vertex to 40. We connect the vertices following a ring lattice
topology, and re-wire 30% of the edges randomly as by the func-
tion of the beta (0.3) parameter of the Watts-Strogatz model. We (a) Cost savings (b) Partitioning stability
(b) Partitioning of the Yahoo! graph.
ns and (b) the Yahoo! web graph across 115 partitions. The figure
s.
vs. cluster size (c) Runtime vs. k
he number of vertices, (b) runtime as a function of the number of
(a) Cost savings (b) Partitioning stability
Figure 6: Adapting to dynamic graph changes. We vary the
(a) Cost savings (b) Partitioning stability
Figure 7: Adapting to resource changes. We vary the num-
o
t
m
A
t
o
f
p
s

Tools
• Network X: cliques, k-cores, blockmodels
• graph-tool: very fast blockmodels, visualization
• okapi (mahout): k-cores, Spinner
• GraphX (spark): nothing as yet
• Gephi: MCL, Girvan-Newman, Chinese Whispers
• micans.org: MCL by the creator
• mapequation.org: Infomap
• sites.google.com/site/ﬁndcommunities: Louvain from the creators (C++)
• pycluster (coming soon): k-medoids

Further reading
• Mining of Massive Datasets (ch.10 «Mining social networks
graphs», pp 343-402)
• Data Clustering Algorithms and Applications (ch 17 «Network
clustering», pp 415-443)
• SNA Course at Coursera: www.coursera.org/course/sna
• Great review on community ﬁnding: snap.stanford.edu/class/
cs224w-readings/ fortunato10community.pdf
• Articles with analysed methods and relevant cases (to be sent)

Clustering of graphs and search of assemblages

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Clustering of graphs and search of assemblages

Similar to Clustering of graphs and search of assemblages (20)

More from Data-Centric_Alliance

More from Data-Centric_Alliance (17)

Recently uploaded

Recently uploaded (20)

Clustering of graphs and search of assemblages