SlideShare a Scribd company logo
1 of 49
Download to read offline
Clustering of graphs and
search of assemblages
Kirill Rybachuk, DCA
What is assemglage?
• Assemblage is an intuitive concept and has no common definition
• Global approach: 'such aggregation cannot be for no reason'
• Modularity: comparison with random graph (almost Erdös-Renew)
• Local approach: 'more connections inside, then outside in the vicinity'
• In weak sense: internal degree > external degree
• In strong sense: internal degree > external degree for each node
• Completely local approach: aggregation of 'similar' nodes
• Jaccard measure:
j belong to the same cluster. Another way of looking at it is that an edge that is a
part of many triangles is probably in a dense region i.e. a cluster. We use the Jaccard
measure to quantify the overlap between adjacency lists. Let Adj(i) be the adjacency
list of i, and Adj(j) be the adjacency list of j. For simplicity, we will refer to the
similarity between Adj(i) and Adj(j) as the similarity between i and j itself.
Sim(i, j) =
|Adj(i) ∩ Adj(j)|
|Adj(i) ∪ Adj(j)|
(5.1)
Global Sparsification
Why is this important?
1. DMP Segments
2. Recommendations on goods
3. Centricity within
assemblages: actual data flows
4. Comparison to nominal
communities (dormitories, vk
groups)
5. Compression & Visualization
How does one find
assemblages?
• Graph partitioning: optimal partitioning into preset number of graphs of k.
how can one find that k well?
• Community finding: finding of particular aggregations
k not controlled directly
assemblages don't have to cover
completely
assemblages may overlay
How does one assess
success?
• Objective functions
• Optimization on graph completely: only approximate heuristics
• Comparison of results from various algorithms
• Selection of the most optimal k for graph partitioning
• If ground truth is available — use standard metrics for classifying purposes
• Often WTF is the best metric ever — see it with your own eyes
CLIQUES
Cliques
• Just clique: complete sub-graph
• Everybody knows everybody
• Sometimes maximality required: no one can be added
• Brohn-Kerbosch algorithm (1973): 

n - number of nodes, d_max - maximum degree
O(n · 3dmax/3)
Disadvantages of cliques
• Too austere assumption
• Big cliques are almost absent
• Small cliques are present even in Erdös-Renew graph
• Disappearance of just one edge destroys the whole clique
• No center and outskirts of the assemblage
• Symmetry: no sense in centricity
Generalizations
• n-clique: maximum subgraph where distance between any two
nodes is no longer than n.
• at n=1 reduces to simple clique
• n-clique may be even non-connected inside!
• n-club: n-clique with diameter n. Always connected.
• k-core: maximum subgraph where every node has at least k
internal neighbors.
• p-clique: each node's internal neighbor share comprises at least p
(0 to 1)
Algorithm for finding k-cores
• Batagelj and Zaversnik algorithm (2003): O(m) where m stands for number of edges
• Input: graph G(V,E)
• Output: k value for each node















Metrics and objective
functions
Modularity
• Assume — degree of node i,
• m - number of edges, A - connectivity matrix
• Stir all edges retaining distribution of degrees
• Probability of i and j connected (roughly!):
• Modularity: measure of 'non-randomness' of an assemblage:





ki
kikj
2m
Q =
1
2m
i,j
Aij −
kikj
2m
I ci = cj
Modularity properties
• m_s: number of edges in assemblage s, d_s: total degree of
nodes in s
• maximum value: at S disconnected cliques Q = 1-1/S
• Maximum for connected graph: at S equal subgraphs
connected by the same edge
• In that case, Q=1-1/S-S/m
Q ≡
s
ms
m
−
ds
2m
2
Modularity disadvantages
• Resolution limit!
• Assemblages with merge into one
• If cliques on the picture have dimension of n_s they
merge at
• Cluster dimension shifting towards fitting
ms <
√
2m
ns(ns − 1) + 1 < S
WCC
• Assume S is an assemblage (its nodes)
• V - all nodes
• : number of triangles within S where node x is present
• number of nodes of S forming at least one triangle with node x
• Weighted community clustering for one node:
• Average for the whole assemblage:







WCC(S) =
1
|S|
x∈S
WCC(x, S)
WCC
• Product of two components
• Triangle = 'clique'
• To the left: какая доля компашек с участием x находится внутри его «домашнего»
сообщества?
• To the right in the numerator: how many people overall are involved in cliques with x?
• To the right in the denominator: how many people would have benn involved in cliqes with
x, had S been a clique?
WCC
Algorithms
Newman-Girvan
• Hierarchic divisive algorithm
• Sequentially remove edges of greates betweenness
• Stop as the criterion is fulfilled (for instance, obtainment of k connected
components)
• O(n*m) for calculation of shortest routes
• O(n+m) for re-calculation of connected components
• The whole algorithm: at least O(n^3)
• Not used with graphs exceeding several thousands of nodes
k-medoids
• k-means: normalized space required
• Graph determines clearance between nodes only (for instance, 1 - Jaccard), therefore k-
means not suitable
• k-medoids: only available points act as centroids
• k shall be predetermined (graph partitioning)
• the most renowned variant is called PAM
k-medoids: PAM
1. Expressedly set k - number of clusters
2. Initialize: select k of random nodes as medoids
3. For each point find the closest medoid thus forming initial clustering
4. minCost = initial configuration losses function
5. For each medoid m:
5.1 For each node v!=m within cluster centered in m:
5.1.1 Shift the medoid to v
5.1.2 Re-distribute all nodes between new medoids
5.1.3 cost=function of losses for the whole graph
5.1.4 if cost<minCost:
5.1.4.1 Remember the medoids
5.1.4.2 minCost=cost
5.1.5 Put the medoid back (to m)
6. Perfor the best substitution of all those found (i.e., change one medoid within one cluster)
7. Repeat items 4-5 until medoids are stable
k-medoids: new heuristics
1. While true:
1.1 For each medoid m:
1.1.1 Randomly select s points within cluster centered in m
1.1.2 For each node v of s:
1.1.2.1 Shift the medoid to v
1.1.2.2 Re-distribute all nodes between new medoids
1.1.2.3 cost = function of losses for the whole graph
1.1.2.4 if cost < minCost:
1.1.2.4.1 Remember the medoids
1.1.2.4.2 minCost = cost
1.1.2.5 Put the medoid back (in m)
1.1.3 If the best substitute of s enhances the losses function:
1.1.3.1 Perform the substitution
1.1.3.2 StableSequence=0
1.1.4 Otherwise:
1.1.4.1 StableSequence +=1
1.1.4.2 If StableSequence>threshold:
1.1.4.2.1 Restore the current configuration
k-medoids: clara
• Bagging for graph clustering
• Select and cluster random subsample
• Remaining nodes are just connected to the nearest medoids in the very end
• Run several times to select the best variant
• Acceleration only at complexity over O(n)
• Complexity of PAN: O(k*n^2*number of iterations)
• Complexity of new heuristics: O(k*n*s*number of iterations)
k-medoids for DMP
segments
• Plot graph for domains
• Data: selection of users, and set of visited domains for each of them
• Assume U_x is all users visited domain x
• Weight of edge between domains x and y:

• Cutting of noises:
1. Threshold for nodes (domains): at least 15 visits
2. Threshold for edges: affitinty of at least 20
affinity(x, y) =
|Ux ∩ Uy||U|
|Ux||Uy|
=
ˆp(x, y)
ˆp(x)ˆp(y)
How much data? How many
clusters?
• 30,000 users provide around 1200 domains and around 12
interpreted assemblages
• 500,000 users: 10,000 domains, around 30 assemblages
Interpreting the picture
• Node size: number of visits
• Node color: assemblage
• Edge color = assemblage color if internal,
otherwise grey
• Edge thickness: affinity
• Too complicated for networkX!
• Disadvantages of networkX visualization:
1. Non-flexible
2. Non-interactive
3. Unstable
4. Slow
• Use graph-tool if possible!
News sites
Movies and TV series
Research papers, cartoons, and cars
Culinary
Kazakhstan
Books & Laws
Pre-processing: Local
sparsification
• Sparsify the graph retaining the assemblage structure
• Algorithms are faster, and pictures are prettier
• Option 1: sort all neighbors per Jaccasrd measure in descension, and remove the tail
• Minus: dense communities remain untouched, and those sparse get destroyed completely
• Option 2: sort neighbors for each node and retain edges
• d_i — degree of i, e from 0 to 1. At e=0,5 sparsification is tenfold
• Power law retained, and connectedness almost retained!
min{1,de
i }
Local sparsification:
demonstration
Stable cores:
• Randomized algorithms non-stable
• Different runs return different results
• Addition/removal of 2% of nodes
may completely change the clustering
picture
• Stable cores: launch the algorithm for
100 tiumes and count the share of how
many times each pair of nodes reached
the same cluster
• Resulting hierarchic clustering
Louvain
• Blondel et al, 2008
• The most audacious of all modularity-based algorithms
• Multi-level assemblages
• Very fast
• 1. Initializing: all nodes separately (n assemblages with 1 node)
• 2. Unify assemblage pairs in iterations providing the greatest modularity increment
• 3. As the incrementation ceases, represent each assemblage as 1 node in a new graph
• 4. Repeat clauses 2 to 3 until only 2 assemblages remain
Louvain: illustration
• Cell phone operator from Belgium
• 2.6 million customers
• 260 assemblages with over 100
customers, 36 with over 10,000
• 6 assemblage levels
• French and Dutch segments are
almost independent
MCL
• Markov Cluster Algorithm (van Dongen, 1997-2000)
• Normalize columns of connectedness matrix:
• 'Share of money for each friend' or 'Probability of random walk transition'
• Repeat 3 steps in iterations:
• Expand:
• Inflate: (number of clusters grows as r grows)
• Prune: zero out the least elements in each column
• Repeat until M converges
• Complexity: ~n*d^2 for the first iteration (the following will be faster)
of the graph G is also refered to as a flow matrix of G or simply
a flow matrix M, the ith
column contains the flows out of node
correspondingly the ith
row contains the in-flows. Note that whil
out-flows) sum to 1, the rows (or in-flows) are not required to d
The most common way of deriving a column-stochastic trans
graph is to simply normalize the columns of the adjacency matr
M(i, j) =
A(i, j)
n
k=1 A(k, j)
In matrix notation, M := AD−1
, where D is the diagonal degr
D(i, i) = n
j=1 A(j, i). We will refer to this particular transition
29
one can associate other stochastic matrices with the graph G.
Both MCL and our methods introduced in Section 3.2 can be thought of as sim-
ulating stochastic flows (or simulating random walks) on graphs according to certain
rules. For this reason, we refer to these processes as flow simulations.
3.1.2 Markov Clustering (MCL) Algorithm
We next describe the Markov Clustering (MCL) algorithm for clustering graphs,
proposed by Stijn van Dongen [41], in some detail as it is relevant to understanding
our own method.
The MCL algorithm is an iterative process of applying two operators - expansion
and inflation - on an initial stochastic matrix M, in alternation, until convergence.
Both expansion and inflation are operators that map the space of column-stochastic
matrices onto itself. Additionally, a prune step is performed at the end of each
inflation step in order to save memory. Each of these steps is defined below:
Expand: Input M , output Mexp.
Mexp = Expand(M)
def
= M ∗ M
The ith
column of Mexp can be interpreted as the final distribution of a random walk of
length 2 starting from vertex vi, with the transition probabilities of the random walk
given by M. One can take higher powers of M instead of a square (corresponding to
longer random walks), but this gets computationally prohibitive very quickly.
Inflate: Input M and inflation parameter r, output Minf .
Minf (i, j)
def
=
M(i, j)r
n
k=1 M(k, j)r
Minf corresponds to raising each entry in the matrix M to the power r and then
We next describe the Markov Clustering (MCL) algorithm for clustering graphs,
proposed by Stijn van Dongen [41], in some detail as it is relevant to understanding
our own method.
The MCL algorithm is an iterative process of applying two operators - expansion
and inflation - on an initial stochastic matrix M, in alternation, until convergence.
Both expansion and inflation are operators that map the space of column-stochastic
matrices onto itself. Additionally, a prune step is performed at the end of each
inflation step in order to save memory. Each of these steps is defined below:
Expand: Input M , output Mexp.
Mexp = Expand(M)
def
= M ∗ M
The ith
column of Mexp can be interpreted as the final distribution of a random walk of
length 2 starting from vertex vi, with the transition probabilities of the random walk
given by M. One can take higher powers of M instead of a square (corresponding to
longer random walks), but this gets computationally prohibitive very quickly.
Inflate: Input M and inflation parameter r, output Minf .
Minf (i, j)
def
=
M(i, j)r
n
k=1 M(k, j)r
Minf corresponds to raising each entry in the matrix M to the power r and then
normalizing the columns to sum to 1. By default r = 2. Because the entries in the
matrix are all guaranteed to be less than or equal to 1, this operator has the effect of
exaggerating the inhomogeneity in each column (as long as r > 1). In other words,
flow is strengthened where it is already strong and weakened where it is weak.
Prune: In each column, we remove those entries which have very small values (where
“small” is defined in relation to the rest of the entries in the column), and the retained
30
MCL: example
ging to one cluster.
y example
simple example of the MCL process in action for the graph in Fig-
initial stochastic matrix M0 obtained by adding self-loops to the graph
ng each column is given below
M0 =
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
0.33 0.33 0.25 0 0 0
0.33 0.33 0.25 0 0 0
0.33 0.33 0.25 0.25 0 0
0 0 0.25 0.25 0.33 0.33
0 0 0 0.25 0.33 0.33
0 0 0 0.25 0.33 0.33
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
32
t of applying one iteration of Expansion, Inflation and the Prune steps
w:
M1 =
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
0.33 0.33 0.2763 0 0 0
0.33 0.33 0.2763 0 0 0
0.33 0.33 0.4475 0 0 0
0 0 0 0.4475 0.33 0.33
0 0 0 0.2763 0.33 0.33
0 0 0 0.2763 0.33 0.33
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
e flow along the lone inter-cluster edge (M0(4, 3)) has evaporated to 0.
e more iteration results in convergence.
⎛
0 0 0 0 0 0
⎞
The result of applying one iteration of Expansion, Inflation and the P
is given below:
M1 =
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
0.33 0.33 0.2763 0 0 0
0.33 0.33 0.2763 0 0 0
0.33 0.33 0.4475 0 0 0
0 0 0 0.4475 0.33 0.33
0 0 0 0.2763 0.33 0.33
0 0 0 0.2763 0.33 0.33
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
Note that the flow along the lone inter-cluster edge (M0(4, 3)) has evapo
Applying one more iteration results in convergence.
M2 =
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
0 0 0 0 0 0
0 0 0 0 0 0
1 1 1 0 0 0
0 0 0 1 1 1
0 0 0 0 0 0
0 0 0 0 0 0
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
Hence, vertices 1, 2 and 3 flow completely to vertex 3, where as the vertic
6 flow completely to vertex 4. Hence, we group 1, 2 and 3 together with 3
“attractor” of the cluster, and similarly for 4, 5 and 6.
MCL: illustration
MCL: problems and
solutions
• Problems:
• Too many clusters
• Learning rate r reduces: less clusters, but works slower
• Non-balanced clusters: one giant and a pile of those having 2 to 3 nodes
• The thing is in re-learning!
• Try to make distribution of flow from neighboring nodes similar
• Regularization of (R-MCL): Mexp = M*M0
Mexp = M ∗ M0
SCD: step 1
• Approximate maximization of WCC
• Count the triangles first
• Remove all edges that form no triangles
• Rough partitioning (algorithm 1):
• 1. Sort the nodes on the basis of local cluster
factor
• 2. First assemblage: first node + all neighbors
• 3. Second assemblage: first node of those
remaining (not visited yet) + all its neighbors
• 4. ...
• Complexity: O(n*d^2+n*log(n))
SCD: step 2
• Improve the algorithm 1 results in iterations until
WCC ceases improving (algorithm 2)
• * Find bestMovement for each node
(MapReduce)
• bestMovement: add / remove / transfer
• * Perform bestMovement simultaneously for all
nodes
• * Complexity: O(1) per one bestMovement, O(d
+1) for one node, O(m) for the whole graph
• * Whole algorithm: O(m*log(n))
SCD: experiment
Spinner
• Based on label propagation
• Implemented in Okapi (Mahout) by Telefonica
• Symmetrize initial graph D: weight w(u,v)=1 if the edge was located
in one direction, and 2 if in both directions
• * Regulate the balance: set maximum number of edges possible within
the assemblage (c may alter from 1 to 10-15):
• * Assemblage workload l: current number of edges in the community
l:
• * Relative workload:
• * Pre-set number of clusters k similar to k-medoids method; c=1,...k
• * What label shall be assigned to node v? The one that is most
common amongst neighbors of v:
• * Balanced state amendment:
p the current label if it is among them.
s convergence speed [6], and in our dis-
uces unnecessary network communica-
lgorithm halts when no vertex updates
ormulation of LPA assumes undirected
en graphs are directed (e.g. the Web).
stems like Pregel allow directed graphs,
re aware of graph directness, like PageR-
would need to convert a graph to undi-
would be to create an undirected edge
henever at least one directed edge exists
he directed graph.
agnostic to the communication patterns
on top. Consider the example graph in
tition to 3 parts. In the undirected graph
cut edges. At this point, according to the
agnostic of the directness of the original
ertex to another partition is as likely, and
dge less.
the directness of the edges in the orig-
ns are equally beneficial. In fact, either
n 1 or vertex 1 to partition 3 would in
edges in the directed graph. Once the
stem and messages are sent across the
decision results in less communication
straint, only encouraging a similar number of edges across
ferent partitions. As we will show, this decision allows a fu
centralized algorithm. While in this work we focus on the pr
tion and evaluation of the more system-related aspects of S
we plan to investigate theoretical justifications and guarant
hind our approach in future work.
Here, we consider the case of a homogeneous system,
each machine has equal resources. This setup is often pr
in synchronous graph processing systems like Pregel, to m
the time spent by faster machines waiting at the synchron
barrier for stragglers.
We define the capacity C of a partition as the maximum
of edges it can have so that partitions are balanced:
C = c·
|E|
k
Parameter c > 1 ensures additional capacity to each part
available for migrations. We define the load of a partition
actual number of edges in that partition:
B(l) = Â
v2G
deg(v)d(a(v),l)
A larger value of c increases the number of migrations
partition allowed at each iteration, possibly speeding up
gence, but it may increase unbalance, as more edges are allo
be assigned to each partition over the ideal value
|E|
.
entation reduces unnecessary network communica-
n 4). The algorithm halts when no vertex updates
e original formulation of LPA assumes undirected
er, very often graphs are directed (e.g. the Web).
models of systems like Pregel allow directed graphs,
ithms that are aware of graph directness, like PageR-
A as is, we would need to convert a graph to undi-
ve approach would be to create an undirected edge
s u and v whenever at least one directed edge exists
u and v in the directed graph.
h, though, is agnostic to the communication patterns
ons running on top. Consider the example graph in
e want to partition to 3 parts. In the undirected graph
e initially 3 cut edges. At this point, according to the
n, which is agnostic of the directness of the original
ation of a vertex to another partition is as likely, and
e one cut edge less.
we consider the directness of the edges in the orig-
all migrations are equally beneficial. In fact, either
2 to partition 1 or vertex 1 to partition 3 would in
e less cut edges in the directed graph. Once the
into the system and messages are sent across the
this latter decision results in less communication
k.
centralized algorithm. While in this work we focus o
tion and evaluation of the more system-related aspe
we plan to investigate theoretical justifications and
hind our approach in future work.
Here, we consider the case of a homogeneous
each machine has equal resources. This setup is o
in synchronous graph processing systems like Prege
the time spent by faster machines waiting at the s
barrier for stragglers.
We define the capacity C of a partition as the max
of edges it can have so that partitions are balanced:
C = c·
|E|
k
Parameter c > 1 ensures additional capacity to ea
available for migrations. We define the load of a p
actual number of edges in that partition:
B(l) = Â
v2G
deg(v)d(a(v),l)
A larger value of c increases the number of mig
partition allowed at each iteration, possibly speed
gence, but it may increase unbalance, as more edges
be assigned to each partition over the ideal value
|E|
k
3
We introduce a penalty function to discourage assigning vertices
to nearly full partitions. Given a partition indicated by label l, the
penalty function p(l) is defined as follows:
p(l) =
B(l)
C
(7)
To integrate the penalty function we normalize (4) first, and re-
formulate the score function as follows:
score00
(v,l) = Â
u2N(v)
w(u,v)d(a(u),l)
Âu2N(v) w(u,v)
p(l) (8)
3.3 Convergence and Halting
is to “push” the cu
it converged to, tow
sult, we restart the
look for a new loca
score, possibly dec
rithm continues as
concerned, we assi
The number of
state depends on th
Clearly, not every
times, no iteration
may not affect any
We introduce a penalty function to discourage assigning vertices
to nearly full partitions. Given a partition indicated by label l, the
penalty function p(l) is defined as follows:
p(l) =
B(l)
C
(7)
To integrate the penalty function we normalize (4) first, and re-
formulate the score function as follows:
score00
(v,l) = Â
u2N(v)
w(u,v)d(a(u),l)
Âu2N(v) w(u,v)
p(l) (8)
is
it
s
lo
s
r
c
s
C
ti
V is the set of vertices in the graph and E is the set of
that an edge e 2 E is a pair (u,v) with u,v 2 V. We
N(v) = {u: u 2 V,(u,v) 2 E} the neighborhood of a ve
by deg(v) = |N(v)| the degree of v. In a k-way partit
define L as a set of labels L = {l1,...,lk} that essentially c
to the k partitions. a is the labeling function a : V ! L
a(v) = lj if label lj is assigned to vertex v.
The end goal of Spinner is to assign partitions, or labe
vertex such that it maximizes edge locality and partitio
anced.
3.1 K-way Label Propagation
We first describe how to use basic LPA to maximize ed
and then extend the algorithm to achieve balanced part
tially, each vertex in the graph is assigned a label li at ran
0 < i  k. Subsequently, every vertex iteratively propag
bel to its neighbors. During this iterative process, a verte
the label that is more frequent among its neighbors. Eve
assigns a different score for a particular label l which is e
number of neighbors assigned to label l. A vertex shows
to labels with high score. More formally:
score(v,l) = Â
u2N(v)
d(a(u),l)
where d is the Kronecker delta. The vertex updates its l
label lv that maximizes its score according to the update
Spinner: Scalability
• Calculation within Pregel — perfect for label propagation
• Easy to add and remove clusters (1 cluster per 1 worker)
• Easy re-calculation in case of addition / removal of nodes
• Scalability of Spinner upon clustering of a random graph (Watts-Strogatz)
• Saving of resources upon addition of new edges or new assemblages (workers)

















(a) Partitioning of the Twitter graph. (b) Partitioning of the Yahoo! graph.
Figure 4: Partitioning of (a) the Twitter graph across 256 partitions and (b) the Yahoo! web graph across 115 partitions. The figure
shows the evolution of metrics f, r, and score(G) across iterations.
(a) Runtime vs. graph size (b) Runtime vs. cluster size (c) Runtime vs. k
Figure 5: Scalability of Spinner. (a) Runtime as a function of the number of vertices, (b) runtime as a function of the number of
workers, (c) runtime as a function of the number of partitions.
supersteps. This approach allows us to factor out the runtime of al-
gorithm as a function the number of vertices and edges.
Figure 5.2 presents the results of the experiments, executed on
a AWS Hadoop cluster consisting of 116 m2.4xlarge machines. In
the first experiment, presented in Figure 5(a), we focus on the scal-
ability of the algorithm as a function of the number of vertices and
edges in the graph. For this, we fix the number of outgoing edges
per vertex to 40. We connect the vertices following a ring lattice
topology, and re-wire 30% of the edges randomly as by the func-
tion of the beta (0.3) parameter of the Watts-Strogatz model. We (a) Cost savings (b) Partitioning stability
(b) Partitioning of the Yahoo! graph.
ns and (b) the Yahoo! web graph across 115 partitions. The figure
s.
vs. cluster size (c) Runtime vs. k
he number of vertices, (b) runtime as a function of the number of
(a) Cost savings (b) Partitioning stability
Figure 6: Adapting to dynamic graph changes. We vary the
(a) Cost savings (b) Partitioning stability
Figure 7: Adapting to resource changes. We vary the num-
o
t
m
A
t
o
f
p
s
Tools
• Network X: cliques, k-cores, blockmodels
• graph-tool: very fast blockmodels, visualization
• okapi (mahout): k-cores, Spinner
• GraphX (spark): nothing as yet
• Gephi: MCL, Girvan-Newman, Chinese Whispers
• micans.org: MCL by the creator
• mapequation.org: Infomap
• sites.google.com/site/findcommunities: Louvain from the creators (C++)
• pycluster (coming soon): k-medoids
Further reading
• Mining of Massive Datasets (ch.10 «Mining social networks
graphs», pp 343-402)
• Data Clustering Algorithms and Applications (ch 17 «Network
clustering», pp 415-443)
• SNA Course at Coursera: www.coursera.org/course/sna
• Great review on community finding: snap.stanford.edu/class/
cs224w-readings/ fortunato10community.pdf
• Articles with analysed methods and relevant cases (to be sent)
THANK YOU

More Related Content

What's hot

CS 354 Bezier Curves
CS 354 Bezier Curves CS 354 Bezier Curves
CS 354 Bezier Curves Mark Kilgard
 
Neural Collaborative Subspace Clustering
Neural Collaborative Subspace ClusteringNeural Collaborative Subspace Clustering
Neural Collaborative Subspace ClusteringTakahiro Hasegawa
 
Lecture 9-online
Lecture 9-onlineLecture 9-online
Lecture 9-onlinelifebreath
 
Perspective in Informatics 3 - Assignment 1 - Answer Sheet
Perspective in Informatics 3 - Assignment 1 - Answer SheetPerspective in Informatics 3 - Assignment 1 - Answer Sheet
Perspective in Informatics 3 - Assignment 1 - Answer SheetHoang Nguyen Phong
 
Dijkstra's Algorithm
Dijkstra's AlgorithmDijkstra's Algorithm
Dijkstra's AlgorithmArijitDhali
 
Perspective in Informatics 3 - Assignment 2 - Answer Sheet
Perspective in Informatics 3 - Assignment 2 - Answer SheetPerspective in Informatics 3 - Assignment 2 - Answer Sheet
Perspective in Informatics 3 - Assignment 2 - Answer SheetHoang Nguyen Phong
 
On the Convex Layers of a Planer Dynamic Set of Points
On the Convex Layers of a Planer Dynamic Set of PointsOn the Convex Layers of a Planer Dynamic Set of Points
On the Convex Layers of a Planer Dynamic Set of PointsKasun Ranga Wijeweera
 
Lecture 14 data structures and algorithms
Lecture 14 data structures and algorithmsLecture 14 data structures and algorithms
Lecture 14 data structures and algorithmsAakash deep Singhal
 
Lec02 03 rasterization
Lec02 03 rasterizationLec02 03 rasterization
Lec02 03 rasterizationMaaz Rizwan
 
Shortest path (Dijkistra's Algorithm) & Spanning Tree (Prim's Algorithm)
Shortest path (Dijkistra's Algorithm) & Spanning Tree (Prim's Algorithm)Shortest path (Dijkistra's Algorithm) & Spanning Tree (Prim's Algorithm)
Shortest path (Dijkistra's Algorithm) & Spanning Tree (Prim's Algorithm)Mohanlal Sukhadia University (MLSU)
 
Graph Theory,Graph Terminologies,Planar Graph & Graph Colouring
Graph Theory,Graph Terminologies,Planar Graph & Graph ColouringGraph Theory,Graph Terminologies,Planar Graph & Graph Colouring
Graph Theory,Graph Terminologies,Planar Graph & Graph ColouringSaurabh Kaushik
 

What's hot (19)

Backtracking
BacktrackingBacktracking
Backtracking
 
CS 354 Bezier Curves
CS 354 Bezier Curves CS 354 Bezier Curves
CS 354 Bezier Curves
 
Dijkstra c
Dijkstra cDijkstra c
Dijkstra c
 
Neural Collaborative Subspace Clustering
Neural Collaborative Subspace ClusteringNeural Collaborative Subspace Clustering
Neural Collaborative Subspace Clustering
 
Lecture 9-online
Lecture 9-onlineLecture 9-online
Lecture 9-online
 
Perspective in Informatics 3 - Assignment 1 - Answer Sheet
Perspective in Informatics 3 - Assignment 1 - Answer SheetPerspective in Informatics 3 - Assignment 1 - Answer Sheet
Perspective in Informatics 3 - Assignment 1 - Answer Sheet
 
Graph Coloring using Peer-to-Peer Networks
Graph Coloring using Peer-to-Peer NetworksGraph Coloring using Peer-to-Peer Networks
Graph Coloring using Peer-to-Peer Networks
 
Dijkstra's Algorithm
Dijkstra's AlgorithmDijkstra's Algorithm
Dijkstra's Algorithm
 
Perspective in Informatics 3 - Assignment 2 - Answer Sheet
Perspective in Informatics 3 - Assignment 2 - Answer SheetPerspective in Informatics 3 - Assignment 2 - Answer Sheet
Perspective in Informatics 3 - Assignment 2 - Answer Sheet
 
testpang
testpangtestpang
testpang
 
On the Convex Layers of a Planer Dynamic Set of Points
On the Convex Layers of a Planer Dynamic Set of PointsOn the Convex Layers of a Planer Dynamic Set of Points
On the Convex Layers of a Planer Dynamic Set of Points
 
Lecture 14 data structures and algorithms
Lecture 14 data structures and algorithmsLecture 14 data structures and algorithms
Lecture 14 data structures and algorithms
 
Lec02 03 rasterization
Lec02 03 rasterizationLec02 03 rasterization
Lec02 03 rasterization
 
Shortest path (Dijkistra's Algorithm) & Spanning Tree (Prim's Algorithm)
Shortest path (Dijkistra's Algorithm) & Spanning Tree (Prim's Algorithm)Shortest path (Dijkistra's Algorithm) & Spanning Tree (Prim's Algorithm)
Shortest path (Dijkistra's Algorithm) & Spanning Tree (Prim's Algorithm)
 
Representation
RepresentationRepresentation
Representation
 
Neural Networks - How do they work?
Neural Networks - How do they work?Neural Networks - How do they work?
Neural Networks - How do they work?
 
Graph Theory,Graph Terminologies,Planar Graph & Graph Colouring
Graph Theory,Graph Terminologies,Planar Graph & Graph ColouringGraph Theory,Graph Terminologies,Planar Graph & Graph Colouring
Graph Theory,Graph Terminologies,Planar Graph & Graph Colouring
 
Face recognition using LDA
Face recognition using LDAFace recognition using LDA
Face recognition using LDA
 
Unit 3
Unit 3Unit 3
Unit 3
 

Similar to Clustering of graphs and search of assemblages

Ram minimum spanning tree
Ram   minimum spanning treeRam   minimum spanning tree
Ram minimum spanning treeRama Prasath A
 
CLIQUE Automatic subspace clustering of high dimensional data for data mining...
CLIQUE Automatic subspace clustering of high dimensional data for data mining...CLIQUE Automatic subspace clustering of high dimensional data for data mining...
CLIQUE Automatic subspace clustering of high dimensional data for data mining...Raed Aldahdooh
 
Randomness conductors
Randomness conductorsRandomness conductors
Randomness conductorswtyru1989
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionJinwon Lee
 
Parallel Algorithms for Geometric Graph Problems (at Stanford)
Parallel Algorithms for Geometric Graph Problems (at Stanford)Parallel Algorithms for Geometric Graph Problems (at Stanford)
Parallel Algorithms for Geometric Graph Problems (at Stanford)Grigory Yaroslavtsev
 
Analysis and design of a half hypercube interconnection network topology
Analysis and design of a half hypercube interconnection network topologyAnalysis and design of a half hypercube interconnection network topology
Analysis and design of a half hypercube interconnection network topologyAmir Masoud Sefidian
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012Ted Dunning
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraJason Riedy
 
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01Hemant Jha
 
Mrongraphs acm-sig-2 (1)
Mrongraphs acm-sig-2 (1)Mrongraphs acm-sig-2 (1)
Mrongraphs acm-sig-2 (1)Nima Sarshar
 
IJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsIJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsAkisato Kimura
 
clustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdfclustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdfp_manimozhi
 
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
convolutional_neural_networks in deep learning
convolutional_neural_networks in deep learningconvolutional_neural_networks in deep learning
convolutional_neural_networks in deep learningssusere5ddd6
 

Similar to Clustering of graphs and search of assemblages (20)

Ram minimum spanning tree
Ram   minimum spanning treeRam   minimum spanning tree
Ram minimum spanning tree
 
09 placement
09 placement09 placement
09 placement
 
CLIQUE Automatic subspace clustering of high dimensional data for data mining...
CLIQUE Automatic subspace clustering of high dimensional data for data mining...CLIQUE Automatic subspace clustering of high dimensional data for data mining...
CLIQUE Automatic subspace clustering of high dimensional data for data mining...
 
Randomness conductors
Randomness conductorsRandomness conductors
Randomness conductors
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
 
Parallel Algorithms for Geometric Graph Problems (at Stanford)
Parallel Algorithms for Geometric Graph Problems (at Stanford)Parallel Algorithms for Geometric Graph Problems (at Stanford)
Parallel Algorithms for Geometric Graph Problems (at Stanford)
 
Analysis and design of a half hypercube interconnection network topology
Analysis and design of a half hypercube interconnection network topologyAnalysis and design of a half hypercube interconnection network topology
Analysis and design of a half hypercube interconnection network topology
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear Algebra
 
Data Mining Lecture_7.pptx
Data Mining Lecture_7.pptxData Mining Lecture_7.pptx
Data Mining Lecture_7.pptx
 
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
 
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
 
Mrongraphs acm-sig-2 (1)
Mrongraphs acm-sig-2 (1)Mrongraphs acm-sig-2 (1)
Mrongraphs acm-sig-2 (1)
 
IJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsIJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphs
 
ACM 2013-02-25
ACM 2013-02-25ACM 2013-02-25
ACM 2013-02-25
 
1535 graph algorithms
1535 graph algorithms1535 graph algorithms
1535 graph algorithms
 
clustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdfclustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdf
 
Db Scan
Db ScanDb Scan
Db Scan
 
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
 
convolutional_neural_networks in deep learning
convolutional_neural_networks in deep learningconvolutional_neural_networks in deep learning
convolutional_neural_networks in deep learning
 

More from Data-Centric_Alliance

Оффлайн-данные в онлайн-рекламе
Оффлайн-данные в онлайн-рекламеОффлайн-данные в онлайн-рекламе
Оффлайн-данные в онлайн-рекламеData-Centric_Alliance
 
Mobile Programmatic Platform by Exebid.DCA
Mobile Programmatic Platform by Exebid.DCAMobile Programmatic Platform by Exebid.DCA
Mobile Programmatic Platform by Exebid.DCAData-Centric_Alliance
 
Full-Stack Programmatic Platform Exebid.DCA
Full-Stack Programmatic Platform Exebid.DCAFull-Stack Programmatic Platform Exebid.DCA
Full-Stack Programmatic Platform Exebid.DCAData-Centric_Alliance
 
Exebid.DCA. Programmatic-платформа
Exebid.DCA. Programmatic-платформаExebid.DCA. Programmatic-платформа
Exebid.DCA. Programmatic-платформаData-Centric_Alliance
 
Exebid.DCA. Programmatic-платформа
Exebid.DCA. Programmatic-платформаExebid.DCA. Programmatic-платформа
Exebid.DCA. Programmatic-платформаData-Centric_Alliance
 
Exebid.DCA, programmatic-платформа, которой можно доверять
Exebid.DCA, programmatic-платформа, которой можно доверятьExebid.DCA, programmatic-платформа, которой можно доверять
Exebid.DCA, programmatic-платформа, которой можно доверятьData-Centric_Alliance
 
Будущее медиа в эпоху больших данных: ничего личного
Будущее медиа в эпоху больших данных: ничего личногоБудущее медиа в эпоху больших данных: ничего личного
Будущее медиа в эпоху больших данных: ничего личногоData-Centric_Alliance
 
Лучшие рекламные кампании Exebid.DCA
Лучшие рекламные кампании Exebid.DCAЛучшие рекламные кампании Exebid.DCA
Лучшие рекламные кампании Exebid.DCAData-Centric_Alliance
 
Сравнение аудитории популярных автомобилей класса С
Сравнение аудитории популярных автомобилей класса ССравнение аудитории популярных автомобилей класса С
Сравнение аудитории популярных автомобилей класса СData-Centric_Alliance
 
Big Data is the new oil, или почему просто наличия (больших) данных недостато...
Big Data is the new oil, или почему просто наличия (больших) данных недостато...Big Data is the new oil, или почему просто наличия (больших) данных недостато...
Big Data is the new oil, или почему просто наличия (больших) данных недостато...Data-Centric_Alliance
 
Facetz.DCA. Платформа по управлению данными
Facetz.DCA. Платформа по управлению даннымиFacetz.DCA. Платформа по управлению данными
Facetz.DCA. Платформа по управлению даннымиData-Centric_Alliance
 

More from Data-Centric_Alliance (17)

Оффлайн-данные в онлайн-рекламе
Оффлайн-данные в онлайн-рекламеОффлайн-данные в онлайн-рекламе
Оффлайн-данные в онлайн-рекламе
 
Mobile Programmatic Platform by Exebid.DCA
Mobile Programmatic Platform by Exebid.DCAMobile Programmatic Platform by Exebid.DCA
Mobile Programmatic Platform by Exebid.DCA
 
Exebid.DCA Casebook
Exebid.DCA CasebookExebid.DCA Casebook
Exebid.DCA Casebook
 
Full-Stack Programmatic Platform Exebid.DCA
Full-Stack Programmatic Platform Exebid.DCAFull-Stack Programmatic Platform Exebid.DCA
Full-Stack Programmatic Platform Exebid.DCA
 
Mobile Programmatic Exebid.DCA
Mobile Programmatic Exebid.DCAMobile Programmatic Exebid.DCA
Mobile Programmatic Exebid.DCA
 
Exebid.DCA. Programmatic-платформа
Exebid.DCA. Programmatic-платформаExebid.DCA. Programmatic-платформа
Exebid.DCA. Programmatic-платформа
 
Exebid.DCA. Programmatic-платформа
Exebid.DCA. Programmatic-платформаExebid.DCA. Programmatic-платформа
Exebid.DCA. Programmatic-платформа
 
Exebid.DCA, programmatic-платформа, которой можно доверять
Exebid.DCA, programmatic-платформа, которой можно доверятьExebid.DCA, programmatic-платформа, которой можно доверять
Exebid.DCA, programmatic-платформа, которой можно доверять
 
Будущее медиа в эпоху больших данных: ничего личного
Будущее медиа в эпоху больших данных: ничего личногоБудущее медиа в эпоху больших данных: ничего личного
Будущее медиа в эпоху больших данных: ничего личного
 
Лучшие рекламные кампании Exebid.DCA
Лучшие рекламные кампании Exebid.DCAЛучшие рекламные кампании Exebid.DCA
Лучшие рекламные кампании Exebid.DCA
 
Сравнение аудитории популярных автомобилей класса С
Сравнение аудитории популярных автомобилей класса ССравнение аудитории популярных автомобилей класса С
Сравнение аудитории популярных автомобилей класса С
 
Big Data is the new oil, или почему просто наличия (больших) данных недостато...
Big Data is the new oil, или почему просто наличия (больших) данных недостато...Big Data is the new oil, или почему просто наличия (больших) данных недостато...
Big Data is the new oil, или почему просто наличия (больших) данных недостато...
 
DCA (Data-Centric Alliance)
DCA (Data-Centric Alliance)DCA (Data-Centric Alliance)
DCA (Data-Centric Alliance)
 
DataLift.DA
DataLift.DADataLift.DA
DataLift.DA
 
2Click.DCA
2Click.DCA2Click.DCA
2Click.DCA
 
Samba.DCA
Samba.DCASamba.DCA
Samba.DCA
 
Facetz.DCA. Платформа по управлению данными
Facetz.DCA. Платформа по управлению даннымиFacetz.DCA. Платформа по управлению данными
Facetz.DCA. Платформа по управлению данными
 

Recently uploaded

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 

Recently uploaded (20)

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 

Clustering of graphs and search of assemblages

  • 1. Clustering of graphs and search of assemblages Kirill Rybachuk, DCA
  • 2. What is assemglage? • Assemblage is an intuitive concept and has no common definition • Global approach: 'such aggregation cannot be for no reason' • Modularity: comparison with random graph (almost Erdös-Renew) • Local approach: 'more connections inside, then outside in the vicinity' • In weak sense: internal degree > external degree • In strong sense: internal degree > external degree for each node • Completely local approach: aggregation of 'similar' nodes • Jaccard measure: j belong to the same cluster. Another way of looking at it is that an edge that is a part of many triangles is probably in a dense region i.e. a cluster. We use the Jaccard measure to quantify the overlap between adjacency lists. Let Adj(i) be the adjacency list of i, and Adj(j) be the adjacency list of j. For simplicity, we will refer to the similarity between Adj(i) and Adj(j) as the similarity between i and j itself. Sim(i, j) = |Adj(i) ∩ Adj(j)| |Adj(i) ∪ Adj(j)| (5.1) Global Sparsification
  • 3. Why is this important? 1. DMP Segments 2. Recommendations on goods 3. Centricity within assemblages: actual data flows 4. Comparison to nominal communities (dormitories, vk groups) 5. Compression & Visualization
  • 4. How does one find assemblages? • Graph partitioning: optimal partitioning into preset number of graphs of k. how can one find that k well? • Community finding: finding of particular aggregations k not controlled directly assemblages don't have to cover completely assemblages may overlay
  • 5. How does one assess success? • Objective functions • Optimization on graph completely: only approximate heuristics • Comparison of results from various algorithms • Selection of the most optimal k for graph partitioning • If ground truth is available — use standard metrics for classifying purposes • Often WTF is the best metric ever — see it with your own eyes
  • 7. Cliques • Just clique: complete sub-graph • Everybody knows everybody • Sometimes maximality required: no one can be added • Brohn-Kerbosch algorithm (1973): 
 n - number of nodes, d_max - maximum degree O(n · 3dmax/3)
  • 8. Disadvantages of cliques • Too austere assumption • Big cliques are almost absent • Small cliques are present even in Erdös-Renew graph • Disappearance of just one edge destroys the whole clique • No center and outskirts of the assemblage • Symmetry: no sense in centricity
  • 9. Generalizations • n-clique: maximum subgraph where distance between any two nodes is no longer than n. • at n=1 reduces to simple clique • n-clique may be even non-connected inside! • n-club: n-clique with diameter n. Always connected. • k-core: maximum subgraph where every node has at least k internal neighbors. • p-clique: each node's internal neighbor share comprises at least p (0 to 1)
  • 10. Algorithm for finding k-cores • Batagelj and Zaversnik algorithm (2003): O(m) where m stands for number of edges • Input: graph G(V,E) • Output: k value for each node
 
 
 
 
 
 
 

  • 12. Modularity • Assume — degree of node i, • m - number of edges, A - connectivity matrix • Stir all edges retaining distribution of degrees • Probability of i and j connected (roughly!): • Modularity: measure of 'non-randomness' of an assemblage:
 
 
 ki kikj 2m Q = 1 2m i,j Aij − kikj 2m I ci = cj
  • 13. Modularity properties • m_s: number of edges in assemblage s, d_s: total degree of nodes in s • maximum value: at S disconnected cliques Q = 1-1/S • Maximum for connected graph: at S equal subgraphs connected by the same edge • In that case, Q=1-1/S-S/m Q ≡ s ms m − ds 2m 2
  • 14. Modularity disadvantages • Resolution limit! • Assemblages with merge into one • If cliques on the picture have dimension of n_s they merge at • Cluster dimension shifting towards fitting ms < √ 2m ns(ns − 1) + 1 < S
  • 15. WCC • Assume S is an assemblage (its nodes) • V - all nodes • : number of triangles within S where node x is present • number of nodes of S forming at least one triangle with node x • Weighted community clustering for one node: • Average for the whole assemblage:
 
 
 
 WCC(S) = 1 |S| x∈S WCC(x, S)
  • 16. WCC • Product of two components • Triangle = 'clique' • To the left: какая доля компашек с участием x находится внутри его «домашнего» сообщества? • To the right in the numerator: how many people overall are involved in cliques with x? • To the right in the denominator: how many people would have benn involved in cliqes with x, had S been a clique?
  • 17. WCC
  • 19. Newman-Girvan • Hierarchic divisive algorithm • Sequentially remove edges of greates betweenness • Stop as the criterion is fulfilled (for instance, obtainment of k connected components) • O(n*m) for calculation of shortest routes • O(n+m) for re-calculation of connected components • The whole algorithm: at least O(n^3) • Not used with graphs exceeding several thousands of nodes
  • 20. k-medoids • k-means: normalized space required • Graph determines clearance between nodes only (for instance, 1 - Jaccard), therefore k- means not suitable • k-medoids: only available points act as centroids • k shall be predetermined (graph partitioning) • the most renowned variant is called PAM
  • 21. k-medoids: PAM 1. Expressedly set k - number of clusters 2. Initialize: select k of random nodes as medoids 3. For each point find the closest medoid thus forming initial clustering 4. minCost = initial configuration losses function 5. For each medoid m: 5.1 For each node v!=m within cluster centered in m: 5.1.1 Shift the medoid to v 5.1.2 Re-distribute all nodes between new medoids 5.1.3 cost=function of losses for the whole graph 5.1.4 if cost<minCost: 5.1.4.1 Remember the medoids 5.1.4.2 minCost=cost 5.1.5 Put the medoid back (to m) 6. Perfor the best substitution of all those found (i.e., change one medoid within one cluster) 7. Repeat items 4-5 until medoids are stable
  • 22. k-medoids: new heuristics 1. While true: 1.1 For each medoid m: 1.1.1 Randomly select s points within cluster centered in m 1.1.2 For each node v of s: 1.1.2.1 Shift the medoid to v 1.1.2.2 Re-distribute all nodes between new medoids 1.1.2.3 cost = function of losses for the whole graph 1.1.2.4 if cost < minCost: 1.1.2.4.1 Remember the medoids 1.1.2.4.2 minCost = cost 1.1.2.5 Put the medoid back (in m) 1.1.3 If the best substitute of s enhances the losses function: 1.1.3.1 Perform the substitution 1.1.3.2 StableSequence=0 1.1.4 Otherwise: 1.1.4.1 StableSequence +=1 1.1.4.2 If StableSequence>threshold: 1.1.4.2.1 Restore the current configuration
  • 23. k-medoids: clara • Bagging for graph clustering • Select and cluster random subsample • Remaining nodes are just connected to the nearest medoids in the very end • Run several times to select the best variant • Acceleration only at complexity over O(n) • Complexity of PAN: O(k*n^2*number of iterations) • Complexity of new heuristics: O(k*n*s*number of iterations)
  • 24. k-medoids for DMP segments • Plot graph for domains • Data: selection of users, and set of visited domains for each of them • Assume U_x is all users visited domain x • Weight of edge between domains x and y:
 • Cutting of noises: 1. Threshold for nodes (domains): at least 15 visits 2. Threshold for edges: affitinty of at least 20 affinity(x, y) = |Ux ∩ Uy||U| |Ux||Uy| = ˆp(x, y) ˆp(x)ˆp(y)
  • 25. How much data? How many clusters? • 30,000 users provide around 1200 domains and around 12 interpreted assemblages • 500,000 users: 10,000 domains, around 30 assemblages
  • 26. Interpreting the picture • Node size: number of visits • Node color: assemblage • Edge color = assemblage color if internal, otherwise grey • Edge thickness: affinity • Too complicated for networkX! • Disadvantages of networkX visualization: 1. Non-flexible 2. Non-interactive 3. Unstable 4. Slow • Use graph-tool if possible!
  • 28. Movies and TV series
  • 33. Pre-processing: Local sparsification • Sparsify the graph retaining the assemblage structure • Algorithms are faster, and pictures are prettier • Option 1: sort all neighbors per Jaccasrd measure in descension, and remove the tail • Minus: dense communities remain untouched, and those sparse get destroyed completely • Option 2: sort neighbors for each node and retain edges • d_i — degree of i, e from 0 to 1. At e=0,5 sparsification is tenfold • Power law retained, and connectedness almost retained! min{1,de i }
  • 35. Stable cores: • Randomized algorithms non-stable • Different runs return different results • Addition/removal of 2% of nodes may completely change the clustering picture • Stable cores: launch the algorithm for 100 tiumes and count the share of how many times each pair of nodes reached the same cluster • Resulting hierarchic clustering
  • 36. Louvain • Blondel et al, 2008 • The most audacious of all modularity-based algorithms • Multi-level assemblages • Very fast • 1. Initializing: all nodes separately (n assemblages with 1 node) • 2. Unify assemblage pairs in iterations providing the greatest modularity increment • 3. As the incrementation ceases, represent each assemblage as 1 node in a new graph • 4. Repeat clauses 2 to 3 until only 2 assemblages remain
  • 37. Louvain: illustration • Cell phone operator from Belgium • 2.6 million customers • 260 assemblages with over 100 customers, 36 with over 10,000 • 6 assemblage levels • French and Dutch segments are almost independent
  • 38. MCL • Markov Cluster Algorithm (van Dongen, 1997-2000) • Normalize columns of connectedness matrix: • 'Share of money for each friend' or 'Probability of random walk transition' • Repeat 3 steps in iterations: • Expand: • Inflate: (number of clusters grows as r grows) • Prune: zero out the least elements in each column • Repeat until M converges • Complexity: ~n*d^2 for the first iteration (the following will be faster) of the graph G is also refered to as a flow matrix of G or simply a flow matrix M, the ith column contains the flows out of node correspondingly the ith row contains the in-flows. Note that whil out-flows) sum to 1, the rows (or in-flows) are not required to d The most common way of deriving a column-stochastic trans graph is to simply normalize the columns of the adjacency matr M(i, j) = A(i, j) n k=1 A(k, j) In matrix notation, M := AD−1 , where D is the diagonal degr D(i, i) = n j=1 A(j, i). We will refer to this particular transition 29 one can associate other stochastic matrices with the graph G. Both MCL and our methods introduced in Section 3.2 can be thought of as sim- ulating stochastic flows (or simulating random walks) on graphs according to certain rules. For this reason, we refer to these processes as flow simulations. 3.1.2 Markov Clustering (MCL) Algorithm We next describe the Markov Clustering (MCL) algorithm for clustering graphs, proposed by Stijn van Dongen [41], in some detail as it is relevant to understanding our own method. The MCL algorithm is an iterative process of applying two operators - expansion and inflation - on an initial stochastic matrix M, in alternation, until convergence. Both expansion and inflation are operators that map the space of column-stochastic matrices onto itself. Additionally, a prune step is performed at the end of each inflation step in order to save memory. Each of these steps is defined below: Expand: Input M , output Mexp. Mexp = Expand(M) def = M ∗ M The ith column of Mexp can be interpreted as the final distribution of a random walk of length 2 starting from vertex vi, with the transition probabilities of the random walk given by M. One can take higher powers of M instead of a square (corresponding to longer random walks), but this gets computationally prohibitive very quickly. Inflate: Input M and inflation parameter r, output Minf . Minf (i, j) def = M(i, j)r n k=1 M(k, j)r Minf corresponds to raising each entry in the matrix M to the power r and then We next describe the Markov Clustering (MCL) algorithm for clustering graphs, proposed by Stijn van Dongen [41], in some detail as it is relevant to understanding our own method. The MCL algorithm is an iterative process of applying two operators - expansion and inflation - on an initial stochastic matrix M, in alternation, until convergence. Both expansion and inflation are operators that map the space of column-stochastic matrices onto itself. Additionally, a prune step is performed at the end of each inflation step in order to save memory. Each of these steps is defined below: Expand: Input M , output Mexp. Mexp = Expand(M) def = M ∗ M The ith column of Mexp can be interpreted as the final distribution of a random walk of length 2 starting from vertex vi, with the transition probabilities of the random walk given by M. One can take higher powers of M instead of a square (corresponding to longer random walks), but this gets computationally prohibitive very quickly. Inflate: Input M and inflation parameter r, output Minf . Minf (i, j) def = M(i, j)r n k=1 M(k, j)r Minf corresponds to raising each entry in the matrix M to the power r and then normalizing the columns to sum to 1. By default r = 2. Because the entries in the matrix are all guaranteed to be less than or equal to 1, this operator has the effect of exaggerating the inhomogeneity in each column (as long as r > 1). In other words, flow is strengthened where it is already strong and weakened where it is weak. Prune: In each column, we remove those entries which have very small values (where “small” is defined in relation to the rest of the entries in the column), and the retained 30
  • 39. MCL: example ging to one cluster. y example simple example of the MCL process in action for the graph in Fig- initial stochastic matrix M0 obtained by adding self-loops to the graph ng each column is given below M0 = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 0.33 0.33 0.25 0 0 0 0.33 0.33 0.25 0 0 0 0.33 0.33 0.25 0.25 0 0 0 0 0.25 0.25 0.33 0.33 0 0 0 0.25 0.33 0.33 0 0 0 0.25 0.33 0.33 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ 32 t of applying one iteration of Expansion, Inflation and the Prune steps w: M1 = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 0.33 0.33 0.2763 0 0 0 0.33 0.33 0.2763 0 0 0 0.33 0.33 0.4475 0 0 0 0 0 0 0.4475 0.33 0.33 0 0 0 0.2763 0.33 0.33 0 0 0 0.2763 0.33 0.33 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ e flow along the lone inter-cluster edge (M0(4, 3)) has evaporated to 0. e more iteration results in convergence. ⎛ 0 0 0 0 0 0 ⎞ The result of applying one iteration of Expansion, Inflation and the P is given below: M1 = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 0.33 0.33 0.2763 0 0 0 0.33 0.33 0.2763 0 0 0 0.33 0.33 0.4475 0 0 0 0 0 0 0.4475 0.33 0.33 0 0 0 0.2763 0.33 0.33 0 0 0 0.2763 0.33 0.33 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ Note that the flow along the lone inter-cluster edge (M0(4, 3)) has evapo Applying one more iteration results in convergence. M2 = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ Hence, vertices 1, 2 and 3 flow completely to vertex 3, where as the vertic 6 flow completely to vertex 4. Hence, we group 1, 2 and 3 together with 3 “attractor” of the cluster, and similarly for 4, 5 and 6.
  • 41. MCL: problems and solutions • Problems: • Too many clusters • Learning rate r reduces: less clusters, but works slower • Non-balanced clusters: one giant and a pile of those having 2 to 3 nodes • The thing is in re-learning! • Try to make distribution of flow from neighboring nodes similar • Regularization of (R-MCL): Mexp = M*M0 Mexp = M ∗ M0
  • 42. SCD: step 1 • Approximate maximization of WCC • Count the triangles first • Remove all edges that form no triangles • Rough partitioning (algorithm 1): • 1. Sort the nodes on the basis of local cluster factor • 2. First assemblage: first node + all neighbors • 3. Second assemblage: first node of those remaining (not visited yet) + all its neighbors • 4. ... • Complexity: O(n*d^2+n*log(n))
  • 43. SCD: step 2 • Improve the algorithm 1 results in iterations until WCC ceases improving (algorithm 2) • * Find bestMovement for each node (MapReduce) • bestMovement: add / remove / transfer • * Perform bestMovement simultaneously for all nodes • * Complexity: O(1) per one bestMovement, O(d +1) for one node, O(m) for the whole graph • * Whole algorithm: O(m*log(n))
  • 45. Spinner • Based on label propagation • Implemented in Okapi (Mahout) by Telefonica • Symmetrize initial graph D: weight w(u,v)=1 if the edge was located in one direction, and 2 if in both directions • * Regulate the balance: set maximum number of edges possible within the assemblage (c may alter from 1 to 10-15): • * Assemblage workload l: current number of edges in the community l: • * Relative workload: • * Pre-set number of clusters k similar to k-medoids method; c=1,...k • * What label shall be assigned to node v? The one that is most common amongst neighbors of v: • * Balanced state amendment: p the current label if it is among them. s convergence speed [6], and in our dis- uces unnecessary network communica- lgorithm halts when no vertex updates ormulation of LPA assumes undirected en graphs are directed (e.g. the Web). stems like Pregel allow directed graphs, re aware of graph directness, like PageR- would need to convert a graph to undi- would be to create an undirected edge henever at least one directed edge exists he directed graph. agnostic to the communication patterns on top. Consider the example graph in tition to 3 parts. In the undirected graph cut edges. At this point, according to the agnostic of the directness of the original ertex to another partition is as likely, and dge less. the directness of the edges in the orig- ns are equally beneficial. In fact, either n 1 or vertex 1 to partition 3 would in edges in the directed graph. Once the stem and messages are sent across the decision results in less communication straint, only encouraging a similar number of edges across ferent partitions. As we will show, this decision allows a fu centralized algorithm. While in this work we focus on the pr tion and evaluation of the more system-related aspects of S we plan to investigate theoretical justifications and guarant hind our approach in future work. Here, we consider the case of a homogeneous system, each machine has equal resources. This setup is often pr in synchronous graph processing systems like Pregel, to m the time spent by faster machines waiting at the synchron barrier for stragglers. We define the capacity C of a partition as the maximum of edges it can have so that partitions are balanced: C = c· |E| k Parameter c > 1 ensures additional capacity to each part available for migrations. We define the load of a partition actual number of edges in that partition: B(l) = Â v2G deg(v)d(a(v),l) A larger value of c increases the number of migrations partition allowed at each iteration, possibly speeding up gence, but it may increase unbalance, as more edges are allo be assigned to each partition over the ideal value |E| . entation reduces unnecessary network communica- n 4). The algorithm halts when no vertex updates e original formulation of LPA assumes undirected er, very often graphs are directed (e.g. the Web). models of systems like Pregel allow directed graphs, ithms that are aware of graph directness, like PageR- A as is, we would need to convert a graph to undi- ve approach would be to create an undirected edge s u and v whenever at least one directed edge exists u and v in the directed graph. h, though, is agnostic to the communication patterns ons running on top. Consider the example graph in e want to partition to 3 parts. In the undirected graph e initially 3 cut edges. At this point, according to the n, which is agnostic of the directness of the original ation of a vertex to another partition is as likely, and e one cut edge less. we consider the directness of the edges in the orig- all migrations are equally beneficial. In fact, either 2 to partition 1 or vertex 1 to partition 3 would in e less cut edges in the directed graph. Once the into the system and messages are sent across the this latter decision results in less communication k. centralized algorithm. While in this work we focus o tion and evaluation of the more system-related aspe we plan to investigate theoretical justifications and hind our approach in future work. Here, we consider the case of a homogeneous each machine has equal resources. This setup is o in synchronous graph processing systems like Prege the time spent by faster machines waiting at the s barrier for stragglers. We define the capacity C of a partition as the max of edges it can have so that partitions are balanced: C = c· |E| k Parameter c > 1 ensures additional capacity to ea available for migrations. We define the load of a p actual number of edges in that partition: B(l) = Â v2G deg(v)d(a(v),l) A larger value of c increases the number of mig partition allowed at each iteration, possibly speed gence, but it may increase unbalance, as more edges be assigned to each partition over the ideal value |E| k 3 We introduce a penalty function to discourage assigning vertices to nearly full partitions. Given a partition indicated by label l, the penalty function p(l) is defined as follows: p(l) = B(l) C (7) To integrate the penalty function we normalize (4) first, and re- formulate the score function as follows: score00 (v,l) = Â u2N(v) w(u,v)d(a(u),l) Âu2N(v) w(u,v) p(l) (8) 3.3 Convergence and Halting is to “push” the cu it converged to, tow sult, we restart the look for a new loca score, possibly dec rithm continues as concerned, we assi The number of state depends on th Clearly, not every times, no iteration may not affect any We introduce a penalty function to discourage assigning vertices to nearly full partitions. Given a partition indicated by label l, the penalty function p(l) is defined as follows: p(l) = B(l) C (7) To integrate the penalty function we normalize (4) first, and re- formulate the score function as follows: score00 (v,l) = Â u2N(v) w(u,v)d(a(u),l) Âu2N(v) w(u,v) p(l) (8) is it s lo s r c s C ti V is the set of vertices in the graph and E is the set of that an edge e 2 E is a pair (u,v) with u,v 2 V. We N(v) = {u: u 2 V,(u,v) 2 E} the neighborhood of a ve by deg(v) = |N(v)| the degree of v. In a k-way partit define L as a set of labels L = {l1,...,lk} that essentially c to the k partitions. a is the labeling function a : V ! L a(v) = lj if label lj is assigned to vertex v. The end goal of Spinner is to assign partitions, or labe vertex such that it maximizes edge locality and partitio anced. 3.1 K-way Label Propagation We first describe how to use basic LPA to maximize ed and then extend the algorithm to achieve balanced part tially, each vertex in the graph is assigned a label li at ran 0 < i  k. Subsequently, every vertex iteratively propag bel to its neighbors. During this iterative process, a verte the label that is more frequent among its neighbors. Eve assigns a different score for a particular label l which is e number of neighbors assigned to label l. A vertex shows to labels with high score. More formally: score(v,l) = Â u2N(v) d(a(u),l) where d is the Kronecker delta. The vertex updates its l label lv that maximizes its score according to the update
  • 46. Spinner: Scalability • Calculation within Pregel — perfect for label propagation • Easy to add and remove clusters (1 cluster per 1 worker) • Easy re-calculation in case of addition / removal of nodes • Scalability of Spinner upon clustering of a random graph (Watts-Strogatz) • Saving of resources upon addition of new edges or new assemblages (workers)
 
 
 
 
 
 
 
 
 (a) Partitioning of the Twitter graph. (b) Partitioning of the Yahoo! graph. Figure 4: Partitioning of (a) the Twitter graph across 256 partitions and (b) the Yahoo! web graph across 115 partitions. The figure shows the evolution of metrics f, r, and score(G) across iterations. (a) Runtime vs. graph size (b) Runtime vs. cluster size (c) Runtime vs. k Figure 5: Scalability of Spinner. (a) Runtime as a function of the number of vertices, (b) runtime as a function of the number of workers, (c) runtime as a function of the number of partitions. supersteps. This approach allows us to factor out the runtime of al- gorithm as a function the number of vertices and edges. Figure 5.2 presents the results of the experiments, executed on a AWS Hadoop cluster consisting of 116 m2.4xlarge machines. In the first experiment, presented in Figure 5(a), we focus on the scal- ability of the algorithm as a function of the number of vertices and edges in the graph. For this, we fix the number of outgoing edges per vertex to 40. We connect the vertices following a ring lattice topology, and re-wire 30% of the edges randomly as by the func- tion of the beta (0.3) parameter of the Watts-Strogatz model. We (a) Cost savings (b) Partitioning stability (b) Partitioning of the Yahoo! graph. ns and (b) the Yahoo! web graph across 115 partitions. The figure s. vs. cluster size (c) Runtime vs. k he number of vertices, (b) runtime as a function of the number of (a) Cost savings (b) Partitioning stability Figure 6: Adapting to dynamic graph changes. We vary the (a) Cost savings (b) Partitioning stability Figure 7: Adapting to resource changes. We vary the num- o t m A t o f p s
  • 47. Tools • Network X: cliques, k-cores, blockmodels • graph-tool: very fast blockmodels, visualization • okapi (mahout): k-cores, Spinner • GraphX (spark): nothing as yet • Gephi: MCL, Girvan-Newman, Chinese Whispers • micans.org: MCL by the creator • mapequation.org: Infomap • sites.google.com/site/findcommunities: Louvain from the creators (C++) • pycluster (coming soon): k-medoids
  • 48. Further reading • Mining of Massive Datasets (ch.10 «Mining social networks graphs», pp 343-402) • Data Clustering Algorithms and Applications (ch 17 «Network clustering», pp 415-443) • SNA Course at Coursera: www.coursera.org/course/sna • Great review on community finding: snap.stanford.edu/class/ cs224w-readings/ fortunato10community.pdf • Articles with analysed methods and relevant cases (to be sent)