Mining Communities in Networks: A Solution for Consistency and Its Evaluation
Mining Communities in Networks:
A Solution for Consistency and
Its Evaluation
Haewoon Kwak Yoonchan Choi* Young-Ho Eom
Hawoong Jeong Sue Moon
KAIST, Korea
*Samsung Advanced Institute of Technology, Korea
1
Outline
• Introduction to Community Identification
2
Outline
• Introduction to Community Identification
• Inconsistency problem in CI
2
Outline
• Introduction to Community Identification
• Inconsistency problem in CI
• Metrics for the inconsistency in CI
2
Outline
• Introduction to Community Identification
• Inconsistency problem in CI
• Metrics for the inconsistency in CI
• Empirical solution to remove inconsistency
2
Outline
• Introduction to Community Identification
• Inconsistency problem in CI
• Metrics for the inconsistency in CI
• Empirical solution to remove inconsistency
• The case study of AS network
2
Definitions of community
• “Subsets of nodes characterized by having
more internal connections than external
connections between them”
• “Set of web pages dealing with similar
topics”
• “Functional units”
4
Community in ...
• Sociology
• Biology
• Epidemiology
• Information theory
• Social network analysis
5
Community identification
• Graph partitioning based on betweenness
• Clique-based approach
• Link-pattern based approach
• Random walks on network
6
How do we know whether
‘communities are well identified?’
7
Quantitative metric
• Modularity, Q [15]
• eii : ratio of the number of links between
nodes belonging to community i over all
links
‣ ai : ratio of ends of edges that are
attached to vertices in community i
10
Datasets
• 12 network: 34 to 11M nodes
‣ Online social network
‣ Biological network
‣ Internet AS network
‣ Wikipedia link network
‣ WWW network
29
Overview of datasets
10
9
Orkut Cyworld
# of edges (log)
8
Flickr Wikipedia
7 YouTube
Facebook WWW
6
5
BBS AS Graph
4
3 C. Elegans Protein
2 Karate
1
0
0 1 2 3 4 5 6 7 8 9 10
# of nodes (log)
30
Measurement methodology
• Choosing one of max ∆Q is related to
input order of nodes
• For each network,
‣ Generating N sets with different order
‣ Finding communities in N sets
‣ Comparing identified communities
31
We learned...
(a) Karate club (b) C.Elegans (c) Protein Interaction
(d) BBS (e) AS graph (f) Facebook
33
We learned...
Louvain algorithm produces highest Q
CNM shows the smallest variance
(a) Karate club (b) C.Elegans (c) Protein Interaction
*Only Louvain works in a huge network
(d) BBS (e) AS graph (f) Facebook
33
Figure 6: Consistency (no data available
Pairwise membershiprandomly orde
runs of an algorithm, each over a prob.
Over runs of an algorithm, each over a ra
uantify set, we an algorithm, eachpaira of nodespair of
the likelihood of alikelihood of aordered inp
Over• The likelihood of athe of nodes resulting
runs of quantify pair over randomly
resulti
munity as: community as: aover of runs resulting in t
set, we quantify same community pair N nodes
same the likelihood of
in the
same community as:
(
where where
1 if = in the th dataset
1 1 the th dataset
0if otherwise in
= if = in the
0 otherwise 0 otherwise
and and are nodes and and represent communities that
and belong to, respectively. 34We call this metric pairwise mem
and and are nodes and and represe
Distribution of p.m.p.
(a) Karate club (b) C.Elegans (c) Protein Interaction
(d) BBS (e) AS graph (f) Facebook
36
Distribution of p.m.p.
There are many edges whose
pairwise membership prob. is not (c) Protein Interaction
(a) Karate club (b) C.Elegans 0 or 1
(d) BBS (e) AS graph (f) Facebook
36
ms produce pairwise membership probabilities of
’s. For the remaining nine networks, Louvain p
Consistency, C
t consistent outcome and, for (g) to (h), the only ou
der to quantify network-wide community members
, we define a metric of consistency for the entire
• To quantify network-wide consistency,
Normalization
sistency Weighing p.m.p. pairwise 0.5
weighs the away from membership prob
om . The second term in (4) normalizes from
e of communities detected by CNM algorithm in th
37
C in 12 networks
Figure 6: Consistency (no data available by CNM and Wakita for Wikipedia and Cyworld)
ver runs of an algorithm, each over a randomly ordered input
we quantify the likelihood of a pair of nodes resulting in the 38
C in 12 networks
No one outperforms the other two
Figure 6: Consistency (no data available by CNM and Wakita for Wikipedia and Cyworld)
ver runs of an algorithm, each over a randomly ordered input
we quantify the likelihood of a pair of nodes resulting in the 38
Intuitions behind our approach
• Every edge has pairwise membership prob.
• High pairwise membership probability
indicates that two nodes are likely to be in
the same community
• All 3 algorithms in weighted network place
edge of high weight within the community
40
Reinforcing p.m.p.
• After a cycle of N runs,
‣ Calculate pairwise membership prob.
‣ Assign p.m.p. as edge weight
41
Reinforcing p.m.p.
• After a cycle of N runs,
‣ Calculate pairwise membership prob.
‣ Assign p.m.p. as edge weight
• Return to another cycle of N runs
41
Reinforcing p.m.p.
• After a cycle of N runs,
‣ Calculate pairwise membership prob.
‣ Assign p.m.p. as edge weight
• Return to another cycle of N runs
• Continue until C gains no improvement
41
Convergence of C
Figure 8: Convergence of consistency
erforms the other two in all networks and no consistent correla-
42
between the consistency and the topological characteristics of
Convergence of C
Except Orkut & Cyworld,
C converges to 1 within 5 cycles
Figure 8: Convergence of consistency
erforms the other two in all networks and no consistent correla-
42
between the consistency and the topological characteristics of
Agreement btwn. trials
nvergence of consistency
la-
of
us-
FI-
the
all
all
hip
is, Figure 10: Comparison of community size distribution in 4 tri-
43
Agreement btwn. trials
nvergence of consistency
la-
of
us-
Communities of independent trials
FI- are almost identical
the
all
all
hip
is, Figure 10: Comparison of community size distribution in 4 tri-
43
For non-converging case
• Is not enough N = 100 ?
• Resolution limit in community detection ?
We are building an analytical framework
to explain inconsistency problems
44
Summary
• We identify inconsistency
in community identification
51
Summary
• We identify inconsistency
in community identification
• We define new metrics
for measuring inconsistency
51
Summary
• We identify inconsistency
in community identification
• We define new metrics
for measuring inconsistency
• We propose empirical solutions
reinforcing pairwise membership probability
51
Summary
• We identify inconsistency
in community identification
• We define new metrics
for measuring inconsistency
• We propose empirical solutions
reinforcing pairwise membership probability
• We present preliminary analysis
of communities in AS graph
51
Hi, I’m Haewoon Kwak, a ph. d student of KAIST, Korea.
Today I’m gonna talk about inconsistency problem in community identification and its empirical solution. This work is collaboration with ...
If there are many methods to find communities in network,
If a network becomes more complex, we are not sure which partitioning is better
The modularity, Q, is a quality measure of partitioned communities.
For each community i, we calculate the difference between the fraction of the number of within-community edges and the square of the fraction of the sum of degrees over all links. The value of modularity ranges from -1 to 1. The value Q = 1 is the maximum, indicates strong community structure
Obtaining the highest modularity is NP-hard problem,
so approximation algorithms are used.
The CNM algorithm begins with each node as a separate community in a network
Then the algorithm finds the pair of communities with the global maximum ΔQ. Community pairs of maximum gain of modularity. In this example there are two maximum delta Q. Algorithm choose one of them according to implementation
Then the algorithm finds the pair of communities with the global maximum ΔQ. Community pairs of maximum gain of modularity. In this example there are two maximum delta Q. Algorithm choose one of them according to implementation
Then the algorithm finds the pair of communities with the global maximum ΔQ. Community pairs of maximum gain of modularity. In this example there are two maximum delta Q. Algorithm choose one of them according to implementation
Then the algorithm finds the pair of communities with the global maximum ΔQ. Community pairs of maximum gain of modularity. In this example there are two maximum delta Q. Algorithm choose one of them according to implementation
Then the algorithm finds the pair of communities with the global maximum ΔQ. Community pairs of maximum gain of modularity. In this example there are two maximum delta Q. Algorithm choose one of them according to implementation
Then the algorithm finds the pair of communities with the global maximum ΔQ. Community pairs of maximum gain of modularity. In this example there are two maximum delta Q. Algorithm choose one of them according to implementation
Then the algorithm finds the pair of communities with the global maximum ΔQ. Community pairs of maximum gain of modularity. In this example there are two maximum delta Q. Algorithm choose one of them according to implementation
Then the algorithm finds the pair of communities with the global maximum ΔQ. Community pairs of maximum gain of modularity. In this example there are two maximum delta Q. Algorithm choose one of them according to implementation
Then the algorithm finds the pair of communities with the global maximum ΔQ. Community pairs of maximum gain of modularity. In this example there are two maximum delta Q. Algorithm choose one of them according to implementation
Then the algorithm finds the pair of communities with the global maximum ΔQ. Community pairs of maximum gain of modularity. In this example there are two maximum delta Q. Algorithm choose one of them according to implementation
Then the algorithm finds the pair of communities with the global maximum ΔQ. Community pairs of maximum gain of modularity. In this example there are two maximum delta Q. Algorithm choose one of them according to implementation
Then the algorithm finds the pair of communities with the global maximum ΔQ. Community pairs of maximum gain of modularity. In this example there are two maximum delta Q. Algorithm choose one of them according to implementation
Then the algorithm finds the pair of communities with the global maximum ΔQ. Community pairs of maximum gain of modularity. In this example there are two maximum delta Q. Algorithm choose one of them according to implementation
Then the algorithm finds the pair of communities with the global maximum ΔQ. Community pairs of maximum gain of modularity. In this example there are two maximum delta Q. Algorithm choose one of them according to implementation
updates ΔQ values that correspond to any neighboring community of the newly merged community
In the first phase, it starts with single-node communities like CNM & Wakita. Each node is moved to the adjacent community that maximize delta Q. If delta Q is negative, the node stays at original community.
In the second phase, the algorithm re- builds the network with /communities as nodes/ and /sum of weights between nodes as link weights/, and returns to the first phase.
So far we have presented the process of three algorithms.
Some of you feel unclear about a part of algorithm, choosing one of maximum delta Q.
now we move on to the problem of inconsistency
From this Figure, there is a great difference between two partitioning.
In left, 7 communities are identified, and in right only three communities identified. This network has only 34 nodes. Thus, if network becomes larger, we can predict the problems become more serious. In next section, we quantitatively show the significance of inconsistent problems.
Now we move on measuring inconsistency.
AS Graph is from work by Oliveira,"Quantifying the Completeness of the Observed Internet AS-level Structure"
Social network data except Cyworld is from work by Alan mislove and Meeyoung cha.
First, we compare the distribution of modularity.
You remember modularity is the quality measure of partitioning.
From the distribution of modularity, we know how different partitioning is.
The pairwise membership probability represents the empirical probability that two nodes belong to the same community across multiple runs of the same algorithm.
If two nodes are always in the same community, the value becomes 1
or, two nodes are always in the different community, the value is 0
We consider pairwise membership probability only between neighbors.
The larger the proportion of 0 or 1 is,
the more consistent the communities are
No one algorithm outperforms the other two in all networks and
no correlation between out consistency and the topological characteristics of a network, such as average degree, link density, and average clustering coefficient.
We discuss plausible reasons later
This plots the community size distributions from independent trials.
All 4 plots almost completely overlap and are very close to each other.
Our choice of N = 100 is to make sure that we break ties in choosing max delta Q, but 100 might be not large enough to break all possible ties in Cyworld and Orkut. / Fortunato and Barthélemy report that communities below a certain size may not be resolved and are grouped into a larger loose community. The resolution limit is the threshold community size, and is a function of the total number of links, not nodes.
So far we have seen how to identify communities in a consistent manner. Now we need to check whether identified communities are meaningful.
Here we apply consistent community identification to AS graph.
Out of 48 communities, we found interesting communities: the largest community, a geographically concentrated community, and a star-shaped community
The layers of strongly connected tier-1 ASes at the core and other tier-1 ASes remind us of the Internet Jellyfish model. We leave in-depth mapping of our communities to the jellyfish model for future work
Next, we draw geographically concentrated community. For manual inspection, we choose the community with top Korean ISPs. This community has 658 ASes, and 97.4% of ASes are in Korea. HK! The interesting point is physical constratins such as transatlantic and pacific lines somehow manifest thru grouping.
We found a star-shaped community. All leaf ASes connect only to the hub AS and no other. They are single- homed stub ASes. One notable observation is that in this community of a star topology there is no peer-peer relation
This is preliminary.
We found many interesting things for future work.
This is preliminary.
We found many interesting things for future work.
This is preliminary.
We found many interesting things for future work.
This is preliminary.
We found many interesting things for future work.
We color the 100 by 100 grids according to the number of links with the corresponding pairwise member- ship probabilities in two consecutive cycles.