Discovering Overlapping Community Structure in Networks through Co-clustering
1. Scalable Multiple Clustering
of
High-dimensional Data
Under the Supervision of Submitted by
Mrs. Sunita Beniwal Sahil Kakkar
Assistant Professor M. Tech. Candidate
Department of CSE Reg. No. 14011018
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
2. Contents
• Introduction
• Equivalence of Multiple and Overlapping Clustering
• Problem Description
• Scalability Issues
• Motivation for Community Detection
• Problem Formulation
• CDCC Algorithm
• Simulation Results
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
3. Introduction
• A clustered view of the dataset defines a “partition”. It is also called a
“clustering” in literature.
• The idea of Multiple Clustering is based on the notion that there can
be more than one such partitions possible in a dataset. These multiple
clusterings provide multiple views of the dataset.
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
A Single Partition
4. Partition based on algorithm-types Partition based on applications
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
Computer
Vision
Anamoly
DetectionPattern
Recognition
Profiling
(Segementation)Advertising/
Recommendars
Deep
Learning
Neural
Network
Bayesian
Clustering
Dimensionality
Reduction
5. Equivalence of Multiple and Overlapping
Clustering
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
Overlapping Clustering Solution
Partition A Partition B
6. Problem Description
• Co-clustering refers to jointly clustering of samples and features alike.
• Formally, given a set O of objects and the set F of features, with |O| =
n & |F| = d, a co-cluster C is a triple (O’, F’, R), with O’ ⊆ O, F’ ⊆ F & R
⊆ O × F that can be described as:
C (O’, F’, R) = {O’ × F’ | o ϵ O’, f ϵ F’, (o, f) ϵ R},
where the relation R defines the structure-type of the co-cluster.
• Notice that augmenting the relation R extends the similarity-based
measure to a more general & flexible “pattern-based framework”.
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
7. Scalability Issues
• Pair-wise distance metric renders the running cost of any
clustering algorithm prohibitively expensive due to two
limitations:
• High-dimensionality of samples
• Large number of sample-pairs to compare, nC2
• To solve high-dimensionality problem, randomized dimensionality
reduction techniques like MinHash or Weighted Minwise Sampling
are used. They summarize the m-dimensional features to very lesser
k-dimensional (k <<< m) feature-hash such that the inter-sample
similarity is preserved.
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
8. Theory of Locality Sensitive Hashing
• Use Jaccard Index (𝕁) as the similarity measure for Locality
Sensitive Hashing.
• Given the randomized hash (MH1 or WMS2) h(x) of the
sample x, for any pair of samples x and y, the probability of
hash collision is given by:
Pr ℎ 𝑥 = ℎ 𝑦 = 𝑓 𝕁 𝑥, 𝑦
where f is monotonically increasing.
• This approach of comparing samples uses hash-table and is
linear in number of samples, hence avoids nC2 number of
comparisons.
1MH = Minhash signature
2WMS = Weighted Minwise Sampling signature
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
h:
9. Motivation of Community Detection through
Co-clustering
• Community = Dense interaction among nodes
= Co-cluster formed by densely connected subgraph
• Current community detection algorithms have at least one of the
following problems:
• Size/number of communities to be specified in advance
• Partition-based (Communities cannot overlap)
• Fuzzy membership method does not scale to large networks
• Community structure heavily depends on choice of seed nodes
• Limited scalability due to pair-wise comparisons
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
10. Problem Formulation
• Let S be set of nodes interacting in a network. For any S’ ⊆ S, a
community C can be described as
C(S’, S’, R) = {S’× S’ ∣ s ∈ S’, (s, s) ∈ R}
where relation R encodes binary connectivity (cliquish-ness) among
nodes of the community.
Two Communities:
{b, c, d} and {a, e}
a b c d e
c d b e a
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
a b c d e
a 1 0 0 0 1
b 0 1 1 1 0
c 0 1 1 1 0
d 0 1 1 1 0
e 1 0 0 0 1
11. Density relaxation for practical purposes
• In real-life applications, completely-connected subgraphs are rare. So
community is expressed as a sub-graph that is at least as dense as ρ
(lower bound):
C(S’, S’, R’) = {S’×S’∣ s ∈ S’, ∣R’∣ ≥ ρ ⋅ ∣S’×S’∣}
where
R’ = {(s, s) ∈ R ∣ s ∈ S’}
and the fraction ρ denotes the minimum threshold on density.
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
12. CDCC Algorithm
• Input: Binary adjacency matrix Dn×n
Tuning parameters: B and K (for dimensionality reduction)
Thresholds: Jmin (min density) and Vmin (min size)
• Output: Detected Communities
• Three phases:
1. Generating row-clusters
2. Identifying corresponding column-clusters
3. Extracting communities
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
13. Phase 1: Generating
row-clusters using LSH
Hash-table for row-clusters
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
15. Phase 3: Extracting
Communities from
co-clusters
The union of row-set and column-set
of a co-cluster represents a detected
community.
The overlapping columns (nodes)
account for the overlap among
communities.
U =
Overlapping
CommunitiesRow-clusters Column-clusters
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
16. Runtime Complexity
• n = nodes in the graph
• m = number of co-clusters detected (m << n)
• d = edges per node (avg. non-zeros per row)
• Phase 1: O(BKn + dn)
• Phase 2: O(dn)
• Phase 3: O(m)
• Thus, overall complexity = O((BK+d)n)
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
17. LFR Benchmark Network Generator
• Generated 11 network graphs, each with 100 nodes and average
degree = 19, gradually increasing overlapping nodes.
• CPN denotes average Class membership Per Node.
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
18. Simulation results (NMI, F-measure & NVI) on
LFR benchmarks
• Network shift (change in number of classes) is detected by disruptive
rise in the NMI & F-measure and corresponding fall in NVI.
Number of classes
shown as yellow bars
and right-side y-axis
displays the scale as
number of classes
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
19. Number of overlapping nodes discovered
against benchmark on
• CDCC recovers maximum overlapping nodes just after the shifts in
the community structure, as the newly discovered community
compensates for the otherwise degrading quality.
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
20. Total running time of CDCC with increasing
graph size
• The running time analysis confirms theoretical scalability derived
earlier as O((BK+d)n).
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
21. Thank You
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
PAPER PUBLISHED
Kakkar S. and Beniwal S., “Discovering Overlapping Community Structure in Networks through Co-clustering”, in
IEEE International Conference on Inventive Computation Technologies, Coimbatore, TN, India, 2016. [Accepted]