Discovering Overlapping Community Structure in Networks through Co-clustering

Scalable Multiple Clustering
of
High-dimensional Data
Under the Supervision of Submitted by
Mrs. Sunita Beniwal Sahil Kakkar
Assistant Professor M. Tech. Candidate
Department of CSE Reg. No. 14011018
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016

Contents
• Introduction
• Equivalence of Multiple and Overlapping Clustering
• Problem Description
• Scalability Issues
• Motivation for Community Detection
• Problem Formulation
• CDCC Algorithm
• Simulation Results

Introduction
• A clustered view of the dataset defines a “partition”. It is also called a
“clustering” in literature.
• The idea of Multiple Clustering is based on the notion that there can
be more than one such partitions possible in a dataset. These multiple
clusterings provide multiple views of the dataset.
A Single Partition

Partition based on algorithm-types Partition based on applications
Computer
Vision
Anamoly
DetectionPattern
Recognition
Profiling
(Segementation)Advertising/
Recommendars
Deep
Learning
Neural
Network
Bayesian
Clustering
Dimensionality
Reduction

Equivalence of Multiple and Overlapping
Clustering
Overlapping Clustering Solution
Partition A Partition B

Problem Description
• Co-clustering refers to jointly clustering of samples and features alike.
• Formally, given a set O of objects and the set F of features, with |O| =
n & |F| = d, a co-cluster C is a triple (O’, F’, R), with O’ ⊆ O, F’ ⊆ F & R
⊆ O × F that can be described as:
C (O’, F’, R) = {O’ × F’ | o ϵ O’, f ϵ F’, (o, f) ϵ R},
where the relation R defines the structure-type of the co-cluster.
• Notice that augmenting the relation R extends the similarity-based
measure to a more general & flexible “pattern-based framework”.

Scalability Issues
• Pair-wise distance metric renders the running cost of any
clustering algorithm prohibitively expensive due to two
limitations:
• High-dimensionality of samples
• Large number of sample-pairs to compare, nC2
• To solve high-dimensionality problem, randomized dimensionality
reduction techniques like MinHash or Weighted Minwise Sampling
are used. They summarize the m-dimensional features to very lesser
k-dimensional (k <<< m) feature-hash such that the inter-sample
similarity is preserved.

Theory of Locality Sensitive Hashing
• Use Jaccard Index (𝕁) as the similarity measure for Locality
Sensitive Hashing.
• Given the randomized hash (MH1 or WMS2) h(x) of the
sample x, for any pair of samples x and y, the probability of
hash collision is given by:
Pr ℎ 𝑥 = ℎ 𝑦 = 𝑓 𝕁 𝑥, 𝑦
where f is monotonically increasing.
• This approach of comparing samples uses hash-table and is
linear in number of samples, hence avoids nC2 number of
comparisons.
1MH = Minhash signature
2WMS = Weighted Minwise Sampling signature
h:

Motivation of Community Detection through
Co-clustering
• Community = Dense interaction among nodes
= Co-cluster formed by densely connected subgraph
• Current community detection algorithms have at least one of the
following problems:
• Size/number of communities to be specified in advance
• Partition-based (Communities cannot overlap)
• Fuzzy membership method does not scale to large networks
• Community structure heavily depends on choice of seed nodes
• Limited scalability due to pair-wise comparisons

Problem Formulation
• Let S be set of nodes interacting in a network. For any S’ ⊆ S, a
community C can be described as
C(S’, S’, R) = {S’× S’ ∣ s ∈ S’, (s, s) ∈ R}
where relation R encodes binary connectivity (cliquish-ness) among
nodes of the community.
Two Communities:
{b, c, d} and {a, e}
a b c d e
c d b e a
a b c d e
a 1 0 0 0 1
b 0 1 1 1 0
c 0 1 1 1 0
d 0 1 1 1 0
e 1 0 0 0 1

Density relaxation for practical purposes
• In real-life applications, completely-connected subgraphs are rare. So
community is expressed as a sub-graph that is at least as dense as ρ
(lower bound):
C(S’, S’, R’) = {S’×S’∣ s ∈ S’, ∣R’∣ ≥ ρ ⋅ ∣S’×S’∣}
where
R’ = {(s, s) ∈ R ∣ s ∈ S’}
and the fraction ρ denotes the minimum threshold on density.

CDCC Algorithm
• Input: Binary adjacency matrix Dn×n
Tuning parameters: B and K (for dimensionality reduction)
Thresholds: Jmin (min density) and Vmin (min size)
• Output: Detected Communities
• Three phases:
1. Generating row-clusters
2. Identifying corresponding column-clusters
3. Extracting communities

Phase 1: Generating
row-clusters using LSH
Hash-table for row-clusters

Phase 2: Identifying
corresponding
column-clusters
Hash-table for column-clusters

Phase 3: Extracting
Communities from
co-clusters
The union of row-set and column-set
of a co-cluster represents a detected
community.
The overlapping columns (nodes)
account for the overlap among
communities.
U =
Overlapping
CommunitiesRow-clusters Column-clusters

Runtime Complexity
• n = nodes in the graph
• m = number of co-clusters detected (m << n)
• d = edges per node (avg. non-zeros per row)
• Phase 1: O(BKn + dn)
• Phase 2: O(dn)
• Phase 3: O(m)
• Thus, overall complexity = O((BK+d)n)

LFR Benchmark Network Generator
• Generated 11 network graphs, each with 100 nodes and average
degree = 19, gradually increasing overlapping nodes.
• CPN denotes average Class membership Per Node.

Simulation results (NMI, F-measure & NVI) on
LFR benchmarks
• Network shift (change in number of classes) is detected by disruptive
rise in the NMI & F-measure and corresponding fall in NVI.
Number of classes
shown as yellow bars
and right-side y-axis
displays the scale as
number of classes

Number of overlapping nodes discovered
against benchmark on
• CDCC recovers maximum overlapping nodes just after the shifts in
the community structure, as the newly discovered community
compensates for the otherwise degrading quality.

Total running time of CDCC with increasing
graph size
• The running time analysis confirms theoretical scalability derived
earlier as O((BK+d)n).

Thank You
PAPER PUBLISHED
Kakkar S. and Beniwal S., “Discovering Overlapping Community Structure in Networks through Co-clustering”, in
IEEE International Conference on Inventive Computation Technologies, Coimbatore, TN, India, 2016. [Accepted]

Discovering Overlapping Community Structure in Networks through Co-clustering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Discovering Overlapping Community Structure in Networks through Co-clustering

Similar to Discovering Overlapping Community Structure in Networks through Co-clustering (20)

Recently uploaded

Recently uploaded (20)

Discovering Overlapping Community Structure in Networks through Co-clustering