Community Detection in Social Networks: A Brief Overview

Community Detection in Social Networks
A Brief Overview
Satyaki Sikdar
Heritage Institute of Technology, Kolkata
8 January 2016
Satyaki Sikdar Community Detection 8 January 2016 1 / 37

Introduction
Table of Contents
1 Introduction
About Me
Social Networks
Mathematical background
2 Motivation
3 The Hunt for Communities
4 The Need for Speed (and quality)

Introduction About Me
about me
Extremely lazy - I’ve been told
Working with social networks for the past 8 months the supervision of Prof. Partha
Basuchowdhuri
Conversant in Python, C++ and C - an average programmer at best
Vice Chair of Heritage Institute of Technology ACM Student Chapter

Introduction Social Networks
Networks
Networks are everywhere. They crop up wherever there are interactions between actors.
friendship networks

Networks
friendship networks
follower networks

Networks
friendship networks
follower networks
neural networks

Networks
friendship networks
follower networks
neural networks
telecom networks

Networks
friendship networks
follower networks
neural networks
telecom networks
trade of goods and services

Networks
friendship networks
follower networks
neural networks
telecom networks
protein protein interactions - medicine design

Networks
friendship networks
follower networks
neural networks
telecom networks
citations and collaborations

Networks
friendship networks
follower networks
neural networks
telecom networks
power grid networks

Networks
friendship networks
follower networks
neural networks
telecom networks
power grid networks
predator prey networks

Citation and Email networks

Telecommunication and Protein networks

Friendship and Les Mis´erables

High school relationship network
Nearly bipartite
One giant component and a lot of little
ones
No cycles, almost tree like - information /
disease spreads fast

Introduction Mathematical background
Network representation
Networks portray the interactions between diﬀerent actors.
Actors or individuals are nodes/vertices in
the graph
If there’s interaction between two nodes,
there’s an edge/link between them
The links can have weights or intensities
signifying the strength of connections
The links can be directed, like in the web
graph. There’s a directed link between
two nodes (pages) A and B if there’s a
hyperlink to B from A

Introduction Mathematical background
Degree and degree distribution
The degree of a node is the number of outward edges from that node
The degree distribution of a network is distribution of the fraction of nodes with a given
degree with the corresponding degrees
Node Degree
1 3
2 2
3 4
4 2
5 3
6 3
7 3
8 2
9 2
10 2

Motivation
Table of Contents
1 Introduction
2 Motivation
What are they and why do we even care?
Communities!
Justiﬁcation for the presence of communities

Motivation What are they and why do we even care?
Community Structure: An Informal Deﬁnition
The degree distribution follows a power
law and is long-tailed
The distribution of edges is
inhomogeneous
High concentrations of edges within
special groups of vertices, and low
concentrations between them. This
feature of real networks is called
community structure

Motivation What are they and why do we even care?
Degree distributions of real life networks

Motivation Communities!
Why bother about communities?
Communities are groups of vertices which probably share common properties and/or play
similar roles within the graph.
Society oﬀers a wide variety of possible group organizations: families, working and
friendship circles, villages, towns, nations.
Communities also occur in many networked systems from biology, computer science,
engineering, economics, politics, etc.
In protein-protein interaction networks, communities are likely to group proteins having
the same speciﬁc function within the cell
In the graph of the World Wide Web they may correspond to groups of pages dealing
with the same or related topics

Applications of Community Detection
Clustering Web clients who have similar interests and are geographically near to each
other improves the performance of services
Identifying clusters of customers with similar interests in the network of purchase
networks of online retailers enables to set up efficient recommendation systems
Clusters of large graphs can be used to create data structures in order to efficiently store
the graph data and to handle navigational queries, like path searches
Allocation of tasks to processors in parallel computing. This can be accomplished by
splitting the computer cluster into groups with roughly the same number of processors,
such that the number of physical connections between processors of different groups is
minimal.

A few real world examples
Figure: Zachary’s Karate Club
Figure: Collaboration network between scientists
working at the Santa Fe Institute

Motivation Justiﬁcation for the presence of communities
An Empirical Justiﬁcation
Figure: Add health friendship data Coded by Race: Blue = Black, Yellow = White, Red = Hispanic,
Green = Asian, White = Other

Homophily: Birds of a feather stick together
There’s a visible bias in friendships
52% white students, white-white friendships 86%

38% black students, black-black friendships 85%

5% Hispanics, Hispanic-Hispanic friendships 2%

5% Hispanics, Hispanic-Hispanic friendships 2%
Asymmetric behavior highlights homophily
Results in non-uniform edge distributions
Promotes the formation and maintains the community structure

The Hunt for Communities
Table of Contents
1 Introduction
2 Motivation
Where to start?
Deﬁnitions
A na¨ıve approach - NP hardness
Girvan-Newman Algorithm
Girvan-Newman in Action
Modularity
Louvain Method
Our method - methodical graph sparsiﬁcation

The Hunt for Communities Where to start?
Formalizing the problem
For a given graph G(V, E), ﬁnd a cover C = {C1 , C2 , ..., Ck} such that
i
Ci = V
For disjoint communities, Ci Cj = ∅ ∀i, j
For overlapping communities, Ci Cj = ∅ ∀i, j
Figure: Zachary’s Karate Club Network
C = {C1, C2, C3}, C1 = yellow nodes, C2 =
green, C3 = blue is a disjoint cover
However, ¯C = { ¯C1, ¯C2}, ¯C1 = yellow & green
nodes and ¯C2 = blue & green nodes is an
overlapping cover

The Hunt for Communities Definitions
A few more definitions
Figure: A simple graph with three
communities. Intra-community
edges are blue and inter-community
ones in green
Let C be a community of a graph G(V, E) with |C| = nc,
|V| = n and |E| = m . We define,
Average link density δ(G) =
m
n(n − 1)/2
Intra-cluster density δint(C) =
#internal edges of C
nc(nc − 1)/2
Inter-cluster density δext(C) =
#intercluster edges of C
nc(n − nc)
For a good community, we expect δint(C) >> δ(G) and
δext(C) << δ(G)
We look to maximize
C
(δint(C) − δext(C))

The Hunt for Communities A na¨ıve approach - NP hardness
A Na¨ıve Approach
We have an objective function f(C) =
C∈C
(δint(C) − δext(C))
How do we ﬁnd a good C?
Exhaustive enumeration, or in simple words, brute force!
Try out all the possible communities C of all possible sizes, pick the best sets of C that
maximizes f(C)
What’s the problem? Too many choices of C to pick from - needle in a haystack!
Even for small graphs, brute forcing becomes infeasible
Can we do better?

The Hunt for Communities Girvan-Newman Algorithm
A Little Background: Edge Betweenness Centrality
Betweenness centrality of an edge e is the sum of the fraction of all-pairs shortest paths that
pass through e: cB(e) =
s,t∈V
σ(s, t|e)
σ(s, t)
where σ(s, t) is the number of shortest paths from s
to t and σ(s, t|e) is the number of shortest paths from s to t passing through the edge e
Top 6 edges
Edge cB(e) type
(10, 13) 0.3 inter
(3, 5) 0.23333 inter
(7, 15) 0.2079 inter
(1, 8) 0.1873 inter
(13, 15) 0.1746 intra
(5, 7) 0.1476 intra
Bottom 6 edges
Edge cB(e) type
(8, 11) 0.022 intra
(1, 2) 0.0269 intra
(9, 11) 0.031 intra
(8, 9) 0.0412 intra
(12, 15) 0.052 intra
(3, 4) 0.060 intra

The Hunt for Communities Girvan-Newman Algorithm
The Girvan-Newman Algorithm
Proposed by Girvan and Newman in 2002, and was improved in 2004.
Based on reachability of nodes - shortest paths
Edges are selected on the basis of the edge betweenness centrality
The algorithm
1 Computation centrality for all edges
2 Removal of edge with largest centrality; ties can be broken randomly
3 Recalculation of the centralities on the running graph
4 Iterate from step 2, stop when you get clusters of desirable quality

The Hunt for Communities Girvan-Newman in Action
(a) Best edge: (10, 13)
(f) Final graph
(b) Best edge: (3, 5)
(e) Best edge: (2, 11)
(c) Best edge: (7, 15)
(d) Best edge: (1, 8)

The Hunt for Communities Modularity
Modularity
For a given graph G(V, E), and a disjoint cover C = {C1 , C2 , ..., Ck}, we have,
the number of intra-community edges as
1
2
ij
Aij δ(ci , cj )
the expected number of edges between all pairs of nodes in a community as
1
2
ij
ki kj
2m
δ(ci , cj )
the diﬀerence of the actual and the expected values is
1
2
ij
Aij −
ki kj
2m
δ(ci , cj )
We deﬁne modularity Q =
1
2m
ij
Aij −
ki kj
2m
δ(ci , cj ). Q ∈ [−1, 1]
The higher the modularity, the better is the community structure*.
The lower it is, the more is the randomness in edge distribution

The Hunt for Communities Louvain Method
Louvain Method: A Greedy Approach
Proposed by Blondel et al in 2008.
Takes the greedy maximization approach
Very fast in practice, it’s the current state-of-the-art in disjoint community detection.
Performs hierarchical partitioning, stopping when there cannot be any further
improvement in modularity
Contracts the graph in each iteration thereby speeding up the process

The Algorithm
1 Initially each node is in it’s own community
2 A sequential sweep over the nodes is performed.
Given a node i, the gain in weighted modularity (∆Q) coming from putting i in the
community of its neighbor j is computed. i is put in that community for which ∆Q is
maximum (∆Q 0).
3 Communities are replaced by supernodes and two supernodes are connected by an edge iﬀ
there’s at least an edge between vertices of the two communities.
4 The above two steps are repeated as long as ∆Q 0

Louvain Method in Action

Figure: Belgian mobile phone network. The red nodes are French speakers and the Green ones are
Dutch

The Hunt for Communities Our method - methodical graph sparsification
Community Detection by Graph Sparsification
Proposed by Basuchowdhuri, Sikdar, Shreshtha, Majumder in 2015. Accepted in ACM
CoDS 2016 as a full paper.
The input graph is methodically sparsified preserving the community structure. A
t-spanner is used for this purpose.
Louvain Method is applied on the reduced graph to obtain the clusters
Very fast in practice. Performance is comparable to Louvain Method both in terms of
quality and modularity.

The Algorithm
1 Construct a t-spanner for the given network. Take the complement of the spanner in the
original network
2 Form a cover using any fast community detection in the sparsiﬁed graph
3 Run Louvain method to reﬁne the clusters

Figure: Original network. n =
115, m = 613
Figure: Sparsiﬁed network. n
= 115, m = 137
Figure: Final network. n =
115, m = 137

The Need for Speed (and quality)
Table of Contents
1 Introduction
2 Motivation
Performance comparison

The Need for Speed (and quality) Performance comparison
Performance Comparison
Louvain Method Our Algorithm
Dataset n m Modularity Time t Modularity Time
Karate 34 78 0.415 0 7 0.589422 0.5
Dolphins 62 159 0.518 0 5 0.6744 0.53
Football 115 613 0.604 0 9 0.8627 0.69
Enron 33,696 180,811 0.596 0.38 3 0.855 13.13
DBLP 317,080 1,049,866 0.819 11 9 0.9589864 78.56

Wrapping Up
Social network analysis is a vibrant dynamic field spanning across fields like sociology,
economics, physics, biology and not just CS
Community detection is an active field of research.
Not much work is done with dynamic networks.

Thank you for listening!

Community Detection in Social Networks: A Brief Overview

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Community Detection in Social Networks: A Brief Overview

Similar to Community Detection in Social Networks: A Brief Overview (20)

Recently uploaded

Recently uploaded (20)

Community Detection in Social Networks: A Brief Overview